BPE chooses the pair with the largest count and merges it into a new token. A stated lexicographic tie-break makes the rule deterministic even when counts tie.

highlighted = computed this step

Merge the most frequent pair

The argmax pair is (u,g) with count 5. The tie-break is lexicographically smallest pair, though this round has a strict maximum.

argmax=(u,g),count=5\operatorname{argmax}=(u,g),\quad \text{count}=5
Round one mergeThe most frequent pair is merged deterministically.Round one mergeThe most frequent pair is merged deterministically.round one mergetie-break: lexicographically smallest pair among max countscurrent corpushug freq=3 symbols=h u gpug freq=2 symbols=p u gpair counts(h,u)=3(p,u)=2(u,g)=5chosen merge(u,g)->ug count=5after mergehug freq=3 symbols=h ugpug freq=2 symbols=p ug

The new token

Merging (u,g) creates the token ug. The segmentations become h ug with frequency 3 and p ug with frequency 2.

(u,g)ug(u,g)\rightarrow ug
Round one mergeThe most frequent pair is merged deterministically.Round one mergeThe most frequent pair is merged deterministically.round one mergetie-break: lexicographically smallest pair among max countscurrent corpushug freq=3 symbols=h u gpug freq=2 symbols=p u gpair counts(h,u)=3(p,u)=2(u,g)=5chosen merge(u,g)->ug count=5after mergehug freq=3 symbols=h ugpug freq=2 symbols=p ug

Summary

After one merge, the displayed corpus has changed. The next round counts adjacent pairs in that new segmentation.

h ug,p ugh\ ug,\quad p\ ug
Round one mergeThe most frequent pair is merged deterministically.Round one mergeThe most frequent pair is merged deterministically.round one mergetie-break: lexicographically smallest pair among max countscurrent corpushug freq=3 symbols=h u gpug freq=2 symbols=p u gpair counts(h,u)=3(p,u)=2(u,g)=5chosen merge(u,g)->ug count=5after mergehug freq=3 symbols=h ugpug freq=2 symbols=p ug