Byte-pair encoding starts from character symbols and builds larger tokens by merging adjacent pairs. This first lesson shows only the toy corpus and its integer frequencies.

highlighted = computed this step

Subword tokens

BPE starts from character symbols and repeatedly merges the most frequent adjacent symbol pair. The source here is a tiny corpus with integer frequencies.

start from characters\text{start from characters}
Initial BPE corpusThe source words and frequencies are visible.Initial BPE corpusThe source words and frequencies are visible.initial corpustie-break: lexicographically smallest pair among max countscurrent corpushug freq=3 symbols=h u gpug freq=2 symbols=p u g

The shown corpus

The word hug has frequency 3 and symbols h u g. The word pug has frequency 2 and symbols p u g.

hug: h u g freq 3,pug: p u g freq 2\text{hug: }h\ u\ g\text{ freq }3,\quad \text{pug: }p\ u\ g\text{ freq }2
Initial BPE corpusThe source words and frequencies are visible.Initial BPE corpusThe source words and frequencies are visible.initial corpustie-break: lexicographically smallest pair among max countscurrent corpushug freq=3 symbols=h u gpug freq=2 symbols=p u g

Summary

The first register is the displayed corpus itself. The next step counts adjacent pairs from these exact rows.

visible corpuspair counts\text{visible corpus}\rightarrow\text{pair counts}
Initial BPE corpusThe source words and frequencies are visible.Initial BPE corpusThe source words and frequencies are visible.initial corpustie-break: lexicographically smallest pair among max countscurrent corpushug freq=3 symbols=h u gpug freq=2 symbols=p u g