Subword Tokens

Byte-pair encoding starts from character symbols and builds larger tokens by merging adjacent pairs. This first lesson shows only the toy corpus and its integer frequencies.

highlighted = computed this step

BPE starts from character symbols and repeatedly merges the most frequent adjacent symbol pair. The source here is a tiny corpus with integer frequencies.

\text{start from characters}

The shown corpus

The word hug has frequency 3 and symbols h u g. The word pug has frequency 2 and symbols p u g.

\text{hug: }h\ u\ g\text{ freq }3,\quad \text{pug: }p\ u\ g\text{ freq }2

Summary

The first register is the displayed corpus itself. The next step counts adjacent pairs from these exact rows.

\text{visible corpus}\rightarrow\text{pair counts}