Byte-pair encoding starts from character symbols and builds larger tokens by merging adjacent pairs. This first lesson shows only the toy corpus and its integer frequencies.
highlighted = computed this step
Subword tokens
BPE starts from character symbols and repeatedly merges the most frequent adjacent symbol pair. The source here is a tiny corpus with integer frequencies.
start from characters
The shown corpus
The word hug has frequency 3 and symbols h u g. The word pug has frequency 2 and symbols p u g.
hug: hug freq 3,pug: pug freq 2
Summary
The first register is the displayed corpus itself. The next step counts adjacent pairs from these exact rows.