CSV and Text
Count Token Frequencies
Split a sentence into words with .split() and count how many times each
word appears. The replay shows counts seeding new words at 1 and
incrementing on repeats — the word 'the' appears three times and 'cat'
twice.
By hand
The Pythonic way
Counter(text.split()) tokenises and tallies in one call. The result is a
Counter (a dict subclass) with the same counts.
naive.py
text = 'the cat sat on the mat the cat'
words = text.split()
counts = {}
for w in words:
if w in counts:
counts[w] = counts[w] + 1
else:
counts[w] = 1
print('RESULT:', {k: counts[k] for k in sorted(counts)})
library.py
from collections import Counter
text = 'the cat sat on the mat the cat'
counts = Counter(text.split())
print('RESULT:', {k: counts[k] for k in sorted(counts)})
RESULT: {'cat': 2, 'mat': 1, 'on': 1, 'sat': 1, 'the': 3}
Implementation notes
- The mechanism is identical to
python-data-basics/frequency-count(ch05). The distinction is the input: ch05 counts a pre-split label list; this lesson first tokenises a raw text string with.split(). Real NLP pipelines add lowercasing and punctuation stripping before counting. .split()with no argument splits on any whitespace run and ignores leading/trailing whitespace — equivalent to.strip().split().Counter.most_common(n)returns the n highest-frequency tokens, useful for finding stop-words or topic keywords.