Duplicates
Find Exact Duplicates
Flag each element that is a repeat of an earlier occurrence. By hand, keep
a seen dict and mark each element True if it is already present. With
pandas, DataFrame.duplicated() returns a boolean Series where True
marks every row after its first occurrence.
By hand
With pandas
df.duplicated() scans rows and marks every row after the first occurrence
of each value as True. The result is a boolean dtype: bool Series.
naive.py
names = ['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Dave']
seen = {}
flags = []
for name in names:
flags.append(name in seen)
seen[name] = True
print('RESULT:', flags)
library.py
import pandas as pd
from dalib.display import set_display
set_display()
names = ['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Dave']
df = pd.DataFrame({'name': names})
flags = df.duplicated()
result = flags.tolist()
print('index:', flags.index.tolist())
print('dtype:', flags.dtype)
print('values:', flags.tolist())
print('RESULT:', result)
index: [0, 1, 2, 3, 4, 5]
dtype: bool
values: [False, False, True, False, True, False]
RESULT: [False, False, True, False, True, False]
Implementation notes
duplicated()useskeep='first'by default — the first occurrence isFalse(not a duplicate) and all later ones areTrue. Usekeep='last'to keep the last occurrence instead, orkeep=Falseto flag every copy including the first.- Pass
subset=['col1', 'col2']to check duplicates only on specific columns rather than all columns. - Cross-reference:
drop-duplicates-keep-first(this chapter) to remove the flagged rows in one step rather than just identifying them.