Find Exact Duplicates - Data Cleaning Step by Step

Flag each element that is a repeat of an earlier occurrence. By hand, keep a seen dict and mark each element True if it is already present. With pandas, DataFrame.duplicated() returns a boolean Series where True marks every row after its first occurrence.

By hand

With pandas

df.duplicated() scans rows and marks every row after the first occurrence of each value as True. The result is a boolean dtype: bool Series.

naive.py

names = ['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Dave']
seen = {}
flags = []
for name in names:
    flags.append(name in seen)
    seen[name] = True
print('RESULT:', flags)

library.py

import pandas as pd
from dalib.display import set_display
set_display()

names = ['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Dave']
df = pd.DataFrame({'name': names})
flags = df.duplicated()
result = flags.tolist()
print('index:', flags.index.tolist())
print('dtype:', flags.dtype)
print('values:', flags.tolist())
print('RESULT:', result)

index: [0, 1, 2, 3, 4, 5]
dtype: bool
values: [False, False, True, False, True, False]
RESULT: [False, False, True, False, True, False]

Implementation notes

duplicated() uses keep='first' by default — the first occurrence is False (not a duplicate) and all later ones are True. Use keep='last' to keep the last occurrence instead, or keep=False to flag every copy including the first.
Pass subset=['col1', 'col2'] to check duplicates only on specific columns rather than all columns.
Cross-reference: drop-duplicates-keep-first (this chapter) to remove the flagged rows in one step rather than just identifying them.