Duplicates
Drop Duplicates — Keep First
Remove duplicate entries, keeping only the first occurrence of each value.
By hand, walk the list and collect each name into result only when it has
not been seen before. With pandas, DataFrame.drop_duplicates() performs
the same operation in one call.
By hand
With pandas
df.drop_duplicates() returns a new DataFrame with duplicate rows removed,
keeping the first occurrence. The snapshot shows the shape before and after
so the number of dropped rows is immediately visible.
naive.py
names = ['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Dave']
seen = {}
result = []
for name in names:
if name not in seen:
result.append(name)
seen[name] = True
print('RESULT:', result)
library.py
import pandas as pd
from dalib.display import set_display
set_display()
names = ['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Dave']
df = pd.DataFrame({'name': names})
clean = df.drop_duplicates()
result = clean['name'].tolist()
print('columns:', df.columns.tolist())
print('shape before:', df.shape)
print('shape after:', clean.shape)
print('RESULT:', result)
columns: ['name']
shape before: (6, 1)
shape after: (4, 1)
RESULT: ['Alice', 'Bob', 'Carol', 'Dave']
Implementation notes
drop_duplicates()useskeep='first'by default. Passkeep='last'to retain the last occurrence, orkeep=Falseto drop every row that has any duplicate (keeping nothing).- The returned DataFrame preserves the original index labels. Use
.reset_index(drop=True)if a clean 0-based index is needed. - Pass
subset=['col']to deduplicate on a specific column while keeping all other columns. - Cross-reference:
find-exact-duplicates(this chapter) to inspect which rows would be dropped before committing to the removal.