Drop Duplicates — Keep First - Data Cleaning Step by Step

Remove duplicate entries, keeping only the first occurrence of each value. By hand, walk the list and collect each name into result only when it has not been seen before. With pandas, DataFrame.drop_duplicates() performs the same operation in one call.

By hand

With pandas

df.drop_duplicates() returns a new DataFrame with duplicate rows removed, keeping the first occurrence. The snapshot shows the shape before and after so the number of dropped rows is immediately visible.

naive.py

names = ['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Dave']
seen = {}
result = []
for name in names:
    if name not in seen:
        result.append(name)
        seen[name] = True
print('RESULT:', result)

library.py

import pandas as pd
from dalib.display import set_display
set_display()

names = ['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Dave']
df = pd.DataFrame({'name': names})
clean = df.drop_duplicates()
result = clean['name'].tolist()
print('columns:', df.columns.tolist())
print('shape before:', df.shape)
print('shape after:', clean.shape)
print('RESULT:', result)

columns: ['name']
shape before: (6, 1)
shape after: (4, 1)
RESULT: ['Alice', 'Bob', 'Carol', 'Dave']

Implementation notes

drop_duplicates() uses keep='first' by default. Pass keep='last' to retain the last occurrence, or keep=False to drop every row that has any duplicate (keeping nothing).
The returned DataFrame preserves the original index labels. Use .reset_index(drop=True) if a clean 0-based index is needed.
Pass subset=['col'] to deduplicate on a specific column while keeping all other columns.
Cross-reference: find-exact-duplicates (this chapter) to inspect which rows would be dropped before committing to the removal.