Duplicates
Dedup by Key — Keep Latest
Keep the most-recent row per business key. Given parallel lists of ids,
dates, and values — where the same id appears more than once — return the
record with the largest date for each id. By hand, scan the rows and
overwrite a dict entry whenever a newer date is found. With pandas, sort by
date then call drop_duplicates(keep='last') so the last survivor per id is
the most recent.
By hand
With pandas
Sort the full DataFrame by date ascending, then call
drop_duplicates('id', keep='last'). After sorting, the last occurrence of
each id is the one with the latest date, so keep='last' retains exactly
the right row. A final sort_values('id') makes the output order
deterministic.
naive.py
ids = ['A', 'B', 'A', 'B', 'C']
dates = ['2024-01-10', '2024-01-05', '2024-01-15', '2024-01-20', '2024-01-08']
values = [100, 200, 110, 220, 300]
latest = {}
for i in range(len(ids)):
rid, d, v = ids[i], dates[i], values[i]
if rid not in latest or d > latest[rid][0]:
latest[rid] = (d, v)
result = sorted((k, v[1]) for k, v in latest.items())
print('RESULT:', result)
library.py
import pandas as pd
from dalib.display import set_display
set_display()
ids = ['A', 'B', 'A', 'B', 'C']
dates = ['2024-01-10', '2024-01-05', '2024-01-15', '2024-01-20', '2024-01-08']
values = [100, 200, 110, 220, 300]
df = pd.DataFrame({'id': ids, 'date': dates, 'value': values})
clean = df.sort_values('date').drop_duplicates('id', keep='last')
clean = clean.sort_values('id')
result = list(zip(clean['id'].tolist(), clean['value'].tolist()))
print('columns:', clean.columns.tolist())
print('shape before:', df.shape)
print('shape after:', clean.shape)
print('RESULT:', result)
columns: ['id', 'date', 'value']
shape before: (5, 3)
shape after: (3, 3)
RESULT: [('A', 110), ('B', 220), ('C', 300)]
Implementation notes
- ISO date strings (
YYYY-MM-DD) sort lexicographically in the same order as chronologically, so string comparison is valid for date ordering without parsing. drop_duplicates(subset, keep='last')relies on row order; alwayssort_valuesby the date column first so "last" really means "latest".- The two-step pattern (
sort_values→drop_duplicates(keep='last')) is the standard pandas idiom for latest-per-key deduplication. - Cross-reference:
drop-duplicates-keep-first(this chapter) for the simpler case where any occurrence is equally valid and you just need one.