Dedup by Key — Keep Latest - Data Cleaning Step by Step

Keep the most-recent row per business key. Given parallel lists of ids, dates, and values — where the same id appears more than once — return the record with the largest date for each id. By hand, scan the rows and overwrite a dict entry whenever a newer date is found. With pandas, sort by date then call drop_duplicates(keep='last') so the last survivor per id is the most recent.

By hand

With pandas

Sort the full DataFrame by date ascending, then call drop_duplicates('id', keep='last'). After sorting, the last occurrence of each id is the one with the latest date, so keep='last' retains exactly the right row. A final sort_values('id') makes the output order deterministic.

naive.py

ids    = ['A', 'B', 'A', 'B', 'C']
dates  = ['2024-01-10', '2024-01-05', '2024-01-15', '2024-01-20', '2024-01-08']
values = [100, 200, 110, 220, 300]
latest = {}
for i in range(len(ids)):
    rid, d, v = ids[i], dates[i], values[i]
    if rid not in latest or d > latest[rid][0]:
        latest[rid] = (d, v)
result = sorted((k, v[1]) for k, v in latest.items())
print('RESULT:', result)

library.py

import pandas as pd
from dalib.display import set_display
set_display()

ids    = ['A', 'B', 'A', 'B', 'C']
dates  = ['2024-01-10', '2024-01-05', '2024-01-15', '2024-01-20', '2024-01-08']
values = [100, 200, 110, 220, 300]
df = pd.DataFrame({'id': ids, 'date': dates, 'value': values})
clean = df.sort_values('date').drop_duplicates('id', keep='last')
clean = clean.sort_values('id')
result = list(zip(clean['id'].tolist(), clean['value'].tolist()))
print('columns:', clean.columns.tolist())
print('shape before:', df.shape)
print('shape after:', clean.shape)
print('RESULT:', result)

columns: ['id', 'date', 'value']
shape before: (5, 3)
shape after: (3, 3)
RESULT: [('A', 110), ('B', 220), ('C', 300)]

Implementation notes

ISO date strings (YYYY-MM-DD) sort lexicographically in the same order as chronologically, so string comparison is valid for date ordering without parsing.
drop_duplicates(subset, keep='last') relies on row order; always sort_values by the date column first so "last" really means "latest".
The two-step pattern (sort_values → drop_duplicates(keep='last')) is the standard pandas idiom for latest-per-key deduplication.
Cross-reference: drop-duplicates-keep-first (this chapter) for the simpler case where any occurrence is equally valid and you just need one.