Joining Records
Inner Join by Key
Join two tables by a shared key, keeping only rows where the key exists in
both. A first pass builds an index dict from the smaller table; a second
pass walks the larger table and appends matched pairs, silently skipping any
key that is absent from the index. Order 3 (key 'x') has no matching
customer and is dropped.
By hand
The Pythonic way
Build the index with a dict comprehension, then produce result with a list
comprehension that folds the lookup and the if k in index filter into one
expression.
naive.py
order_ids = [1, 2, 3, 4, 5]
order_ckeys = ['a', 'b', 'x', 'c', 'a']
cust_keys = ['a', 'b', 'c', 'd']
cust_names = ['alice', 'bob', 'carol', 'dave']
index = {}
for k, name in zip(cust_keys, cust_names):
index[k] = name
result = []
for oid, k in zip(order_ids, order_ckeys):
if k in index:
result.append((oid, index[k]))
print('RESULT:', result)
library.py
order_ids = [1, 2, 3, 4, 5]
order_ckeys = ['a', 'b', 'x', 'c', 'a']
cust_keys = ['a', 'b', 'c', 'd']
cust_names = ['alice', 'bob', 'carol', 'dave']
index = {k: name for k, name in zip(cust_keys, cust_names)}
result = [(oid, index[k]) for oid, k in zip(order_ids, order_ckeys) if k in index]
print('RESULT:', result)
RESULT: [(1, 'alice'), (2, 'bob'), (4, 'carol'), (5, 'alice')]
Implementation notes
- Building the index first makes the join O(n + m) overall: O(m) to build the index, then O(n) for the order pass. A nested-loop approach would be O(n × m).
- This is the same mechanism as
pd.merge(orders, customers, on='key', how='inner')in thepython-pandastrack — mechanism here, API there. - An outer join variant would use
index.get(k, None)and keep all orders, substitutingNonefor unmatched customers. - Key
'x'in order 3 has no entry in the customer table; theif k in indexguard drops it silently, which is the defining behaviour of an inner join.