Decision Trees
Gini Impurity
Measure node impurity: Gini = 1 − Σpₖ² where pₖ = count_k/n per class.
A counting loop tallies labels into a dict; a second loop sums squared
proportions. Library: NumPy 1 − np.sum((counts/n)²) to confirm.
RESULT: Gini impurity (rounded).
By hand
With NumPy
np.unique(labels, return_counts=True) returns sorted unique labels and
their counts in one call. 1 - np.sum((counts/n)**2) applies the formula.
naive.py
labels = ['A', 'A', 'B', 'A', 'B', 'B']
n = len(labels)
counts = {}
for lbl in labels:
counts[lbl] = counts.get(lbl, 0) + 1
sq_sum = 0.0
for lbl in counts:
p = counts[lbl] / n
sq_sum = sq_sum + p * p
gini = round(1 - sq_sum, 4)
print('RESULT:', gini)
library.py
import numpy as np
from dalib.display import set_display
set_display()
labels = ['A', 'A', 'B', 'A', 'B', 'B']
n = len(labels)
_, counts = np.unique(labels, return_counts=True)
gini = round(float(1 - np.sum((counts / n) ** 2)), 4)
print('counts:', counts.tolist())
print('RESULT:', gini)
counts: [3, 3]
RESULT: 0.5
Implementation notes
- Gini=0 means a pure node (one class only); maximum is 1−1/k for k equally represented classes. For binary: max=0.5 at a 50/50 split — this example.
- Decision trees choose the split that most reduces Gini from parent to the weighted average of child Ginis (Gini gain).
- Gini uses no logarithm, making it cheaper to compute than entropy.
Cross-reference:
entropy-information-gain(this chapter) for the log-based alternative; both drive the same split-selection logic. - Cross-reference:
threshold-probabilities(ch04) for how class counts relate to predicted probabilities.