Measure node impurity: Gini = 1 − Σpₖ² where pₖ = count_k/n per class. A counting loop tallies labels into a dict; a second loop sums squared proportions. Library: NumPy 1 − np.sum((counts/n)²) to confirm. RESULT: Gini impurity (rounded).

By hand

With NumPy

np.unique(labels, return_counts=True) returns sorted unique labels and their counts in one call. 1 - np.sum((counts/n)**2) applies the formula.

naive.py
labels = ['A', 'A', 'B', 'A', 'B', 'B']
n = len(labels)
counts = {}
for lbl in labels:
    counts[lbl] = counts.get(lbl, 0) + 1
sq_sum = 0.0
for lbl in counts:
    p = counts[lbl] / n
    sq_sum = sq_sum + p * p
gini = round(1 - sq_sum, 4)
print('RESULT:', gini)
library.py
import numpy as np
from dalib.display import set_display
set_display()

labels = ['A', 'A', 'B', 'A', 'B', 'B']
n = len(labels)
_, counts = np.unique(labels, return_counts=True)
gini = round(float(1 - np.sum((counts / n) ** 2)), 4)
print('counts:', counts.tolist())
print('RESULT:', gini)
counts: [3, 3]
RESULT: 0.5

Implementation notes

  • Gini=0 means a pure node (one class only); maximum is 1−1/k for k equally represented classes. For binary: max=0.5 at a 50/50 split — this example.
  • Decision trees choose the split that most reduces Gini from parent to the weighted average of child Ginis (Gini gain).
  • Gini uses no logarithm, making it cheaper to compute than entropy. Cross-reference: entropy-information-gain (this chapter) for the log-based alternative; both drive the same split-selection logic.
  • Cross-reference: threshold-probabilities (ch04) for how class counts relate to predicted probabilities.