Gini Impurity - Machine Learning Basics

Measure node impurity: Gini = 1 − Σpₖ² where pₖ = count_k/n per class. A counting loop tallies labels into a dict; a second loop sums squared proportions. Library: NumPy 1 − np.sum((counts/n)²) to confirm. RESULT: Gini impurity (rounded).

By hand

With NumPy

np.unique(labels, return_counts=True) returns sorted unique labels and their counts in one call. 1 - np.sum((counts/n)**2) applies the formula.

naive.py

labels = ['A', 'A', 'B', 'A', 'B', 'B']
n = len(labels)
counts = {}
for lbl in labels:
    counts[lbl] = counts.get(lbl, 0) + 1
sq_sum = 0.0
for lbl in counts:
    p = counts[lbl] / n
    sq_sum = sq_sum + p * p
gini = round(1 - sq_sum, 4)
print('RESULT:', gini)

library.py

import numpy as np
from dalib.display import set_display
set_display()

labels = ['A', 'A', 'B', 'A', 'B', 'B']
n = len(labels)
_, counts = np.unique(labels, return_counts=True)
gini = round(float(1 - np.sum((counts / n) ** 2)), 4)
print('counts:', counts.tolist())
print('RESULT:', gini)

counts: [3, 3]
RESULT: 0.5

Implementation notes

Gini=0 means a pure node (one class only); maximum is 1−1/k for k equally represented classes. For binary: max=0.5 at a 50/50 split — this example.
Decision trees choose the split that most reduces Gini from parent to the weighted average of child Ginis (Gini gain).
Gini uses no logarithm, making it cheaper to compute than entropy. Cross-reference: entropy-information-gain (this chapter) for the log-based alternative; both drive the same split-selection logic.
Cross-reference: threshold-probabilities (ch04) for how class counts relate to predicted probabilities.