Entropy and Information Gain - Machine Learning Basics

Entropy = −Σpₖ·log₂(pₖ) measures node impurity in bits. Information gain = entropy(parent) − weighted average of child entropies for a proposed split. Three inline counting loops compute each entropy. Library: NumPy helper ent() to confirm. RESULT: information gain (rounded).

By hand

With NumPy

Helper ent() uses np.unique for counts and −np.sum(p·np.log2(p)) for entropy in bits. abs() guards against −0.0 on pure (single-class) nodes.

naive.py

import math
parent = ['A', 'A', 'B', 'B']
left   = ['A', 'A', 'B']
right  = ['B']
n_p = len(parent)
cnt_p = {}
for lbl in parent:
    cnt_p[lbl] = cnt_p.get(lbl, 0) + 1
h_p = 0.0
for lbl in cnt_p:
    p = cnt_p[lbl] / n_p
    h_p = h_p - p * math.log2(p)
n_l = len(left)
cnt_l = {}
for lbl in left:
    cnt_l[lbl] = cnt_l.get(lbl, 0) + 1
h_l = 0.0
for lbl in cnt_l:
    p = cnt_l[lbl] / n_l
    h_l = h_l - p * math.log2(p)
n_r = len(right)
cnt_r = {}
for lbl in right:
    cnt_r[lbl] = cnt_r.get(lbl, 0) + 1
h_r = 0.0
for lbl in cnt_r:
    p = cnt_r[lbl] / n_r
    h_r = h_r - p * math.log2(p)
ig = h_p - (n_l / n_p) * h_l - (n_r / n_p) * h_r
print('RESULT:', round(ig, 4))

library.py

import numpy as np
from dalib.display import set_display
set_display()

def ent(labels):
    n = len(labels)
    _, counts = np.unique(labels, return_counts=True)
    p = counts / n
    return abs(float(-np.sum(p * np.log2(p))))

parent = ['A', 'A', 'B', 'B']
left   = ['A', 'A', 'B']
right  = ['B']
h_p = ent(parent)
h_l = ent(left)
h_r = ent(right)
ig = h_p - (len(left) / len(parent)) * h_l - (len(right) / len(parent)) * h_r
print('h_parent:', round(h_p, 4))
print('h_left:', round(h_l, 4))
print('h_right:', round(h_r, 4))
print('RESULT:', round(ig, 4))

h_parent: 1.0
h_left: 0.9183
h_right: 0.0
RESULT: 0.3113

Implementation notes

Entropy=0 → pure node; entropy=log₂(k) → maximally mixed k equal classes. Binary maximum = 1 bit at 50/50; parent here is 50/50 → H=1.0.
IG=0.3113 means this split removes 0.3113 bits of impurity. The right child is pure (H=0) but has only 1 sample, so its weight (1/4) is small.
log₂(1)=0 in the pure child: the h_r loop computes −1·log₂(1)=0 correctly without a special-case branch.
Weights (n_left/n, n_right/n) penalise unequal splits — a large impure child contributes more to the weighted average than a small pure one.
Cross-reference: gini-impurity (this chapter) for the log-free impurity alternative; the split-selection logic is the same, only the measure differs.