Decision Trees
Best Split (Gini Scan)
Find the best 1-feature split by scanning candidate thresholds and choosing
the lowest weighted Gini impurity. For each threshold t: split into left
(X<t) and right (X≥t), compute weighted Gini. Library:
DecisionTreeClassifier(max_depth=1) reads tree_.threshold[0].
RESULT: best threshold.
By hand
With scikit-learn
DecisionTreeClassifier(max_depth=1) fits a single split and stores the
chosen threshold in clf.tree_.threshold[0].
naive.py
X = [1, 2, 3, 5, 7, 8]
y = [0, 0, 0, 1, 1, 1]
n = len(X)
thresholds = [2.5, 4.0, 6.0]
best_t = None
best_wg = 1.0
for t in thresholds:
left_y = [y[i] for i in range(n) if X[i] < t]
right_y = [y[i] for i in range(n) if X[i] >= t]
nl = len(left_y)
nr = len(right_y)
c0l = left_y.count(0)
gl = 1 - (c0l / nl) ** 2 - ((nl - c0l) / nl) ** 2
c0r = right_y.count(0)
gr = 1 - (c0r / nr) ** 2 - ((nr - c0r) / nr) ** 2
wg = (nl / n) * gl + (nr / n) * gr
if wg < best_wg:
best_wg = wg
best_t = t
print('RESULT:', best_t)
library.py
from sklearn.tree import DecisionTreeClassifier
from dalib.display import set_display
set_display()
X = [[1], [2], [3], [5], [7], [8]]
y = [0, 0, 0, 1, 1, 1]
clf = DecisionTreeClassifier(max_depth=1, random_state=0)
clf.fit(X, y)
threshold = round(float(clf.tree_.threshold[0]), 4)
print('sklearn threshold:', threshold)
print('RESULT:', threshold)
sklearn threshold: 4.0
RESULT: 4.0
Implementation notes
- sklearn's split condition is feature ≤ threshold (left), > threshold
(right). The naive uses strict
<— equivalent here because no training sample equals 4.0 (the midpoint (3+5)/2=4.0 lies between feature values). - Candidate thresholds are midpoints between adjacent sorted feature values. Three candidates [2.5, 4.0, 6.0] cover the informative region; symmetric outer candidates give higher impurity and are omitted.
- Weighted Gini = (n_left/n)·Gini_left + (n_right/n)·Gini_right. At t=4.0 both children are pure (Gini=0) so weighted Gini=0 — the global minimum.
- Cross-reference:
gini-impurity(this chapter) for the per-node formula;decision-stump-predict(this chapter) applies the found threshold.