Evaluation Metrics
Precision, Recall, and F1
Compute precision, recall, and F1 from given TP, FP, FN counts. Precision = TP/(TP+FP); recall = TP/(TP+FN); F1 = 2PR/(P+R). Library: sklearn precision_score, recall_score, f1_score on the same y_true/y_pred. RESULT: (precision, recall, f1) rounded.
By hand
With scikit-learn
precision_score, recall_score, f1_score each take y_true and y_pred;
default average='binary' treats label 1 as positive.
naive.py
tp = 2
fp = 1
fn = 2
precision = tp / (tp + fp)
recall = tp / (tp + fn)
f1 = 2 * precision * recall / (precision + recall)
print('RESULT:', (round(precision, 4), round(recall, 4), round(f1, 4)))
library.py
from sklearn.metrics import precision_score, recall_score, f1_score
from dalib.display import set_display
set_display()
y_true = [1, 0, 1, 1, 0, 1]
y_pred = [1, 1, 0, 1, 0, 0]
p = round(float(precision_score(y_true, y_pred)), 4)
r = round(float(recall_score(y_true, y_pred)), 4)
f = round(float(f1_score(y_true, y_pred)), 4)
print('RESULT:', (p, r, f))
RESULT: (0.6667, 0.5, 0.5714)
Implementation notes
- F1 is computed from the raw (unrounded) precision and recall so floating-
point rounding doesn't compound. The rounded values appear only in the
final
print. - Precision answers "of all predicted positives, how many were correct?"; recall answers "of all actual positives, how many did we find?". F1 is their harmonic mean — harmonic mean punishes a metric that is very high on one axis and very low on the other, more than the arithmetic mean would.
- Accuracy = (TP+TN)/n = 3/6 = 0.5 on this data (lower than F1 = 0.57),
showing that accuracy can understate classifier performance when classes are
imbalanced. Cross-reference:
accuracy-scoreandconfusion-matrix-counts(this chapter).