Deep Learning

These notes focus primarily on evaluating a binary classification problem.

Accuracy, Precision, Recall and F1-score

One possible measure to evaluate binary classification models is Accuracy (1.1), i.e., the fraction of predictions we got right:

$$ Accuracy = \frac{TP+TN}{TP+TN+FP+FN} …(1.1) $$

in which:

TP: True Position, which is an outcome where the model correctly predicts the positive class.

TN: True Negative, which is an outcome where the model correctly predicts the negative class.

FP: False Position, which is an outcome where the model incorrectly predicts the positive class.

FN: False Negative, which is an outcome where the model incorrectly predicts the negative class.

However, in many cases, accuracy is a poor or misleading metric:

Most often when different kinds of mistakes have different costs.
Typical case includes class imbalance, when positives or negatives are extremely rare.

For class-imbalanced problems, it is useful to separate out different kinds of errors, such as Precision (1.2) and Recall (1.3):

$$ Precision = \frac{TP}{TP+FP}…(1.2) $$

$$ Recall = \frac{TP}{TP+FN}…(1.3) $$

Precision answers the question of what proportion of positive identifications was actually correct. Recall answers the question of what proportion of actual positives was identified correctly.

If model A has better precision and better recall than model B, then model A is probably better.

F1 score (also called F-score or F-measure) (1.4) is the of precision and recall.

$$ F_1=\Bigg(\frac{precision^{-1}+recall^{-1}}{2} \Bigg)^{-1} …(1.4) $$

The importance of the F1 score is different based on the scenario. Lets assume the target variable is a binary label.

Balanced class: In this situation, the F1 score can effectively be ignored, the mis-classification rate is key.
Unbalanced class, but both classes are important: If the class distribution is highly skewed (such as 80:20 or 90:10), then a classifier can get a low mis-classification rate simply by choosing the majority class. In such a situation, I would choose the classifier that gets high F1 scores on both classes, as well as low mis-classification rate. A classifier that gets low F1-scores should be overlooked.
Unbalanced class, but one class if more important that the other. E.g., in Fraud detection, it is more important to correctly label an instance as fraudulent, as opposed to labeling the non-fraudulent one. In this case, I would pick the classifier that has a good F1 score only on the important class. Recall that the F1-score is available per class.

ROC Curve and AUC

An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds.

The ROC is also known as a relative operating characteristic curve, because it is a comparison of two operating characteristics (TPR and FPR) as the criterion changes.

This curve plots two parameters:

True Positive Rate, a.k.a sensitivity, recall, hit rate.
False Positive Rate, a.k.a fall-out.

AUC stands for “Area under the ROC Curve”. That is, AUC measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1).

AUC provides an aggregate measure of performance across all possible classification thresholds. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example.

An optimum observer required to give a yes or no answer simply chooses an operating level and concludes that the receiver input arose from signal plus noise only when this level is exceeded by the output of his likelihood ratio receiver. Associated with each such operating level are conditional probabilities that the answer is a false alarm and the conditional probability of detection. Graphs of these quantities called receiver operating characteristic, or ROC, curves are convenient for evaluating a receiver. If the detection problem is changed by varying, for example, the signal power, then a family of ROC curves is generated. Such things as betting curves can easily be obtained from such a family.— Peterson, W., Birdsall, T., Fox, W. (1954). The theory of signal detectability, Transactions of the IRE Professional Group on Information Theory, 4, 4, pp. 171 - 212.