riccardo

# How to evaluate AI solutions with two numbers

Updated: Aug 1, 2018

The number of business processes and decisions that technologies like machine learning are able to augment and automate is increasing continuously. However, with automation comes the need during the design phase to effectively challenge and evaluate the possible implementation. Only a real test can give us the exact answers but, we need to formulate the right questions in order to find them.

Considering the upcoming numerous business applications, we want to focus on machine learning and classifications: customer marketing profiling, loan applications, operational improvements etc… However, we will deal with it as a normal probabilistic discussion.

When discussing possible solutions with vendors, accuracy (number of correct classifications) may be often used as base metrics and it is often a number above 90%, sometimes close to 99%. That is in part because of a human flaw: we tend to think and act in a binary way much more than we believe (see behavioral finance). We are much more sensitive to probabilities like 0% and 100% and we tend to associate with those two extremes anything in-between.

Accuracy is often incomplete information. 99% accuracy could be the unfortunate result of missing the only positive case of a deadly disease out of 100 patients. Similarly, a test 99% accurate could be almost useless if applied to an incident rate of 0.1% (more on this later on). Sometimes we may be facing weak technology, other times it is just about the toughness of the problem determined by its low incident rate. Obviously, nobody wants to buy bad solutions but, even good ones may not be immediate to evaluate.

We cannot just define the number of correct classifications that we obtain on the total ones, we need an extra level of detail. With the risk of simplifying too much by considering a world with only binary classifications, a good starting reference is represented by the following four categories and derived indicators: True Positive (TP), False Positive (FP), True Negative (TN) and False Negative (FN). With them we can calculate the following indicators:

Precision = TP / (TP + FP)

Recall (or Sensitivity, approximating slightly) = TP / (TP + FN)

Specificity = TN / (TN + FP)

Precision and Recall are in general inversely correlated: if we want higher precision we need to account for more false negatives while, if we want higher recall we need to account for more false positives. We want to think about the specific application and what we really need. We found the following useful examples online: we may want high Recall for cancer screening, being willing to go through some false positives while missing very few false negatives; on the contrary, we may focus on Precision for recommendation of streaming videos, being in that case focused on providing users only relevant contents even though we may be missing some of them.

__Some useful tips:__

*We do not need always all the indicators but, using only one of them can create dangerous situations like the one resulting in maximum Recall (Sensitivity) by returning all positives without any reason. In general, two of them, Precision + Recall and Sensitivity + Specificity are used (we now mentioned Recall and Sensitivity separately because, as we said above, without the approximation of only binary classifications they are slightly different concepts). With the new extra level of detail, what we previously called Accuracy would basically be: (TP + TN) / (TP + TN + FP + FN).*

We will go now through a common example where we want to miss very few positive cases even though we may have to re-test many false positives (high Recall). The common example is the HIV test where there is also a low incident rate when compared to the population. We will assume the test can identify true positives with 99.7% probability (Recall / Sensitivity) and a true negative with the same 99.7% probability (Specificity) on a population of 300MM with an incident rate of 0.5% (average positive cases). If a patient tests positive the first time, he would have the probability of being actually so equal to: 60% *= 99.7% x 0.5% x 300MM / [99.7% x 0.5% x 300MM + (100% - 99.7%) x (100% - 0.5%) x 300MM], where 99.7% is associated to the correct TP from the Sensitivity while (100% - 99.7%) is associated to the FP from the residual of the Specificity*.

We may not be sure what to do with 60% probable positivity; it may be one of the numerous false positives. However, if repeating the test the positivity was confirmed, the probability of being correct would increase from 60% to about 99.8% and we could stop there (same calculation as above with 60% as new prior information).

In a completely opposite situation, if the test returned a negative the first time, we would have certainty of the negativity almost immediately after the first test with a probability of: 99.9% *= 99.7% x (100% - 0.5%) x 300MM / [99.7% x (100% - 0.5%) x 300MM + (100% - 99.7%) x 0.5% x 300MM]*. The true negativity is immediate because the residual of the Sensitivity (100% - 99.7%) is now applied to a much lower possible population equal to 0.5% x 300MM.

A second test is not always possible or it is not always something improving our prediction. Submitting a different blood sample at different times to an HIV test is very different from re-submitting the answers of a questionnaire to a system evaluating loan applications. That is a simplified discussion but, it is to emphasize the importance of the specific application.

To conclude, we can have a much better understanding of the performances of the solution by distinguishing TP, FP, TN and FN and the derived indicators. We can also identify the appropriate tradeoff of our specific application by examining how a possible second test improves the probability of our findings. Moreover, the first result of a test may have very different probabilities of being correct depending on its accordance with the condition of the majority of the population and the associated incident rate.

*Note: probabilities above are usually calculated with explicit conditional or Bayesian formulation but, we tried to express probabilities intuitively.*

*Image tag:*