Quantification bias

Why do many new systems seem to receive disproportionate scrutiny and criticism, even when they are objectively better than alternatives?

This is an interesting phenomenon / cognitive bias that I’ve encountered over the years that I’ve struggled to label; it’s distinct from status quo bias, which is the tendency for us to prefer keeping things the same rather than introduce uncertainty. For now, I’ll call this effect the quantification bias. In a nutshell, quantification bias occurs when an incumbent system or practice has the potential to be replaced by a new system that affords quantitative measurement of its performance in a way the older system doesn’t. Because of this, the new system is held to a higher standard than the incumbent and is judged harshly for errors — even if the new system performs objectively better than the incumbent.

Examples of this might be replacing human decision-making with an automated system or replacing a heuristic-based system with a machine learning model. The new system often brings with it a method of evaluation that allows users to easily quantify its performance on a number of dimensions (think false positives, false negatives, precision, recall, area under an ROC curve, etc.).

One area where I see this happening quite often is discussion of AI/ML regulation. A number of passed and proposed laws (in the US ) require regular auditing of AI/ML systems to ensure that performance has not degraded, meets acceptable standards, and is not biased. Boutique firms (e.g., 1, 2, 3, but this is not an exhaustive list) are quickly forming to support these auditing requirements and forward-thinking tech companies are already auditing their own models for bias and fairness.

To be clear, this is a welcome development. Measuring the accuracy of systems, interrogating systemic errors, and reducing bias are positive improvements for society! My read of the policy environment that is producing these requirements, however, is of one that is distrustful of machine learning, of “algorithms”, and of AI. That distrust is certainly well-supported; one doesn’t need to be an AI ethics expert to have encountered numerous examples of these systems making bad decisions with outsized effects for marginalized members of society (amazon example, propublica, skin color, home sales).

That being said, in many cases (such as lending, hiring, medicine, and so on), the AI systems in question are replacing processes that are rife with human biases. There is usually no similar proactive auditing requirement for humans, nor is there a requirement that humans submit to regulators all of the information that they used to make a decision — a clause in many proposed regulations of AI — nor would it even be possible, as any information submitted would be subject to ex post reasoning and justification by humans.

The fact that humans are biased and may not fully understand or be able to articulate the exact methods they used to arrive at a conclusion does not in itself mean that they are producing results that are inferior somehow to an AI system. It does mean, however, that humans are held to a different standard in evaluating or quantifying potential errors or trends in their decision-making. Consider, for example, that in many places in the US, judges are elected by popular vote with no requirements that the candidates demonstrate a baseline level of jurisprudential expertise or that they are unbiased in how they arrive at decisions.

Thankfully, this is not an intractable problem; the answer is not to reject these standards for quantifiable systems, but to hold incumbent systems to the same standards. One positive improvement would be to require quantifying the degree of improvement over existing systems that a proposed system offers. Making an apples-to-apples comparison of human decision-making systems and AI systems requires evaluating them in the same way and holding them to the same standard. This will be difficult for some systems — after all, one of the appeals of ML systems is that one can, to some extent, “replay” history using different decision thresholds. However, there are often analogues that can be used for comparison.

A well-structured quantitative assessment should not leave readers wondering “compared to what”, should provide a baseline, and should always offer both relative and absolute measurements. For instance, the statement that “the system incorrectly classified individuals as high-risk 20% of the time” could seem alarming and garner headlines — but compare that to the following:

“The new system has a false positive rate of 20%. Out of 500 individuals that were truly low-risk, 100 of them were deemed to be high risk by the system. In comparison, the incumbent system saw close to 35% of high-risk judgments overturned on appeal. The new system offers a 42% relative improvement in false positives.”

This isn’t a tech booster defense of ML systems for high-risk decisions, but a request that we begin to hold existing systems to the same standards and make informed decisions about tradeoffs.

Related: Beware Isolated Demands for Rigor