Monday, December 15, 2025

 

Researchers reveal bias in a widely used measure of algorithm performance




Santa Fe Institute





When scientists test algorithms that sort or classify data they often turn to a trusted tool called Normalized Mutual Information (or NMI) to measure how well an algorithm’s output matches reality. But according to new research, that tool may not be as reliable as many assume.

In a paper published in Nature Communications, SFI Postdoctoral Fellow Max Jerdee, Alec Kirkley (University of Hong Kong), and SFI External Professor Mark Newman (University of Michigan) show that NMI — one of the most widely used metrics in data science and network research — can produce biased results. "Normalized mutual information has been used or referenced in thousands of papers in the decades since it was first proposed,” Newman says, “but it turns out that it can give incorrect results, and the errors are large enough to change scientific conclusions in some cases."

Suppose researchers are developing algorithms to classify medical conditions based on patient symptoms. One model might correctly identify diabetes but treat all cases the same, while another is better at distinguishing between type 1 and type 2, but completely misses the diabetes diagnosis 10% of the time, therefore having a greater error margin. In situations like this, researchers need a way to say which model’s predictions give more information about the true condition. Mutual information helps with that, measuring how much a model’s output reduces the uncertainty about the correct classifications. Researchers often normalize that measure so it falls between 0 and 1,  which makes different problems easier to compare. Yet Jerdee and colleagues found that this normalization introduces two major biases.

First, it can reward algorithms that over-divide data, inventing extra categories and appearing more accurate than they are. Second, commonly used normalization methods can introduce a further bias toward artificially simple algorithms. Both effects can distort comparisons, especially in complex problems where the “true” grouping is not straightforward.

To address these issues, the team developed an asymmetric, reduced version of the mutual information metric that eliminates both sources of bias. When they applied their measure to popular community-detection algorithms, they found that while standard NMI can point researchers to different “best” algorithms depending on how it’s calculated, their revised measure offers a more consistent and trustworthy comparison.

By correcting this metric, the authors hope to improve the reliability of comparisons across any fields where clustering or classification play a central role. “Scientists use NMI as a kind of yardstick to compare algorithms,” Jerdee says. “But if the yardstick itself is bent, you might draw the wrong conclusion about which method performs better.”

Read the paper “Normalized mutual information is a biased measure for classification and community detection” in Nature Communications (date TBD). DOI: 10.1038/s41467-025-66150-8

No comments: