We investigated Amsterdam's attempt to build a 'fair' fraud detection model

Here's a summary of the key themes and opinions expressed in the Hacker News discussion about the Amsterdam fraud detection model, with direct quotations and attributions:

The Problematic Definition of "Fairness" and "Unbiased"

Many commenters took issue with the very definition of "fairness" being used, particularly the idea of equal outcomes across groups. They argued that this ignores underlying realities about differing rates of fraud in different populations.

djoldman: "Unbiased," and "fair" models are generally somewhat ironic."
wongarsu: "So if I want to make a model to recommend inkjet printers then a quarter of all recommendations should be for HP printers? After all, a quarter of all sold printers are HP. As you say, that would be a crappy model. But in my opinion that would also be hardly a fair or unbiased model. That would be a model unfairly biased in favor of HP, who barely sell anything worth recommending"
djohnston: "Sorry, this is retarded right? Why would you assume that all groupings of people commit welfare fraud at the same rate?"
BonoboIO: "Yes it is. This is some ideal world thinking, that has nothing to do with reality and is easily falsifiable, but only if you want to see the real world."
throwawayqqq11: "Being flagged as potential fraud based on eg. ethnicity is what you want to eliminate, so you have to start with the assumption of an even distristribution."
ordu: "If some group have higher rate of welfare fraud, the fair/unbiased system must keep false positives for that group at the same level as for general population."

The Inherent Trade-offs in Fairness Metrics

Several contributors highlighted the unavoidable trade-offs between different fairness definitions and between fairness and overall performance. Achieving one type of fairness can worsen others, and improving fairness for one group can harm another.

tbrownaw: "If different groups have different rates of whatever you're predicting, it is not possible to have all of the different ways of measuring performance agree on whether your model is fair or not."
tripletao: "They correctly note the existence of a tradeoff... Ideally, a model would be fair in the senses that: 1. In aggregate over any nationality, people face the same probability of a false positive. 2. Two people who are identical except for their nationality face the same probability of a false positive. In general, it's impossible to achieve both properties."
tbrownaw: "These are two sets of unavoidable tradeoffs: focusing on one fairness definition can lead to worse outcomes on others. Similarly, focusing on one group can lead to worse performance for other groups."

The Importance of Ground Truth and Addressing Underlying Issues

Many commenters pointed out that the central problem is often the lack of reliable "ground truth" data. It's difficult to assess bias and fairness without knowing the actual rates of fraud within different groups. Furthermore, simply reweighting a flawed model doesn't address the root causes of inequity.

wongarsu: "A big part of the difficulty of such an attempt is that we don't know the ground truth... So how are we supposed to judge if the new model is unbiased? This seems fundamentally impossible without improving our ground truth in some way."
talkingtab: "My take away is that the factors the city of Amsterdam is using to predict fraud are probably not actually predictors... One has to wonder if the study is more valid a predictor of the implementers' biases than that of the subjects."

Model Performance and Practical Implications

A major concern was that the "fairness" interventions led to a decrease in the model's performance, resulting in more investigations and no real tangible benefit. The focus on fairness metrics overshadowed the fundamental need for the model to be effective at detecting fraud.

BonoboIO: "The article talks a lot about fairness metrics but never mentions whether the system actually catches fraud. Without figures for true positives, recall, or financial recoveries, its effectiveness remains completely in the dark."
wongarsu: "...the model’s performance in the pilot also deteriorated. Crucially, the model was meant to lead to fewer investigations and more rejections. What happened instead was mostly an increase in investigations , while the likelihood to find investigation worthy applications barely changed in comparison to the analogue process."

Concerns About Data Integrity and Methodology

Some commenters raised questions about the rigor of the city's analysis and reporting. The inability to replicate results and discrepancies in the data cast doubt on the validity of the experiment.

3abiton: "A more concerning limitation is that when the city re-ran parts of its analysis, it did not fully replicate its own data and results. For example, the city was unable to replicate its train and test split... It just cast a big doubt on the whole experiment."

Positive Takeaways: Responsible Development and Deployment

Despite the criticisms, some commenters praised the city of Amsterdam for their responsible approach. They highlighted the importance of ethical considerations, scientific iteration, and the willingness to abandon a project that fails to meet its ethical guardrails.

thatguymike: "Congrats Amsterdam: they funded a worthy and feasible project; put appropriate ethical guardrails in place; iterated scientifically; then didn’t deploy when they couldn’t achieve a result that satisfied their guardrails. We need more of this in the world."

The Complexity of Proxy Variables

Commenters pointed out the complex relationship between features and protected characteristics. The features a model uses can act as a proxy for sensitive attributes introducing bias.

tomp: "it's not unusual that relevant features are correlated with protected features. In the specific example above, being an immigrant is likely correlated with not knowing the local language, therefore being underemployed and hence more likely to apply for benefits."
tbrownaw: "But the model designers were aware that features could be correlated with demographic groups in a way that would make them proxies."