In 2016, ProPublica caused a stir when it evaluated the performance of software used in criminal proceedings. The software, used to evaluate a suspect’s likelihood of committing new crimes, was found to yield different results when evaluating blacks and whites.
The significance of that discrepancy is still up for debate, but two researchers from Dartmouth College have raised a more fundamental question: Is the software good? The answer they came up with was “not special” because the performance could be matched by recruiting people on Mechanical Turk or by doing a simple analysis that took only two factors into account.
Software and Bias
The software in question is called COMPAS, for Correctional Offender Management Profiling for Alternative Sanctions. It takes into account a wide range of factors about suspects and uses them to evaluate whether those individuals are likely to commit further crimes and help identify intervention options. COMPAS is highly integrated into the judicial process (see this document from the California Department of Corrections for an idea of its importance). Perhaps most importantly, however, it sometimes influences sentencing, which may be based on the idea that people likely to commit more crimes should be incarcerated for longer.
ProPublica’s evaluation of the software focused on arrests in Broward County, Florida. It found that the software had similar accuracy when it came to predicting whether black and white suspects would commit again. But false positives — cases where the software predicted a new crime that never happened — was twice as likely to involve black suspects. The false negatives, which predicted suspects would remain crime-free but did not, were twice as likely to involve whites.
But by other measures, the software showed no indication of bias (including, as noted above, overall accuracy). So the significance of these findings has remained a matter of debate.
Dartmouth researchers Julia Dressel and Hany Farid decided not to focus on bias, but on overall accuracy. To do this, they took the data of 1,000 defendants and retrieved their age, gender and criminal histories. These were split into groups of 20, and Mechanical Turk was used to recruit people who were asked to guess the probability that each of the 20 individuals would commit a new crime within the next two years.
Wisdom of mechanical Turks
By pooling these results, these people had an average accuracy of 62 percent. That’s not far from COMPAS’ accuracy, which was 65 percent. In this test, multiple individuals evaluated each defendant, so the authors pooled them and took the majority opinion as a decision. This brought the accuracy to 67 percent, eliminating COMPAS. Other measurements of the Mechanical Turks’ accuracy suggested they were just as good as the software.
The results were also similar in that there was no significant difference between their evaluations of black and white suspects. The same was true when the authors presented a similar set of records to a new set of people, but this time they contained information about the suspect’s race. So in terms of overall accuracy, these inexperienced folks were about as good as the software.
But they were also about as bad, in that they were also more likely to make false positives when the defendant was black, though not to the same extent as COMPAS (a false positive rate of 37 percent for blacks, compared to 27 percent for blacks). whites). The false negative rate, where suspects were predicted not to go wrong again, but did, was also higher among whites (40 percent) than blacks (29 percent). Those figures are remarkably similar to COMPAS’ error rates. Including race data on the suspects made no significant difference.
If the algorithm could be matched by what is almost certainly a bunch of amateurs, Dressel and Farid reasoned, it might be because it’s not particularly good. So they did a series of simple statistical tests (linear regressions) using different combinations of the data they had on each defendant. They found that they could match COMPAS’s performance with just two: the age of the suspect and the total number of previous convictions.
This is not as big of a shock as it seems. Dressel and Farid make much of the claim that COMPAS supposedly takes into account 137 different factors when making its prediction. A statement from Equivant, the company that makes the software, points out that those 137 are only for evaluating interventions; prediction of recidivism uses only six factors. (The rest of the statement reads “this shows that our software is pretty good.”) Dressel and Farid also recognize that re-arrest is an imperfect measure of future criminal activity, as some crimes do not lead to arrests, and there are significant racial bias in arrests.
What you should think about all of this comes down to whether you’re comfortable with a process that’s wrong about a third of the time, which affects things like how much time people spend in prison. At the moment, however, there is no evidence of anything more effective than that.
Scientific progress2017. DOI: 10.1126/sciaadv.aao5580 (About DOIs).