The good, the bad and the ugly big data

Helge Helguson Neumann
ESST MA Student

There is a hurricane on its way. You run to the store to stock up on essentials, preparing for the worst. What do you buy? Strawberry Pop-Tarts, apparently.

Using big data, Walmart found out that Americans buy seven times as many Pop-Tarts whenever a storm is brewing. Also using big data, Google guides you away from traffic jams by collecting millions of cellphone signals. Facebook recommends top stories based on their vision of you; a vision consisting of codified numbers generated by the actions of you and your kind. The atoms of cyber-you. With these data, companies are able to make sense of a world that does not even make sense to us. But is it all for good use? Here’s a good, a bad and an ugly example of how big data can be used.

Big data functions as follows: you take a huge amount of data, and then use a computer to run algorithms to find some sort of pattern in the mess. It is used to find trivial issues such as fast-food preferences at different times of the day (at night, women like Thai food, men like Turkish, and everybody loves pizza), but also for making morally tense choices, like categorizing people. The data itself is not dangerous, but our interpretation and usage of it can have dire consequences. When we choose to use big data for big decisions, we need to take into account that big data can be accurate and helpful, but also unfair and racist.

“Big data” has become a buzzword in recent years much due to its promising possibilities for firms and customers, government and citizens. It helps Walmart stock up on Pop-Tarts when a storm is brewing. It helps the Norwegian government reveal fraudulent behaviour in the welfare system. But maybe most promising is its use within the health sector. Using big data from search engines and deep learning artificial intelligence, Google are showing promising results in predicting cancer before doctors are able to. How do they do it? By looking at tons and tons of search data to find some patterns between search words and cancer patients. This is perhaps the biggest advantage of big data: the ability to find interesting and useful correlations where we previously didn’t know there were any.

Not all big data can show these sorts of correlations. In some cases, the data is highly uncorrelated, but are still being used for decision-making. Back in 2012, Sarah Wysocki started her job as a teacher in Washington, D.C. After some time, she was evaluated and scored highly with her superior. She was motivating, good at teaching, and the kids liked her. Two weeks later she was fired. According to a recently implemented teacher evaluation system, IMPACT, she was not suited to be a teacher. As an addition to a human evaluation, IMPACT was put in place to look at data statistics, to better make decisions about which teachers to hire and fire. Unfortunately for Wysocki, the evaluation paid more attention to the data than the person. The data said in particular that she was not effective enough. However, when looking at the data, Wysocki had reason to be upset. A scatterplot of the “effectiveness” showed little sign of a consistent pattern, and looked more like a starry night in the desert. The data did not reveal a significant correlation, with an r=0.25, which is about the same as the correlation between height and ice-cream preferences.

Misusing big data can have great impact on individuals, costing them their jobs or excluding them from insurance policies. It is all based on the atoms of cyber-you, what you are, what you have done, and even more importantly, what people similar to you have done. When these characteristics include black, poor, and American, you are, according to big data, in trouble.

The American police have started to use predictive analysis based on big data sets. However, the statistics that are fed into the data sets are not always trustworthy. For example, while blacks and whites in America smoke marijuana at the same rate, black smokers are four times as likely to be arrested for it. Similarly, the police more frequently patrol poor, black neighbourhoods. More patrols lead to more arrests and more points in the datasets. And once in prison, chances are you will return.

Courtrooms in America are increasingly using big data to determine the likelihood of committing future crimes. ProPublica, a nonprofit investigative newsroom, dug into the numbers and found some disturbing news. The data were highly biased towards black people, automatically predicting black criminals to be more likely to commit future crimes. Not only were the predictions barely more accurate than a coin-toss, they were twice as likely to falsely accuse black people. Recently, American courtrooms have taken it a step further by using algorithms in the sentencing. Because of privacy protection, the defendant is not able to question the sentencing or view the algorithm or what data has been put into it.

Big data has an inherent risk of making history repeat itself. We use data from past experiences to predict the future, and in doing so we nudge the future in that direction. If Walmart finds out we bought Pop-Tarts during the last storm, they will put Pop-Tarts by the cashier before the next storm, making us more likely to buy them. If the police use datasets from poor, black neighbourhoods, chances are those same neighbourhoods will be targeted next. In the end, big data is only numbers. It is us that need to interpret the numbers, decide which ones to use and which ones to discard.

Photo: © posteriori/Shutterstock