Interpret Big Data with Caution

One caution with respect to employing big data (or any other data reliant technique) is the tendency of practitioners to have an overconfidence in understanding the inputs and interpreting the outputs. It sounds like a fundamental concept but if one does not have a strong understanding of what the incoming data signifies, then the interpreted output is highly likely to be biased. As is the case with the concept of sampling, if the sample is not representative of the larger whole then bias will occur. Example:

“Consider Boston’s Street Bump smartphone app, which uses a phone’s accelerometer to detect potholes without the need for city workers to patrol the streets. As citizens of Boston download the app and drive around, their phones automatically notify City Hall of the need to repair the road surface.” [1]

One would be tempted to conclude that the data that feeds into the app would reasonably represent all of the potholes in the city. In actuality, the data that was fed into the app represented those potholes in areas inhabited by young, affluent smartphone owners. The city runs the risk of neglecting areas where older, less affluent, non smartphone owners experience potholes; which is a significant portion of the city.

“As we move into an era in which personal devices are seen as proxies for public needs, we run the risk that already existing inequities will be further entrenched. Thus, with every big data set, we need to ask which people are excluded. Which places are less visible? What happens if you live in the shadow of big data sets?” [2]

[1] http://www.ft.com/intl/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html

[2] https://hbr.org/2013/04/the-hidden-biases-in-big-data/

Advertisements