Big Data as statistical masturbation

Infinite Book Tunnel

It’s just possible that there is a looming crisis in yet another technological sector whose proponents have leaped too far ahead, and too soon, promising all kinds of things they are unable to deliver. It strange how we keep ramming our head into this same damned wall, but this next crisis is perhaps more important than deflated hype at other times, say our over optimism about the timeline for human space flight in the 1970’s, or the “AI winter” in the 1980’s, or the miracles that seemed just at our fingertips when we cracked the Human Genome while pulling riches out of the air during the dotcom boom- both of which brought us to a state of mania in the 1990’s and early 2000’s.

The thing that separates a potentially new crisis in the area of so-called “Big-Data” from these earlier ones is that, literally overnight, we have reconstructed much of our economy, national security infrastructure and in the process of eroding our ancient right privacy on it’s yet to be proven premises. Now, we are on the verge of changing not just the nature of the science upon which we all depend, but nearly every other field of human intellectual endeavor. And we’ve done and are doing this despite the fact that the the most over the top promises of Big Data are about as epistemologically grounded as divining the future by looking at goat entrails.

Well, that might be a little unfair. Big Data is helpful, but the question is helpful for what? A tool, as opposed to a supposedly magical talisman has its limits, and understanding those limits should lead not to our jettisoning the tool of large scale data based analysis, but what needs to be done to make these new capacities actually useful rather than, like all forms of divination, comforting us with the idea that we can know the future and thus somehow exert control over it, when in reality both our foresight and our powers are much more limited.

Start with the issue of the digital economy. One model underlies most of the major Internet giants- Google, FaceBook and to a lesser extent Apple and Amazon, along with a whole set of behemoths who few of us can name but that underlie everything we do online, especially data aggregators such as Axicom. That model is to essentially gather up every last digital record we leave behind, many of them gained in exchange for “free” services and using this living archive to target advertisements at us.

It’s not only that this model has provided the infrastructure for an unprecedented violation of privacy by the security state (more on which below) it’s that there’s no real evidence that it even works.

Just anecdotally reflect on your own personal experience. If companies can very reasonably be said to know you better than your mother, your wife, or even you know yourself, why are the ads coming your way so damn obvious, and frankly even oblivious? In my own case, if I shop online for something, a hammer, a car, a pair of pants, I end up getting ads for that very same type of product weeks or even months after I have actually bought a version of the item I was searching for.

In large measure, the Internet is a giant market in which we can find products or information. Targeted ads can only really work if they are able refract in their marketed product’s favor the information I am searching for, if they lead me to buy something I would not have purchased in the first place. Derek Thompson, in the piece linked to above points out that this problem is called Endogeneity, or more colloquially: “hell, I was going to buy it anyway.”

The problem with this economic model, though, goes even deeper than that. At least one-third of clicks on digital ads aren’t human beings at all but bots that represent a way of gaming advertising revenue like something right out of a William Gibson novel.

Okay, so we have this economic model based on what at it’s root is really just spyware, and despite all the billions poured into it, we have no idea if it actually affects consumer behavior. That might be merely an annoying feature of the present rather than something to fret about were it not for the fact that this surveillance architecture has apparently been captured by the security services of the state. The model is essentially just a darker version of its commercial forbearer. Here the NSA, GCHQ et al hoover up as much of the Internet’s information as they can get their hands on. Ostensibly, their doing this so they can algorithmically sort through this data to identify threats.

In this case, we have just as many reasons to suspect that it doesn’t really work, and though they claim it does, none of these intelligence agencies will actually look at their supposed evidence that it does. The reasons to suspect that mass surveillance might suffer similar flaws as mass “personalized” marketing, was excellently summed up in a recent article in the Financial Times Zeynep Tufekci when she wrote:

But the assertion that big data is “what it’s all about” when it comes to predicting rare events is not supported by what we know about how these methods work, and more importantly, don’t work. Analytics on massive datasets can be powerful in analysing and identifying broad patterns, or events that occur regularly and frequently, but are singularly unsuited to finding unpredictable, erratic, and rare needles in huge haystacks. In fact, the bigger the haystack — the more massive the scale and the wider the scope of the surveillance — the less suited these methods are to finding such exceptional events, and the more they may serve to direct resources and attention away from appropriate tools and methods.

I’ll get to what’s epistemologically wrong with using Big Data in the way used by the NSA that Tufekci rightly criticizes in a moment, but on a personal, not societal level, the biggest danger from getting the capabilities of Big Data wrong seems most likely to come through its potentially flawed use in medicine.

Here’s the kind of hype we’re in the midst of as found in a recent article by Tim Mcdonnell in Nautilus:

We’re well on our way to a future where massive data processing will power not just medical research, but nearly every aspect of society. Viktor Mayer-Schönberger, a data scholar at the University of Oxford’s Oxford Internet Institute, says we are in the midst of a fundamental shift from a culture in which we make inferences about the world based on a small amount of information to one in which sweeping new insights are gleaned by steadily accumulating a virtually limitless amount of data on everything.

The value of collecting all the information, says Mayer-Schönberger, who published an exhaustive treatise entitled Big Data in March, is that “you don’t have to worry about biases or randomization. You don’t have to worry about having a hypothesis, a conclusion, beforehand.” If you look at everything, the landscape will become apparent and patterns will naturally emerge.

Here’s the problem with this line of reasoning, a problem that I think is the same, and shares the same solution to the issue of mass surveillance by the NSA and other security agencies. It begins with this idea that “the landscape will become apparent and patterns will naturally emerge.”

The flaw that this reasoning suffers has to do with the way very large data sets work. One would think that the fact that sampling millions of people, which we’re now able to do via ubiquitous monitoring, would offer enormous gains over the way we used to be confined to population samples of only a few thousand, yet this isn’t necessarily the case. The problem is the larger your sample size the greater your chance at false correlations.

Previously I had thought that surely this is a problem that statisticians had either solved or were on the verge of solving. They’re not, at least according to the computer scientist Michael Jordan, who fears that we might be on the verge of a “Big Data winter” similar to the one AI went through in the 1980’s and 90’s. Let’s say you had an extremely large database with multiple forms of metrics:

Now, if I start allowing myself to look at all of the combinations of these features—if you live in Beijing, and you ride bike to work, and you work in a certain job, and are a certain age—what’s the probability you will have a certain disease or you will like my advertisement? Now I’m getting combinations of millions of attributes, and the number of such combinations is exponential; it gets to be the size of the number of atoms in the universe.

Those are the hypotheses that I’m willing to consider. And for any particular database, I will find some combination of columns that will predict perfectly any outcome, just by chance alone. If I just look at all the people who have a heart attack and compare them to all the people that don’t have a heart attack, and I’m looking for combinations of the columns that predict heart attacks, I will find all kinds of spurious combinations of columns, because there are huge numbers of them.

The actual mathematics of sorting out spurious from potentially useful correlations from being distinguished is, in Jordan’s estimation, far from being worked out:

We are just getting this engineering science assembled. We have many ideas that come from hundreds of years of statistics and computer science. And we’re working on putting them together, making them scalable. A lot of the ideas for controlling what are called familywise errors, where I have many hypotheses and want to know my error rate, have emerged over the last 30 years. But many of them haven’t been studied computationally. It’s hard mathematics and engineering to work all this out, and it will take time.

It’s not a year or two. It will take decades to get right. We are still learning how to do big data well.

Alright, now that’s a problem. As you’ll no doubt notice the danger of false correlation that Jordan identifies as a problem for science is almost exactly the same critique Tufekci made against the mass surveillance of the NSA. That is, unless the NSA and its cohorts have actually solved the statistical/engineering problems Jordan identified and haven’t told us, all the biggest data haystack in the world is going to lead to is too many leads to follow, most of them false, and many of which will drain resources from actual public protection. Perhaps equally troubling: if security services have solved these statistical/engineering problems how much will be wasted in research funding and how many lives will be lost because medical scientists were kept from the tools that would have empowered their research?

At least part of the solution to this will be remembering why we developed statistical analysis in the first place. Herbert I. Weisberg with his recent book Willful Ignorance: The Mismeasure of Uncertainty has provided a wonderful, short primer on the subject.

Statistical evidence, according to Weisberg was first introduced to medical research back in the 1950’s as a protection against exaggerated claims to efficacy and widespread quackery. Since then we have come to take the p value .05 almost as the truth itself. Weisberg’s book is really a plea to clinicians to know their patients and not rely almost exclusively on statistical analyses of “average” patients to help those in their care make life altering decisions in terms of what medicines to take or procedures to undergo. Weisberg thinks that personalized medicine will over the long term solve these problems, and while I won’t go into my doubts about that here, I do think, in the experience of the physician, he identifies the root to the solution of our Big Data problem.

Rather than think of Big Data as somehow providing us with a picture of reality, “naturally emerging” as Mayer-Schönberger quoted above suggested we should start to view it as a way to easily and cheaply give us a metric for the potential validity of a hypothesis. And it’s not only the first step that continues to be guided by old fashioned science rather than computer driven numerology but the remaining steps as well, a positive signal followed up by actual scientist and other researchers doing such now rusting skills as actual experiments and building theories to explain their results. Big Data, if done right, won’t end up making science a form of information processing, but will instead be used as the primary tool for keeping scientist from going down a cul-de-sac.

The same principle applied to mass surveillance means a return to old school human intelligence even if it now needs to be empowered by new digital tools. Rather than Big Data being used to hoover up and analyze all potential leads, espionage and counterterrorism should become more targeted and based on efforts to understand and penetrate threat groups themselves. The move back to human intelligence and towards more targeted surveillance rather than the mass data grab symbolized by Bluffdale may be a reality forced on the NSA et al by events. In part due to the Snowden revelations terrorist and criminal networks have already abandoned the non-secure public networks which the rest of us use. Mass surveillance has lost its raison d’etre.

At least it terms of science and medicine, I recently saw a version of how Big Data done right might work. In an article for Qunta and Scientific American by Veronique Greenwood she discussed two recent efforts by researchers to use Big Data to find new understandings of and treatments for disease.

The physicist (not biologist) Stefan Thurner has created a network model of comorbid diseases trying to uncover the hidden relationships between different, seemingly unrelated medical conditions. What I find interesting about this is that it gives us a new way of understanding disease, breaking free of hermetically sealed categories that may blind us to underlying shared mechanisms by medical conditions. I find this especially pressing where it comes to mental health where the kind of symptom listing found in the DSM- the Bible for mental health care professionals- has never resulted in a causative model of how conditions such as anxiety or depression actually work and is based on an antiquated separation between the mind and the body not to mention the social and environmental factors that all give shape to mental health.

Even more interesting, from Greenwood’s piece, are the efforts by Joseph Loscalzo of Harvard Medical School to try and come up with a whole new model for disease that looks beyond genome associations for diseases to map out the molecular networks of disease isolating the statistical correlation between a particular variant of such a map and a disease. This relationship between genes and proteins correlated with a disease is something Loscalzo calls a “disease module”.

Thurner describes the underlying methodology behind his, and by implication Loscalzo’s, efforts to Greenwood this way:

Once you draw a network, you are drawing hypotheses on a piece of paper,” Thurner said. “You are saying, ‘Wow, look, I didn’t know these two things were related. Why could they be? Or is it just that our statistical threshold did not kick it out?’” In network analysis, you first validate your analysis by checking that it recreates connections that people have already identified in whatever system you are studying. After that, Thurner said, “the ones that did not exist before, those are new hypotheses. Then the work really starts.

It’s the next steps, the testing of hypotheses, the development of a stable model where the most important work really lies. Like any intellectual fad, Big Data has its element of truth. We can now much more easily distill large and sometimes previously invisible patterns from the deluge of information in which we are now drowning. This has potentially huge benefits for science, medicine, social policy, and law enforcement.

The problem comes from thinking that we are at the point where our data crunching algorithms can do the work for us and are about to replace the human beings and their skills at investigating problems deeply and in the real world. The danger there would be thinking that knowledge could work like self-gratification a mere thing of the mind without all the hard work, compromises, and conflict between expectations and reality that goes into a real relationship. Ironically, this was a truth perhaps discovered first not by scientists or intelligence agencies but by online dating services. To that strange story, next time….

Utopia or Dystopia

where past meets future

Big Data as statistical masturbation

2 comments on “Big Data as statistical masturbation”

Leave a comment Cancel reply

Share this:

Related

2 comments on “Big Data as statistical masturbation”

Leave a comment Cancel reply