Data Mining Sucks

And by “suck” I mean to say that data mining of the type probably being used by the NSA isn’t likely to be very effective in finding that terrorist needle in the American haystack. At least, that’s the assessment of a recent report issued by the Cato Institute, Effective Counterterrorism and the Limited Role of Predictive Data Mining (downloadable as a .pdf file here).

The authors of the study are Jeff Jonas, chief scientist with IBM’s Entity Analytic Solutions Group, “one of the country’s leading practitioners of the dark art of data analysis” and the author of software commonly used by major casinos and the CIA to discern hidden patterns in databases; and Jim Harper of the Cato Institute. Here are a few relevant excerpts from the report:

…Though data mining has many valuable uses, it is not well suited to the terrorist discovery problem. It would be unfortunate if data mining for terrorist discovery had currency within national security, law enforcement and technology circles because pursuing this use of data mining would waste taxpayer dollars, needlessly infringe on privacy and civil liberties, and misdirect the valuable time and energy of the men and women in the national security community[…]

One of the fundamental underpinnings of predictive data mining in the commercial sector is the use of training patterns. Corporations that study consumer behavior have millions of patterns that they can draw upon to profile their typical or ideal consumer. Even when data mining is used to seek out instances of identity and credit card fraud, this relies on models constructed using many thousands of known examples of fraud per year.

Terrorism has no similar indicia. With a relatively small number of attempts every year and only one or two major terrorist incidents every few years—each one distinct in terms of planning and execution—there are no meaningful patterns that show what behavior indicates planning or preparation for terrorism. Unlike consumers’ shopping habits and financial fraud, terrorism does not occur with enough frequency to enable the creation of valid predictive models.

Jonas and Harper make the point that unless data mining can give you not just more information, but more useful information it will simply waste valuable resources, both in time and money, that could be better spent elsewhere. As they define that term …

“Useful information” is information that puts the analyst in a position to act appropriately in a given context. It is the usefulness of the result — the fact that it can be used effectively for a given purpose — that establishes the value of any given algorithm.

The authors of this report don’t believe that data mining to find terrorist suspects is going to produce any “useful information, and they make a good case for that claim. Instead, they argue that predictive data mining to find terrorists will provide information, all right, lots of information, but much of that information will be so vague and ambiguous, or simply flat out wrong, as to make it effectively useless:

(cont.)

One of the fundamental underpinnings of predictive data mining in the commercial sector is the use of training patterns. Corporations that study consumer behavior have millions of patterns that they can draw upon to profile their typical or ideal consumer. Even when data mining is used to seek out instances of identity and credit card fraud, this relies on models constructed using many thousands of known examples of fraud per year.

Terrorism has no similar indicia. With a relatively small number of attempts every year and only one or two major terrorist incidents every few years—each one distinct in terms of planning and execution—there are no meaningful patterns that show what behavior indicates planning or preparation for terrorism.

Unlike consumers’ shopping habits and financial fraud, terrorism does not occur with enough frequency to enable the creation of valid predictive models. Predictive data mining for the purpose of turning up terrorist planning using all available demographic and transactional data points will produce no better results than the highly sophisticated commercial data mining done today. The one thing predictable about predictive data mining for terrorism is that it would be consistently wrong.

Without patterns to use, one fallback for terrorism data mining is the idea that any anomaly may provide the basis for investigation of terrorism planning. Given a “typical” American pattern of Internet use, phone calling, doctor visits, purchases, travel, reading, and so on, perhaps all outliers merit some level of investigation. This theory is offensive to traditional American freedom, because in the United States everyone can and should be an “outlier” in some sense. More concretely, though, using data mining in this way could be worse than searching at random; terrorists could defeat it by acting as normally as possible.

Treating “anomalous” behavior as suspicious may appear scientific, but, without patterns to look for, the design of a search algorithm based on anomaly is no more likely to turn up terrorists than twisting the end of a kaleidoscope is likely to draw an image of the Mona Lisa. […]

What data mining is most likely to create is millions of “false positives” in which people would be wrongly identified as possible terrorist suspects merely on the basis of their “anomalous” behavior with respect to online activities, financial transactions, purchases, etc.

Assuming a 99 percent accuracy rate, searching our population of nearly 300,000,000, some 3,000,000 people would be identified as potential terrorists.

The expense incurred in attempting to narrow down the this list of identified suspects between actual suspects and those falsely identified would prove counterproductive. It would result in lost time and a misallocation of resources. Efforts directed toward tracking down genuine leads would be swamped by the effort needed to follow the many false leads that predictive data mining of this scale is guaranteed to produce.

If you have the time, please read the entire report. It’s an eye opener in more ways than one.

3 Comments

rdf on December 13, 2006 at 3:42 pm

Data mining is just the latest in a long series of programs by governments to spy on their citizens. The objective is always the same, regardless of the reasons expressed publicly – to control political opposition to the regime in power.

Even when the original motivation was to anticipate outside threats, once the apparatus is in place it quickly becomes redirected. In the present case the US government isn’t so stupid as not to realize they are not going to find a needle in a haystack, especially when they don’t know it is a needle they are looking for. Just in the past few years we have seen many cases of spying against citizen groups exposed, from the DoD keeping tabs on Quakers to the NYC police photographing people at public rallies. And these are only the ones we know about. Who is to say what the FBI, CIA and NSA are really up to.

I have a short essay on this topic, mostly from an historical perspective:

Surveillance vs Civil Liberties

I can only repeat the tag line from my essay:

Only an open society can be a free society.
dada on December 13, 2006 at 4:59 pm

it’s not about homeland security, it’s not about stopping or interdicting terrorist acts, it’s about power…it’s nothing more than a publically funded criminal, ie: illegal, enterprise in service of those who control the dollars and the politicians.

joe and suzie six-pack have nothing to worry about…anyone with an opposing viewpoint or opinion, and an avenue/venue for expressing same, does.

this is the remnant, improved, re-invigorated and much more potent, of a system of intelligence gathering for political purposes that became SOP during the Nixon years…it’s just gone hi-tech.

the short version of my 2¢
Kidspeak on December 14, 2006 at 2:32 am

On Dec. 3, Clive Thompson, writing in the NYT magazine had a very provocative article on modern methods of doing “data mining”: Open Source Spying. Not the linear, find-a-needle-in-a-haystack model advocated by the likes of the Bushite administration as well as traditional intelligence agency work. But rather, examination of the topics and traffic in topics out there in the internet traffic. The article describes the methods in a more or less understandable fashion – given that I’ve had a lot of stat training and know how the techniques work in terms of looking at human behavior.

I highly recommend the article, though it is behind the unfortunate “wall” that the Times has set up. If you have access, it’s an interesting read that speaks to the issues raised in this diary.

About The Author

Steven D

3 Comments

Recent Posts

Recent Comments