And by “suck” I mean to say that data mining of the type probably being used by the NSA isn’t likely to be very effective in finding that terrorist needle in the American haystack. At least, that’s the assessment of a recent report issued by the Cato Institute, Effective Counterterrorism and the Limited Role of Predictive Data Mining (downloadable as a .pdf file here).
The authors of the study are Jeff Jonas, chief scientist with IBM’s Entity Analytic Solutions Group, “one of the country’s leading practitioners of the dark art of data analysis” and the author of software commonly used by major casinos and the CIA to discern hidden patterns in databases; and Jim Harper of the Cato Institute. Here are a few relevant excerpts from the report:
…Though data mining has many valuable uses, it is not well suited to the terrorist discovery problem. It would be unfortunate if data mining for terrorist discovery had currency within national security, law enforcement and technology circles because pursuing this use of data mining would waste taxpayer dollars, needlessly infringe on privacy and civil liberties, and misdirect the valuable time and energy of the men and women in the national security community[…]
One of the fundamental underpinnings of predictive data mining in the commercial sector is the use of training patterns. Corporations that study consumer behavior have millions of patterns that they can draw upon to profile their typical or ideal consumer. Even when data mining is used to seek out instances of identity and credit card fraud, this relies on models constructed using many thousands of known examples of fraud per year.
Terrorism has no similar indicia. With a relatively small number of attempts every year and only one or two major terrorist incidents every few years—each one distinct in terms of planning and execution—there are no meaningful patterns that show what behavior indicates planning or preparation for terrorism. Unlike consumers’ shopping habits and financial fraud, terrorism does not occur with enough frequency to enable the creation of valid predictive models.
Jonas and Harper make the point that unless data mining can give you not just more information, but more useful information it will simply waste valuable resources, both in time and money, that could be better spent elsewhere. As they define that term …
“Useful information” is information that puts the analyst in a position to act appropriately in a given context. It is the usefulness of the result — the fact that it can be used effectively for a given purpose — that establishes the value of any given algorithm.
The authors of this report don’t believe that data mining to find terrorist suspects is going to produce any “useful information, and they make a good case for that claim. Instead, they argue that predictive data mining to find terrorists will provide information, all right, lots of information, but much of that information will be so vague and ambiguous, or simply flat out wrong, as to make it effectively useless:
(cont.)
One of the fundamental underpinnings of predictive data mining in the commercial sector is the use of training patterns. Corporations that study consumer behavior have millions of patterns that they can draw upon to profile their typical or ideal consumer. Even when data mining is used to seek out instances of identity and credit card fraud, this relies on models constructed using many thousands of known examples of fraud per year.
Terrorism has no similar indicia. With a relatively small number of attempts every year and only one or two major terrorist incidents every few years—each one distinct in terms of planning and execution—there are no meaningful patterns that show what behavior indicates planning or preparation for terrorism.
Unlike consumers’ shopping habits and financial fraud, terrorism does not occur with enough frequency to enable the creation of valid predictive models. Predictive data mining for the purpose of turning up terrorist planning using all available demographic and transactional data points will produce no better results than the highly sophisticated commercial data mining done today. The one thing predictable about predictive data mining for terrorism is that it would be consistently wrong.
Without patterns to use, one fallback for terrorism data mining is the idea that any anomaly may provide the basis for investigation of terrorism planning. Given a “typical” American pattern of Internet use, phone calling, doctor visits, purchases, travel, reading, and so on, perhaps all outliers merit some level of investigation. This theory is offensive to traditional American freedom, because in the United States everyone can and should be an “outlier” in some sense. More concretely, though, using data mining in this way could be worse than searching at random; terrorists could defeat it by acting as normally as possible.
Treating “anomalous” behavior as suspicious may appear scientific, but, without patterns to look for, the design of a search algorithm based on anomaly is no more likely to turn up terrorists than twisting the end of a kaleidoscope is likely to draw an image of the Mona Lisa. […]
What data mining is most likely to create is millions of “false positives” in which people would be wrongly identified as possible terrorist suspects merely on the basis of their “anomalous” behavior with respect to online activities, financial transactions, purchases, etc.
Assuming a 99 percent accuracy rate, searching our population of nearly 300,000,000, some 3,000,000 people would be identified as potential terrorists.
The expense incurred in attempting to narrow down the this list of identified suspects between actual suspects and those falsely identified would prove counterproductive. It would result in lost time and a misallocation of resources. Efforts directed toward tracking down genuine leads would be swamped by the effort needed to follow the many false leads that predictive data mining of this scale is guaranteed to produce.
If you have the time, please read the entire report. It’s an eye opener in more ways than one.