What is it
Data mining – Wikipedia, the free encyclopedia
Data mining has been defined as “The nontrivial extraction of implicit, previously unknown, and potentially useful information from data like artificial intelligence, is an umbrella term and is used with varied meaning in a wide range of contexts.
What does it do
A simple example of data mining is its use in a retail sales department. If a store tracks the purchases of a customer and notices that a customer buys a lot of silk shirts, the data mining system will make a correlation between that customer and silk shirts.. In this case, the data mining system used by the retail store discovered new information about the customer that was previously unknown to the company.
What Data mining doesn’t do: Filtering Noise
Except that maybe this is an error: the customer in the example, let’s call him Joe bought the shirts as a gift for his father in law. Joe is even allergic to silk. And therein lies one of the drawbacks of data mining: It produces thousands of false positives.
Looking for the oddball: Are there needles in those haystacks ?
The opposite strategy is used while searching for terrorists among the gobs of mundane data about the millions of people who are currently in US soil: Data such as the fact that one of the terrorists had been in the country for less than two years, had 30 credit cards and a quarter million dollars’ debt. Or the amount of middle eastern men taking flight lessons in the last 2 years, as Coleen Rowley with her built in data mining tool – her brain. But this kind of mining also yields thousands of false positives, i.e., thousands of innocent civilians get characterized as terrorists
Thousands of false positives
Sure enough, someone on the blogosphere / MSM brought that up on the context of Able Danger:
Eric Umansky: Thousands of False Positives on Able Danger ?
The NYT’s Philip Shenon, who has done some of the Able Danger reporting, was interviewed Friday on WNYC. There host Mike Pesca raised the false positives question. Here’s what Shenon said:
“I understand from others at the Pentagon that one of the problems here is that Able Danger came up with names not just of Atta and three others, it came up with a tremendous number of names of very decent American citizens.”
That sounds like a whole lot more than the “60” the Times suggested…Friday (i.e. same day as the radio interview). Are Shenon and Jehl on the same page?
Shenon’s comment matches with my personal experience: As I said in my blog, data mining brings up a lot of false positives -on the hundreds with the relatively low volume of data I worked. Thousands of false positives is very plausible with the gobs of data that Able Danger must have dealt with
.
Using Data Mining to Find Terrorists: False positives
Those limitations are not only the musings of yours truly, but have also been raised by data mining experts such as Herb Edelstein , an internationally recognized expert in data mining, data warehousing and CRM, consulting to both computer vendors and users. A popular speaker and teacher, he is also a co-founder of The Data Warehousing Institute
False positives. Given the difficulty of developing good signatures and the small number of terrorists relative to the population of the United States, there are likely to be an enormous number of innocent people identified as potential terrorists (false positives). The more you try to avoid false positives, the more likely you are to miss many true positives. Unlike a direct mail campaign where the cost of a false positive is only a few dollars at worst, the costs in identifying terrorists – in dollars, time and wasted opportunity – are staggering. Suppose we had a collection of algorithms that has a false positive rate of only 0.1 percent – extraordinarily good for a problem of this complexity. That would mean 220,000 false positives! There are not enough investigators to investigate every false positive. Even if there were, the dollar cost would be in the billions, as would the cost of the resulting lawsuits. More importantly, the resources and amount of calendar time expended in these mostly useless investigations would likely leave many true terrorists free. Even if we concentrated only on non-citizens, we would still have more than 20,000 false positives to be vetted.
Seeing what is not there: Data Dredging
Again from Wiki:
Data dredging.. implies imposing patterns (and particularly causal relationships) on data where none exist. This imposition of irrelevant, misleading or trivial attribute correlation is more properly criticized as “data dredging” in the statistical literature… [it] implies scanning the data for any relationships, and then when one is found coming up with an interesting explanation. (This is also referred to as “overfitting the model”.) .. A more significant danger is finding correlations that do not really exist. Investment analysts appear to be particularly vulnerable to this. [but gamblers are worse] “There have always been a considerable number of pathetic people who busy themselves examining the last thousand numbers which have appeared on a roulette wheel, in search of some repeating pattern. Sadly enough, they have usually found it.”
Separating the Wheat from the Shaft
Of course there are modern data mining programs that have sophisticated noise filtering and sample bias techniques. The Pentagon’s “Able Danger” system is likely to have used those `leading edge’ data mining techniques such as the ones mentioned in the paper Data Mining For Very Busy People (copy here (pdf – lots of hairy math and statistics) they illustrate how many a simple real life problem can yield too many results, and how some state of the art algorithms can sift through those. But then they miss true positives, as you can see in the paper’s golf example:
in the golf example, the best outcome is playing lots of golf. outlook= overcast always appears when the golfer plays lots of golf and never when the golfer plays no or some golf. The treatment outlook = overcast and this test find four of the six best outcomes [but it lost 2 true positives : Golfers play lots of golf in sunny days that aren’t too hot or too windy]
http://www.texasturkey.us/images/MINING_S.jpg
.
More fron the expert on Using Data Mining to Find Terrorists
Data Mining In Depth: Using Data Mining to Find Terrorists
It was recently reported that a few days after the September 11 attacks, FBI agents visited one of the largest providers of consumer data. They did so to see if the 9/11 terrorists were in the database and quickly found five of them. One of the terrorists had been in the country for less than two years, had 30 credit cards and a quarter million dollars’ debt with a payment schedule of $9,800 per month. Mohammed Atta, the ringleader, had also been here less than two years and had 12 addresses under the names Mohammed Atta, Mohammed J. Atta, J. Atta and others. Surely, their report speculated, with patterns like this, we can use the databases we presently have to ferret out terrorists in our midst. Unfortunately, the answer is, “It depends.”There are limitations in using these so-called patterns of the agents’ observations. We need to ask, first, how the records were found and, second, if the observed characteristics are indeed repeated patterns or merely isolated instances. Because I am not privy to any knowledge other than what was published in the report, my analysis is based on surmise.
More than likely, the FBI started their search with database queries using the suspected terrorists’ names and likely variants. They found the terrorists’ records and then noticed the number of credit cards, addresses and the amount of debt. However, they probably would not have known in advance to look for these attributes. Furthermore, the terrorists’ records probably didn’t show that they had been in the country for only two years; that is knowledge the FBI brought to the search.
We also don’t know how easily the observations generalize to other terrorists or how many non-terrorists have these same attributes. Combing the database for people who have a number of credit cards, big debts or multiple addresses would undoubtedly yield both criminals (most of whom aren’t terrorists) and perfectly innocent folks.
The large number of addresses for Atta may be an even more difficult screening criterion to use, considering that we don’t know the names of unknown terrorists, let alone their aliases. It would be nearly impossible to conduct an aggregation across the hundreds of millions of individuals in this database to calculate the number of addresses, especially because all a terrorist would have to do to defeat such a search is use different aliases.
As I indicated last month Data Mining In Depth: TIAin’t, we don’t have enough known terrorists or a consistent set of behaviors to use data mining to build predictive models. Thus, it would not be particularly productive to search for a signature.