What is it
Data mining – Wikipedia, the free encyclopedia
Data mining has been defined as “The nontrivial extraction of implicit, previously unknown, and potentially useful information from data like artificial intelligence, is an umbrella term and is used with varied meaning in a wide range of contexts.
What does it do
A simple example of data mining is its use in a retail sales department. If a store tracks the purchases of a customer and notices that a customer buys a lot of silk shirts, the data mining system will make a correlation between that customer and silk shirts.. In this case, the data mining system used by the retail store discovered new information about the customer that was previously unknown to the company.
What Data mining doesn’t do: Filtering Noise
Except that maybe this is an error: the customer in the example, let’s call him Joe bought the shirts as a gift for his father in law. Joe is even allergic to silk. And therein lies one of the drawbacks of data mining: It produces thousands of false positives.
Looking for the oddball: Are there needles in those haystacks ?
The opposite strategy is used while searching for terrorists among the gobs of mundane data about the millions of people who are currently in US soil: Data such as the fact that one of the terrorists had been in the country for less than two years, had 30 credit cards and a quarter million dollars’ debt. Or the amount of middle eastern men taking flight lessons in the last 2 years, as Coleen Rowley with her built in data mining tool – her brain. But this kind of mining also yields thousands of false positives, i.e., thousands of innocent civilians get characterized as terrorists
Thousands of false positives
Sure enough, someone on the blogosphere / MSM brought that up on the context of Able Danger:
Eric Umansky: Thousands of False Positives on Able Danger ?
The NYT’s Philip Shenon, who has done some of the Able Danger reporting, was interviewed Friday on WNYC. There host Mike Pesca raised the false positives question. Here’s what Shenon said:
“I understand from others at the Pentagon that one of the problems here is that Able Danger came up with names not just of Atta and three others, it came up with a tremendous number of names of very decent American citizens.”
That sounds like a whole lot more than the “60” the Times suggested…Friday (i.e. same day as the radio interview). Are Shenon and Jehl on the same page?
Shenon’s comment matches with my personal experience: As I said in my blog, data mining brings up a lot of false positives -on the hundreds with the relatively low volume of data I worked. Thousands of false positives is very plausible with the gobs of data that Able Danger must have dealt with
.
Using Data Mining to Find Terrorists: False positives
Those limitations are not only the musings of yours truly, but have also been raised by data mining experts such as Herb Edelstein , an internationally recognized expert in data mining, data warehousing and CRM, consulting to both computer vendors and users. A popular speaker and teacher, he is also a co-founder of The Data Warehousing Institute
False positives. Given the difficulty of developing good signatures and the small number of terrorists relative to the population of the United States, there are likely to be an enormous number of innocent people identified as potential terrorists (false positives). The more you try to avoid false positives, the more likely you are to miss many true positives. Unlike a direct mail campaign where the cost of a false positive is only a few dollars at worst, the costs in identifying terrorists – in dollars, time and wasted opportunity – are staggering. Suppose we had a collection of algorithms that has a false positive rate of only 0.1 percent – extraordinarily good for a problem of this complexity. That would mean 220,000 false positives! There are not enough investigators to investigate every false positive. Even if there were, the dollar cost would be in the billions, as would the cost of the resulting lawsuits. More importantly, the resources and amount of calendar time expended in these mostly useless investigations would likely leave many true terrorists free. Even if we concentrated only on non-citizens, we would still have more than 20,000 false positives to be vetted.
Seeing what is not there: Data Dredging
Again from Wiki:
Data dredging.. implies imposing patterns (and particularly causal relationships) on data where none exist. This imposition of irrelevant, misleading or trivial attribute correlation is more properly criticized as “data dredging” in the statistical literature… [it] implies scanning the data for any relationships, and then when one is found coming up with an interesting explanation. (This is also referred to as “overfitting the model”.) .. A more significant danger is finding correlations that do not really exist. Investment analysts appear to be particularly vulnerable to this. [but gamblers are worse] “There have always been a considerable number of pathetic people who busy themselves examining the last thousand numbers which have appeared on a roulette wheel, in search of some repeating pattern. Sadly enough, they have usually found it.”
Separating the Wheat from the Shaft
Of course there are modern data mining programs that have sophisticated noise filtering and sample bias techniques. The Pentagon’s “Able Danger” system is likely to have used those `leading edge’ data mining techniques such as the ones mentioned in the paper Data Mining For Very Busy People (copy here (pdf – lots of hairy math and statistics) they illustrate how many a simple real life problem can yield too many results, and how some state of the art algorithms can sift through those. But then they miss true positives, as you can see in the paper’s golf example:
in the golf example, the best outcome is playing lots of golf. outlook= overcast always appears when the golfer plays lots of golf and never when the golfer plays no or some golf. The treatment outlook = overcast and this test find four of the six best outcomes [but it lost 2 true positives : Golfers play lots of golf in sunny days that aren’t too hot or too windy]
http://www.texasturkey.us/images/MINING_S.jpg
.
More fron the expert on Using Data Mining to Find Terrorists
Data Mining In Depth: Using Data Mining to Find Terrorists
It was recently reported that a few days after the September 11 attacks, FBI agents visited one of the largest providers of consumer data. They did so to see if the 9/11 terrorists were in the database and quickly found five of them. One of the terrorists had been in the country for less than two years, had 30 credit cards and a quarter million dollars’ debt with a payment schedule of $9,800 per month. Mohammed Atta, the ringleader, had also been here less than two years and had 12 addresses under the names Mohammed Atta, Mohammed J. Atta, J. Atta and others. Surely, their report speculated, with patterns like this, we can use the databases we presently have to ferret out terrorists in our midst. Unfortunately, the answer is, “It depends.”There are limitations in using these so-called patterns of the agents’ observations. We need to ask, first, how the records were found and, second, if the observed characteristics are indeed repeated patterns or merely isolated instances. Because I am not privy to any knowledge other than what was published in the report, my analysis is based on surmise.
More than likely, the FBI started their search with database queries using the suspected terrorists’ names and likely variants. They found the terrorists’ records and then noticed the number of credit cards, addresses and the amount of debt. However, they probably would not have known in advance to look for these attributes. Furthermore, the terrorists’ records probably didn’t show that they had been in the country for only two years; that is knowledge the FBI brought to the search.
We also don’t know how easily the observations generalize to other terrorists or how many non-terrorists have these same attributes. Combing the database for people who have a number of credit cards, big debts or multiple addresses would undoubtedly yield both criminals (most of whom aren’t terrorists) and perfectly innocent folks.
The large number of addresses for Atta may be an even more difficult screening criterion to use, considering that we don’t know the names of unknown terrorists, let alone their aliases. It would be nearly impossible to conduct an aggregation across the hundreds of millions of individuals in this database to calculate the number of addresses, especially because all a terrorist would have to do to defeat such a search is use different aliases.
As I indicated last month Data Mining In Depth: TIAin’t, we don’t have enough known terrorists or a consistent set of behaviors to use data mining to build predictive models. Thus, it would not be particularly productive to search for a signature.
< clink >
.
CAN WE TRUST EITHER THEM OR WELDON?
Alert TKS reader “KH” sends in another piece of the puzzle.
In Sunday’s Bergen Record, columnist Mike Kelly quoted an Able Danger team member as saying Mohammed Atta lived in the Wayne Inn in New Jersey for a year before the 9/11 attacks.
KH found another Bergen Record article, from Friday, June 20, 2003:
Mohamed Atta, the mastermind of the Sept. 11 terror attacks, lived in a Wayne motel for about a year before the attacks, state police said Thursday.
The revelation – the first time authorities have placed Atta in the motel – came after state police Superintendent Joseph R. Fuentes testified Thursday afternoon before the Assembly Homeland Security Committee about his agency’s efforts to combat terrorism.
In his testimony, Fuentes confirmed what officials had said previously: New Jersey was one of the launching points for the Sept. 11 attacks and that some of the 19 hijackers moved through New Jersey hotels and used New Jersey-based criminal networks to obtain identification documents.
Fuentes would not go into detail, but Sgt. Kevin Rehmann, a state police spokesman, later said Fuentes was referring to Atta. Rehmann said Atta had lived in the Wayne Motor Inn on Route 23 for a year.”
“The Record has previously reported that several of the terrorists were seen in Passaic County and that two unidentified terrorists stayed at the Wayne motel.
Two days after the attacks, FBI agents searched the Wayne Motor Inn’s guest records and took copies of several receipts, hotel employees said at the time.”
Excellent article with added links to references.
~~~
Thank you for this diary-I am among the “technologically impaired” and you explained data mining and it’s downfalls so clearly that I got it! Seems to relate to the problem that the human brain wants to see patterns, and cause and effect whether they are there or not. (If B follows A then A caused B) It’s hard to resist and added to the other problems with the data mining it adds up to huge problems.