Earlier this week I was asked to comment once again on potential issues and concerns around big data. This time, the concerns were around bad analytics being applied to big data. In an article recently published on searchCIO.com, a benign example of bad analytics being applied to big data resulted in the funding of a research grant where no correllation of facts actually existed. Other articles point to the potential for people being wrongly excluded from vital benefits such as healthcare or the government making egregiously bad decisions based upon poor analysis (as if that has never happened before :) ). Below are some of my general thoughts on the topic your amusement:
- Data and Information Are Not Synonymous Terms. Data are facts; information is a fact (or facts) in context. Removing context from data can obsure its meaning as effectively as encrypting it. For an example, take the 10-digit number 3015553078. Standing alone as datum, without context, this number has no meaning. If we were to give it context by, say, adding commas (3,015,553,078) or by segmenting it in two (30155 53078) the data takes on some level of significance. Only by adding the proper context, though -- in this case, (301) 555-3078 -- can we extract the proper meaning (or information) behind the datum provided.
- Intelliegence Requires Data and Information. Intelligence is a collection of information which has political and/or military value. By analyzing data and information we can accurately extract hidden information of significance and relevance. In the above example, for instance, if you were given the information that (a) 301 is the area code for Maryland and (b) I used to live in Maryland, you might be able to conclude that the aforementioned telephone number used to be mine.
- Big Data Collection Risks Removing Too Much Context. This is especially the case with unstructured data. In many cases the only context searched for is a cross referencing between an individual and certain terms. The more those terms come up, the more an individual is assumed to meet a certain criteria. For a real-world example, I harken back to the late 80's/early 90's. Around this time, law enforcement officials in Dade County began stopping individuals traveling north on I-95 for suspicious of narcotrafficking. Based upon their data, most overland drug couriers were (a) dark-skinned males (b) between the ages of 20 and 30 (c) driving late model luxury cars who (d) made it a point not to speed. Based upon this confluece of data, I was once pulled over for such a stop...despite being in military uniform with my West Point ring proudly on display.
- Data Analytics Is A Starting Point, Not An End Point. Using the example in (3) above, even I can understand why I was pulled over; what continues to annoy me to this day about that situation is that the officer insisted upon doing a full search of the vehicle despite me offering both positive military ID and a set of valid military orders. As I fit the selection criteria for a profile stop, the officer felt it reasonable to ignore all other information being presented and delay my journey north for over an hour. This, to me, epitomizes the problem with big data analytics. Even the best-written search strings and heuristic models will get it wrong. While the best models can achieve as much as a 98% accuracy rate, a 2% error rate scattered of 1 million selectees still amounts to 10,000 erroneous results. If these results pertained to, say, healthcare coverage, the impact could be tremendous.
My bottom line with big data analytics is this: utilizing the data to narrow the pool through which a human being must search may (note word) be sensible and proper depending upon the context. Utilizing a data search query for ultimate decisioning without human intervention is short sighted and will lead to potentially life-changing errors. As pundits who advocate big data continue to extol the potential efficiency gains of the associated technologies, we as security professionals must ensure that we do not lose sight of the dangers associated with its irresponsible use.
My two cents...
No comments:
Post a Comment