Dark Reading is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Perimeter

10/24/2012
10:32 AM
Adrian Lane
Adrian Lane
Commentary
50%
50%

When Data Errors Don't Matter

Does bad data break 'big data' analysis?

I ran across this short video comparing MySQL to MongoDB, and it really made me laugh. A tormented MySQL engineer is arguing platform choices with a Web programming newbie who only understands big data at a buzzword level. Do be careful if you watch the video with the sound on because the latter portion is not child-friendly, but this comical post captures the essence of the argument relational DB architects have against NoSQL: Big data systems fail system architects' criteria for data accuracy and consistency. Their reasoning is if the data's not accurate, who care's whether it's "Web scale?" It's garbage in, garbage out, so why bother?

But I think the question deserves more attention. In fact, I ask the question: Does some bad data in a big data cluster matter?

I think that the answer is, "No, it does not."

There are two reasons for this.

Data in the aggregate:
Most of the big data analytics are basing decisions across billions on records. Trends and decisions are not a simple "X=Y" comparison, but billions of "X=Y" comparisons. Decisions are made across the aggregate to show trends and provide a likelihood of an event. Big data clusters are not used to produce an accurate ATM statement, but rather to predict a person's potential interest in a specific product based upon prior Web search history. It's less about binary outcomes and more like fuzzy-logic.

Data velocity:
Most of the clusters I've seen in operation pour new data in at furious rate -- terabytes of data every day. Queries may favor more recent events, or they may balance their predictions on current and historic trend data. In either case, if you get some bad data into the cluster due to a hardware of software issue, it's likely to cause a short-term dip in accuracy. Tomorrow a whole new batch of data will offset, or overwrite, or mute the impact of yesterday's bad data. Data velocity and volume greatly reduce the impact of data corruption of a handful of records.

And that's the essence of big data analytics -- it's not so much about specific data points as it is metatrends.

Keep in mind that if there is one thing that's consistent with big data systems it's inconsistency. These systems are incredibly diverse in features and functions. It's dangerous to pigeonhole big data into a specific set of value statements because there are some 120 different NoSQL systems, each with add-on packages that provide near limitless functional variations. While the Web programmer newbie in the video above may not have a clue, application developers who work with big data have tuned out the relational database dogma for good reason. There are, in fact, ACID-compliant databases built on a Hadoop framework. These provide transactional consistency -- granted, in different ways than many relational platforms -- but the options exist. There are cases where relational databases are a must-have, but the decision to choose one over the other is far more complex that what's commonly portrayed.

And let's not forget that most relational systems have their own issues with data accuracy. The handful of studies I've seen on data accuracy in relational platforms -- during the past 12 years or so -- finds about 25 percent of the data stored to be inaccurate. Data entry errors, data "aging" issues where information becomes inaccurate over time, errors when collecting information, errors when aggregating and correlating, errors when loading data into the relational format, as well as other problems do exist in relational environments. This is not due to the hardware or software, but it's simply due due to how data is collected and processed between systems. It's a set of issues not often discussed, as relational databases are excellent at transactional consistency, but still have unreliable data that affects analysts even more than it does with big data clusters.

Adrian Lane is an analyst/CTO with Securosis LLC, an independent security consulting practice. Special to Dark Reading. Adrian Lane is a Security Strategist and brings over 25 years of industry experience to the Securosis team, much of it at the executive level. Adrian specializes in database security, data security, and secure software development. With experience at Ingres, Oracle, and ... View Full Bio

Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
I 'Hacked' My Accounts Using My Mobile Number: Here's What I Learned
Nicole Sette, Director in the Cyber Risk practice of Kroll, a division of Duff & Phelps,  11/19/2019
DevSecOps: The Answer to the Cloud Security Skills Gap
Lamont Orange, Chief Information Security Officer at Netskope,  11/15/2019
Attackers' Costs Increasing as Businesses Focus on Security
Robert Lemos, Contributing Writer,  11/15/2019
Register for Dark Reading Newsletters
White Papers
Video
Cartoon Contest
Current Issue
Navigating the Deluge of Security Data
In this Tech Digest, Dark Reading shares the experiences of some top security practitioners as they navigate volumes of security data. We examine some examples of how enterprises can cull this data to find the clues they need.
Flash Poll
Rethinking Enterprise Data Defense
Rethinking Enterprise Data Defense
Frustrated with recurring intrusions and breaches, cybersecurity professionals are questioning some of the industrys conventional wisdom. Heres a look at what theyre thinking about.
Twitter Feed
Dark Reading - Bug Report
Bug Report
Enterprise Vulnerabilities
From DHS/US-CERT's National Vulnerability Database
CVE-2012-2079
PUBLISHED: 2019-11-22
A cross-site request forgery (CSRF) vulnerability in the Activity module 6.x-1.x for Drupal.
CVE-2019-11325
PUBLISHED: 2019-11-21
An issue was discovered in Symfony before 4.2.12 and 4.3.x before 4.3.8. The VarExport component incorrectly escapes strings, allowing some specially crafted ones to escalate to execution of arbitrary PHP code. This is related to symfony/var-exporter.
CVE-2019-18887
PUBLISHED: 2019-11-21
An issue was discovered in Symfony 2.8.0 through 2.8.50, 3.4.0 through 3.4.34, 4.2.0 through 4.2.11, and 4.3.0 through 4.3.7. The UriSigner was subject to timing attacks. This is related to symfony/http-kernel.
CVE-2019-18888
PUBLISHED: 2019-11-21
An issue was discovered in Symfony 2.8.0 through 2.8.50, 3.4.0 through 3.4.34, 4.2.0 through 4.2.11, and 4.3.0 through 4.3.7. If an application passes unvalidated user input as the file for which MIME type validation should occur, then arbitrary arguments are passed to the underlying file command. T...
CVE-2019-18889
PUBLISHED: 2019-11-21
An issue was discovered in Symfony 3.4.0 through 3.4.34, 4.2.0 through 4.2.11, and 4.3.0 through 4.3.7. Serializing certain cache adapter interfaces could result in remote code injection. This is related to symfony/cache.