Read The Times Australia

Daily Bulletin

Big Data analyses depend on starting with clean data points

  • Written by: The Conversation
imageWhat you get out is what you put in.Keys image via www.shutterstock.com

Popularly referred to as “Big Data,” mammoth sets of information about almost every aspect of our lives have triggered great excitement about what we can glean from analyzing these diverse data sets. Benefits range from better investment of resources, whether for government services or for sales promotions, to more effective medical treatments. However, real insights can be obtained only from data that are accurate and complete, so it’s critical to keep in mind how the data were collected.

Data scientists know the importance of accurate and complete data. After all, if the data itself is unreliable, you’ll wind up making invalid conclusions based on your analysis.

imageOh, did I press that?Marcin Wichary, CC BY

To avoid that pitfall, one major cost for most data analysis projects comes from data preparation and cleaning – that is, finding and correcting errors in the data. These errors include incorrect values, missing entries, aliasing (where information about two distinct entities has been merged in error, for example, because two people have the same name) and multiple entry (where information about the same entity is split up, for example, because the name has been spelled differently for the same person). When data sets are small, the analyst can manually examine and validate each entry. With large data sets, we have to rely on computer-executed algorithms. The development of such algorithms is now a subfield itself.

The old truism “garbage in, garbage out” is more apt than ever in this era of complex and gargantuan data sets – and the sometimes weighty consequences of trusting what they seem to imply.

How inaccuracies creep in

Errors in data can arise for a variety of reasons. For example, users often make mistakes when filling in web forms. Data cleaning software can verify that the zip code matches the street address, and possibly even correct it. So if the state has been entered along with the town in the city field (for example, “Plainfield, NJ” for city), data cleaning can move the state entry to the correct field. Or if a street has only house numbers 1–80, data cleaning software can flag as erroneous a house number entered as “125.” Many inadvertent errors can be caught, and possibly fixed, by clever software.

Bad data entry isn’t the only source of inaccuracies. One common place where errors arise is in linking data across data sets. Unless both data sets use a unique identifier – such as a social security number – with each entry, it is challenging to match entries across data sets: there are likely to be entries that wind up linked even though they should be distinct, and entries that are not linked even though they correspond.

Another frequent source of mistakes is when computer software creates table entries based on other, more complex, data. For example, if you write a review of a product, this may be condensed into one of a few buckets (eg, loved/liked/hated) along a few simple axes (eg, ambiance, food taste, service, value for money). The condensed form is amenable to quantitative analysis, which the original text form is not. But errors can be made in the process of condensing.

imageIf the data aren’t good, neither are the interpretations.Pete Birkinshaw, CC BY

At least don’t motivate people to lie

Dirty data are almost impossible to clean when errors are due to intentional user choice as opposed to inadvertent causes. Suppose you enter your neighbor’s address as yours: clever software cannot catch this lie without knowing more about you – after all, the address entered is technically a valid entry, it’s just not correct.

If we are to trust the results of analysis, we must ensure that the data collection procedures at least don’t give users incentive to cheat.

Consider web forms that routinely ask us to fill out information about ourselves. Many users enter a bogus email address in these forms, perhaps for fear of possible spam mail. Some websites confirm the email address entered, for instance, by sending a verification link that the user has to click. But such verification is expensive and unfriendly. The complementary approach is for the website to develop a reputation for trustworthiness so that users are willing to share their email addresses without worrying about the potential for misuse.

In fact, people (and businesses and other entities) will provide correct and complete data only if they feel they can trust the data collection. The US Census Bureau is able to collect high-quality data because it can assure citizens that what they report in the census will not be used for tax collection or any other such government purpose, other than statistical reporting. While it might be desirable to catch tax cheats and obvious that census data could greatly enhance the government’s ability to identify them, laws in most countries prevent such use of census data, because the moment citizens know census data can be used for tax computation, they will be motivated to lie to the census-taker.

imageCould big data have helped prevent the Germanwings plane crash?Emmanuel Foudrot/Reuters

Big data can’t outsmart high-stakes incentives to lie

Maybe you don’t really care whether or not you get the right targeted weekly email highlighting sales of possible interest to you at a local chain store. But there are certainly other instances where the stakes for big data accuracy are much higher.

For instance, take the current spotlight on German privacy laws centered on the mental health of pilot Andreas Lubitz. He allegedly crashed a plane intentionally into the Alps and killed 150 people in March. Given his mental health, he probably should not have been flying an airplane. Some people advocate that his employer, Lufthansa, parent company of Germanwings, should have had complete access to Lubitz’s mental health record and thus been able to keep him out of the cockpit before he had a chance to bring down a flight.

But weakening privacy laws would not reveal to authorities the true mental health of people like Lubitz. Rather, it would make it less likely that the official health record is a reliable record of fact. Someone like Lubitz, who is keen to fly and dreams of becoming a pilot, would likely do everything possible to hide any disqualifying condition from his official medical record if he knew it could be used against him. The incentive for omission and falsehood would undermine the ability to collect and use a reliable data set. In this case, privacy would be sacrificed without any safety payoff. Much better to keep the medical record data clean, and qualify pilots through tests run outside the formal medical system.

It’s great for us as a society to make use of all the data resources we have. But it’s important not to ruin the quality of this data resource in our enthusiasm to use it, even if with good intentions. Unless we are careful about how we deploy these big data sets, we’ll collect data of poor quality – particularly so where there are individual points of concern, such as Lubitz’s health record. The inferences we draw from big data are only as good as the individual data points we feed in.

H V Jagadish's research on Big Data is funded in part by the National Science Foundation and the National Institutes of Health.

Authors: The Conversation

Read more http://theconversation.com/big-data-analyses-depend-on-starting-with-clean-data-points-43687

Business News

How Telematics Helps Australian Companies Improve Productivity

Operating a commercial fleet in Australia is a uniquely demanding endeavour. Between the sprawling urban sprawl of cities like Sydney and Melbourne and the immense, unforgiving stretches of the Outb...

Daily Bulletin - avatar Daily Bulletin

Inside the Icon: The BridgeMuseum Officially Opens at the Sydney Harbour Bridge

A bold new way to experience one of Australia’s most recognisable landmarks has arrived, with BridgeClimb Sydney officially opening the all-new BridgeMuseum.  Located inside the Sydney Harbour Brid...

Daily Bulletin - avatar Daily Bulletin

Is Your Brand Showing Up in AI Search? Most Melbourne Brands Aren't.

The New Front Door Nobody Told You About Something changed. Quietly. Without a press release. The way buyers find businesses in Australia has been rewired. Not replaced, rewired. Google isn't dead...

Daily Bulletin - avatar Daily Bulletin

How Australian Businesses Can Measure SEO ROI

SEO can feel vague when you are staring at a dashboard full of numbers that do not clearly connect to revenue. The key is to measure the right signals in the right order, then tie them back to outcome...

Daily Bulletin - avatar Daily Bulletin

How Commercial Roller Shutters Improve Site Security Without Slowing Operations

Security upgrades can be frustrating when they make everyday work harder. A door that takes too long to open, creates bottlenecks at shift change, or fails at the worst time can turn “better protectio...

Daily Bulletin - avatar Daily Bulletin

Why a Document Destruction Service Still Matters for Modern Businesses

Businesses generate large volumes of information every day, from staff records and contracts to invoices, reports and customer files. While attention often focuses on how documents are stored, the way...

Daily Bulletin - avatar Daily Bulletin

Bicycle Rack Safety and Space-Smart Storage

Bike storage problems usually show up as small annoyances first: tangled handlebars, scratched frames, and bikes that topple when you pull one out. Over time, those issues become safety risks, especia...

Daily Bulletin - avatar Daily Bulletin

How to Tell if a Childcare Centre Is a Good Fit for Your Child

Choosing childcare can feel like you’re making a huge decision with limited information. Tours are short, centres are often on their best behaviour, and your child might act differently in a new space...

Daily Bulletin - avatar Daily Bulletin

Car Import Timeline: What Usually Happens at Each Stage

Importing a car into Australia can feel confusing because multiple agencies and checkpoints are involved, and the timeline is shaped as much by paperwork quality as it is by shipping speed. The most u...

Daily Bulletin - avatar Daily Bulletin

The Daily Magazine

Gold Migration Lawyers in Liquidation: How the Closure Affects Your ART Appeal

If your appeal was with Gold Migration Lawyers, a recent change to how the Tribunal decides cases ...

The pressure cooker: life in urban Australia in 2026

Australian cities have always been demanding. Long commutes, rising housing costs, busy schedules a...

What Actually Makes a Good Criminal Lawyer in Melbourne

Most people only think about this question once. That is usually too late. Most people charged wi...

Why Working With A Chatswood Tutor Can Improve Academic Performance

Academic expectations continue increasing for students across primary school, high school, and senio...

Is It Worth Getting Solar Panels in Melbourne?

The real question is not whether solar works in Melbourne. It works. The question is what it is co...

How A Diploma Of Project Management Builds Practical Skills For Modern Work Environments

Developing the ability to plan, execute, and deliver outcomes efficiently is a key requirement in to...

How to Choose the Right Football for Every Level

Choosing a football may seem straightforward, but the right option depends on who will be using it a...

What to Ask a Wedding Photographer Before You Book

Booking a wedding photographer can feel deceptively simple: you like the photos, you like the vibe...

Why Stress Relief For Dogs Is Essential For Emotional Balance And Long-Term Wellbeing

Managing emotional health is just as important as physical care when it comes to pets, which is why ...