Read The Times Australia

Daily Bulletin

Topology looks for the patterns inside big data

  • Written by: The Conversation
imageWhat good is all this data if we can't figure out how to analyze it?Elif Ayiter, CC BY-NC-ND

Big data gets much attention from media, industry and government. Companies and labs generate massive amounts of data associated with everything from weather to cell phone usage to medical records, and each data set may involve hundreds of variables.

These sets are so large and complex that traditional methods of looking for patterns within them can’t make much headway. Often touted as a silver bullet, data analytics certainly have the potential to make inroads into once intractable problems. But we have to be able to figure out what we’re looking at.

imageA regression line can show the relationship between height and weight in a group of people.Jake, CC BY

Statistics 101 invariably contains a lecture or two about linear regression – finding the best line that fits a set of points scattered in a plane. These graphs often show up in articles about climate change, for example, where temperature and other weather data are plotted against time, or in economic forecasts where employment or GDP history is used to extrapolate into the future.

But what if the set of points doesn’t lie near a line but instead forms something like a circle?

imageA collection of points on the circle (left), and the best-fitting line (right).Kevin Knudson, CC BY-NC-ND

Clearly regression isn’t useful in this context, but we know that only because we can see that the points form a circle.

Now imagine instead a collection of points lying on a circle in a higher-dimensional space. In three dimensions we might be able to see the circle, but if we have more variables, as often happens when examining large data sets, we are in trouble. How could we detect the circle? Better: how could we tell a computer to find the circle?

These are the types of questions arising from the growth of big data – and algebraic topology provides some answers.

imageSpheres, cubes, look the same to me.Phil Shirley, CC BY-NC-ND

How to make sense of big data spatially

Topology is sometimes called “rubber sheet geometry.” To a topologist, a sphere and a cube are the same thing. Imagine a cube made from flexible material; inserting a straw and blowing into the cube would puff it out into a sphere. Operations like this are called deformations, and two objects are considered to be the same if one can be deformed to the other.

Topologists study spaces by assigning algebraic objects called invariants to them. They may be as simple as an integer, but they are often more complicated algebraic structures. For data analysis, the invariant of choice is persistent homology.

Ordinary homology measures the number of “holes” that cannot be filled in a space. Let’s think about a sphere again. If we draw a loop on the sphere, it bounds a 2-dimensional disc on the surface; that is, we can fill in any loop on the sphere and so there are no 2-dimensional “holes.” By contrast, the surface of the sphere itself bounds a 3-dimensional “hole” that cannot be filled.

imageA closed loop on the surface of a sphere; it bounds a disc and therefore does not add to the first Betti number.Peter Saveliev, CC BY-NC-ND

The Betti numbers of a space count the number of such unfillable holes of each dimension. A sphere has second Betti number equal to 1 (because its interior cannot be filled) and first Betti number equal to 0 (because any loop bounds a disc on the sphere). The zero-th Betti number counts the number of pieces a space has; in the case of the sphere we have one piece. There are higher-dimensional versions of this as well for more complicated spaces.

The problem with using ordinary homology for data analysis is that if we compute the homology of a discrete set of data points, we will be disappointed. There are no holes, only a collection of disconnected points. The zero-th Betti number will count how many points there are, but as there are no loops or spheres in such a set the higher Betti numbers will all be 0. This is where persistent homology enters the story.

We need to take our discrete set of points and join them together. Imagine putting a small ball of radius r around each point in our data set. If r is very small, then none of the balls will intersect and the Betti numbers of all the balls in the set are the same as for the discrete set.

imageBalls of increasing radius around data points.Kevin Knudson, CC BY-NC-ND

However, if we allow r to grow, then eventually the balls will begin to touch and we will likely get nontrivial higher Betti numbers. In the animation, we see that once r reaches a certain threshold, the balls around the top three points intersect in pairs and therefore contain the triangle joining the three points. Moreover, we cannot fill in the triangle since there’s a small gap in the middle; this means the first Betti number is 1 at that stage. But as r gets a bit bigger, then all three balls intersect at once and we can fill in the triangle; the first Betti number then drops to 0.

imageBarcode associated to the above data. The 0-th Betti number at top decreases from 4 to 0. The first Betti number at bottom shows the emergence of two short-lived 1-dimensional homology classes.Kevin Knudson, CC BY-NC-ND

Persistent homology tracks these numbers as the radius grows; the plot of these numbers against the parameter r is called a barcode. Long bars suggest features in the data that may be significant (they persist, hence the terminology). Short bars often arise from noise in the data and may be disregarded (or not – context is important).

So what we’ve done is pass from a discrete collection of points to a sequence of more complicated spaces (one for each r) which hopefully model the data much better than a simple linear regression might.

imageA circle persists in the spaces as the radius of the balls increases.Kevin Knudson, CC BY-NC-ND

In the above animation, we show how a few points on a circle might be modeled in this way. We have suppressed the balls around the points, connecting two points when their associated balls overlap, forming triangles when three intersect and so on. A circle persists for quite a long time, leading us to guess that our data lie near a circle.

imageTopological data analysis led to a new way to compress digital photos.Peter Forret, CC BY-NC

Applications beyond the theory

Gunnar Carlsson of Stanford University is one of the pioneers of topological data analysis. One of his group’s first successes was the discovery of the topology of the space of natural images. This data set consists of several million 3-pixel by 3-pixel patches sampled from black and white digital photographs. Each pixel is described by a number between 0 and 255, measuring its grayscale value; each 3-by-3 patch then corresponds to a point in a 9-dimensional space, each coordinate giving the numerical value of the associated pixel. After tossing out the constant patches and doing some normalizations, this space lies inside a 7-dimensional sphere. At first glance, the set appears to fill out the sphere, but structure emerges by restricting attention to areas where the points pack closer together.

imageA Klein bottle is like a Mobius strip: it has no boundary.Tttrung, CC BY-SA

Carlsson and his collaborators showed that the data actually lie on a Klein bottle, a nonorientable 2-dimensional surface embedded in the sphere. They were able to push this further to find a compression algorithm for photos that is slightly better than the industry standard JPEG 2000. Carlsson has published an excellent survey of this work.

In light of this success, Carlsson and some of his colleagues founded AYASDI, a company with a growing roster of clients in banking, finance, government and other industries. They use these and other techniques to analyze diabetes, breast cancer and cardiopulmonary disease data. The results are encouraging – certain subgroups of patients with high survival rates, invisible using traditional statistical methods, may be found via these techniques.

The real promise of these methods, however, lies in the possibility of tailoring treatments and solutions to individuals. Analysis of large data sets lets us know, for example, that a drug once thought to be 80% effective is actually 100% effective on 80% of patients, identifiable via some marker. Topological data analysis provides another tool to advance these analytics, often identifying features that were hidden before.

Kevin Knudson does not work for, consult to, own shares in or receive funding from any company or organisation that would benefit from this article, and has no relevant affiliations.

Authors: The Conversation

Read more http://theconversation.com/topology-looks-for-the-patterns-inside-big-data-39554

Business News

Inside the Icon: The BridgeMuseum Officially Opens at the Sydney Harbour Bridge

A bold new way to experience one of Australia’s most recognisable landmarks has arrived, with BridgeClimb Sydney officially opening the all-new BridgeMuseum.  Located inside the Sydney Harbour Brid...

Daily Bulletin - avatar Daily Bulletin

Is Your Brand Showing Up in AI Search? Most Melbourne Brands Aren't.

The New Front Door Nobody Told You About Something changed. Quietly. Without a press release. The way buyers find businesses in Australia has been rewired. Not replaced, rewired. Google isn't dead...

Daily Bulletin - avatar Daily Bulletin

How Australian Businesses Can Measure SEO ROI

SEO can feel vague when you are staring at a dashboard full of numbers that do not clearly connect to revenue. The key is to measure the right signals in the right order, then tie them back to outcome...

Daily Bulletin - avatar Daily Bulletin

How Commercial Roller Shutters Improve Site Security Without Slowing Operations

Security upgrades can be frustrating when they make everyday work harder. A door that takes too long to open, creates bottlenecks at shift change, or fails at the worst time can turn “better protectio...

Daily Bulletin - avatar Daily Bulletin

Why a Document Destruction Service Still Matters for Modern Businesses

Businesses generate large volumes of information every day, from staff records and contracts to invoices, reports and customer files. While attention often focuses on how documents are stored, the way...

Daily Bulletin - avatar Daily Bulletin

Bicycle Rack Safety and Space-Smart Storage

Bike storage problems usually show up as small annoyances first: tangled handlebars, scratched frames, and bikes that topple when you pull one out. Over time, those issues become safety risks, especia...

Daily Bulletin - avatar Daily Bulletin

How to Tell if a Childcare Centre Is a Good Fit for Your Child

Choosing childcare can feel like you’re making a huge decision with limited information. Tours are short, centres are often on their best behaviour, and your child might act differently in a new space...

Daily Bulletin - avatar Daily Bulletin

Car Import Timeline: What Usually Happens at Each Stage

Importing a car into Australia can feel confusing because multiple agencies and checkpoints are involved, and the timeline is shaped as much by paperwork quality as it is by shipping speed. The most u...

Daily Bulletin - avatar Daily Bulletin

Portable Toilet Hygiene Standards Explained: Clean vs Sanitised vs Disinfected

In portable toilet servicing, the words clean, sanitised, and disinfected often get used as if they mean the same thing. They don’t. And that difference matters because a unit can look tidy and still ...

Daily Bulletin - avatar Daily Bulletin

The Daily Magazine

Gold Migration Lawyers in Liquidation: How the Closure Affects Your ART Appeal

If your appeal was with Gold Migration Lawyers, a recent change to how the Tribunal decides cases ...

The pressure cooker: life in urban Australia in 2026

Australian cities have always been demanding. Long commutes, rising housing costs, busy schedules a...

What Actually Makes a Good Criminal Lawyer in Melbourne

Most people only think about this question once. That is usually too late. Most people charged wi...

Why Working With A Chatswood Tutor Can Improve Academic Performance

Academic expectations continue increasing for students across primary school, high school, and senio...

Is It Worth Getting Solar Panels in Melbourne?

The real question is not whether solar works in Melbourne. It works. The question is what it is co...

How A Diploma Of Project Management Builds Practical Skills For Modern Work Environments

Developing the ability to plan, execute, and deliver outcomes efficiently is a key requirement in to...

How to Choose the Right Football for Every Level

Choosing a football may seem straightforward, but the right option depends on who will be using it a...

What to Ask a Wedding Photographer Before You Book

Booking a wedding photographer can feel deceptively simple: you like the photos, you like the vibe...

Why Stress Relief For Dogs Is Essential For Emotional Balance And Long-Term Wellbeing

Managing emotional health is just as important as physical care when it comes to pets, which is why ...