According to our website, Unbounded Analytics “provides infinite flexibility to generate new insights from all unsampled, high-cardinality data.” Great! But wait… what does high cardinality data even mean?
At first glance data cardinality is one of those buzzwords whose only purpose is to make you look smarter in a smug way. Everyone hates buzzwords – myself included. In fact, I was on a Zoom call the other day and found out that the phrase “data cardinality is ambiguous” is actually the first suggested search in Google when you type, “Data Cardinality.” To that end, let’s double click into data cardinality with the following primer on the term that’s got everyone consulting Dr. Google.
While researching data cardinality for this post, I came across multiple articles that went down in the technical weeds way too quickly. That’s not helpful to most folks, including myself, so we aren’t going to do that. Before we begin though, it is worth pointing out that data cardinality can have slightly differing meanings depending on context, so let’s break out into their meanings one by one.
Data Modeling and Data Cardinality
If you’re looking to build some kind of data model, data cardinality is a measure of the correlation between two sets of data. In the table below, we have two columns; one with numbers and the other with letters. You may notice that every number has a unique corresponding letter making for a one-to-one relationship:
An example of one-to-one data cardinality, where each individual letter in column I corresponds to a unique number in column II. In this kind of relationship, it can be said that the data cardinality is high on either column.
There are three categories of data cardinality; one-to-one, one-to-many, and many-to-one. The latter two categories are pictured below:
An example of one-to-many data cardinality, where a single letter in column I corresponds to multiple numbers in column II. This data model is normally used for elements such as product categories, where one category is reused for multiple products (e.g. headphones, speaker, laptop).
An example of many-to-one data cardinality, where different letters in column I correspond to the same number in column II. While one-to-many can be used to assign a product category, many-to-one can be used to assign multiple invoices to products, as in many invoices may refer to the same product.
And finally, there’s also the infamous many-to-many relationship. You either hate it or you love it. There are, however, real use cases, such as authors writing books. One book can have multiple authors, but one author can multiple books. Don’t get me started on the database management part of that!
Data in Databases
You may have heard the term data cardinality used differently – particularly when someone refers to high or low data cardinality. So let’s talk about what that means. When referring to databases, data cardinality broadly refers to the total number of unique (or distinct in database terms) values in a dataset. Just like in the previous context, data cardinality is a count, but instead of being a count of correlation between two datasets, it’s a count of the total number of unique values (or, N, if you want to be super mathy). If we have a very small set of data, like in our pictures above, our data cardinality can be at most the number of entries. For further clarification, see the picture below:
In this example, the count of data cardinality is five, since there are five unique entries. If we add another E to the data set though, the number of items grows to six, while the data cardinality stays at five.
Data cardinality gets a little bit trickier with time series databases. Here’s where the big issues start coming into play and why you’ve probably been hearing this term thrown around so much. With time series data, usually there is a bunch of additional data which is attached to the main data that is being examined (called “tags” or “metadata”). For instance, if I was tracking solar power generation over time, I would want to know the wattage of power that I was generating, but also at what times I was generating that amount of power, and maybe even date of the year. In this example, we have a dataset that looks like the below:
|1:00 pm||160 watts||2/21/2021|
|2:00 pm||120 watts||2/22/2021|
|3:00 pm||100 watts||2/23/2021|
In the pictured example, our data cardinality would be the total number of permutations possible, which in this case is 27 (3 times x 3 wattage readings x 3 dates = 27). 27 as a count of data cardinality might still be pretty small, but imagine if we were tracking solar generation over the course of one year, with 10,000 times, 5,000 wattage readings, over 365 dates. Then our count of data cardinality would be massive. If you’re anything like me, that might seem a little small and inconsequential. Frankly, who cares? So let’s talk about the consequences of having high data cardinality at a higher level in a few quick sentences.
Here’s the deal: having a lot of data makes for a slow and expensive database – pretty simple stuff – and different databases have different methods of dealing with high data cardinality.
Data and Application Monitoring
When monitoring applications, data cardinality becomes a little bit of a double-edged sword. In order to have a firm understanding of a given environment, more data is always better, but it can become tricky to handle due to its size. Typically data with highest cardinality is the most useful and effective to troubleshooting or getting to the bottom of issues. For instance, in order to debug a particular application, you would want to get more data to help pinpoint where a problem is.
High Cardinality Data
Let’s say that you’re a website host and you’re experiencing a cyberattack, but you don’t know it yet. Your website host has some big name banks for customers – the kind of big names that have a lot of transactions and money running through their websites on a daily basis. At the moment, one of your key applications is suddenly slowing down and you don’t know why. Some of your customers are reporting that they’re losing money which is causing them to become – err, let’s use the term unhappy. In that scenario, you would probably want to figure out why that application is slowing down pretty quickly. You would need as much unique data as possible about that application to pinpoint exactly where your issue is: the kind of data that would create a lengthy chart and that has as many different permutations as possible. That’s called high cardinality data.
|IP Address||Time of access||Latency (in milliseconds)||Task status||Error|
Like in the sample above, high cardinality data is highly specific and unique data, but including many more vectors and many more rows of data than pictured above.
Running that kind of scenario requires a massive amount of computing power, so it can become problematic if a piece of software isn’t equipped to handle it. Different pieces of software have different ways of handling high cardinality data. To see how Instana handles high cardinality data, check out our Unbounded Analytics documentation.
Medium Cardinality Data
For an example of medium cardinality data, let’s take a lower pressure situation where you’re running a parking garage that holds a large number of cars. You’re keeping and monitoring an application that keeps track of makes and models of cars inside your parking garage. The data your application generates is not necessarily unique data since it does repeat, but there are still a good amount of car makes and models out there, so your chart of data will be fairly large even if it does repeat. Data for this situation could be seen as medium cardinality. Handling and monitoring medium cardinality data shouldn’t be much of a problem for systems, but it might not be useful for any in-depth activity. For example, if a car were to be stolen out of your parking lot and you have to to figure out which one it is. That might not be such an easy task with only car make and model if you have a lot of cars in your garage.
In this sample of medium cardinality data, some of the data may repeat but the table of data may still be massive, stretching far beyond the six rows in this sample
Low Cardinality Data
Low cardinality data is the easiest to understand and process, but also sometimes least useful. Low cardinality data could come from an application containing a simple count of yes/no that it gets from a single question survey on a website. It’s fairly simplistic and has its uses, but that’s about it
A sample of low cardinality data, where there may be only two binary opinions. While the above chart might be longer than the pictured sample, there will still only be two possible answers.
So there you have it; that’s the spark notes version of data cardinality. Next time someone says something has high or low data cardinality, you will know what they’re talking about – either in order to understand them, or in order to call bullshit on them for using a buzzword without knowing what they’re talking about. In the meantime, if you have any questions, feel free to reach out to me if you have any questions. I’m most easily found on twitter @iamcippino. Also feel free to check out Instana’s play-with application, Instana’s pre configured online observability sandbox, to see how Instana handles high-cardinality data.