What is Data Sampling and Why Should You Avoid It?

Published: January 13, 2016 Updated: March 20, 2019 Author , Category Analytics

If you are serious about growing your business, then you know how critical data-driven decisions are. You need solid numbers and reliable insights. No wonder data sampling is considered a real pain point in the web analytics world.

Sampled data may simply not be good enough. It threatens the accuracy of your reports by increasing the chance of bias, excluding data that might be necessary and leading you astray.

If you are serious about growing your business, then you know how critical data-driven decisions are. You need solid numbers and reliable insights. That is why sampled data may simply not be good enough, and may even lead you astray.

If you want to know more about the importance of data for your strategy, have a look at our post: 7 Steps to Make Data-Based Decisions with Web Analytics

Fortunately, there are some ways to overcome issues with data sampling and ensure your reports always provide you with complete data. And today we’ll walk you through some possible solutions so your analysis won’t be hurt by unreliable reporting.

What is Data Sampling?

First, let’s start with a data sampling definition. Sampling is a common statistics technique used, for instance, for political or opinion polls. If a researcher wants to determine the most popular way of commuting to work in the US, they will not need to talk to every American citizen. Instead, they can select a representative group of 1,000 members, hoping it will be enough to make the results accurate.

In web analytics, sampling works in a very similar way. Only a subset of your traffic data is selected and analyzed, and that sample is used to estimate the overall results. But can you really be sure that your software will choose a representative set of your traffic?

Unfortunately most of web analytics tools automatically start sampling data when you reach a particular limit of actions tracked on your website.

What are data sampling methods?

Samples comes in different shapes and sizes as there are various data sampling methods. You can basically divide them into probability and non-probability sampling.

Probability sampling is an approach in which samples from a larger population are chosen using a method based on various statistical methods. So you select random numbers that correspond to points (users) in data set and ensure that everyone in your set has equal chance of getting selected.

probability sampling
This particular method provides the best chance to create a truly representative sample of the population.

On the other hand, non-probabilistic sampling is an approach where a data sample is determined based on the subjective judgment of an analyst. It means that not all points of the population will have a chance to be selected.

nonprobability sampling
Applying this method gives you a smaller chance to create a sample that accurately illustrates the larger population.

Issues with Sampled Data

We’ve already mentioned that web-data sampling may result in far less accurate reports. The best way to understand it is to get some first-hand experience. Have you ever tried to compare reports based on sampled and complete traffic? You can do this by comparing full Piwik PRO reports and sampled Google Analytics reports (provided the number of sessions hit the limit).

It is a good idea to see what kind of discrepancies you may be facing and then decide if the issue is serious or not. Even if your discrepancy is less than 10%, which is not bad at all, it may not stay this way forever.

It is better to take proper steps to get full and clean reports. Note that 5% discrepancy between different tools is acceptable and depends on numerous factors.

In tools such as Google Analytics, your data is aggregated and delivered to you as a random data set, which means you cannot be sure if your reports are displaying the overall traffic and meaningful trends, or if the selected set is missing the point. When you invest a substantial amount of money and time into analyzing your reports, they should be accurate.

After all, you decide on the direction and next steps of development of your business based on this information. If you do not have this knowledge, you could just go with your gut, which may actually be a better idea than working with unreliable data.

Always make sure your analytics tool provides solid data and try to avoid sampling. Otherwise, you may be missing out some crucial information.

Free Comparison of 5 Leading Web Analytics Vendors

Compare 40 Variables of 5 Leading Enterprise-Ready Web Analytics Vendors:

Download FREE Guide

Data Sampling in Google Analytics

Google Analytics is a commonly used web traffic analytics tool. Unfortunately, numerous users encounter difficulties related to data sampling, even if they don’t experience much traffic. Sampling occurs automatically when the monthly limits of 500,000 (Google Analytics Standard, free version) or 100 million (Analytics 360) sessions are collected.

The Google Analytics 360 comes at a price of $150,000 per year, which is a lot compared to other solutions, and the bad news is that even this investment may not solve the problem of data sampling in Google Analytics.

You know that you have sampling problem when you can see a yellow bar at the top-right of your report saying, “The report is based on x visits (x% of visits).”

data sampling in GA

If you get reports based on 100% of sessions, this does not concern you. The lower the sample size, the bigger the sampling issue you face. When you notice your sample is below 10%, you can be quite sure that these reports are not of much use.

You may be able to monitor some ups and downs in your stats, but that’s it. Data sampling damages all detailed reports, which means your metrics may be way off the mark (even 80%) and numbers may not match up with reality.

Google Analytics provides unsampled data when it comes to visits and page views, but when analyzing detailed metrics such as e-commerce reports, you may notice a lack of compatibility. You cannot read and interpret this kind of information properly without having complete data.

Free Comparison of 5 Leading Web Analytics Vendors

Compare 40 Variables of 5 Leading Enterprise-Ready Web Analytics Vendors:

Download FREE Guide

Does Piwik PRO Sample Your Data?

By default Piwik PRO does not sample your data. One of the greatest advantages of Piwik PRO over other analytics tools is that you get unsampled data at all times. Moreover, when you self-host, the only data limit is your server capacity.

If you decide to rely on Piwik PRO Cloud, the limit is 500 million actions per month. This number of actions is achieved by very few websites, and there are plans suitable for websites of all shapes and sizes.

To avoid sampling issues, many will suggest you go for GA 360. You could, if you can afford to do so, but we suggest trying out Piwik PRO first. With 100% of your data, you can be 100% confident your reports are correct. Piwik PRO is much more affordable, free of sampling issues, and allows you to completely own your data. So why not give it a go?

Get in touch with us

Author:

Karolina Gawron, Content Marketer

Content manager at Piwik PRO

See more posts of this author

Author:

Karolina Matuszewska, Content Marketer

Content Marketer at Piwik PRO. Specializing in issues of on-site and off-site personalization. Transforming technical jargon into engaging and informative articles dedicated for digital marketers and web analytics specialists. LinkedIn Profile

See more posts of this author
Free Web Analytics Vendor Comparison Download

Share