This post was originally published in January 2016.
Data sampling is a standard practice applied by several major analytics platforms. Sampling has its advantages and uses in certain situations. However, using it automatically without being aware of the consequences of working on a sample may cause problems. These include report inaccuracy.
In this article, we will show you how data sampling works. We’ll also help you understand when it works well and when it doesn’t.
What is data sampling in analytics?
Sampling is a common statistics technique used, for instance, for political or opinion polls. If a researcher wants to determine the most popular way of commuting to work in the US, they will not need to talk to every American citizen. Instead, they can select a representative group of 1,000 people, hoping it will be enough to make the results accurate.
In web analytics, sampling works in a very similar way. Only a subset of your traffic is selected and analyzed, and that sample is used to estimate the overall results.
What are data sampling methods?
Samples come in different shapes and sizes, as there are various data sampling methods. You can divide them into probability and non-probability sampling.
- Probability sampling is an approach in which you choose samples based on various statistical methods. You select random numbers corresponding to points (users) in the data set and ensure that everyone in your set has an equal chance of getting selected.
This method provides the best chance to create a representative sample.
- Non-probability sampling is an approach where you determine a data sample based on the subjective judgment of an analyst. It means that not all points of the population will have a chance to be selected.
Applying this method makes it less probable that the sample accurately illustrates the larger population.
Issues with sampled data
Data sampling is designed to speed up reporting in web analytics, but depending on the circumstances and sampling approach, it may cause issues.
Issues with the sampled data might include the following:
- Not representative samples: Some tools, like Google Analytics, have limits when it comes to data collection, and are sampling data from a certain point, no matter how much traffic you have. This means that the bigger your website grows, the less accurate your reports become.
For example, if your site generates 60 million hits per month and 60,000 visits per day, sampling can limit you to 10 million hits per month and 10,000 visits per day or less. This makes it impossible to obtain a decent representation of your data. And the more your website grows, the more skewed your reports will become.
- Performance issues and accuracy:
- Working on larger sample sizes can lower the speed at which your reports are created. A larger sample size may take more time to process, but will bring more accurate results.
- The small sample size speeds up the loading of the report, but at the cost of accuracy. If you limit the sample, you might not be able to see actual patterns. You could miss out on opportunities you would otherwise have noticed if you saw the whole picture.
- Sampling errors: Samples can have errors. Errors may occur due to high variation in a particular metric in a given date range. Or they could be due to an overall low volume of a given metric in proportion to visits. For example, if your site has an extremely low transaction count compared to total visits, sampling may cause significant discrepancies.
That said, there will always be a mismatch between various analytics platforms. According to some experts, 5% is an acceptable threshold.
Web data sampling may result in far less accurate reports and can hide crucial insights from your data, directly impacting business efficiency.
Data sampling in Google Analytics
Google Analytics platforms such as Universal Analytics, Google Analytics 360 (GA 360), and new Google Analytics 4 (GA4) use probability sampling. It means your data is aggregated and delivered as a random data set. A few things have changed with the launch of Google Analytics 4 (GA4), but the concept remains the same.
Google Analytics samples your reports based on the number of sessions. Each version of Google Analytics has a different session limit. Default reports are unsampled, but if you apply ad-hoc queries like secondary dimensions or segments, your data gets sampled after reaching the following thresholds:
- In Universal Analytics, it occurs when your ad hoc reports hit 500,000 sessions at the property level for the chosen date range.
- Google Analytics 360 has a higher sampling threshold. You won’t have to worry about sampling unless your ad hoc reports hit 1,000,000 sessions for the chosen date range.
Sampling in Google Analytics 4
Similar to Universal Analytics and GA 360, sampling occurs in GA4 in standard reports and advanced analysis, such as when you create a report to analyze funnels, paths, cohorts, segment overlap, and others when the data exceeds 10 million counts (1 billion in case of GA4 360).
With data sampling, your data isn’t wholly accurate and representative of user behavior. Google will display information on how much a given report is based on available data. If the percentage shown is lower than 70-80%, you shouldn’t fully trust the data you’re getting.
You may also like: 6 key Google Analytics limitations
Data sampling: Good or bad?
If you are trying to analyze data that needs to be precise, like your website’s conversion rate or total revenue, then data sampling can cause problems. It’s always better to work on a complete data set when it doesn’t impede the speed of your reporting.
But it’s unfair to disregard sampled data entirely. Sometimes you can’t avoid sampling. If a report is generated for a vast number of events or sessions, it may take a very long time to generate. Or, it may exceed the time limit and not generate at all.
Many factors affect your reports’ performance:
- Dimension cardinality. Cardinality is the number of unique values a dimension can contain. For example, the ‘Mobile’ dimension in Google Analytics has two values – ‘Yes’ or ‘No’ – which means its cardinality is ‘two’.
- Vast data amount. The data volume used for report calculation affects the report’s speed. This might be caused, for example, by a high volume of sessions or a time range that includes extended periods, such as several years of data.
- Number of applied filters. For example, you can use filters to exclude traffic from particular IP addresses, include data from specific subdomains or directories, or convert dynamic page URLs to readable text strings. More filters mean lower performance.
Let’s say you manage a trendy website. You’ve generated a custom report and selected a date range that includes 5 million sessions. Instead of creating the custom report based on all sessions, you might use half of those sessions and still get valuable insights.
Now analytics only needs to calculate figures based on half of the data, and the report is quicker to load.
Most web analytics platforms automatically start sampling data when you reach a particular limit of actions tracked on your website. But you can choose an analytics platform that gives you 100% accurately reported data and doesn’t sample data by default and allows you to use sampling when necessary.
Read more about data sampling and how it weakens your reporting:
Raw data and sampled data: How to ensure accurate data
Does Piwik PRO sample your data?
By default, Piwik PRO does not sample your data. With Piwik PRO, you get unsampled data at all times unless you decide sampling is necessary.
In Piwik PRO, sampling serves to improve report performance. The sample is taken from the entire data set, meaning the more traffic considered, the more accurate the results.
If you experience problems loading reports, you can enable data sampling and choose the sample size. The data is sampled by the visitor ID, so the context of a session is not lost. This allows you to still use funnel reports where paths of users in sessions are analyzed, and complete paths are required for accurate reporting.
No data is removed. The data is collected even if the traffic limits are exceeded. And you can use it, for example, if you upgrade to a paid plan.
Conclusion
Sampled data may not be good enough for accurate data analysis. Yet, sampling can be particularly useful with data sets that are too large to analyze as a whole.
Always make sure your analytics platform provides solid data, and use sampling only when working with full dataset affects the load time of reports. Otherwise, you may miss out on information that might be critical for your business.
Related posts: