Privacy by design in practice: How "just enough" data beats "just in case" collection

This article is based on Matt Gershoff‘s session at Piwik PRO Day 2025. Matt is the CEO and co-founder of Conductrics, an experimentation and customer insight platform built for complex, data-centric enterprises. Conductrics unifies full-stack experimentation, transparent machine learning, and direct customer feedback into a single, disciplined workflow – built with privacy-by-design principles at its core.

In analytics, there’s a persistent belief that more data is always better. But what if the opposite is true? What if collecting less data – deliberately, intentionally, with a clear purpose – actually makes your analytics more effective while reducing compliance risk?

This is the central question behind privacy by design, a framework that’s transforming how organizations approach data collection in a world of GDPR, HIPAA, and increasingly strict privacy regulations.

While collecting more data “just in case” feels safer, according to Matt Gershoff, it’s also one of the biggest sources of unnecessary compliance risk, analytical noise, and wasted organizational resources in the analytics industry today. His approach of “just enough” data collection is more intentional, more aligned with privacy regulation, and often more analytically effective.

In this article, you’ll learn:

Why data minimization principles improve analytics quality, not just compliance
How to calculate the marginal value of data before collecting it
Practical techniques for privacy-preserving analytics, like K-anonymity and data binning
How to implement privacy by design without sacrificing analytical power

What is Privacy by Design?

Privacy by Design is an approach that embeds privacy and data protection into systems, processes, and technologies from the outset rather than adding them as an afterthought.

Why “just in case” data collection creates more problems than it solves

Matt Gershoff describes what he calls a data maximalist culture – the instinct to collect for all possible future questions rather than for specific, defined ones:

“There’s not an explicit task we’re trying to solve. It’s we’re collecting data for all possible future questions. Not for this question, not for specific questions, but we just want to have the data for it – just in case.”

The shadow objective behind this approach is maximum optionality: keeping every possible door open, regardless of whether you’ll ever walk through it.

Data maximalism comes with hidden costs, which can be especially severe for organizations operating in regulated industries, such as healthcare, finance or government. The most notable risks include:

Compliance exposure: The more granular and identifiable your data, the greater your risk under GDPR, HIPAA, and other privacy regulations. Healthcare organizations paid over $100 million in HIPAA fines between 2023 and 2025 due to tracking violations, with individual penalties reaching up to $2.1 million.
Analytical noise: More data doesn’t necessarily mean better answers. Most additional data points add complexity without improving the signal.
The gambler’s fallacy: Teams believe “just one more data point” will unlock massive value, when the payoff rarely arrives.

“It’s just that next hand where there’s this huge payoff… if I just had one more bit of data, then I’d be able to unlock all this value.”

The alternative is Privacy by Design – a framework developed by Dr. Ann Cavoukian in 1997 and later codified in Article 25 of the GDPR. While many organizations view it as a compliance checkbox, it’s actually a methodology for building better systems from the ground up.

The framework’s second principle, privacy as the default, has three practical implications:

Minimize what you collect. Only gather personally identifiable information (PII) that is directly needed to answer the question at hand. If aggregate data will do, don’t collect individual-level data.
Start non-identifiable by default. Your systems should begin with the least identifying data possible and increase granularity only when there’s a specific, justified reason to do so.
Reduce linkability. Don’t connect all of your data simply because you can. Many analytical questions can be answered on partitioned, decoupled datasets without ever linking back to an individual.

Data minimization in practice: Calculating the marginal value of data

Before adding a new data point, Gershoff recommends asking: What would we lose if we didn’t have it?

Here is an example: You’re tracking customer spend down to the cent. What if you removed cents and kept only whole euros or dollars? In most cases, your analysis wouldn’t meaningfully change.

“If I had to pay for it right now, given my scarce resources, how much would I be willing to pay for this data? It’s probably not very much.”

You can test this approach by taking one sensitive or high-granularity data field, creating a binned version, and running your analysis on both.

Gershoff’s results for this test are as follows:

Binning raw, normally distributed data into just 10 groups retained 0.98 correlation with the original dataset
Almost no meaningful information was lost
Identifiability dropped significantly

Why simpler analytics models often outperform complex ones

Gershoff draws on physicist Murray Gell-Mann’s concept of effective complexity: the measure of actual structure, or regularity, that exists within a system, distinct from random noise.

Analysts often overestimate the complexity of what they’re looking for, when in fact:

Most patterns in customer behavior are simple
Most A/B tests yield small, clean differences, or none at all
The belief that vast datasets will reveal vast structures is usually unfounded

Why does it matter for data collection? Complexity and data volume are inherently related. When you assume the solution will be complex, you collect complex data to match. When you allow for simple solutions, you can collect less and build leaner, easier-to-audit systems.

K-anonymity explained: How to preserve privacy without losing analytical power

K-anonymity is a privacy-preserving approach where individuals are grouped into equivalence classes – no single person can be distinguished from at least K-1 others.

Think “Where’s Waldo?”: If everyone in a group looks the same from a data perspective, you can’t single out any individual.

“If you have a K of 100, that means 99 other people look like any individual person, at a minimum.”

The analytical advantage here is that most A/B testing methodologies rely on linear regression under the hood. And linear regression works perfectly well on aggregated, K-anonymous data.

You can use K-anonymous data to run:

Standard A/B tests
Multivariate tests
Variance reduction methods like CUPED
Heterogeneous treatment effect analyses

This is particularly significant for organizations in healthcare or other regulated sectors. HIPAA’s de-identification standards map closely to K-anonymity principles: when data cannot reasonably identify an individual, it falls outside the scope of protected health information (PHI).

Learn more about HIPAA compliance: HIPAA-compliant analytics in 2025: Your complete vendor comparison and selection guide

Building intentional data collection: A practical framework

Running through Gershoff’s entire talk is intentional design – the idea that data collection should be deliberate, purposeful, and customer-oriented.

“One of the key things that privacy has really helped me think about is to focus on the customer — to really be asking what it is that the customer needs, and based upon that, asking what is the minimum set of information that we need to ask of our customers in order to better serve them.”

It reframes privacy compliance not as a constraint on analytics, but as a discipline that makes analytics better. When you’re forced to articulate why you’re collecting a data point, you naturally eliminate noise.

Questions to ask before collecting any new data point

Before adding any data point to your tracking setup, ask:

What specific question does this answer? If you can’t name it, you probably don’t need the data.
What would you lose if you had less granularity? Try a binned version before committing to raw, high-fidelity storage.
Can this be stored at the task level rather than the individual level? Many analytical problems can be solved on aggregated data without ever touching PII.
What is the marginal value of this data point? If you’d pay very little for it on an open market, that’s a signal.
Does the complexity of your solution match the likely complexity of the answer? Most patterns are simple. Start there.
Could this data point create compliance exposure? Consider whether it could be linked to protected health information, financial data, or other regulated categories.

How Piwik PRO enables privacy by design analytics

The principles Gershoff describes aren’t just theoretical — they’re how Piwik PRO’s platform is architected from the ground up.

Privacy by design approach

Rather than defaulting to maximum data retention, Piwik PRO gives organizations control over what they collect, how long they store it, and how it’s linked — with privacy settings configured at the site or app level rather than as an afterthought.

Flexible data collection methods

Organizations can choose between full consent-based tracking, cookieless analytics, or anonymous tracking depending on their specific compliance requirements and analytical needs. Whether you’re a healthcare provider subject to HIPAA, a financial institution navigating GLBA, or an organization managing GDPR compliance, you can adjust the platform to your specific regulatory requirements.

Learn more about different data collection options: Anonymous tracking: How to do useful analytics without personal data

Anonymous tracking for complete insights

Organizations that don’t need to identify individual users can collect complete, useful analytics data without processing any personal data, meeting their analytical needs and privacy requirements. This becomes especially valuable under emerging frameworks like the EU’s Digital Omnibus, which allows first-party analytics without consent when certain data minimization criteria are met.

Learn more about Digital Omnibus: First-party analytics without consent: Your Digital Omnibus compliance guide

Built-in data minimization

Piwik PRO offers:

Configurable data retention periods by data type
Session-level custom dimensions that don’t require persistent user identifiers
Aggregation options for reporting without exposing individual-level data
First-party data ownership with no third-party processing

It means that organizations can start with minimal data collection and increase granularity only when there’s a demonstrated business need, rather than collecting as much as possible and trying to connect the data points afterwards.

The future is built on intentional data practices

The argument behind collecting less data isn’t just a compliance story or a practical guide to staying on the right side of GDPR or HIPAA. It goes deeper into analytical quality and organizational integrity.

When teams collect data without a defined purpose, they accumulate noise that makes it harder to make sense of. They build systems that are difficult to audit, explain to regulators, and maintain. And they miss the opportunity to understand customers well enough to actually serve them better.

With the EU’s Digital Omnibus proposals, the emergence of state-level privacy laws across the US, and increasing regulatory scrutiny globally, organizations that adopted privacy-by-design principles early will find themselves with a significant competitive advantage. Not because they were forced to, but because they recognized that less, collected deliberately, is analytically more powerful than more, collected just in case.

“Be intentional. Be thoughtful. Have empathy. Think from your customers’ perspectives. That’s why privacy and analytics dovetail when approaching it from this intentionality aspect.”

Matt Gershoff

CEO of Conductrics

Ready to implement privacy by design analytics?

Get started with privacy-friendly analytics today or see Piwik PRO in action and learn how we support data minimization without sacrificing analytics insights.

Start a free 30-day trial

Schedule a demo