The ultimate guide to data anonymization in analytics [updated]

,

Written by Karolina Lubowicka

Published December 13, 2018

In the face of GDPR, many companies are looking for ways to process and utilize personal data without violating the new rules.

This is all quite difficult, as GDPR significantly limits the ways in which personal data can be collected and processed. One of the biggest challenges is the high bar the regulation sets for acquiring a visitor’s consent.

The two main obstacles are:

1) under GDPR, consent has to be freely given, specific, informed and an unambiguous indication of the data subject’s agreement to the processing of personal data relating to him or her to serve as a valid basis for processing user data.

If you want to dig deeper into the details of GDPR consent, we advise you to read this blog post:
How Consent Manager Can Help You Obtain GDPR-Compliant Consents From Your Users

2) GDPR has no grandfather provision allowing for the continued use of data collected using non-compliant methods prior to the date of GDPR’s entry into force. In practice, this means that all data collected before GDPR should be removed from databases if it doesn’t meet all the requirements (and most probably it doesn’t).

What’s more, the definition of personal data has broadened drastically, and now includes cookies and many other online identifiers used in web analytics. You can read more about it here:
What Is PII, non-PII, and Personal Data?

Every company wanting to process analytics data has to adjust their approach to meet the demands of the new law. We tackle this topic on our blog here:
How Will GDPR Affect Your Web Analytics Tracking?

Another option is seeking other legal bases allowing us to process data and use historical analytics databases without going into a gray area.

One of the most favorable methods seems to be data anonymization. It may prove a good strategy for retaining the benefits while mitigating the risks involved in dealing with user data.

The key benefits of data anonymization

Companies that use this technique can benefit from one very important fact – anonymous data is not personal data for the purposes of GDPR.

According to Recital 26 of GDPR: The principles of data protection should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.

Under the provision cited above, anonymous data doesn’t require any additional safeguards to ensure its security. Among other things, this means that:

  • you don’t need to get consent to process it
  • you can use it for other purposes than the ones it was originally collected for (you can even sell it!)
  • it can be stored for an indefinite period of time
  • it can be exported internationally

In other words, you can use it freely for virtually every purpose you want to.

PII vs personal data

Learn how to recognize PII and personal data to stay away from privacy issues

What’s more, data anonymization is a great way to prove that you’re making all possible efforts to ensure the security of your users’ data. According to data privacy experts, this technique can be treated as:

  • part of a privacy by design strategy
  • part of a risk minimization strategy
  • a way to prevent personal data security breaches
  • part of a data minimization strategy

These advantages, however, result from one fact – anonymisation is a very complicated and demanding process. It requires a lot of preparation and the use of specialized techniques. The benefits you receive are more like reward for your hard work than a low-hanging fruit.

What exactly is data anonymization?

Data anonymization is the use of one or more techniques designed to make it impossible – or at least more difficult – to identify a particular individual from stored data related to them.

According to London’s Global University, Anonymisation is the process of removing personal identifiers, both direct and indirect, that may lead to an individual being identified.

An individual may be directly identified from their name, address, postcode, telephone number, photograph or image, or some other unique personal characteristic.

An individual may be indirectly identifiable when certain information is linked together with other sources of information, including, their place of work, job title, salary, their postcode or even the fact that they have a particular diagnosis or condition.

Which kinds of data should be anonymized

In the case of anonymization performed to align with the demands of GDPR, that would mean anonymizing every piece of information that can be classified as personal data.

Since, as we’ve already mentioned, the definition of personal data in GDPR is very broad, that will include such information as:

  • login details
  • device IDs
  • IP addresses
  • cookies
  • browser type
  • device type
  • plug-in details
  • language preference
  • time zones
  • screen size, screen color depth, system fonts
  • … and much more

That’s quite a long list, isn’t it?

What’s particularly important in the case of anonymization is that, according to the Article 29 Working Party’s Opinion 05/2014 on Anonymisation Techniques, it shouldn’t be treated as a single unified approach to data protection.

It’s rather a set of different techniques and methods used to permanently mask the original content of the dataset.

There’s also a very limited list of techniques that could be considered as providing sufficient level of security. Among the approved anonymization techniques, the Article 29 Working Party lists two types of procedures: randomization and generalization.

Here you can find a short description of techniques encompassed by their scope.

Randomization:

Noise Addition: where personal identifiers are expressed imprecisely, for instance:

height: 180 cm → height: 320 cm

Substitution/Permutation: where personal identifiers are shuffled within a table or replaced with random values, for instance:

ZIP: 10120 → ZIP: postcode

Differential Privacy: where personal identifiers of one data set are compared against an anonymized data set held by a third party with instructions to employ a noise function and an acceptable amount of data leakage is defined.

Generalization:

Aggregation/K-Anonymity: where personal identifiers are generalized into a range or group, for instance:

Age: 30 → Age: 20-35

L-Diversity: where personal identifiers are first generalized, then each attribute within an equivalence class is made to occur at least n times, for instance: properties are assigned to personal identifiers, and each property is made to occur with a dataset, or partition, a minimum number of times.

The most common threats in anonymization

However, each of the techniques described above has its own pitfalls, especially when tested against the three most common risks involved in anonymizing data. Those risks are:

Singling out

The possibility to isolate some or all records which identify an individual in the dataset

Linkability

The ability to link at least two records concerning the same data subject or a group of data subjects (either in the same database or in two different databases)

Inference

The possibility to deduce, with significant probability, the value of an attribute from the values of a set of other attributes.

As you can see in the table below, every technique has its own set of strengths and weaknesses:

Is Singling Out still a risk?Is Linkability still a risk?Is Inference still a risk?
Noise AuditionYesMay notMay not
SubstitutionYesYesMay not
Aggregation or K-anonymityNoYesYes
L-diversityNoYesMay not
Differnetial privacyMay notMay notMay not

Source: Article 29 Working Party, Opinion 05/2014 on Anonymisation Techniques

For these reasons, it’s highly advisable to use not one but a combination of several anonymization in concert to prevent your data set from being re-identified. However, even that approach doesn’t necessarily translate into total data security.

Because there are now so many different public datasets available to cross-reference, any set of records with a decent amount of information on someone’s actions has a good chance of matching identifiable public records.

87% of the American population can be uniquely identified by a combination of just their ZIP code, gender, and date of birth!
Latanya Sweeney, 2000

That’s why, even when applying anonymization processes, it’s important to limit the amount of anonymized data disclosed to the public and to stick to the data minimization approach. In this way you minimize the risk of this data set being matched with any kind of public records.

PII vs personal data

Learn how to recognize PII and personal data to stay away from privacy issues.

We’re aware that anonymization techniques and the threats involved in applying them to your data is a much broader topic, impossible to tackle in a single blog post. That’s why we’ve put together a list of valuable guides shedding some more light on the technical aspects of data anonymization:

We hope they’ll prove useful!

What other options you have

The techniques presented above are typically applied to the data sets that contain personal data. It means that you’ll need to collect consents to use it anyway.

However, you should be aware that there’s also a way to collect data that is anonymous from the very start. You’ll successfully escape the obligation to collect data subject consents before you begin processing the data.

For that, you’ll need analytics software that allows for data anonymization (unfortunately, that won’t be the case with Google Analytics – here you can read why).

This is how we approached anonymization in Piwik PRO:

… And a few words of technical explanation:

When anonymous data collection is enabled for a visitor, the user is anonymous (UIA) parameter is added to the tracker and Piwik PRO instance settings are the following for anonymous visitors:

  • Geolocation is fully or partially disabled. Depending on the instance settings Piwik PRO will record just the country information or nothing at all. An associated web server will see the same masked IP address.
  • No “fingerprinting” data is used to identify returning users. Characteristics of the visitor’s device or browser (operating system, browser information, language settings, etc.) are not compared in an attempt to identify users.
  • One cookie identifier (“Visitor ID” ) is stored in the visitor’s browser. This cookie’s duration is set to 30 minutes, after which it is deleted automatically by the browser. Therefore the online identifier describes a visit and not a visitor.

This kind of data anonymization allows you to collect at least some information about user behavior (e.g. number of visitors, page views, conversions and time spent on the site) without asking for consent.

If you want to learn more about Piwik PRO Data Anonymization module, be sure to check this page out.

Disadvantages of data anonymization

Although data anonymization has some very strong advantages, don’t forget about its drawbacks.

It’s important to remember that if you want to anonymize new data collected from your website, then you’ll either need to obtain consent to collect personal data (like cookies, IP addresses, and device ID) and then apply anonymization techniques, or only collect anonymous data from the start.

In the latter case, this data would be limited to pageviews, as most other analytics metrics and reports requires personal data like unique pageviews, unique visitors, user location, etc.

However safe this approach may sound, it also deprives you of all the valuable insights you can gain with more detailed information about your customers.

Stripping every common identifier from your data makes it impossible to cultivate a more personalized approach towards your clients and visitors – for instance, by serving them with tailored messaging and dedicated offers or recommendations.

Statistics prove that personalization is an increasingly successful marketing tactic. What’s more, consumers are keen to share their personal data with companies if the data will be used for their own benefit:

79%

of consumers say they are only likely to engage with an offer if it has been personalized to reflect previous interactions the consumer has had with the brand. (Marketo)

57%

of consumers are okay with providing personal information (on a website) as long as it’s for their benefit and is being used in responsible ways. (Janrain)

That’s why it’s worth sacrificing your historical data set in some cases and going the extra mile to provide your users with enhanced levels of security and transparency.

This will help them be relaxed about sharing their personal details with you. Then you can use this data to provide them with level of personalization and customer experience they desire.

First-party data is one of the biggest assets in every marketer’s arsenal. We’ve written a lot about it in these blog posts:

You can do it by asking your users for consent to process their data and storing all the information received in alignment with the new EU data privacy law – something we’ve written a lot about on our blog in the GDPR section. Be sure to check it out!

Anonymous analytics – final thoughts

Anonymization is definitely one of the greatest ways to ensure the safety of data you collect. This extra measure of security lets you freely exploit your data collection in ways that wouldn’t be legally allowed when it comes to non-anonymized data.

However, there are also some considerable benefits of using personal data in its pure (original) form. That’s why you really need to think through the pros and cons of each option before making a final decision.

But no matter what method you choose, remember that storing your data in a safe environment is also of paramount importance.

For instance, Piwik PRO Analytics allows you to store your data at a location of your choice – using your own infrastructure, in a third-party database, or in our own secure private cloud with servers located in EU and the USA.

What’s more, our software enables you to apply additional security measures to your data, like SAML Authentication or Audit Log, and you can take advantage of professional data security advice and support.

If you’d like to learn more, feel free to contact us anytime!