Saving Analytical Data Without Violating GDPR – Part 2: Aggregation and Anonymization

In a previous post, we reviewed two GDPR anonymization options – minimization and masking. In this installment we discuss two additional options.

Aggregation

Another way to comply with GDPR is to group data in such a way that individual records no longer exist and cannot be distinguished from other records in the same grouping. This may be accomplished through a single aggregation of the data into the most commonly consumed set or, more commonly, by creating multiple aggregations of the data for different use cases.

For this strategy to work, the data set needs to remove data elements that can directly (national number identifier, name, passport ID, etc.) or indirectly (region, area code, etc.) allow the identity of a record to be derived. This can be somewhat complicated as the indirect identification needs to take into consideration things like set size and dimensionality of the data as well as background or publically available data. For thousands of daily sales records across a country, this may easily be sufficient, but for mobile telephone locational data in a large metro area it would be very ineffective.

The potential downside of this strategy is that the effectiveness of the data for broad data analytical purposes may need to be reduced to provide adequate anonymization. For a more technical explanation of this type of aggregation, take a look at the following publication on l-diversity and privacy-centric data mining algorithms, A Comprehensive Review on Privacy Preserving Data Mining.

Anonymization

If data must be maintained at a detail level, then anonymization of personal data may be the best solution available. Anonymization is generally achieved through encryption or a one-way hash algorithm. Generally, if the organization creates a hash of all the key values of the record along with the personal data contained in the record, it can create a hash key that allows for dynamic reporting and aggregation on the data set without exposing the personal data.

When using an anonymization strategy of this type, the company will need to hash all of the personal data concatenated as a single field to effectively prevent rainbow table solutions. In cases where surrogate keys are used, hashing them into the string as well introduces elements that are more difficult to derive and will further degrade the effectiveness of rainbow table type attacks. Creating a hash on just one field (credit card number or social security number) is not effective due to the small number of possible combinations, producing a set that is trivial for rainbow tables to solve.

When selecting a hash, organizations need to have “due regard to the state of the art,” so careful consideration should be given to select an algorithm which is computationally infeasible to invert (high preimage resistance). A selection of algorithms that meet this criteria would be SHA-256 or SHA-512, Blake2s and Blake2b, for example. MD5 can also be used but for small input sets (strings under 50 characters or so) though it may be vulnerable to breaking in the next couple years on very advanced hardware.

Closing Notes

GDPR is a complex subject area with wide ranging impacts across the business environment. In this and our previous post, we addressed only one small part of the landscape (analytical data) and a subset of the GDPR requirements, specifically de-identification or anonymization. For more information on GDPR and other helpful resources, see this post or visit our website.

No single software program, vendor, or strategy will make an organization GDPR compliant on its own. Companies should consult with their legal and information systems teams to verify that whatever measures are taken align with the organization’s overall GDPR strategy.

Authors

Don Loden