Introduction
Sometimes a company/researcher or the government needs to publicly share data. In an age where the GDPR dictates strong security, people will resort to anonymizing the data since anonymous data falls outside of the scope of GDPR.
Yet making data anonymous isn’t that easy. Simply de-identifying (for example by following HIPAA guidelines) isn’t enough.
And so k-anonymity was born. The original problem is that quasi identifiers like age, zip code or others can be used in a linkage attack.
Sometimes by combining quasi identifiers, which on their own right aren’t identifiers, can still identify people.
The solution is to adjust the data in such a manner that the quasi identifiers of each record cannot be distinguished from k – 1 other records.
For more details on this:
- For the more technical audience: The first paper on it can be found here: https://www.win.tue.nl/~jhartog/CourseVerif/Papers/10.1.1.90.4099.pdf.
- Or if you want the idea in a nutshell: Wikipedia article.
Converting data
K-anonymity is reached by converting the data in 3 ways:
- Generalization
- Generalize numerical data into intervals like e.g. age [18-30],[31-50],[50, …[
- Suppression
- Certain values of the attributes are replaced by an asterisk ‘*’.
- Adding noise to data
- Add random noise to your data. e.g. adding Laplace noise. https://en.wikipedia.org/wiki/Laplace_distribution ( https://en.wikipedia.org/wiki/Differential_privacy)
In this blog I’m explaining a new simple way of scrambling data.
n/k clustering
Assume you have n records you want to convert and you want to reach k-anonymity you’ll need records to be in equivalence classes with at least k elements in them.
One way of doing this is by converting the data into mean values of floor(n/k) clusters. The idea is quite simple. So here is a proof of concept:
import numpy from sklearn.cluster import KMeans k = 2 # k anonymization on nrElements = 10 vals = numpy.floor((numpy.random.rand(nrElements, 1) * 70)+18).astype(int) kmeans = KMeans(nrElements//k).fit(vals) clusters = kmeans.cluster_centers_ clusters.sort(axis=0) converted = list(map( lambda y: int( min(clusters, key=lambda x: abs(x-y))[0]), vals) ))
- First we generate some random age values as an example.
- KMeans clustering calculates the centroids
- Finally we convert the original values into the closest centroid value.
Example values:
[[35], [47], [18], [82], [37], [74], [38], [66], [23], [67]]
Clusters:
[[20.5], [36.666666666666664], [47.0], [66.5], [78.0]]
Converted example:
[36, 47, 20, 78, 36, 78, 36, 66, 20, 66]
The remarkable thing about the result is that it may not be k-anonymized. For instance the value 47 is unique. This may well be a sign that this record is too different to be included. The suggestion would then be to suppress the value or skip the entire record all together.
Conclusion
While n/k clustering doesn’t guarantee a k-anonymized solution it gives a good idea of which values may be too risky to be used.
Finding an optimal generalization may seem simple but is equivalent to the subset problem which is proven to be NP-hard.