Sensitive data combinations

This post is my first attempt to tackle the thorny issue of data which is not core personally-identifiable information (PII) but which, in some combinations, is enough to identify an individual. I’ll call this type of data combination-PII (or combo-PII), and such a combination in a specific search a ‘profile’, for this purpose of this discussion.

Combo-PII is reference data that describes living persons

This type of data is usually called ‘reference’ data by database specialists. This is the background data that structures our picture of a person using categories, such as the city and country we live in, our age range, consumer choices (e.g., electricity provider), and similar data. Each of these values, taken by itself, is not enough to identify a person. Many such values taken together can, in some cases, either identify the data subject with certainty, or narrow the number of possibilities enough for subject to be guessed, or combined with other data to produce a match.

This type of data I will call combo-PII. It is the kind of data whose PII-ness is variable, depending on the nature of the data. It is the kind of data that is identical for most data subjects, and so is aggregated into totals and sub-totals for analytic and regulatory reporting. This type of data normally forms the ‘dimension’ values for data-warehouse and related reporting.

Staff who are tasked with data privacy should be able to look up any data element in the data inventory (link) to see how it is classified (non-PII, PII, combo-PII, possibly other categories as well).

A simple example

For example, suppose our data controller asks for a report showing data subjects, with aggregated counts, totals, averages, etc. grouped by the subject’s (Belgian) commune of residence, gender, and year of birth. Suppose further that within this group we have a data subject who is male, born in 1990, and lives in a large commune like Antwerp. The aggregated line of this report is likely to include hundreds or thousands of matches. If, instead, the subject lives in a small commune with a few thousand residents, most of whom are retired (meaning that residents born in 1990 are relatively few), our subject will be in a small group of matches.

Consider the following about our data subject: Antwerp commune has a 2017 population of 520,504. Assuming that gender and birth years are evenly distributed over the population (at 0.5 and 0.0125, respectively), the number of matches for these 3 reference values is 3253.

If instead the commune is Mesen, population 1049, the number of possible matches drops to 7. Our data subject is one of 7 people, and it won’t take much to figure out which one he is. One more piece of information might be enough. The analytic process is analogous to a detective  who puts clues together to narrow a field of suspects.

You might not even need to narrow the field down. If you’re deciding whether to display an online ad based on a profile, having your group narrowed down to 7 viewers may be enough to justify the cost of showing the ad to all of them.

My point here is that a given combo-PII profile can only be categorized as PII based on the content of the data-subject’s record and the underlying context of the data: does this particular record produce a relatively large or small number of matches?

If matches are few, one approach is to block exposure of the record if the match-count falls below some pre-determined threshold, as a compromise between legitimate data exploitation and the need for privacy. This approach is outlined in the example below.

The threshold is useful, since it can be set based on data sensitivity. For example, super-PII (things like protected witnesses, convicted criminals, undercover police officers) might have a higher threshold than normal PII. Reports that are kept confidential and are seen by only a few people might have a lower threshold, given the lower risk of exposure. There can be a different hard-coded threshold for each case, and you could even have an algorithm to set the threshold dynamically.

Below is a list of Belgian communes (gemeente is Dutch for commune) and populations, edited to show only a few of the largest and smallest ones.

Gemeente                    Population_2017
--------------------------  -------------
Antwerpen                   520504
Gent                        259083
Charleroi                   201256
[--- cut ---]
Martelange                  1814
Herbeumont                  1619
Daverdisse                  1429
Mesen                       1049

I should note that, in your GDPR-compliant database, tables such as population by commune will be highly re-usable and will be among the most-frequently consulted tables in the database. Any time that you have to estimate the sensitivity of a profile which includes a person’s commune, you can use this table. Not only are such tables re-usable, but you won’t need very many of them, possibly no more than a dozen, to sensitize most of your PII.

We can run a query on this data to give us some information about the selectivity of the profile (commune, gender, year-of-birth) for each commune. Note that the right-most column displays “(null)”, meaning ‘no value available, for the rows which have a small number of matches. This demonstrates that you can omit displaying data based on its dynamically-determined sensitivity (here that value is a dummy, simply the count of matches from the third column, but the principle is the same).

Here’s the result:

Gemeente    Population_2017  possible_matches  exposed_value  combo_PII_reduction_factor
----------  ---------------  ----------------  -------------  --------------------------  
Antwerpen   520504           3253              3253           0.00625                     
Gent        259083           1619              1619           0.00625                     
Charleroi   201256           1258              1258           0.00625                     
Martelange  1814             11                (null)         0.00625                     
Herbeumont  1619             10                (null)         0.00625                     
Daverdisse  1429             9                 (null)         0.00625                     
Mesen       1049             7                 (null)         0.00625    

One might logically ask how these omitted values are useful. In the above example, our report could still include a single figure for the number of null values, so that aggregate counts would still reflect the number of records omitted for privacy reasons (but not their details).

I’ll include the SQL (SQLite version 3.8.11) at the end. The SQL there has a hard-coded threshold of 15, meaning that records with fewer than 15 matches are not shown. Here data from selectively-suppressed rows (as is shown above in column “exposed value”) is shown as NULL, but in practice it will likely be a label that indicates that the record was suppressed for privacy reasons..

Note also that combo-PII reduction factor is constant. This because I have used constants for the proportions of genders and age groups. In a real-life case you would normally try to have actual population data for these factors. See Appendix B for more details.

I assumed constant distributions for gender and birth year. Gender will likely hover around 50-50, but the distribution of birth years is unlikely to be constant. If you had additional data, for example, birth year by commune, that would make your selectivity estimates more accurate. That is why I have those values (commented out) in the sample SQL, as placeholders for improvement. (A table of gender breakdowns of the above populations is available at the link given in Appendix C.)

Combo-PII is a tricky business

Combo-PII is a tricky for several reasons, among them:

  • nobody wants to expose PII, embarrass his or her organization, nor suffer a fine from the DPA
  • data monetization is more cost-effective the more specifically it is targeted; there is a financial incentive to identify target data subjects or at least narrow them down to a small group. Nobody wants to have to tell marketing or customer service that they can’t use this data at all.

Thus we have competing priorities, and need to find a balance that we can explain and justify as being objectively based. The dividing line separating adequate privacy from legitimate use is not cut-and-dried, but depends on several factors:

  • how visible is the data? For example, data displayed in internal reports for analysts could have more leeway than those made available to the public
  • how selective is the data? Knowing that a data subject lives in a small village is provides more information than knowing that he or she lives in Brussels

Below I propose a simple approach that can be used to fine-tune the selectivity of combo-PII to adjust for factors like visibility and selectivity.

Mask the selective, expose the non-selective

A one-size-fits-all categorization is not workable for combo-PII, since it does not take such factors as selectivity and audience into consideration. The approach put forward in this post outlines a compromise that protects PII while enabling organizations to protect sensitive data while remaining free to use non-sensitive data for legitimate purposes.

The approach outlined here requires some custom development in the form of queries for each data access that the application makes, with expose/block decisions made on-the-fly by the database’s SQL code. These SQL queries can be stored as virtual tables, called ‘views’, in the database. This extra effort yields several benefits:

  • these views will normally follow a standard structure and can be easily coded and maintained more easily than most types of code
  • these views can be substituted for existing tables or views transparently into existing SQL-based applications, resulting in minimal code disturbance
  • as mentioned above, the threshold setting allows you to fine-tune the exposure to fit the circumstances
  • if your organization is audited, you have a defensible strategy (assuming you keep your statistical data updated) which takes each individual’s vulnerability into account, with customizable thresholds for maximum granularity

Appendix A – Re-creating this demonstration

The code below is included for two reasons.

  • Those who work with SQL may be interested in trying this example and extending it (as suggested below for gender, in Appendix B)
  • Those with no interest in SQL can verify that the code to implement the proposed approach is simple and brief

If you want to try this code yourself, do the following:

Create or open a SQLite3 database (SQLite is free and can be queried from the command line, plus there are several free utilities which provide an GUI for running queries). If you have SQLite already but are unsure of the version, type

 select sqlite_version();

The result should start with 3, followed by a decimal and some more numbers (that is, you need at least version 3).

Once you have access to SQL, create the table and enter the rows used in this demonstration by executing the following statements in SQLite:

Create the table:

 POP_2017 NUM

Insert the sample rows:


Now run the query:

WITH constants
AS (
/* define constants so as to avoid using literals */
 SELECT 15 AS max_matches
 ,"%d" AS format_str
 ,0 AS decimals
 /* genders, plus birth_years, could have been put into
 constants in this example, but I break them out to show
 that they could be used to provide more accurate 
 estimates if you had, for example, breakdowns of gender
 ratios or age ranges by communue
AS (
 SELECT 0.5 AS gender_ratio
AS (
 SELECT 0.0125 AS pop_born_same_year
AS (
/* Here's the refinement: we take gender ratio times age-cohort ratio to reduce
 our pool of eligible data subjects
 SELECT genders.gender_ratio * birth_years.pop_born_same_year AS reduction_factor
 FROM birth_years
 ,POP_2017 AS Population_2017
 ,printf(c.format_str, ROUND(POP_2017 * pop.reduction_factor, c.decimals)) AS possible_matches
 /* The next statement is the point of the exercise
 Here I use a CASE statement to decide, for each row, if the combination of reference fields
 is too selective, that is, if it risks exposing enough data to identify an individual. If it is
 too selective, I don't expose the data. 
 WHEN c.max_matches < POP_2017 * pop.reduction_factor
 THEN printf(c.format_str, ROUND(POP_2017 * pop.reduction_factor, c.decimals))
 END AS exposed_value
 /* the remaining fields are merely for illustration */
 ,pop.reduction_factor as combo_PII_reduction_factor
-- ,genders.gender_ratio AS gender_reduction_factor
-- ,printf(c.format_str, ROUND(POP_2017 * genders.gender_ratio, c.decimals)) AS pool_of_same_sex
-- ,birth_years.pop_born_same_year AS birth_year_reduction_factor
-- ,printf(c.format_str, ROUND(POP_2017 * birth_years.pop_born_same_year, c.decimals)) AS 
-- pool_born_same_year
 ,population AS pop
 ,constants c
where Gemeente in ('Antwerpen', 'Gent', 'Charleroi', 'Martelange', 'Herbeumont', 'Daverdisse', 'Mesen')

Appendix B – Sharpening the accuracy of the estimates

As noted above, I simplified the example by using assumptions for the proportion of men and women as well as the distribution of birth-year cohorts in my example. In a production system you would use actual population data, if available. For example, the breakdown for the gender ratio by commune for the sample communes used above is as follows:

----------  ------  --------  ------------  
Antwerpen   MAN     259334    0.498         
Antwerpen   VROUW   261170    0.502         
Charleroi   MAN     98113     0.488         
Charleroi   VROUW   103143    0.512         
Daverdisse  MAN     703       0.492         
Daverdisse  VROUW   726       0.508         
Gent        MAN     128848    0.497         
Gent        VROUW   130235    0.503         
Herbeumont  MAN     828       0.511         
Herbeumont  VROUW   791       0.489         
Martelange  MAN     934       0.515         
Martelange  VROUW   880       0.485         
Mesen       MAN     538       0.513         
Mesen       VROUW   511       0.487

With a few extra lines of SQL the actual gender ratio for the data subject’s commune could be applied instead of the assumed value (0.5) used in the sample code. In this simple example the actual data would not likely have changed the results, but with other categories of data, the actual values can make a large difference. In any case, a finely-tuned sensitivity estimation process should help to convince the data-protection authorities that you have made a strong effort to protect the privacy of your data subjects while adapting the protection level to the circumstances.

Appendix C – Source for this data

The data for this example was taken from this Belgian government site link.

It is also likely that the Belgian government or private data services can provide additional breakdowns for most of the categories of PII.

Leave a Reply