The Data Inventory, Part 3

In my previous post about building a data inventory (link), I showed a basic 3-table schema which stores the names, locations, and other things about the personally-indentifiable information (PII) in a set of databases. This post shows how one might go about recording the actual PII that is in the databases.

Challenges

The technical difficulty of collecting all of a person’s information and purging it on request is well known (link). One consolation is that the same information that would be purged on request will be exported to be given to the user, so that a single system can be used to identify this information.

Note that not all information the company holds is equivalent. In a future post I plan to cover personal data that may be held in what I call ‘recovery’ files, such as database transaction logs, backup files, audit trails, and other structured files. These files cannot be readily queried, nor can individual records be deleted from them (at least, not without compromising the ability to audit the database or perform a restore of it), providing justification for holding the data. Such data still needs to be protected, however, since a skilled hacker might be able to extract the data from such files. I cover an end-to-end solution that does protect such files in an earlier post (link).

For this post, however, I plan to cover only the data that is stored in databases and can be queried programmatically by applications or ad-hoc by database users.

Design Goals

We want to be able to respond to GDPR requests by data-subjects, for example:

  • information
    • what data you store about me
    • what you do with that data
    • what is your reason for storing this data
    • where and how this data is stored
    • who may access this data, under what circumstances
  • extract
  • purge
    • You may be hit with a large number of purge requests at once, in which case a manual process will be slow and labor-intensive
    • You may need to retain part of the data for justified reasons (e.g., the data subject is bringing a lawsuit against you), so that you need to make sure that justifications are associated with the corresponding data

We want to ensure that we identify all PII of the requesting data subject, and only that person’s data, that is, we want to be able to:

  • detect if this person is present in multiple databases
  • detect is this person is present in different roles (e.g., a hospital may have the person both as a former patient and also as the contact person for a different patient)
  • detect different persons who may be confused with each other (e.g., due to same name)
  • ensure that the data about this person is consistent across all instances (data quality)

Data-subject-interaction management

We want to be able to associate supporting GDPR information directly with the concerned data item in a granular way, such as:

  • consents and revocations
  • legal/business justification
  • purpose(s) of storing

The schema shown in this post records both the above items and their associated consents, justifications, etc. You may find it easier to use a separate cloud-based application for managing this aspect (e.g., link)

Possible approaches

I will show here an approach that provides a high degree of granularity, but there are many approaches that could be used, such as:

  • column-by-column tracking (the one to be shown here)
    • most granularity (e.g., can record allow/consent evidence to individual data items)
    • can pre-calculate sensitivity, if desired (link)
    • sensitivity can be based on the entire dataset, regardless of where it is stored
    • can store justification(s)
    • can easily add new categories if the need emerges
    • all values stored as text, but you can use the datatype information to convert to native format if needed
    • if you have unstructured data (images, word processing, etc.) in a non-relational (key-value) store, you can store the key in this method
    • a flexible choice where there are different types of database in use
    • disadvantage: is probably the most trouble to set up, due to high granularity
    • row-by-row (e.g., local replica table, ID in remote DB, local XML, other unstructured)
    • keeps the entire row together in one data-inventory entry
    • may be more convenient than column-by-column if you use only one database product
  • person-by-person
    • link all of a person’s information together, with the person at the core of the schema
    • possibly the best choice for a mix of structured and unstructured data, for example, a social media application
  • unstructured/NoSQL
    • may be a good choice where you have a lot of unstructured data, such as images, scans, or biometrics (fingerprints, retinal scans)
    • also a good choice for free-form text (tweets, comments, surveys) used for sentiment analysis, tagging, and other big-data analyses

A column-by-column method

Starting with the schema in the previous data-inventory post (link), I have added 3 additional tables, just enough to illustrate the concept. Here’s a logical diagram of the schema:

A simple data-inventory schema

The new tables are:

VALUES BY COLUMN – This table has a row for each column containing PII. Each row is linked back to its parent table, DATA INVENTORY LOCAL.

Here’s how the working table (the one the application uses) looks for the first few columns:

SELECT
EMPLOYEE_ID,
FIRST_NAME,
LAST_NAME,
EMAIL
FROM
HR.EMPLOYEES
WHERE EMPLOYEE_ID IN (100, 101)

EMPLOYEE_ID  FIRST_NAME  LAST_NAME  EMAIL     
-----------  ----------  ---------  --------  
100          Steven      King       SKING     
101          Neena       Kochhar    NKOCHHAR  

Here are the corresponding entries in DATA INVENTORY LOCAL:

SELECT
ID,
OWNER,
TABLE_NAME,
COLUMN_NAME,
DATA_TYPE,
LEN
FROM
KEN.DATA_INVENTORY_LOCAL
WHERE ID IN (5,6,8);

ID  DATABASE_ID  OWNER  TABLE_NAME  COLUMN_NAME  DATA_TYPE  LEN  
--  -----------  -----  ----------  -----------  ---------  ---  
5   1            HR     EMPLOYEES   FIRST_NAME   VARCHAR2   20   
6   1            HR     EMPLOYEES   HIRE_DATE    DATE       7    
8   1            HR     EMPLOYEES   LAST_NAME    VARCHAR2   25   

and now some actual values from VALUES BY COLUMN:

SELECT
LOCAL_ROW_ID,
ID,
INVENTORY_ID,
COLUMN_VALUE
FROM
KEN.VALUES_BY_COLUMN
ORDER BY
LOCAL_ROW_ID;

LOCAL_ROW_ID  ID  INVENTORY_ID  COLUMN_VALUE         
------------  --  ------------  -------------------  
100           5   6             2003-06-17 00:00:00  
100           3   8             King                 
100           1   5             Steven               
101           6   6             2005-09-21 00:00:00  
101           4   5             NeeNa                
101           2   8             Kochhar 

I have placed the column LOCAL_ROW_ID first, and ordered by this column, in order to keep together the rows pertaining to one data subject. Note that the column INVENTORY_ID is a link back to DATA_INVENTORY_LOCAL, where we can find the column name, table name, datatype, and so forth. (We will use the datatype to convert HIRE_DATE from its string representation here back to a DATE datatype if, for example, we need to use date arithmetic to determine how long the person has been employed at this company.)

Now that we have our columns broken out individually, we can add table JUSTIFIED BY to record the justifications (there may be more than one) for holding this data item. JUSTIFIED BY, for its part, can refer to a table of standard justifications, such as might be formulated by the company’s legal advisers. This granular approach gives us the ability to have different justifications for different data items of a single person. For example, suppose that one of these employees leaves the company; we might need to retain that person’s hire date in order to confirm dates of employment to the tax authorities, but we might no longer have a justification for keeping that person’s date of birth.

Using the same approach as we used for justification, we could also add additional tables to capture consent (or revocation) information, such as date and time, evidence of consent (such as the record returned by the “I consent” button that was clicked), even (if desired) the legal terms and conditions that were shown to the user at the time of consent (especially useful if you change your terms and conditions over time, and don’t want to ask the user for new consent each time).

Keeping it all up-to-date

The reader who is experienced with data replication will have noticed that this schema proposed above will be a lot of work to keep synchronized with the application data. The ideal approach to this problem will be to set up some kind of automatic replication. A modern way to accomplish this is via messages in a publish/subscribe model, for example:

  • Case 1 – application changes an item of personal data. Send a message from application to data inventory, which has subscribed to all such changes. Data inventory applies the change on its copy of that item.
  • Case 2 – data inventory changes data in response to a data-subject request for correction, clarification, or purging. It sends a message to each application storing the changed data item, which applies the change

This example is oversimplified, as is the model. An application may object that it cannot purge the data for some business reason; if this happens only rarely, it might be resolved manually. If it happens more often, the application might send a counter-message providing a justification, which the data inventory could add to its justifications for keeping the data item.

This mini-model is only for illustration; a real data-inventory application will be much more complex, with many more tables and categories. That said, the above approach is far better than maintaining your inventory as a flat model on, say, a spreadsheet (I have seen sites offering template spreadsheets for just this purpose). A relational solution enables you to capture more data, more efficiently, with flexibility for the future.

Leave a Reply