The Data Inventory, Part 1

Let’s get started on something concrete. One of the first things you’ll need to launch your privacy-compliance effort is an inventory of what data you are currently storing. This inventory will be at the core of your efforts, and will be the reference point for stakeholders. In this article I suggest a basic approach to get started using a single table. Future posts will add more tables to provide additional information, so that in the end we have a small schema for our inventory. 

Assumptions

First, I’m going to make a few assumptions. They might not all apply to your data holdings, in which case you may not need all of the information that I will lay out in my planned data-inventory series of posts. Here are my assumptions for this post:

  • you have data ‘silos’, meaning that you have different applications with their own databases
  • these silos may be storing the same types of data, or even multiple copies of the same values (e.g., the name and address of a given person)
  • you prefer to start simply, for example, with a spreadsheet, rather than jumping into a database
  • most of your critical data resides in (or at least has its master copy in) an RDBMS product, such as Oracle, Sybase ASE/IQ, MS SQL Server, Postgres, or MySQL)

Possible inventory elements

Our first data inventory will consist of a single table, with one row per data item. This will make it easy to capture in a spreadsheet (though we will need to move to a database in later stages, for reasons which will become apparent).

Here are the columns that we need to capture:

  • database host – the physical server where is the data stored
  • database name – within the database host, which database is it
  • database schema – within the database, which database user owns the the table in question
  • data-element local name – this is the name that the database uses to identify the data item
  • data-element global name – a given data item, such as a person’s family name or birth date, might not have the same column name, length, format, or even data type across all databases. For this reason we want a global label so that we can identify all instances of a particular PII-element, regardless of what it is called in its storage system
  • data-element comment – I rarely see this feature, which is present in all commercial RDBMS products, taken advantage of, but if you do have comments, by all means capture them
  • data type – this is the data type (e.g., text, date, number, currency)
  • data format – some kinds of data, such as telephone numbers or dates, are rejected unless they conform to a defined format
  • primary key – is the data item (part of) a primary key
  • foreign key – is the data item (part of) a foreign key
  • optionality – is the data item mandatory (that is, must it be filled in to be accepted in the database)
  • table rows – how many rows are in the table that holds this data element
  • different values (cardinality) – how many distinct values
  • blank values – how many of the values of this data item are empty. (For mandatory data elements, this value should be zero)
  • master copy – do you believe this to be the master (most reliable) copy of this data
  • PII category – what is your first take on whether this data item is PII (I suggest: ‘PII’, ‘not-PII’, and ‘combo-PII’) where:
  • PII – things you are sure are PII (name, address, phone, age, sex, IP address)
  • not-PII – your company’s product information, publicly-available information (e.g., a supplier’s published address)
  • combo-PII – this term applies to data items that are individually not PII but, in combination with PII or other combo-PIIs, might be used to identify a person (much more on this later)

The alert reader will notice that this structure cannot capture some variations, for example:

  • the database might reside on a virtual server, which in turn resides on a physical host or even a cloud host
  • the data item might not reside in an RDBMS, but in an XML or big-data store, in which case some values above might not be applicable

Note that there are RDBMS internal schemas, such as the one which owns the catalog or data dictionary, that normally will hold no PII. We can exclude these by default.

What we can do with this simple structure

Although such a table is basic, it already permits us to answer some basic questions about our PII:

  • What data items do we have on this host, database, or schema
  • How many rows of PII do we have
  • How many data items do we have that contain PII-likely strings (e.g., ‘name’, ‘phone’, ‘street’)
  • How many data items contain a small percentage of distinct values (probably not PII by itself, for reasons we’ll see in a future post)
  • How many data items contain a large percentage of distinct values (probably PII)

Such results will be helpful in creating your initial impact assessment.

Extracting the information

Below is an example of how to start extracting the information to populate your starter table. This example is from Oracle 11g Express, using the EMPLOYEES table in the provided sample schema HR. The structure of the equivalent query in other RDMSs will be similar, though the table and column names will likely vary.

SELECT

    C.OWNER,

    C.TABLE_NAME,

    C.COLUMN_NAME,

    C.DATA_TYPE,

    C.DATA_LENGTH as LEN,

    C.DATA_PRECISION as DECIMALS,

    C.NULLABLE,

    C.NUM_DISTINCT AS DIFF_VALUES,

    T.NUM_ROWS,

    C.NUM_NULLS

FROM

    SYS.ALL_TAB_COLUMNS C

    INNER JOIN SYS.ALL_TABLES T ON T.TABLE_NAME = C.TABLE_NAME

WHERE

    C.OWNER = 'HR'

AND C.TABLE_NAME = 'EMPLOYEES'

ORDER BY

    C.COLUMN_NAME;

Gives us the result:

OWNER  TABLE_NAME  COLUMN_NAME     DATA_TYPE  LEN  DECIMALS  NULLABLE  DIFF_VALUES  NUM_ROWS  NUM_NULLS 

-----  ----------  --------------  ---------  ---  --------  --------  -----------  --------  --------- 

HR     EMPLOYEES   COMMISSION_PCT  NUMBER     22   2         Y         7            107       72        

HR     EMPLOYEES   DEPARTMENT_ID   NUMBER     22   4         Y         11           107       1         

HR     EMPLOYEES   EMAIL           VARCHAR2   25   (null)    N         107          107       0         

HR     EMPLOYEES   EMPLOYEE_ID     NUMBER     22   6         N         107          107       0         

HR     EMPLOYEES   FIRST_NAME      VARCHAR2   20   (null)    Y         91           107       0         

HR     EMPLOYEES   HIRE_DATE       DATE       7    (null)    N         98           107       0         

HR     EMPLOYEES   JOB_ID          VARCHAR2   10   (null)    N         19           107       0         

HR     EMPLOYEES   LAST_NAME       VARCHAR2   25   (null)    N         102          107       0         

HR     EMPLOYEES   MANAGER_ID      NUMBER     22   6         Y         18           107       1         

HR     EMPLOYEES   PHONE_NUMBER    VARCHAR2   20   (null)    Y         107          107       0         

HR     EMPLOYEES   SALARY          NUMBER     22   8         Y         58           107       0

What can we tell about this data at a glance? Most of it looks like PII, with at least one especially-sensitive element (SALARY). We can also see that the data is pretty complete, judging from the low number of NULL values (except for COMMISSION_PCT, presumably reflecting the fact that most employees are not on commission).

I have used a simple join to include the number of rows in the table (column NUM_ROWS). Similar joins can be added to include additional information, such as column comments and whether the data element is part of a primary or foreign key.

In future posts we’ll add more tables around this basic table to include more information.

Enriching the data inventory

The extraction was the easy part; the harder part will be to assign the data things like PII-categorization and enterprise name, since these labels have to be chosen, assigned and managed by you, the data controller. Determining whether a particular data element is the master copy will require some knowledge of the system that stores the data.

This is where the hard work comes in. You will need to have a process for choosing data-element labels, as these will be used to identify and shield PII. Give the choices some thought since, once deployed, it will be inconvenient to change them.

PII-categorization and the ‘badge’

“Data protection by design and by default” is the title of Article 25 of the GDPR. I suggest that, as an operational discipline, PII be your default category for all data elements. This means that all data defaults to private (that is, forbidden for many uses). The only exception is for data elements that come under a recognized exception (which I will call their ‘badge’, after the badges that most of need to enter our workplace. The data inventory can serve as the basis for a data-badge management system.

Leave a Reply