The Data Inventory, Part 2

In The Data Inventory, Part 1, we saw how a data inventory can be started with a single-table format in a spreadsheet. In this post we will use relational tables to modify and enhance this design to include additional information about our data to include a global view of all our databases, plus a global view of our data items.

In addition to the Oracle schema that was summarized in Part 1, I have added the columns of a table of customers from SQLite (using the sample Chinook schema). This is our result:

ID  DATABASE_ID  OWNER    TABLE_NAME  COLUMN_NAME     COLUMN_GLOBAL_ID  DATA_TYPE     LEN     DECIMALS  NULLABLE  DIFF_VALUES  NUM_ROWS  NUM_NULLS  

--  -----------  -------  ----------  --------------  ----------------  ------------  ------  --------  --------  -----------  --------  ---------  

1   1            HR       EMPLOYEES   COMMISSION_PCT  1                 NUMBER        22      2         Y         7            107       72         

2   1            HR       EMPLOYEES   DEPARTMENT_ID   4                 NUMBER        22      4         Y         11           107       1          

3   1            HR       EMPLOYEES   EMAIL           3                 VARCHAR2      25      (null)    N         107          107       0          

4   1            HR       EMPLOYEES   EMPLOYEE_ID     4                 NUMBER        22      6         N         107          107       0          

5   1            HR       EMPLOYEES   FIRST_NAME      5                 VARCHAR2      20      (null)    Y         91           107       0          

6   1            HR       EMPLOYEES   HIRE_DATE       6                 DATE          7       (null)    N         98           107       0          

7   1            HR       EMPLOYEES   JOB_ID          4                 VARCHAR2      10      (null)    N         19           107       0          

8   1            HR       EMPLOYEES   LAST_NAME       8                 VARCHAR2      25      (null)    N         102          107       0          

9   1            HR       EMPLOYEES   MANAGER_ID      4                 NUMBER        22      6         Y         18           107       1          

10  1            HR       EMPLOYEES   PHONE_NUMBER    10                VARCHAR2      20      (null)    Y         107          107       0          

11  1            HR       EMPLOYEES   SALARY          11                NUMBER        22      8         Y         58           107       0          

16  2            Chinook  Customers   Address         13                NVARCHAR(70)  (null)  (null)    Y         (null)       (null)    (null)     

17  2            Chinook  Customers   City            14                NVARCHAR(40)  (null)  (null)    Y         (null)       (null)    (null)     

15  2            Chinook  Customers   Company         12                NVARCHAR(80)  (null)  (null)    Y         (null)       (null)    (null)     

19  2            Chinook  Customers   Country         17                NVARCHAR(40)  (null)  (null)    Y         (null)       (null)    (null)     

12  2            Chinook  Customers   CustomerId      4                 INTEGER       (null)  (null)    N         (null)       (null)    (null)     

23  2            Chinook  Customers   Email           3                 NVARCHAR(60)  (null)  (null)    N         (null)       (null)    (null)     

22  2            Chinook  Customers   Fax             (null)            NVARCHAR(24)  (null)  (null)    Y         (null)       (null)    (null)     

13  2            Chinook  Customers   FirstName       5                 NVARCHAR(40)  (null)  (null)    N         (null)       (null)    (null)     

14  2            Chinook  Customers   LastName        8                 NVARCHAR(20)  (null)  (null)    N         (null)       (null)    (null)     

21  2            Chinook  Customers   Phone           10                NVARCHAR(24)  (null)  (null)    Y         (null)       (null)    (null)     

20  2            Chinook  Customers   PostalCode      18                NVARCHAR(10)  (null)  (null)    Y         (null)       (null)    (null)     

18  2            Chinook  Customers   State           15                NVARCHAR(40)  (null)  (null)    Y         (null)       (null)    (null)     

24  2            Chinook  Customers   SupportRepId    16                INTEGER       (null)  (null)    Y         (null)       (null)    (null)

We can note a few things here:

  • Table and column name conventions differ across database products; SQLite names do not default to all-caps, as they do in Oracle
  • SQLite’s catalog display does not split datatype name, length, and precision into separate columns, as does Oracle
  • SQLite does not give us a quick-and-easy way to add information like the number of table rows
  • I have added a new column, DATABASE_ID, as the second column
  • I have added a new column, COLUMN_GLOBAL_ID, just after COLUMN_NAME

Regarding the first three items, the lesson is that different database products have differing conventions about how their catalog data is arranged. It might be helpful to have information like datatype, column width, and so forth, but maybe we don’t really need it for a data inventory geared to PII. After all, a person’s date of birth may be stored as a DATE datatype (Oracle) or just a formatted string (SQLite). It’s PII either way, so knowing the datatype is not critical.

Regarding DATABASE_ID, this is an ID-type value, one which we make up as a reference to a row in another table, DATABASES. In this case the referred table looks like this:

ID  DATABASE_NAME  HOST_NAME    DB_PRODUCT          

--  -------------  -----------  ------------------  

1   OraXE          acer_laptop  Oracle Express 11g  

2   SQLite01       acer_laptop  SQLite3

We now can capture our source database for each schema. Because databases generally have multiple schemas, this structure allows us to capture information about all the schemas in a particular database without having to duplicate the common information over and over. We are free to add useful information about each database, such as the exact version, its function (dev/test/prod), and so forth.

More to the point, we are now able to answer questions like:

  • What databases do we have on host X?
  • What Oracle Express 11g databases do we have?

and so forth. Depending on our needs, we can add more tables, such as a table of hosts or installed database software.

Our second new ID-type column is COLUMN_GLOBAL_ID, which refers to a new table, DATA_INVENTORY_GLOBAL, which looks like this:

COLUMN_NAME_GLOBAL        GLBL_PII              OWNER    TABLE_NAME  COLUMN_NAME     

------------------------  --------------------  -------  ----------  --------------  

COMMISSION_PCT            PII                   HR       EMPLOYEES   COMMISSION_PCT  

EMAIL                     PII                   Chinook  Customers   Email           

EMAIL                     PII                   HR       EMPLOYEES   EMAIL           

GENERATED_ID              NOT PII               HR       EMPLOYEES   EMPLOYEE_ID     

GENERATED_ID              NOT PII               HR       EMPLOYEES   DEPARTMENT_ID   

GENERATED_ID              NOT PII               Chinook  Customers   CustomerId      

GENERATED_ID              NOT PII               HR       EMPLOYEES   MANAGER_ID      

GENERATED_ID              NOT PII               HR       EMPLOYEES   JOB_ID          

FIRST_NAME                PII                   HR       EMPLOYEES   FIRST_NAME      

FIRST_NAME                PII                   Chinook  Customers   FirstName       

HIRE_DATE                 COMBO                 HR       EMPLOYEES   HIRE_DATE       

LAST_NAME                 PII                   HR       EMPLOYEES   LAST_NAME       

LAST_NAME                 PII                   Chinook  Customers   LastName        

PHONE_NUMBER              PII                   HR       EMPLOYEES   PHONE_NUMBER    

PHONE_NUMBER              PII                   Chinook  Customers   Phone           

SALARY                    PII                   HR       EMPLOYEES   SALARY          

COMPANY_NAME              COMBO                 Chinook  Customers   Company         

ADDRESS                   PII                   Chinook  Customers   Address         

CITY                      COMBO                 Chinook  Customers   City            

STATE                     COMBO                 Chinook  Customers   State           

EXPRESS_SERVICE_CONTRACT  NOT PII               Chinook  Customers   SupportRepId    

COUNTRY                   COMBO                 Chinook  Customers   Country         

POSTAL_CODE               COMBO                 Chinook  Customers   PostalCode

With a table like this we can see where we have specific kinds of data, particularly PII or possible PII (the COMBO category). A phone number or a last name is always PII, regardless of technical details (what it’s called in the database, how it’s stored, etc). We can always find our last-name data elements by searching for the global name “LAST_NAME” and see that element’s privacy classification. (Note that we have to choose the global-element names and assign privacy categories to them ourselves.)

It’s helpful to be able to retrieve our PII data elements whenever we need to, such as to create reports for the DPA or DPO, to create impact assessments, and so forth.

Our software can also consult such a database before revealing data, in order to know what rules must be applied to the data (whether or not this piece of data has the ‘badge’ proposed in Part 1 to authorize its exposure). In principle,  any code that potentially exposes PII can check the applicable badge for each data item every time that it is invoked. In a future post we will see an example of such functionality.

This strategy has several benefits, including:

  • privacy rules need not be hard-coded, but can be looked up
  • a data element might have its rule changed at the global level, then automatically distributed and followed by each application
  • if PII has a justification for exposure, this structure can be further augmented to show the justification reasons (e.g., data-subject consent)
  • the sensitivity of combo elements could be decided by algorithm, based on the actual content of the data (a technique to be illustrated in a future post)
  • this approach, using a global system to manage and propagate privacy rules, is part of a strategy of privacy by default/design (and will hopefully impress any auditors who come to visit)

We now we have a mini-schema that looks like this:

3-table data inventory

A final caveat is in order. Your data inventory, while not itself constituting PII, must be well secured. If the database comes into the wrong hands it could provide a road map to the most sensitive data, and as such is a security risk.

Leave a Reply