In The Data Inventory, Part 1, we saw how a data inventory can be started with a single-table format in a spreadsheet. In this post we will use relational tables to modify and enhance this design to include additional information about our data to include a global view of all our databases, plus a global view of our data items.
In addition to the Oracle schema that was summarized in Part 1, I have added the columns of a table of customers from SQLite (using the sample Chinook schema). This is our result:
ID DATABASE_ID OWNER TABLE_NAME COLUMN_NAME COLUMN_GLOBAL_ID DATA_TYPE LEN DECIMALS NULLABLE DIFF_VALUES NUM_ROWS NUM_NULLS -- ----------- ------- ---------- -------------- ---------------- ------------ ------ -------- -------- ----------- -------- --------- 1 1 HR EMPLOYEES COMMISSION_PCT 1 NUMBER 22 2 Y 7 107 72 2 1 HR EMPLOYEES DEPARTMENT_ID 4 NUMBER 22 4 Y 11 107 1 3 1 HR EMPLOYEES EMAIL 3 VARCHAR2 25 (null) N 107 107 0 4 1 HR EMPLOYEES EMPLOYEE_ID 4 NUMBER 22 6 N 107 107 0 5 1 HR EMPLOYEES FIRST_NAME 5 VARCHAR2 20 (null) Y 91 107 0 6 1 HR EMPLOYEES HIRE_DATE 6 DATE 7 (null) N 98 107 0 7 1 HR EMPLOYEES JOB_ID 4 VARCHAR2 10 (null) N 19 107 0 8 1 HR EMPLOYEES LAST_NAME 8 VARCHAR2 25 (null) N 102 107 0 9 1 HR EMPLOYEES MANAGER_ID 4 NUMBER 22 6 Y 18 107 1 10 1 HR EMPLOYEES PHONE_NUMBER 10 VARCHAR2 20 (null) Y 107 107 0 11 1 HR EMPLOYEES SALARY 11 NUMBER 22 8 Y 58 107 0 16 2 Chinook Customers Address 13 NVARCHAR(70) (null) (null) Y (null) (null) (null) 17 2 Chinook Customers City 14 NVARCHAR(40) (null) (null) Y (null) (null) (null) 15 2 Chinook Customers Company 12 NVARCHAR(80) (null) (null) Y (null) (null) (null) 19 2 Chinook Customers Country 17 NVARCHAR(40) (null) (null) Y (null) (null) (null) 12 2 Chinook Customers CustomerId 4 INTEGER (null) (null) N (null) (null) (null) 23 2 Chinook Customers Email 3 NVARCHAR(60) (null) (null) N (null) (null) (null) 22 2 Chinook Customers Fax (null) NVARCHAR(24) (null) (null) Y (null) (null) (null) 13 2 Chinook Customers FirstName 5 NVARCHAR(40) (null) (null) N (null) (null) (null) 14 2 Chinook Customers LastName 8 NVARCHAR(20) (null) (null) N (null) (null) (null) 21 2 Chinook Customers Phone 10 NVARCHAR(24) (null) (null) Y (null) (null) (null) 20 2 Chinook Customers PostalCode 18 NVARCHAR(10) (null) (null) Y (null) (null) (null) 18 2 Chinook Customers State 15 NVARCHAR(40) (null) (null) Y (null) (null) (null) 24 2 Chinook Customers SupportRepId 16 INTEGER (null) (null) Y (null) (null) (null)
We can note a few things here:
- Table and column name conventions differ across database products; SQLite names do not default to all-caps, as they do in Oracle
- SQLite’s catalog display does not split datatype name, length, and precision into separate columns, as does Oracle
- SQLite does not give us a quick-and-easy way to add information like the number of table rows
- I have added a new column, DATABASE_ID, as the second column
- I have added a new column, COLUMN_GLOBAL_ID, just after COLUMN_NAME
Regarding the first three items, the lesson is that different database products have differing conventions about how their catalog data is arranged. It might be helpful to have information like datatype, column width, and so forth, but maybe we don’t really need it for a data inventory geared to PII. After all, a person’s date of birth may be stored as a DATE datatype (Oracle) or just a formatted string (SQLite). It’s PII either way, so knowing the datatype is not critical.
Regarding DATABASE_ID, this is an ID-type value, one which we make up as a reference to a row in another table, DATABASES. In this case the referred table looks like this:
ID DATABASE_NAME HOST_NAME DB_PRODUCT -- ------------- ----------- ------------------ 1 OraXE acer_laptop Oracle Express 11g 2 SQLite01 acer_laptop SQLite3
We now can capture our source database for each schema. Because databases generally have multiple schemas, this structure allows us to capture information about all the schemas in a particular database without having to duplicate the common information over and over. We are free to add useful information about each database, such as the exact version, its function (dev/test/prod), and so forth.
More to the point, we are now able to answer questions like:
- What databases do we have on host X?
- What Oracle Express 11g databases do we have?
and so forth. Depending on our needs, we can add more tables, such as a table of hosts or installed database software.
Our second new ID-type column is COLUMN_GLOBAL_ID, which refers to a new table, DATA_INVENTORY_GLOBAL, which looks like this:
COLUMN_NAME_GLOBAL GLBL_PII OWNER TABLE_NAME COLUMN_NAME ------------------------ -------------------- ------- ---------- -------------- COMMISSION_PCT PII HR EMPLOYEES COMMISSION_PCT EMAIL PII Chinook Customers Email EMAIL PII HR EMPLOYEES EMAIL GENERATED_ID NOT PII HR EMPLOYEES EMPLOYEE_ID GENERATED_ID NOT PII HR EMPLOYEES DEPARTMENT_ID GENERATED_ID NOT PII Chinook Customers CustomerId GENERATED_ID NOT PII HR EMPLOYEES MANAGER_ID GENERATED_ID NOT PII HR EMPLOYEES JOB_ID FIRST_NAME PII HR EMPLOYEES FIRST_NAME FIRST_NAME PII Chinook Customers FirstName HIRE_DATE COMBO HR EMPLOYEES HIRE_DATE LAST_NAME PII HR EMPLOYEES LAST_NAME LAST_NAME PII Chinook Customers LastName PHONE_NUMBER PII HR EMPLOYEES PHONE_NUMBER PHONE_NUMBER PII Chinook Customers Phone SALARY PII HR EMPLOYEES SALARY COMPANY_NAME COMBO Chinook Customers Company ADDRESS PII Chinook Customers Address CITY COMBO Chinook Customers City STATE COMBO Chinook Customers State EXPRESS_SERVICE_CONTRACT NOT PII Chinook Customers SupportRepId COUNTRY COMBO Chinook Customers Country POSTAL_CODE COMBO Chinook Customers PostalCode
With a table like this we can see where we have specific kinds of data, particularly PII or possible PII (the COMBO category). A phone number or a last name is always PII, regardless of technical details (what it’s called in the database, how it’s stored, etc). We can always find our last-name data elements by searching for the global name “LAST_NAME” and see that element’s privacy classification. (Note that we have to choose the global-element names and assign privacy categories to them ourselves.)
It’s helpful to be able to retrieve our PII data elements whenever we need to, such as to create reports for the DPA or DPO, to create impact assessments, and so forth.
Our software can also consult such a database before revealing data, in order to know what rules must be applied to the data (whether or not this piece of data has the ‘badge’ proposed in Part 1 to authorize its exposure). In principle, any code that potentially exposes PII can check the applicable badge for each data item every time that it is invoked. In a future post we will see an example of such functionality.
This strategy has several benefits, including:
- privacy rules need not be hard-coded, but can be looked up
- a data element might have its rule changed at the global level, then automatically distributed and followed by each application
- if PII has a justification for exposure, this structure can be further augmented to show the justification reasons (e.g., data-subject consent)
- the sensitivity of combo elements could be decided by algorithm, based on the actual content of the data (a technique to be illustrated in a future post)
- this approach, using a global system to manage and propagate privacy rules, is part of a strategy of privacy by default/design (and will hopefully impress any auditors who come to visit)
We now we have a mini-schema that looks like this:
A final caveat is in order. Your data inventory, while not itself constituting PII, must be well secured. If the database comes into the wrong hands it could provide a road map to the most sensitive data, and as such is a security risk.