What parts of the GDPR are most relevant to you?

One way to become familiar with the legislation is to read it from beginning to end. Consisting of 88 pages of PDF and over 55,000 words, the GDPR is not a fast read. Nor, having read it, are you likely to hold it in your head. What if you could skip to the parts that are most interesting for you? This post suggests a simple approach to doing just that.

The Python code used to produce my first post on examining the text of the GDPR (link) can, with only slight modification, produce a list of all the words used in the document (remember from that post that ‘stopwords’, those very common words with little meaning in themselves, are excluded), plus a count of each word’s occurrences in the GDPR. The main difference between my previous post and this one is that here I examine the least-frequently-occurring nouns, while in that post I looked at the most-frequently-occurring verbs.

For convenience, here is the code.


import sys
from operator import itemgetter

from nltk import *
from nltk.tag import pos_tag
import numpy
import matplotlib

# NLTK's default English stop words
# Stop words are high-occurrence, low-semantic-value words ('the', 'very', etc.)
english_stopwords = set(corpus.stopwords.words('english'))

# Open, read, and tokenize the GDPR
# 'tokenize' means to split a text into its words and symbols
# the file CELEX_32016R0679_EN_TXT.txt is a text extract of the PDF
# version of the GDPR, made using Adobe Reader's 'save as/text'
file_pointer = open(r"c:\temp\CELEX_32016R0679_EN_TXT.txt", 'r')
gdpr_words = word_tokenize(file_pointer.read())

# Remove single-character tokens, which are mostly punctuation
gdpr_words = [word for word in gdpr_words if len(word) > 1]

# Remove numbers
gdpr_words = [word for word in gdpr_words if not word.isnumeric()]

# Lowercase all gdpr_words (the stopwords are lowercase as well)
gdpr_words = [word.lower() for word in gdpr_words]

# Remove stop words from the text
gdpr_words = [word for word in gdpr_words if word not in english_stopwords]

# do part of speech tagging
gdpr_words_pos = pos_tag(gdpr_words)
tag_freq_distn = FreqDist(gdpr_words_pos)

# Calculate frequency distribution
freq_distn = FreqDist(gdpr_words)

# make a list of the nouns only, whether singular ('NN') or plural ('NNS')
freq_distn_nouns = []
for word in tag_freq_distn:
if word[1] in ['NN', 'NNS']:
noun_freq = (word[0], tag_freq_distn[word], word[1])
freq_distn_nouns.append(noun_freq)

# sort by frequency
freq_distn_nouns_sorted = sorted(freq_distn_nouns, key=itemgetter(1, 0))

for word in freq_distn_nouns_sorted:
print(word[0] + ',' + str(word[1]) + ',' + word[2])

Here we will consider those words that occur only once in the GDPR. Because these words are not repeated, we can infer that they were specifically chosen, as opposed to being part of the repetitive, formulaic phrasings that normally characterize legal and bureaucratic documents. In other words, these particular terms are more likely to be significant in the minds of the framers of the GDPR, and thus a good place to find significant portions of the legislation.

The list of one-time words, arranged alphabetically, begins as follows:

absolute
abuse
accession
accomplishment
accredit
accurate
adaptation
adaptations

The entire list contains 476 words and is given at the end of this post. The list could be a bit shorter if plural versions of words were shortened to their singular equivalents (e.g., ‘adaptation’ and ‘adaptations’, above), which is a common language-processing step known as ‘stemming’. I opted to keep things simple here, however.

Let’s look at the first five words in this list and the GDPR text surrounding them (using the search facility in the PDF reader to locate each term):

absolute

The processing of personal data should be designed to serve mankind. The right to the protection of personal data is not an absolute right; it must be considered in relation to its function in society and be balanced against other fundamental rights, in accordance with the principle of proportionality.

abuse

2. In particular, any legislative measure referred to in paragraph 1 shall contain specific provisions at least, where relevant, as to:

[…]

(d) the safeguards to prevent abuse or unlawful access or transfer;

accession

[T]he Commission should take account of obligations arising from the third country’s or international organisation’s participation in multilateral or regional systems in particular in relation to the protection of personal data, as well as the implementation of such obligations. In particular, the third country’s accession to the Council of Europe Convention of 28 January 1981 for the Protection of Individuals with regard to the Automatic Processing of Personal Data and its Additional Protocol should be taken into account.

accredited

[T]he monitoring of compliance with a code of conduct pursuant to Article 40 may be carried out by a body which has an appropriate level of expertise in relation to the subject-matter of the code and is accredited for that purpose by the competent supervisory authority.

accurate

1. Personal data shall be:

[..]

(d) accurate and, where necessary, kept up to date; every reasonable step must be taken to ensure that personal data that are inaccurate, having regard to the purposes for which they are processed, are erased or rectified without delay (‘accuracy’);

I would argue that each of these sections of the GDPR is interesting and will be important to many data controllers and processors. I suggest looking through the list for terms of particular interest to your firm (e.g., ‘banking’, ‘cancer’, ‘chromosomal’, ‘conscience’, ‘consumer’) and looking up the clause which uses each one.

Using a process like this, you can not only gain a better understanding of the legislation as it relates to your organization, but also incorporate those sections that are most critical for your firm’s business into your overall effort, for example:

  • sections that are vague, subjective, or unclear can be inventoried and submitted to your legal counsel for clarification
  • sections that you will rely on to justify your data processing can be linked to the relevant items in your data inventory (link)
  • sections that strengthen your justification, such as the ‘should’ or ‘may’ recitations from the GDPR (link), should also be linked to the data inventory; in the event of an auditor’s challenge, you can argue in your favor that your process supports the broad goals of the GDPR and direct the auditor’s attention to the precise clause supporting your position.

Of course, the technique outlined here is only approximate. After all, some one-time words will be insignificant, while some highly-significant words will occur more than once (and hence not appear on this list). While admitting that this is true, I would argue that such a list is a good place to start, or is at least better than trying to read (and perhaps annotate) the entire document in one go.

One could take this approach further, collecting lists of words which occur 2, 3, or 4 times, etc. One could even compile an index of the GDPR, using the technique of my previous post on this subject (link), showing all the words that appear in it along with the number of times each one appears.

To save the reader the trouble of running the Python code above, which may require some library imports and other tiresome details, the entire list of one-time nouns is produced here:

absolute
abuse
accession
accomplishment
accredit
accurate
adaptation
adaptations
addressees
adequate
administrations
advertising
affect
aggregate
alter
amendments
and/or
applicable
appointment
appraise
archiving
arrange
assess
assists
attacks
attainment
attitudes
auditor
authenticity
avoid
awareness-raising
ban
banking
become
behaviour
behavioural
benefit
binding
bis
border
boxes
branch
broader
brought
brussels
budgets
burdens
cancer
cause
cessation
chamber
change
chromosomal
circumvention
clarify
client
coherence
commence
comment
compatibility
compelling
completion
component
compulsory
conclusion
concrete
confer
confidence
confirms
conformity
conscience
consensus
considerations
consists
constitute
constraints
consular
consumer
contain
content
contexts
continue
continuity
contracts
contradict
controls
convergence
coordinate
copyright
correlation
correspondence
cost-effectiveness
counselling
counterparts
cover
credentials
credit
criterion
currency
customer
customers
danger
decides
definitions
deletes
denmark
deoxyribonucleic
depression
deprivation
destination
detail
detecting
determinants
determination
determines
device
diagnoses
diagnosis
difference
dignity
disclosures
discussion
discussions
dismiss
dismisses
disputes
disseminate
dissemination
dissuasive
distort
distribution
divergences
dna
doubts
drawn
economies
education
eea
efficiency
eligibility
eliminate
email
emergencies
emergency
employer
employers
en
encompass
enterprise
enters
entity
envisages
epidemics
erase
errors
establishes
estonia
euratom
events
everyone
evidence
examine
exchanges
exclude
exclusion
execute
expansion
expedient
expert
experts
expiration
explanation
extend
extensions
facilitate
fair
fairness
fall
family
favour
feasibility
features
files
financing
formation
formats
forms
formulation
foundation
foundations
fragmentation
geneva
globalisation
handle
health-care
help
hereby
high-quality
hindrance
history
holders
home
hospital
humanity
i.e
idem
identifier
images
imbalance
impede
implement
inaccuracies
inaccurate
inactivity
incentives
income
inconveniences
indefinite
indicate
indications
indiscriminate
industry
informs
infringe
initiative
inspection
instruction
integration
intend
interconnection
interference
interlocutor
intermediary
internet
interpretation
intervals
interventions
introduce
introduction
invalid
iv
january
journal
journalism
judicial
june
kept
laboratories
lack
laundering
lawyer
lay
laying
lays
legislator
levels
liaison
libraries
light
lines
links
losses
m.
making
mandates
mankind
manmade
margin
markets
materialise
medicine
medium
memorandum
mental
merit
merits
met
misconduct
morbidity
moreover
mortality
moves
nationality
news
nomination
non-compliance
norms
notion
notions
obstacle
occupations
offers
officers
oj
one-stop-shop
onward
opposes
originate
overridden
override
overrides
oversight
ownership
pages
parliaments
participate
particularity
partnerships
penalty
pensions
perception
permits
photographs
physicians
physiology
plaintiff
platform
population
positions
post
pre-
precautions
preclude
prepare
preserve
press
preventing
preventive
prevents
producers
produces
professional
profits
prohibit
prohibitions
project
projects
proliferation
protects
protocols
proves
publicity
publish
pursuance
pursuit
qualities
quantity
questions
races
range
rate
re-use
reach
reaction
reading
realisation
reappointment
reciprocity
recognise
recognition
reconciliation
recourse
recruitment
refers
refusal
regions
registers
registration
reimbursement
reject
relates
relationships
repeal
replacement
replication
reporting
represent
representation
reprimand
resides
resignation
resilience
resist
resource
respects
retention
retirement
retrieval
return
returns
revocation
ribonucleic
rna
roles
ruling
samples
schulz
science
secrets
secure
securities
see
seeks
segment
sensitivity
sentence
servers
serves
set
settings
signature
significance
signifies
silence
size
skills
software
solutions
specification
specificities
specify
split
sport
start
store
strengthen
submissions
submits
subsequent
subsidiarity
subsidiary
sums
supervise
supply
surveillance
surveys
symbol
tags
tax-evasion
taxation
tenders
territories
test
teu
text
theories
title
towards
traces
translation
translations
transmitting
trial
tribunals
trust
uncertainty
undergone
unemployment
uniformity
units
urgency
user
values
verifications
viii
vis
visualisation
vital
vitro
vote
war
warnings
whereby
wishes
withdrawn
works

Leave a Reply