One way to become familiar with the legislation is to read it from beginning to end. Consisting of 88 pages of PDF and over 55,000 words, the GDPR is not a fast read. Nor, having read it, are you likely to hold it in your head. What if you could skip to the parts that are most interesting for you? This post suggests a simple approach to doing just that.
The Python code used to produce my first post on examining the text of the GDPR (link) can, with only slight modification, produce a list of all the words used in the document (remember from that post that ‘stopwords’, those very common words with little meaning in themselves, are excluded), plus a count of each word’s occurrences in the GDPR. The main difference between my previous post and this one is that here I examine the least-frequently-occurring nouns, while in that post I looked at the most-frequently-occurring verbs.
For convenience, here is the code.
import sys from operator import itemgetter from nltk import * from nltk.tag import pos_tag import numpy import matplotlib # NLTK's default English stop words # Stop words are high-occurrence, low-semantic-value words ('the', 'very', etc.) english_stopwords = set(corpus.stopwords.words('english')) # Open, read, and tokenize the GDPR # 'tokenize' means to split a text into its words and symbols # the file CELEX_32016R0679_EN_TXT.txt is a text extract of the PDF # version of the GDPR, made using Adobe Reader's 'save as/text' file_pointer = open(r"c:\temp\CELEX_32016R0679_EN_TXT.txt", 'r') gdpr_words = word_tokenize(file_pointer.read()) # Remove single-character tokens, which are mostly punctuation gdpr_words = [word for word in gdpr_words if len(word) > 1] # Remove numbers gdpr_words = [word for word in gdpr_words if not word.isnumeric()] # Lowercase all gdpr_words (the stopwords are lowercase as well) gdpr_words = [word.lower() for word in gdpr_words] # Remove stop words from the text gdpr_words = [word for word in gdpr_words if word not in english_stopwords] # do part of speech tagging gdpr_words_pos = pos_tag(gdpr_words) tag_freq_distn = FreqDist(gdpr_words_pos) # Calculate frequency distribution freq_distn = FreqDist(gdpr_words) # make a list of the nouns only, whether singular ('NN') or plural ('NNS') freq_distn_nouns =  for word in tag_freq_distn: if word in ['NN', 'NNS']: noun_freq = (word, tag_freq_distn[word], word) freq_distn_nouns.append(noun_freq) # sort by frequency freq_distn_nouns_sorted = sorted(freq_distn_nouns, key=itemgetter(1, 0)) for word in freq_distn_nouns_sorted: print(word + ',' + str(word) + ',' + word)
Here we will consider those words that occur only once in the GDPR. Because these words are not repeated, we can infer that they were specifically chosen, as opposed to being part of the repetitive, formulaic phrasings that normally characterize legal and bureaucratic documents. In other words, these particular terms are more likely to be significant in the minds of the framers of the GDPR, and thus a good place to find significant portions of the legislation.
The list of one-time words, arranged alphabetically, begins as follows:
The entire list contains 476 words and is given at the end of this post. The list could be a bit shorter if plural versions of words were shortened to their singular equivalents (e.g., ‘adaptation’ and ‘adaptations’, above), which is a common language-processing step known as ‘stemming’. I opted to keep things simple here, however.
Let’s look at the first five words in this list and the GDPR text surrounding them (using the search facility in the PDF reader to locate each term):
The processing of personal data should be designed to serve mankind. The right to the protection of personal data is not an absolute right; it must be considered in relation to its function in society and be balanced against other fundamental rights, in accordance with the principle of proportionality.
2. In particular, any legislative measure referred to in paragraph 1 shall contain specific provisions at least, where relevant, as to:
(d) the safeguards to prevent abuse or unlawful access or transfer;
[T]he Commission should take account of obligations arising from the third country’s or international organisation’s participation in multilateral or regional systems in particular in relation to the protection of personal data, as well as the implementation of such obligations. In particular, the third country’s accession to the Council of Europe Convention of 28 January 1981 for the Protection of Individuals with regard to the Automatic Processing of Personal Data and its Additional Protocol should be taken into account.
[T]he monitoring of compliance with a code of conduct pursuant to Article 40 may be carried out by a body which has an appropriate level of expertise in relation to the subject-matter of the code and is accredited for that purpose by the competent supervisory authority.
1. Personal data shall be:
(d) accurate and, where necessary, kept up to date; every reasonable step must be taken to ensure that personal data that are inaccurate, having regard to the purposes for which they are processed, are erased or rectified without delay (‘accuracy’);
I would argue that each of these sections of the GDPR is interesting and will be important to many data controllers and processors. I suggest looking through the list for terms of particular interest to your firm (e.g., ‘banking’, ‘cancer’, ‘chromosomal’, ‘conscience’, ‘consumer’) and looking up the clause which uses each one.
Using a process like this, you can not only gain a better understanding of the legislation as it relates to your organization, but also incorporate those sections that are most critical for your firm’s business into your overall effort, for example:
- sections that are vague, subjective, or unclear can be inventoried and submitted to your legal counsel for clarification
- sections that you will rely on to justify your data processing can be linked to the relevant items in your data inventory (link)
- sections that strengthen your justification, such as the ‘should’ or ‘may’ recitations from the GDPR (link), should also be linked to the data inventory; in the event of an auditor’s challenge, you can argue in your favor that your process supports the broad goals of the GDPR and direct the auditor’s attention to the precise clause supporting your position.
Of course, the technique outlined here is only approximate. After all, some one-time words will be insignificant, while some highly-significant words will occur more than once (and hence not appear on this list). While admitting that this is true, I would argue that such a list is a good place to start, or is at least better than trying to read (and perhaps annotate) the entire document in one go.
One could take this approach further, collecting lists of words which occur 2, 3, or 4 times, etc. One could even compile an index of the GDPR, using the technique of my previous post on this subject (link), showing all the words that appear in it along with the number of times each one appears.
To save the reader the trouble of running the Python code above, which may require some library imports and other tiresome details, the entire list of one-time nouns is produced here: