Considering the GDPR as a whole

Should versus shall

Sitting down to read through the GDPR is not a casual undertaking, but initial skim-throughs left me wondering about the word should, which one encounters often in the text of the legislation. It seemed odd to me that legislation should merely suggest behaviors and outcomes; I had assumed that legislation is a recital of what you must (or must not) do.

It might be useful to compare the frequency of words like ‘should’ and ‘shall’ (known to English grammar as modal or auxiliary verbs) in the GDPR in order to understand the intentions of its creators. What are they trying to convey with their use of these different modal verbs?

This type of analysis is easy using Python’s Natural Language Tool Kit (NLTK). I used the script below to generate the word-frequency counts and dispersion plot that follow. (The lines starting with a ‘#’ sign are comments that explain what the code is doing. The reference to the file ending with “.txt” is the text of the GDPR, extracted from the legislation using Adobe Reader’s ‘save as other/text” command.)

import sys
from nltk import *
import numpy
import matplotlib

# NLTK's default English stopwords
defaultstopwords = set(corpus.stopwords.words('english'))
should = {'should', 'shouldn'}
defaultstopwords = defaultstopwords - should
fp ="c:\temp\CELEX32016R0679ENTXT.txt", 'r')
words = word_tokenize(

# Remove single-character tokens (mostly punctuation) 
words = [word for word in words if len(word) > 1] 
# Remove numbers 
words = [word for word in words if not word.isnumeric()] 

# Lowercase all words (default_stopwords are lowercase too) 
words = [word.lower() for word in words] 

# Remove stopwords 
words = [word for word in words if word not in default_stopwords] 

# Calculate frequency distribution 
fdist = FreqDist(words) 

#Output top 20 
words for word, frequency in fdist.most_common(20):
  print(u'"{}" has {} occurrences'.format(word, frequency))

draw.dispersion.dispersion_plot(words, ["should", "shall", "may", "might", "must", "required"])

Produces the output below, showing the most common words and the number of occurrences of each:

“data” has 1249 occurrences
“processing” has 628 occurrences
“personal” has 611 occurrences
“shall” has 479 occurrences
“article” has 463 occurrences
“supervisory” has 446 occurrences
“controller” has 442 occurrences
“should” has 422 occurrences
“authority” has 383 occurrences
“subject” has 382 occurrences
“member” has 325 occurrences
“regulation” has 315 occurrences
“union” has 307 occurrences
“protection” has 285 occurrences
“state” has 238 occurrences
“referred” has 222 occurrences
“processor” has 218 occurrences
“may” has 217 occurrences
“public” has 203 occurrences
“purposes” has 203 occurrences

We see that should, may, and shall (marked above in bold) are in the top 20 words (note that stop words have been removed), and are the only verbs (unless you count ‘referred’). Is there a pattern?

The frequency distribution produced by the above code shows a clear boundary:

(click on image to enlarge)

We see that the shoulds are concentrated in the first part, but that the shalls take over somewhat before the half-way mark (in terms of word offset; note that these word-offsets don’t correspond to the actual text, given that stopwords and punctuation have been removed). Unlike either should or shall, however, may is scattered evenly over the entire document.

This switch-over point between should and shall falls at the beginning of Chapter 1. This is where the GDPR stops enumerating goals, sentiments, and housekeeping matters (such as the relation of the legislation to other EU and member-state laws), and gets down to what you must do.

The role of should

How are we to make sense of all these shoulds? We know you should sort your trash and floss your teeth, but how does should fit into legislation? Do you have to do it, or is it just a suggestion?

The interpretation that makes the most sense to me is that the should-clauses are desirable but non-mandatory actions. They express the broad policy goals of the GDPR, goals whose best implementation has to depend on the particular industry and technology in question. By implementing so as to support as many of the shoulds as possible, we can justify our decisions and fortify our position in the event that we have to defend those decisions.

A strategy for the legislation as a whole

If you’ve read this far you may be wondering how any of this can be useful to you. While it is always useful to ponder key passages of the GDPR in isolation (as does most of the commentary I have seen thus far), I propose that organizations consider incorporating the GDPR into their models (functional, data, procedural) so that every data exposure points to its justification or legal purpose (or badge link) in one or more sections of the legislation. For example, a particular data element’s exposure may be justified by both data-subject consent and by legitimate business purpose. If challenged, you need to be able to find all justifications.

Just as every element of a functional or data model has to be traceable to a stated requirement, so must every data exposure be justified by at least one specific clause of the GDPR. In effect, the GDPR constitutes a new set of requirements, independent of (and sometimes in conflict with) our system’s original requirements. It has been dropped into our laps and we have to find a way to integrate it as effectively, cheaply, and non-disruptively as possible.

Equally important, we have to be prepared to show that we have complied with the GDPR’s shalls when the auditor comes knocking, when there is a complaint or request from a data subject, or (heaven forbid) a data incident occurs. We must also perform the reverse operation, which is to consider every should, may, and must in the regulation and ask whether it applies to our system and, if so, how we can best incorporate it into our practice.

In addition to the shalls, our analysis will help us to show that our compliance measures further the goals set out by the shoulds, and that we have made good-faith efforts to do our part in the may department (for example, participating in industry groups to establish codes of conduct, as set out in Article 40).

In future posts I plan to expand the Data Inventory to include references to applicable articles of the GDPR and to incorporate the inventory into a defensive strategy.

One Reply to “Considering the GDPR as a whole”

Leave a Reply