Searching documents with Oracle Text, part 1

As I have mentioned in passing, it is my firm belief that an essential part of GDPR compliance is documentation. If your documentation is sketchy, out-of-date, or vague (consisting, say, of emails and slideshows), how will you show that you have privacy by design, and that you enforce and verify this requirement at every stage in the development process?

Once you’ve created all these documents, how will you find what you’re looking for? Have you ever wasted time trying to find information among hundreds of files in a shared folder or Sharepoint? If so, you already know that these tools have limited ability to search inside of documents.

What I plan to do in this series of posts is to introduce Oracle Text (OT), an astonishingly powerful tool for searching not only text files, but also common business file formats such as MS Office and PDFs. Not only can it search files in your local filesystems, but also web pages and files at the end of URLs.

This search capability offers a variety of sophisticated abilities. In addition to more-common search modes (wildcards, logical connectives), OT offers capabilities that I have previously seen only on dedicated document-management systems. Some of the capabilities, which I illustrate in today’s post, include finding:

  • 2 (or more) words occurring near each other
  • stemmed searching (stemming in linguistics refers to recognizing that a single word may take different grammatical forms, such as plurals of nouns, conjugation of verbs, etc.)
  • fuzzy searching, in which the search terms need not match the target’s terms precisely
  • the number of times the search term was matched within the target
  • plus combinations of all the above

This list is not complete,  as I will be learning to use this tool as I go along, and will offer further capabilities as soon as I can put together small demonstrations to illustrate them.

An important thing to understand is that, although these examples use text that is stored in a database table, OT normally will index and search files in their filesystem or web location (the file pointer or web URL is one of the columns in the OT table). OT only accesses the files when it is creating or synchronizing its indexes..

What is searched in OT queries is not the documents themselves, but the indexes. This means that performance can be very good, even if the files being searched are very numerous or large (though initial index creation may take some time).

What follows is a minimal demo of some of the simpler capabilities of OT. Statements are separated by the string ‘========================================'

for easier reading. For the SQL queries, each query is followed by its output. Each query is preceded by a brief comment (prefixed by ‘rem’, for remark) to indicate what feature is demonstrated in that query. I have also bolded a few terms in the query and result to emphasize the key point of the query.

These statements were run on Oracle 11g Express, which is available for free download. If you decide to install Oracle 11g, a good IDE is SQL Developer, also free, from Oracle. (If you wish to install Oracle and try this demonstration, there are one or two configuration steps you’ll need to take to configure your user. These are well covered on the web.)


CREATE TABLE
TEXT_SEARCH
(
ID INTEGER,
ARTICLE VARCHAR2(20),
SEARCH_TEXT VARCHAR2(1000)
);
========================================
INSERT
INTO
TEXT_SEARCH
(
ID,
ARTICLE,
SEARCH_TEXT
)
VALUES
(
2,
'2',
'The cat and the dog regarded each other with suspicion'
);
========================================
INSERT
INTO
TEXT_SEARCH
(
ID,
ARTICLE,
SEARCH_TEXT
)
VALUES
(
1,
'1',
'The swift brown dog ran after the lazy fox'
);
========================================
CREATE INDEX
TEXT_SEARCH_CIDX
ON
TEXT_SEARCH
(
SEARCH_TEXT
)
INDEXTYPE IS CONTEXT;
========================================
 SQL> COLUMN SEARCH_TEXT FORMAT A60
SQL> SET ECHO ON
SQL> exec ctxsys.ctx_ddl.sync_index('text_search_cidx');

PL/SQL procedure successfully completed.
========================================
SQL> rem Try a simple 'is in' query
SQL> SELECT
id,
score(1) AS score,
SEARCH_TEXT
FROM
text_search
WHERE
CONTAINS(search_text,'dog', 1) > 0;

ID SCORE SEARCH_TEXT
---------- ---------- ------------------------------------------------------------
2 3 The cat and the dog regarded each other with suspicion
1 3 The swift brown dog ran after the lazy fox
========================================
SQL> rem Contains both dog and cat
SQL> SELECT
id,
SEARCH_TEXT
FROM
text_search
WHERE
CONTAINS(search_text,'dog and cat') > 0;

ID SEARCH_TEXT
---------- ------------------------------------------------------------
2 The cat and the dog regarded each other with suspicion
========================================
SQL> rem Contains dog, but not cat (~ is the NOT operator)
SQL> SELECT
id,
search_text
FROM
text_search
WHERE
CONTAINS(search_text,'dog~cat') > 0;

ID SEARCH_TEXT
---------- ------------------------------------------------------------
1 The swift brown dog ran after the lazy fox
========================================
SQL> rem Wildcard search; '%' is the wildcard character
SQL> SELECT
id,
search_text
FROM
text_search
WHERE
CONTAINS(search_text,'suspici%') > 0;

ID SEARCH_TEXT
---------- ------------------------------------------------------------
2 The cat and the dog regarded each other with suspicion
========================================
SQL> rem Has dog or cat (or both), plus a wildcard match
SQL> SELECT
id,
SEARCH_TEXT
FROM
text_search
WHERE
CONTAINS(search_text,'(dog or cat) and suspici%') > 0;

ID SEARCH_TEXT
---------- ------------------------------------------------------------
2 The cat and the dog regarded each other with suspicion
========================================
SQL> rem We can look for cat occurring 'near' suspicion
SQL> rem The definition of NEAR has a wide default
SQL> SELECT
id,
SEARCH_TEXT
FROM
text_search
WHERE
CONTAINS(search_text,'NEAR((cat, suspicion))') > 0;

ID SEARCH_TEXT
---------- ------------------------------------------------------------
2 The cat and the dog regarded each other with suspicion
========================================
SQL> rem We can also look for wildcards using NEAR
SQL> SELECT
id,
SEARCH_TEXT
FROM
text_search
WHERE
CONTAINS(search_text,'NEAR((cat, suspici%))') > 0;
ID SEARCH_TEXT
---------- ------------------------------------------------------------
2 The cat and the dog regarded each other with suspicion
========================================
SQL> rem If we specify a limit of 3 words for the NEAR match,
SQL> rem then no match
SQL> SELECT
id
FROM
text_search
WHERE
CONTAINS(search_text,'NEAR((cat, suspici%),3)') > 0;

no rows selected
 
========================================
SQL> rem If we want two terms next to each other (any order will match)
SQL> rem Then define NEAR to be 1
SQL> SELECT
id, search_text
FROM
text_search
WHERE
CONTAINS(search_text,'NEAR((fox, lazy),1)') > 0;

ID SEARCH_TEXT
---------- ------------------------------------------------------------
1 The swift brown dog ran after the lazy fox
========================================
SQL> rem Stemmed search on 'run' ($run) looks for different forms of a word
SQL> rem Here 'ran' is a form of 'run', so it is found
SQL> SELECT search_text
FROM text_search
WHERE CONTAINS (search_text, 'FUZZY ($run, 1, 2)', 1) > 0;

SEARCH_TEXT
------------------------------------------------------------
The swift brown dog ran after the lazy fox
========================================
SQL> rem Stemmed search on 'fexes'; the singular 'fox' also matches
SQL> rem Here 'ran' is a form of 'run', so it is found
SQL> SELECT search_text
FROM text_search
WHERE CONTAINS (search_text, 'FUZZY ($foxes, 1, 2)', 1) > 0;

SEARCH_TEXT
------------------------------------------------------------
The swift brown dog ran after the lazy fox

Now, you may be thinking something like “this is way to complicated for me to use”. and it’s true that the maximum power of OT can be gained only with ad-hoc queries by someone adept at both SQL and text-searching techniques. What these examples are intended to do is to show you what power is there; whether you choose to make some functionality available with a GUI, or to use an expert when you have an exceptional need to search documents, it’s good to know what is possible.

I’ll close this post by mentioning that this is far from everything that OT has to offer. I plan at least one more post on this topic, as soon as I can get a demo working.

Leave a Reply