As Published In

Oracle Magazine
September/October 2003
Technology SQL

First Expressions

By Jonathan Gennick

Oracle Database 10g brings regular expressions to SQL.

Regular expressions represent a powerful tool for describing and manipulating text data. Indeed, from humble roots in early UNIX utilities such as qed and ed, regular expressions have gone on to become almost ubiquitous, supported by a wide variety of programming and scripting languages, text editors, and now by Oracle Database 10g.

What Are Regular Expressions?

Regular expressions are all about patterns. They allow you to describe patterns in textual data. Once you can describe a pattern, you can search for it, and you can manipulate it.

Patterns are common when working with text. Indeed, text patterns have been an organizing concept for centuries. Without them our lives would be much harder. In the United States, the following pattern for addresses is common:

     name
     street number street name
     city state ZIP code

Once you've internalized this pattern, you can take in an address at a glance. Your mind becomes accustomed to the location of each data element, so the need for labeling the data elements goes away. You recognize a city name not because it's labeled "city name" but because of its position in the pattern.

Recognizing a pattern is the first step. Describing the pattern comes next. Then, when you can describe a pattern, you can manipulate it. Given a number of addresses fitting the pattern just shown, you could extract a list of cities and corresponding ZIP codes and use that list to construct the beginnings of a ZIP code directory.

It's up to you to recognize patterns. To describe the patterns that you recognize, you can use regular expressions. They provide a concise notation that you can use to describe virtually any pattern found in textual data. Oracle's new regular expression functions provide the manipulation capabilities, enabling you to search for patterns, extract text matching a pattern, and transform one pattern into another.

Patterns are fuzzy. Doubtless you can think of countless variations on, or exceptions to, the address pattern shown earlier. People who live in apartments have apartment numbers as part of their address. Canadians don't have ZIP codes; they have "postal codes," and their format differs significantly from that of ZIP codes. When working with regular expressions, it's important to know your data and to be aware of variations that may break the pattern you are working with. A big part of working with regular expressions is awareness of your data and choosing an expression that deals with the realities inherent in that data.

Supporting Expression

The newest Oracle Database release, Oracle Database 10g, implements regular expression support in the form of the following four new functions, which work equally well in SQL and PL/SQL:

  • REGEXP_LIKE
  • REGEXP_REPLACE
  • REGEXP_INSTR
  • REGEXP_SUBSTR

REGEXP_LIKE is a Boolean function, or predicate in SQL, enabling you to identify rows for which a column contains text matching a given pattern. It's designed primarily for use in the WHERE clause of a SELECT statement.

Suppose you have the following table containing information about Michigan's state parks:

CREATE TABLE michigan_park (
    park_name VARCHAR2 (40),
    park_phone VARCHAR2 (15),
    description VARCHAR2 (500)
);

Many parks have several phone numbers, and you've discovered, somewhat to your chagrin, that users have taken to recording many of these phone numbers in the description column. You'd like to get a list of parks for which that's the case. Knowing that local U.S. phone numbers typically fit the pattern xxx-xxxx, you decide to use REGEXP_LIKE to search for phone numbers in the park descriptions. Listing 1 shows the results.

The following is the call to REGEXP_LIKE from Listing 1:

WHERE REGEXP_LIKE(description, 
  '...-....');

The first parameter is the string you wish to search; the description column. The second parameter represents the pattern to search for. In regular expressions, a period (.) represents "any character," so this expression calls for any three characters, followed by a dash, followed by any four characters. The function returns a value of TRUE whenever the text to be searched contains text that matches the pattern given. In this case, the pattern match can occur anywhere in the column value.

The three-dash-four pattern and the results in Listing 1 provide a good example of the sort of fuzziness you must accept when working with regular expressions. For one thing, this pattern for phone numbers is not widely used outside the U.S. I can use the pattern here only because I know my data relates to parks in the state of Michigan, which is part of the U.S., and thus I expect those parks to have U.S. phone numbers. But wait. My pattern isn't entirely successful. Why is Muskallonge Lake State Park in the output? There's no phone number in the description for that park.

Muskallonge Lake State Park appears in Listing 1 because I chose a pattern that was a loose match for what I really wanted. I didn't specify that my pattern, except for the dash, needed to consist only of digits. Do you see the string "217-acre" in the description? There you have it: three characters, followed by a dash, followed by four characters. REGEXP_LIKE gave me what I asked for. The function doesn't know anything of the semantics, or meaning, of the data. It doesn't know that I'm looking for a phone number, nor does it know that "217-acre" isn't one.

Does the presence of a park description without a phone number in Listing 1 present a problem? It depends on what I plan to do with the results. If I'm just curious about phone numbers and there aren't too many erroneous matches, I can just skip over them when I read the report. But if I plan to manipulate those phone numbers in any way, then perhaps I should tighten up my pattern a bit. To begin, I can require three and four digits, rather than characters, as follows:

WHERE REGEXP_LIKE(description, 
  '[0-9]{3}-[0-9]{4,4}');

I can stop here, but I shouldn't. Looking at the data in Listing 1, you can see that some users evidently prefer to separate the digit groups in a phone number by using periods. To maximize the effectiveness of my query, I should allow for either a dash (-) or a period (.) between digit groups, as follows:

WHERE REGEXP_LIKE(description, 
  '[0-9]{3}[-.][0-9]{4,4}');

Let's look at this expression one piece at a time:

[0-9] Matches any digit.

{3} Repeats the previous element three times.

[-.] Matches either a dash or a period.

[0-9] Matches any digit.

{4,4} The previous element must occur a minimum of four times and a maximum of four times.

This seems like a reasonably tight expression, in that I wouldn't expect very many false positives (or false negatives). Listing 2 shows the results. However, using regular expressions to work with text is always an inexact science. If a description said to "see pages 997-1012 of the guidebook," that description would match the expression I've just developed. Regular expressions can only match patterns; they can't discern meaning.

What About Indexes?

Standard indexes are not used for regular expression evaluation. However, if you have a regular expression that you use frequently, you can create a functional index to support it. Otherwise, if the only predicates you supply in the WHERE clause of a query involve regular expressions, the result will be a full table scan. Keep this in mind when querying large tables. It may be that a regular expression is the only viable solution to a given query, but it's always good to make a conscious decision before unleashing a full table scan on a large table.

Don't let the fact that regular expression evaluation may not use indexes stop you from using regular expressions in your queries. Regular expressions are much too valuable to leave out of your toolbox. If a regular expression is what you need to solve a problem at hand, use it. Just be aware of the potential overhead. Also consider that regular expressions can be very useful in contexts for which indexes aren't relevant. For example, having seen the diversity of phone numbers in the description column, I might choose to constrain the park_phone column to a single format, the (xxx) xxx-xxxx area code and phone number pattern used in the U.S., as shown in Listing 3.

The caret (^) and dollar sign ($) characters at either end of Listing 3's regular expression are anchors that serve to prevent leading and trailing spaces from being stored in the park_phone column. The caret anchors the expression to the beginning of the column value, while the dollar sign anchors the expression to the end. That anchoring leaves no room for leading or trailing spaces.

Using Backreferences
Figure 1: Using Backreferences

Manipulating Text

While useful for searching text, regular expressions really shine when you need to do more than just search. Consider the phone numbers shown in Listing 2, which are presented in a wide variety of formats. It would be a difficult task to regularize the formatting using conventional methods; I'd need to write a small program to parse each description. However, it's a much simpler task using the new REGEXP_REPLACE function. The UPDATE statement in Listing 4 transforms most phone numbers into the (xxx) xxx-xxxx format.

REGEXP_REPLACE searches the description column for text matching the pattern described in the second parameter. Whenever a match is found, the text is replaced by the expression in the third parameter. The key to understanding this replacement is to understand the concept of a backreference. A backreference is a numbered reference to the value matched by a parenthetical element, called a subexpression, in a regular expression. For example, the \1 you see in Listing 4 is a backreference to the area code. Figure 1 illustrates how backreferences relate to values matched by the different subexspressions.

Again, there's ambiguity in this solution. The 800 number shown for Mackinac Island in Listing 2 won't match the pattern given in Listing 4, and thus the number will remain in the description as 800-44-PARKS. In this case that may be good, because a change to (800) 44P-ARKS would spoil the marketing impact of the number.
Next Steps

READ
about Oracle Database 10g
oracle.com/database

about regular expressions
Mastering Regular Expressions
by Jeffrey Friedl
(O'Reilly & Associates, oreilly.com)

Oracle Regular Expression Pocket Reference
by Jonathan Gennick and Peter Linsley
(O'Reilly & Associates, oreilly.com)

DOWNLOAD data for this article
Samples.zip

Sometimes you want not only to search for a pattern in a string but to extract and display the text matching that pattern. Say that I wanted to display the acreage for each park. I can begin by attempting to identify parks for which I have that information. To do that, I can use REGEXP_LIKE in the WHERE clause of my query. Next, I want to display the acreage, not the entire description. One solution to this problem is to use REGEXP_INSTR, which returns the location of a pattern in a string. Listing 5 shows REGEXP_INSTR used in conjunction with SUBSTR to return the acreage information for parks whose descriptions include that data.

The regular expression used in Listing 5 is divided by a vertical bar (|) into two parts. Think of that vertical bar as an or operator. Text can match either part and be considered a match. The first part of the expression matches text in the form "xxx acres", while the second part matches "xxx-acre". Rather than require the acreage value to be entirely digits, I've used the rather loose expression [^ ]+ to match any sequence of characters delimited by spaces. I did that to allow for values such as "60,000" and "40,000+".

The final parameter to REGEXP_INSTR in Listing 5 is an 'i', which specifies case-insensitive matching. The parameter before that is either a 0 or a 1, depending on whether I want the position of the first character of the substring or the position of the first character following the substring. The other two numeric parameters specify the character at which to begin searching (1) and the occurrence of the pattern that I want to find (I wanted the first). The information returned by the three REGEXP_INSTR calls is used to compute the proper starting position and length for a call to SUBSTR, which extracts the acreage information I'm after.

The solution in Listing 5 is a bit convoluted and involves repeating the regular expression four times. There's actually a much more convenient way to return text matching an expression, and that's to use the REGEXP_SUBSTR function. This function is similar to SUBSTR in that it returns a substring of a large string. The difference is that with REGEXP_SUBSTR, you identify that substring using a regular expression rather than by character position and length. Listing 6 shows the much simpler REGEXP_SUBSTR solution to the acreage problem.

Listing 6 also shows a more concise way to specify the pattern for acreage values. The expression uses [- ] to allow for either a space or a dash between the number and the word "acre" (or "acres"). The question mark (?) following the "s" in "acres" makes that "s" optional. There's almost always more than one way to specify a pattern. In this case, there's a bit of trade-off between clarity and length.

Sometimes you learn a lot about your data when attempting to manipulate it using a regular expression. It took me four tries to work up to the regular expression for acreage shown in Listing 6. After each try, I looked at my results, learned a bit more about my data, and refined my expression. This iterative process is a natural part of working with regular expressions. You need to become familiar with a representative sample of your data, and it's important to consider variations you might encounter in the pattern you are seeking.

Freedom of Expression

Regular expressions offer a great deal of power for searching and manipulating text. To take advantage of that power, you need to master the language of regular expressions. Don't be intimidated by its seemingly arcane syntax. You won't regret the investment of time learning to write regular expressions, and the syntax becomes easier to read the more you work with it. Before long, you'll wonder how you ever got along without them.

Jonathan Gennick (Jonathan@Gennick.com) is an experienced Oracle DBA and an Oracle Certified Professional. He is a coauthor of the new Oracle Regular Expression Pocket Reference (O'Reilly & Associates).

E-mail this page
Printer View Printer View
Oracle Is The Information Company About Oracle | Oracle RSS Feeds | Careers | Contact Us | Site Maps | Legal Notices | Terms of Use | Privacy