17.2 Advanced Function Options

By now we've introduced you to all but one of Oracle's regular expression functions. However, you've only seen them used in their simplest form. Their full syntax is:

REGEXP_INSTR(source_string, expression 

            [, position [, occurrence

            [, return_option 

            [, match_parameter]]]])



REGEXP_LIKE (source_string, expression 

            [, match_parameter])



REGEXP_REPLACE(source_string, expression

              [, replace_string

              [, position [, occurrence

              [, match_parameter]]]])



REGEXP_SUBSTR(source_string, expression

             [, position [, occurrence

             [, match_parameter]]])

If you're familiar with Oracle's string-manipulation functions, you'll have no trouble discerning the purpose of the REGEXP functions:

REGEXP_INSTR: Returns the character position at which text can be found matching a regular expression in a target string.
REGEXP_LIKE: Discerns whether a given string contains text matching an expression. This is a Boolean function, returning TRUE or FALSE, or NULL.
REGEXP_REPLACE: Performs a regular expression search-and-replace operation, which you'll learn about in Section 17.3.
REGEXP_SUBSTR: Extracts text matching a regular expression from a string.

The parameters to these functions are as follows:

source_string: The string to be searched.
expression: A regular expression describing the pattern of text that you seek.
replace_string: A string generating the replacement text to be used in a search-and-replace operation.
position: The character position within source_string at which to begin a search. This defaults to 1.
occurrence: The occurrence of the pattern you wish to locate. This defaults to 1, giving you the first possible match.
return_option: Valid only for REGEXP_INSTR, and determines whether the beginning or ending character position is returned for text matching a pattern. The default is 0, for the beginning. Use 1 to return the ending position.
match_parameter: A text string through which you may specify options to vary the behavior of the regular expression matching engine.

The match_parameter deserves a bit of explanation. It's a character string that you build using one or more of the following letters: i, c, n, m. One use is to specify whether matching is case-sensitive. By default, your NLS_SORT setting drives this aspect of regular expression behavior. You can override that default on a call-by-call basis, using i to get case-insensitive matching, and c to get case-sensitive matching. Following is a simple demonstration that works with our example data set. When the i option is used, the match succeeds, even though the case really does not match:

SELECT emp_id, text

FROM employee_comment

WHERE REGEXP_LIKE(text, 'CEDAR LAKE', 'c');



no rows selected



SELECT emp_id, text

FROM employee_comment

WHERE REGEXP_LIKE(text, 'CEDAR LAKE', 'i');



    EMP_ID TEXT

---------- ---------------------------------------------

      7499 1105 McConnell Court

           Cedar Lake MI 48812

           Home: (989) 387-4321

           Cell: (237) 438-3333

      7782 Academy Apartments, #138, Cedar Lake MI 48812

      7788 #1 Water Tower Lane

           Cedar Lake MI 48812

The n option is useful when working with data, such as ours, that contains embedded newline (usually chr(10)) characters. By default, the period (.) does not match newlines, which is why the following query brings back only the first line of comment text:

SELECT emp_id, REGEXP_SUBSTR(text, '.*') text

FROM employee_comment

WHERE emp_id = 7698;



    EMP_ID TEXT

---------- ---------------

      7698 Marion Blake

Using 'n' for the match parameter, you can have the period match newlines, which in the case of this query, results in the entire comment being returned:

SELECT emp_id, REGEXP_SUBSTR(text, '.*', 1, 1, 'n') text

FROM employee_comment

WHERE emp_id = 7698;



    EMP_ID TEXT

---------- -----------------------

      7698 Marion Blake

           N1375 Lakeshore Drive

           Vestaburg MI 48891

           (989) 388-1234

This example also illustrates the important point that specifying a match parameter forces you to also specify any preceding, optional parameters that you would otherwise skip.

The final match option is m, which changes the definition of line with respect to the ^ and $ metacharacters. By default, line means the entire target string, so an expression such as ^.*$, together with the n option, will bring back all characters in the target string:

SELECT emp_id, REGEXP_SUBSTR(text, '^.*$', 1, 1, 'n') text

FROM employee_comment

WHERE emp_id = 7788;



    EMP_ID TEXT

---------- --------------------

      7788 #1 Water Tower Lane

           Cedar Lake MI 48812

The n option is necessary here, because otherwise the period would not match the newline. If the period did not match the newline, the expression would fail to match at that point, and it would be impossible to match all characters between the beginning and end of the string.

However, using the m option causes the definition of line to change from the entire target string, to any line within that string, where lines are delimited by newline characters. The following example removes n, replacing it with m:

SELECT emp_id, REGEXP_SUBSTR(text, '^.*$', 1, 1, 'm') text

FROM employee_comment

WHERE emp_id = 7788;



    EMP_ID TEXT

---------- --------------------

      7788 #1 Water Tower Lane

You can specify multiple match parameters in any order. For example, 'in' means the same as 'ni'. If you specify conflicting options, such as 'ic', the last option, 'c' in this case, is the one that takes precedence.

Try our last example in this section using 'mn' as the match parameter string. You'll get back the entire target string. Why? Because of the greediness rule. When 'mn' is used, the first line is a match, but the entire string is a longer match.