SQL for Pattern Matching in Oracle 12c
Oracle 12c is out!
And with it, a myriad of new features that we’ll be learning and playing with in the months and years to come. Paraphrasing Iggy Fernandez’s blog: “So many Oracle manuals, so little time…” The new features abound, and we need to cherry pick some interesting ones to delve into.
There are the basic ones, like installing the new database software, which Yury Velikanov promptly explained just one day after the 12c release. And there are others, buried in the manuals, that may go unnoticed by the majority for some time. They may not be essential for a DBA or developer, but sometimes they can save us a great deal of work and time.
One of these “buried” features of Oracle 12c that caught my interest was SQL for Pattern Matching. It’s a extension to the syntax of the
SELECT statement, using the
MATCH_RECOGNIZE clause, that allows us to identify patterns across sequences of rows.
It’s just another way of doing the same things
In SQL, there are always many different ways to solve a single problem.
MATCH_RECOGNIZE just adds to the set of possible solutions. However, when the problem at hand is to detect patterns in sequences of rows, it may simplify the job immensely and save us a lot of time and lines of code. That doesn’t mean, though, that it wasn’t possible to do pattern matching before; it probably just required more work and convoluted queries.
So, let’s look at one example. I’ll look for patterns in the archived log history of a database rather than venturing through stock market tickers, or other fields’ data that I’m less familiar with. The idea, though, can be applied to a variety of scenarios.
Let’s say that we want to find out periods of high archived log generation for a database. We want to know in which periods of 24 hours the database generated more than 30gb, for example. An easy way to do this is to use the query below, which all DBAs have certainly written before:
select trunc(completion_time) day, sum(blocks*block_size/1024/1024) mbytes from v_archived_log group by trunc(completion_time) having sum(blocks*block_size/1024/1024) > 30000;
DAY MBYTES --------- -------- 03-JUL-13 31,666 04-JUL-13 30,078
The query above shows all the days when the database generated more than 30gb of archived logs. There’s a problem, though. What if, for example, in a particular occasion the database generated 40gb of archived logs between noon the previous day and noon the current day, being quiet the rest of the time?
This query could miss this fact since there would be only 20gb of archived logs each day.
To write a query that considers any period of 24 hours, rather than calendar days, is more complicated. Using MATCH_RECOGNIZE we can solve that with query below:
SELECT * FROM v_archived_log MATCH_RECOGNIZE ( ORDER BY completion_time MEASURES to_char(FIRST (A.completion_time),'yyyy-mm-dd hh24:mi:ss') AS start_time, to_char(LAST (A.completion_time),'yyyy-mm-dd hh24:mi:ss') AS end_time, sum(A.blocks*A.block_size/1024/1024) as mbytes ONE ROW PER MATCH AFTER MATCH SKIP PAST LAST ROW PATTERN (Y+ Z) SUBSET A = (Y, Z) DEFINE Y AS (Y.completion_time - FIRST(Y.completion_time)) <= 1, Z AS (Z.completion_time - FIRST(Y.completion_time)) <= 1 and sum(A.blocks*A.block_size)/1024/1024 >= 30000 );
With the query above, we now can see that there were more periods of high activity than we thought before:
START_TIME END_TIME MBYTES ------------------- ------------------- -------- 2013-06-19 05:00:06 2013-06-20 05:00:01 30,424 2013-06-20 05:12:25 2013-06-21 05:08:07 30,163 2013-06-25 05:04:12 2013-06-26 05:04:04 30,529 2013-07-02 18:51:42 2013-07-03 18:45:55 30,221 2013-07-03 18:52:42 2013-07-04 18:43:14 31,352
No way! It’s too complicated
It does looks scary at first sight! But if we break it down into smaller pieces, it gets easier to understand.
MATCH_RECOGNIZE allows us to search for a pattern in a sequence of rows. It’s important that the sequence of rows be ordered, otherwise the results wouldn’t be deterministic. The order of the rows is specified by the
ORDER BY clause within
In this example, the pattern that we want to find is “a sequence of rows spanning a period of not more than 24 hours for which the sum of the archived log sizes is greater than 30gb“. It is specified as a regular-expression in the
PATTERN clause as: “
Y+ Z“, which means “one or more rows that match the
Y condition, followed by one row that matches the
Z condition”, where
Z are the conditions specified in the
Y specifies that the timestamps of any rows labelled as
Y must not be more than 24 hours apart. There’s no restriction on the maximum number of
Y rows, just that they all be within a 24-hour period.
The only row that matches the
Z condition must also be within 24 hours from the first occurrence of
Y and must be such that the sum of the archived log volumes of all
Z‘s are greater than 30gb. The
SUBSET clause defines the set
A as the union of all
Z‘s, to simplify the calculation of the total volume.
For each match found in the sequence of rows,
MATCH_RECOGNIZE can return either all rows within the match, i.e. all rows that matched the
Z conditions, or a single row summarizing it. In this case, since we just want the sum of all the archived log volumes, I specified that I wanted only
ONE ROW PER MATCH. I also told
MATCH_RECOGNIZE to resume the search after each match by starting from the first row after the match (
AFTER MATCH SKIP clause, with the
PAST LAST ROW option). This clause also permits resuming from the next row or from a specific occurrence of a pattern variable (conditions defined above).
Finally, in the
MEASURES clause we define the value that will be calculated for each match found. Here we use the
LAST navigation functions to retrieve the start and end dates for each match and calculate the sum of the archived logs volume.
The uses for
MATCH_RECOGNIZE are many, from finding patterns in audit trails and logs to analyzing stock market variations. This is another nice tool for the utility belt of DBAs and developers, and the fact that it uses the power of regular-expressions for specifying the patterns is a great touch and gives it a great potential.
Do you see potential interesting uses of this new feature? I’d love to hear them!
If you’re interested, I’d suggest reading the manual to learn all the available options. Learning and mastering regular-expressions is also paramount and it’s a skill that has many other applications: