As Published In

Oracle Magazine
January/February 2004
Technology SQL

Announcing the New MODEL

By Jonathan Gennick

The new MODEL clause brings spreadsheet calculations to the database.

You're writing a query that returns computed column values. As input to your calculations, you need values from multiple rows. What do you do? This is actually a fairly common problem. Solutions to it often involve inscrutable combinations of self-joins and subqueries.

Oracle Database 10g enables a new approach. Using the SELECT statement's new MODEL clause, you can treat relational data as a multidimensional array to which you can apply spreadsheet-like calculations. The result is a query that is easier to develop, easier to understand, and easier to modify.

The Scenario

Suppose you are a book publisher projecting sales into 2004 based on sales of books in print in both 2002 and 2003. You publish in both the United States and Canada, and you have the table of monthly sales data shown in the following:

CREATE TABLE book_sales (
    isbn          	VARCHAR2(13),
    country       	VARCHAR2(2),
    year          	NUMBER,
    month         	NUMBER,
    gross_revenue  	NUMBER,
    return_revenue 	NUMBER,
    net_revenue    	NUMBER,
    gross_units    	NUMBER,
    return_units   	NUMBER,
    net_units     	NUMBER);

You want to project 2004 sales using the following formula:

s4 = ((s3-s2)/s2)*s3+s3

This formula computes the percentage increase or decrease between a given month in 2002 and that same month in 2003, and then applies that percentage to the 2003 sales to get a projection for 2004 sales for that month. Your underlying assumption is that whatever trend was established from 2002 to 2003 will continue into 2004.

To add some real-world complexity to your forecasting task, not all books were in print as of January 2002, leaving you without a complete 24 months of sales history for some titles. For purposes of executing your forecasting formula, when a book has no sales history for a given month, you want to use the average sales of the book during all months of that year for which you do have sales history. Furthermore, you don't have complete information for 2003 yet; you don't have fourth-quarter 2003 sales data. To deal with that, you want to project the trend for each book from the second to third quarters of 2003 into the fourth quarter, spreading those projected sales evenly over October, November, and December. Lastly, your sales history from outside the U.S. is recorded in local currencies, yet you wish to forecast sales for all countries in terms of U.S. dollars.

Enter the MODEL Clause

The calculations I've just described are typical of what you might use when modeling sales in a spreadsheet. You're building a model to use in forecasting 2004 sales, and fortunately, you're running Oracle Database 10g. Having just read about the new MODEL clause, you decide to apply it to your forecasting problem.

To use the MODEL clause, you need to conceptually form your data into a multidimensional array. Each result set row becomes a cell in that array.

When you create a model, begin by thinking about how and whether to partition your model into separate arrays. Partitioning is optional, but it gives the database a point at which to parallelize the work, and it makes formulas much easier to write. You have sales data for two countries, and each country's forecast is independent of the others, so you should partition by country:

MODEL
   PARTITION BY (country c)

Next, you need to dimension your data. By this, I mean that you need to decide which values will combine to uniquely identify a row in your array. For book sales data, you should dimension by ISBN number, year, and month. For each country, those values uniquely identify a row in that country's array:

MODEL
   PARTITION BY (country c)
   DIMENSION BY (isbn i, year y, month m)

Each cell in a model holds one or more values, but the forecasting formula requires just the net revenue value, so list that one column as your measure:

MODEL
   PARTITION BY (country c)
   DIMENSION BY (isbn i, year y, month m)
   MEASURES (s.net_revenue r)

Finally, you can begin to write rules, which define the calculations you want to perform using your measures. As an example, let's focus on forecasting January 2004's net revenue for one book. You reference that cell in a formula as follows:


r['1-56592-948-9',2004,1]

The cell reference begins with the alias specified in the MEASURES clause. Next comes a list of dimension values enclosed within square brackets. This example shows positional referencing, for which you list one value for each dimension, in the order corresponding to the DIMENSION BY clause.

Positional referencing is concise, and it allows you to insert new cell values, which then translate into new rows in the query's result set. By default, the MODEL clause supports what Oracle calls upsert semantics: When you specify dimensions for a previously nonexistent cell and assign a value to that cell, the database combines the dimension, measure, and partition data to create a new row in the result set.

To assign a value to a cell, use notations such as the following:

r['1-56592-948-9',2004,1] = value
r['1-56592-948-9',2004,1] = expression

Listing 1 shows a complete MODEL clause, including a rule to forecast 2004 sales for one book. The rule implements the forecasting formula shown earlier. In the rule, you'll see cell references on both sides of the equal sign, as data from 2002 and 2003 is used to compute an estimate for 2004. When speaking of rules, it's common to use the terms "left side" and "right side" to refer to the parts of the rule to the left and the right of the assignment operator.

For the query in Listing 1, the rule will be evaluated once for each partition—in other words, once per country. Thus you'll get a row for January 2004 Canadian sales of the book in question and another row for U.S. sales in that same month.

When you execute a query such as that in Listing 1, the last clauses evaluated are MODEL, SELECT, and, finally, ORDER BY. You can specify column aliases in the MODEL clause, and I've done that because those aliases let me specify calculations more concisely. Those aliases become the column names that the SELECT clause sees, and you can further alias them in the SELECT clause.

Looping Through Measures

One of your requirements is to fill in missing monthly sales data, so that you have a dense array with data in each cell. You can do that by using FOR loops, such as the following three on the left side of the rule in Listing 2:

  • FOR i IN (SELECT isbn FROM book)— to iterate over each ISBN number
  • FOR y FROM 2002 TO 2003 INCREMENT 1—to iterate over the two years
  • FOR m FROM 1 TO 12 INCREMENT 1— to iterate over the twelve months

Key to making FOR loops useful is the CV() function, which enables access to the current value of a dimension. The CV() functions on the right capture the values through which the FOR loops iterate. Thus, the cell reference

[CV(), CV(), CV()]

becomes

[' 0-596-00441-9', 2002, 1]
[' 0-596-00441-9', 2002, 2]
[' 0-596-00441-9', 2002, 3]
...

Upsert semantics ensure that new result set rows get created as necessary. Execute the query in Listing 2, and you'll see monthly sales rows for each book covering the entire 24-month period. You'll see those rows even though they don't all exist in the database. You'll see them for each country, because the rule is applied automatically and independently to each partition in the model.

The new IS PRESENT predicate in Listing 2's CASE statement tests whether net revenue data exists for a month. IS PRESENT evaluates to true if a cell exists in the initial set of rows evaluated by the MODEL clause. If a value for r exists, that value is copied to nr. Otherwise, nr is set to the average sales for the book and year in question. I could have set r itself to the result of the CASE statement, but creating new values for r potentially changes the results of aggregate functions such as AVG(r). By using nr, I preserve the original r, or net revenue, values for use in further calculations.

The AVG function in Listing 2 introduces another type of cell addressing. The basic format to use in executing an aggregate function over a range of measures is:

AVG(expression)[dimension, 
dimension...]

Listing 2 uses CV() for the first two dimensions to reference the current book and year values. Then it uses the predicate m BETWEEN 1 AND 12 to consider all values for the current year and book when computing the average.

The AVG function specifies an expression within parentheses. That expression, in this case simply a column reference, is evaluated for each cell in the specified range. The AVG function then returns the average of the values from those evaluations.

Rules, Meet Partition Outer Join

When using the MODEL clause, you need to be cognizant of the number of rules that you write. MODEL queries are limited to a maximum of 10,000 rules. The rule count is independent of the number of partition values returned by a query, however, so if a model defines three rules, it has three rules whether the query returns two or 20,000 partition values.

Note that a FOR loop is like a macro and is expanded into many single-cell rules when a query is compiled. The FOR loop in Listing 2, for example, generates 24 rules per book, one rule for each month over a two-year period.

If you create a model that might approach the rule limit, there are different approaches you can take to reduce your rule count. One approach is to remove one or more FOR loops from the rules and add the FOR loop's dimension column into the PARTITION BY clause.

Another approach is to take the densification task outside the MODEL clause using Oracle Database 10g's new partition outerjoin syntax to create the needed rows before you generate the MODEL array.

In my example, I used the partition outerjoin approach. Listing 3 uses the following outer join to eliminate gaps in monthly sales data:

PARTITION BY (s.country, s.isbn)
RIGHT OUTER JOIN month mo 
ON (s.month = mo.m AND s.year=mo.y)

In Listing 3, the month table contains 36 rows, one for each month of the years 2002, 2003, and 2004. The PARTITION BY clause on line three specifies that the join be performed separately for each combination of country and ISBN number, making the join easier to write and understand. The right outerjoin ensures a result for each month. There's no longer a need to create new rows from the MODEL clause for years 2002 through 2003, so the cell reference on the left side of the rule can be changed from using FOR loops to using the ANY predicate:

nr[ANY, ANY, ANY] = ...

These three ANY predicates cause the rule to execute for each combination of book, year, and month. The results from Listing 3 are the same as from Listing 2, but without the explosion of rules caused by the FOR loops.

Projecting the 2004 Sales

You're almost done. But first, there's that requirement to project quarter four of 2003 based on the trend from quarters two and three. Listing 4 shows a rule to accomplish that projection. The new rule may look a bit complex, but it's merely an expansion of the s4 = ((s3-s2)/s2)*s3+s3 formula given earlier, and it uses the BETWEEN predicate to isolate cells for a given quarter. For example, the following expression returns total sales for the current book in second quarter 2003:

SUM(nr)[CV(), 2003, m BETWEEN 4 and 6])

Listing 4 also shows the rule to project 2004 sales, which is an expansion of that same formula given earlier.
Next Steps

DOWNLOAD sample data and code
otn.oracle.com/oramag/oracle/04-jan/modeld.html
otn.oracle.com/oramag/oracle/04-jan/modelc.html

READ MODEL queries white paper
otn.oracle.com/oramag/oracle/04-jan/model.html

Currency Conversion

Your last requirement to fulfill is currency conversion. For this, you can add a reference table containing currency conversion multipliers to your model. In Listing 5, look at the REFERENCE clause following the MODEL keyword.

A reference table is simply another multidimensional array. I named the array money and dimensioned it by country code. You can now get the conversion factor for any given country via a reference to money.to_us[c]. To make the country code available for each cell in the model, Listing 5 adds it to the list of measures. It's OK for a partition or dimension column to also be a measure. Listing 5 uses the following nested cell reference to get the proper conversion factor for each cell:

money.to_us[cc[CV(),CV(),CV()]]

The innermost cell reference, the one for the alias cc, returns the current country code, which is then used as an index into the money array to retrieve the conversion factor by which net revenue can be converted into U.S. dollars.

Getting into the New MODEL

The MODEL clause provides a different way to work with relational data. Now you can perform spreadsheet-like calculations within the database itself. Using rules, you can clearly express intent in a way that isn't possible otherwise. You can also more easily respond to changes in business requirements. By combining MODEL with other features such as partition outerjoins, you'll be able to develop creative solutions to many query problems.

Jonathan Gennick (Jonathan@Gennick.com) is an experienced Oracle DBA and an Oracle Certified Professional. He runs the Oracle-Article e-mail list, which you can learn about at http://gennick.com. Gennick recently coauthored the Oracle Regular Expression Pocket Reference (O'Reilly & Associates, 2003).




Please rate this document:

Excellent Good Average Below Average Poor

E-mail this page
Printer View Printer View
Oracle Is The Information Company About Oracle | Oracle RSS Feeds | Careers | Contact Us | Site Maps | Legal Notices | Terms of Use | Privacy