NoCoug SQL Challenge Entry

May 15, 2009 / By André Araújo

Tags: , , ,

I love puzzles. So when I heard about the NoCoug SQL Challenge I felt tempted to give it a go.

The Northern California Oracle Users Group (NoCoug) has challenged us to find a good way to calculate the probability of getting different sums for x throws of a n-sided die using only SQL. The probabilities for the faces of a single die are stored in a table and that’s all you need to start playing with the problem. The SQL Challenge rules can be found on the NoCoug website, along with some other relevant information.

After working out my very first solution, I read the rules and found it wasn’t fit for the challenge, as it used non-SQL extensions (SQL*Plus). So I started again, this time using pure SQL. I came up with a few options but wasn’t happy with them from a performance perspective. They needed more sweating.

I find that walking is very good for thinking. Whenever I can, and when weather permits, I walk home from work at the end of the day. The distance between work and home is about 6 km, which takes me around one hour to cover. After you’ve done it a few times the walking becomes automatic and  you don’t have to think about it anymore; obstacles, kerbs, corners, street-crossings are all handled in auto-pilot mode. Then, as you don’t have anything else to do, you think.

During the days I was working on the challenge solution, the SQL query used to occupy my thoughts along most of my (almost) daily walk. I decided to use only standard SQL features and wanted to be able to run the solution on both Oracle and SQL Server. While walking, I started thinking on ways to improve my initial solution’s performance without having to resort to non-standard tricks.

Joins of joins

The most straightforward way to calculate the probabilities for the possible sums in x throws of the die (whose face’s probabilities are stored in the die table) is to write a x-way self-cross-join of the table. For example, the probabilities for the sums in 3 throws of the die is given by:

select
  d1.face_value + d1.face_value + d2.face_value total_value,
  sum(d1.probability * d2.probability * d3.probability) probability
from
  die d1
  cross join die d2
  cross join die d3
group by d1.face_value + d1.face_value + d2.face_value

The problem is that the complexity of the query escalates when a high number of throws is involved; imagine writing and executing a 1024 join of that table. Also, the number of rows generated by the multiple cross-joins increases exponentially every time a new join is added to the query. It’s just not feasible to run the query above for a large number of throws. For a 20-sided die, for example, the query above yields 58 rows after the GROUP BY aggregation (totals ranging from 3 to 60). Before the aggregation can be performed, though, all the joins must be calculated, which generates 203 = 8000 combinations.

To improve performance, I needed a way to reduce the number of joins required to calculate the probabilities and the cost of those joins.

Reusing calculations

Let’s suppose I pre-calculate the probabilities for n throws of the die. To calculate the chances for n+1 throws, I don’t need to start from scratch again—I can start from the probabilities for n throws and perform a Cartesian product (or a “cross-join”, in other words) of that, with the probabilities for one die. Similarly, to calculate the probability for 2n throws, I can use the probabilities for n throws and do a “self-cross-join.”

Using this principle, I thought I could pre-calculate probabilities for some chosen number of throws, and then use those results as building blocks to obtain the probabilities for any other numbers. The most obvious “sizes” for the building blocks would be powers of 2, since it’s easy to find a combination of them to generate any other number using binary arithmetics.

Thus, if the probabilities for 1, 2, 4, 8, … 2n throws of a die are pre-calculated, we can easily combine them to calculate probabilities for any number of throws up to 2(n+1)-1.

Combining the building blocks

I wanted to use those building blocks together in a single parameterised SQL so I could generate probabilities for any given number of throws without having to rewrite the query every time. To do that, I used some binary arithmetic and logic to manipulate the joins between the building blocks.

Let’s say T1, T2 and T4 are the pre-calculated probabilities for the throw of 1, 2, and 4 dice, respectively. The probability for 7 throws of the die is given by:


            T1
cross join
            T2
cross join
            T4

Or, alternatively:

            (select face_value, probability from T1)
cross join
            (select face_value, probability from T2)
cross join
            (select face_value, probability from T4)

The query above, though, is not parameterised and only calculates probabilities for 7 throws of the die. Using a parameter (&&throws) we can manipulate which building blocks will be involved in the join:

            (select face_value, probability from T1
             where bitand(1, &&throws) != 0)
cross join
            (select face_value, probability from T2
             where bitand(2, &&throws) != 0)
cross join
            (select face_value, probability from T4
             where bitand(4, &&throws) != 0)

The query is now parameterised. We have a big problem, however: for &&throws = 2, for example, bitand(1, &&throws) = 0, which causes the first select to return no rows, which in consequence causes the whole cross-join to return no rows. And this is definitely not what we want.

To avoid this problem, whenever one of the building blocks is not to be used, instead of just suppressing it, we need to replace it with a neutral element—an element that could be used in the cross-join without altering the values. This element is the select shown below:

select 0 face_value, 1 probability from dual

The value 0 for face_value won’t change the sums for the dice faces, and the value 1 for probability won’t alter the probability products calculate in the cross join. Adding this element to the previous query we then have:

            (select face_value, probability     from T1
             where bitand(1, &&throws) != 0
             union all
             select 0 face_value, 1 probability from dual
             where bitand(1, &&throws) = 0)
cross join
            (select face_value, probability     from T2
             where bitand(2, &&throws) != 0
             union all
             select 0 face_value, 1 probability from dual
             where bitand(2, &&throws) = 0)
cross join
            (select face_value, probability     from T4
             where bitand(4, &&throws) != 0
             union all
             select 0 face_value, 1 probability from dual
             where bitand(4, &&throws) = 0)

With this query, setting the &&throws parameter to any value ranging between 1 and 7 will correctly choose which of the pre-calculated building blocks should be included in the calculation. With one query and one parameter we can then have probabilities calculated for 7 different numbers of throws.

Reducing cardinality

Another factor affecting the performance of the probabilities calculation is the cardinality of the subsequent joins, which escalate exponentially. With a  20-sided die, for example, after n throws you’ll have 20n rows in the result set. For 3 throws, that means 8000 rows in the resulting set.

For 3 throws of a 20-sided die, though, we can only have sums ranging from 3 (1+1+1) to 60 (20+20+20). That means that, in those 8000 rows we have only 58 distinct summed values. Probabilities for any given sum may appear several times because it may be obtained by summing different combinations of the die’s values. For example, using a 20-sided die, the sum 18 can be obtained in the following ways (among many, many others): 1+1+16, 1+2+15, 1+3+14, etc.

To help performance, I decided to perform an aggregation (GROUP BY) after every join to reduce the number of rows before the execution of the subsequent join. The aggregation operation itself has a cost but it pays off, preventing the cardinality of the joins from increasing exponentially.

Applying this idea to the query above we have:

            (select face_value, probability     from T1
             where bitand(1, &&throws) != 0
             union all
             select 0 face_value, 1 probability from dual
             where bitand(1, &&throws) = 0)
cross join
            (select T2.face_value + T4.face_value face_value,
                    sum(T2.probability * T4.probability) probability
             from (
                   (select face_value, probability     from T2
                    where bitand(2, &&throws) != 0
                    union all
                    select 0 face_value, 1 probability from dual
                    where bitand(2, &&throws) = 0)
                cross join
                   (select face_value, probability     from T4
                    where bitand(4, &&throws) != 0
                    union all
                    select 0 face_value, 1 probability from dual
                    where bitand(4, &&throws) = 0)
                  )
             group by T2.face_value + T4.face_value
            )

The query

The final solution is the combination of the techniques described above. The query was written with standard SQL and makes use of Subquery Factoring (a.k.a. Common Table Expressions, a.k.a. the WITH clause) to calculate the building blocks’ probabilities. The probabilities for a 2n-throws building block is calculated based on the probabilities previously calculated for 2(n-1).

Limitations

Each written query, like the one below, can calculate probabilities up to a certain number of throws (max # of throws = 2Q - 1, where Q is the number of T* subqueries used in the query). This number can be easily increased by adding new subqueries to the query. The maximum number of  throws increases exponentially for linear increases in the size of the query.

When testing on Oracle (10.2.0.4 and 11.1.0.7), I found that the maximum Q I could use for a successful execution was 9 (query below). For queries with more than 9 subqueries I got an ORA-600. I couldn’t  find an existing solution for the problem in Metalink.

Code

-- First International NoCoug SQL Challenge
-- Author: Andre Araujo (araujo@pythian.com)
-- Objective: Calculate the probability of getting certain sums for N throws of a dice.

undefine throws

with
T1 as (
  -- Results for the throw of 1 die
  select d1.face_value face_value, sum(d1.probability) probability
  from die d1
  group by d1.face_value
)
,
T2 as (
  -- Results for the throw of 2 dice
  select d1.face_value + d2.face_value face_value, sum(d1.probability * d2.probability) probability
  from T1 d1
  cross join T1 d2
  group by d1.face_value + d2.face_value
)
,
T4 as (
  -- Results for the throw of 4 dice
  select d1.face_value + d2.face_value face_value, sum(d1.probability * d2.probability) probability
  from T2 d1
  cross join T2 d2
  group by d1.face_value + d2.face_value
)
,
T8 as (
  -- Results for the throw of 8 dice
  select d1.face_value + d2.face_value face_value, sum(d1.probability * d2.probability) probability
  from T4 d1
  cross join T4 d2
  group by d1.face_value + d2.face_value
)
,
T16 as (
  -- Results for the throw of 16 dice
  select d1.face_value + d2.face_value face_value, sum(d1.probability * d2.probability) probability
  from T8 d1
  cross join T8 d2
  group by d1.face_value + d2.face_value
)
,
T32 as (
  -- Results for the throw of 32 dice
  select d1.face_value + d2.face_value face_value, sum(d1.probability * d2.probability) probability
  from T16 d1
  cross join T16 d2
  group by d1.face_value + d2.face_value
)
,
T64 as (
  -- Results for the throw of 64 dice
  select d1.face_value + d2.face_value face_value, sum(d1.probability * d2.probability) probability
  from T32 d1
  cross join T32 d2
  group by d1.face_value + d2.face_value
)
,
T128 as (
  -- Results for the throw of 128 dice
  select d1.face_value + d2.face_value face_value, sum(d1.probability * d2.probability) probability
  from T64 d1
  cross join T64 d2
  group by d1.face_value + d2.face_value
)
,
T256 as (
  -- Results for the throw of 256 dice
  select d1.face_value + d2.face_value face_value, sum(d1.probability * d2.probability) probability
  from T128 d1
  cross join T128 d2
  group by d1.face_value + d2.face_value
)
-- The select below combines the results above to achieve the wanted results.
-- The bitand checks ensure only necessary selects are executed.
select T1.face_value + T2.face_value face_value, sum(T1.probability * T2.probability) probability
from
  (select face_value, probability from T1    where bitand(1, &&throws)    != 0 union all select 0, 1 from dual where bitand(1, &&throws)    = 0) T1
cross join
  (select T2.face_value + T4.face_value face_value, sum(T2.probability * T4.probability) probability
   from (select face_value, probability from T2    where bitand(2, &&throws)    != 0 union all select 0, 1 from dual where bitand(2, &&throws)    = 0) T2
cross join
  (select T4.face_value + T8.face_value face_value, sum(T4.probability * T8.probability) probability
   from (select face_value, probability from T4    where bitand(4, &&throws)    != 0 union all select 0, 1 from dual where bitand(4, &&throws)    = 0) T4
cross join
  (select T8.face_value + T16.face_value face_value, sum(T8.probability * T16.probability) probability
   from (select face_value, probability from T8    where bitand(8, &&throws)    != 0 union all select 0, 1 from dual where bitand(8, &&throws)    = 0) T8
cross join
  (select T16.face_value + T32.face_value face_value, sum(T16.probability * T32.probability) probability
   from (select face_value, probability from T16   where bitand(16, &&throws)   != 0 union all select 0, 1 from dual where bitand(16, &&throws)   = 0) T16
cross join
  (select T32.face_value + T64.face_value face_value, sum(T32.probability * T64.probability) probability
   from (select face_value, probability from T32   where bitand(32, &&throws)   != 0 union all select 0, 1 from dual where bitand(32, &&throws)   = 0) T32
cross join
  (select T64.face_value + T128.face_value face_value, sum(T64.probability * T128.probability) probability
   from (select face_value, probability from T64   where bitand(64, &&throws)   != 0 union all select 0, 1 from dual where bitand(64, &&throws)   = 0) T64
cross join
  (select T128.face_value + T256.face_value face_value, sum(T128.probability * T256.probability) probability
   from (select face_value, probability from T128  where bitand(128, &&throws)  != 0 union all select 0, 1 from dual where bitand(128, &&throws)  = 0) T128
cross join
        (select face_value, probability from T256  where bitand(256, &&throws)  != 0 union all select 0, 1 from dual where bitand(256, &&throws)  = 0) T256
--
-- Grouping after each join ensures rows with equal sums are combined, reducing the cardinality for the next join.
--
group by T128.face_value + T256.face_value) T128
group by T64.face_value  + T128.face_value) T64
group by T32.face_value  + T64.face_value)  T32
group by T16.face_value  + T32.face_value)  T16
group by T8.face_value   + T16.face_value)  T8
group by T4.face_value   + T8.face_value)   T4
group by T2.face_value   + T4.face_value)   T2
group by T1.face_value   + T2.face_value
order by 1;

Portability

The code above doesn’t use any special, non-standard SQL features, and can be easily ported to other databases with minor changes.

Running it against SQL Server, for example, all you need to do is to replace the BITAND function with the bitwise AND operator, and replace the use of the substitution variable with a TSQL variable.

Testing

I ran the tests on an Oracle 10.2.0.4 instance. The results looked quite good and the solution seemed to scale as expected. I compared it to the performance of n straightforward joins of the die table (using GROUP BYs to reduce cardinality as explained above) and the results were quite similar.

The proposed solution, though, has the advantage of reduced size of the query, and that the same query can be used to calculated the probabilities for a different number of throws.

The results below were run for a 20-sided die. I ran tests for all the number of throws between 1 and 511. The values shown are the elapsed time average over 10 runs of the query for each number of throws. The complete test took 30 hours to complete.

Test results for a 20-sided die

15 Responses to “NoCoug SQL Challenge Entry”

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>