Data Warehousing Best Practices: Comparing Oracle to MySQL pt 1

Posted in: Technical Track

At Kscope this year, I attended a half day in-depth session entitled Data Warehousing Performance Best Practices, given by Maria Colgan of Oracle. My impression, which was confirmed by folks in the Oracle world, is that she knows her way around the Oracle optimizer.

These are my notes from the session, which include comparisons of how Oracle works (which Maria gave) and how MySQL works (which I researched to figure out the difference, which is why this blog post took a month after the conference to write). Note that I am not an expert on data warehousing in either Oracle or MySQL, so these are more concepts to think about than hard-and-fast advice. In some places, I still have questions, and I am happy to have folks comment and contribute what they know.

One interesting point brought up:
Maria quoted someone (she said the name but I did not grab it) from Forrester saying, “3NF is typically a selfless model used by Enterprise data warehouse, which is used by the whole company. A star schema is a selfish model, used by a department, because it’s already got aggregation in it.”

I thought that was an interesting way of pointing that out — most people do not understand why 3NF is not good enough for data warehousing, and I have had a hard time explaining why a star or snowflake schema should be used. Another schema-related topic I had a hard time putting into words before this workshop was the difference between a star and a snowflake schema: compared to a star schema, in a snowflake schema, you have more than one fact table and maybe some dimensions that are not used often.

From Maria and the slides:
“Oracle says model what will suit your business best. Don’t get lost in academia. Most schemas are not 100% according to the theoretical
models. Some examples: 3NF schema with denormalized attributes to avoid costly joins, Star schema with multiple hierarchies in same fact table.”

Data warehousing has a 3-step approach —

1) data sources -> staging layer (temp loading layer)
2) staging layer (temp loading layer)-> foundation (logical, data store) layer
3) foundation (logical, data store) layer -> access and performance layer

The foundation layer is usually 3NF the access layer is usually a star or snowflake schema. As for the data sources, they can be varied, you would hope that they are in 3NF (and if they are you can skip the first 2 steps) but they are not always that way.

The 3 P’s of best practice for data warehousing (on Oracle) are power, partitioning, parallelism. The goal of the data warehousing environment is to minimize the amount of data accessed and use the most efficient joins – so it is not so index focused. This may be based on Oracle’s way of doing joins, I am not so sure if it applies to MySQL as well.

Power The weakest link in the chain (the 3 steps above) will define the throughput, so make sure your hardware configuration is balanced. Maria mentioned that as DBAs, “most of the time we don’t have control over this, but we’re still bound to the SLAs.”

This includes hardware that immediately comes to mind such as # of CPUs/cores, speed of CPU, amount of RAM, speed of disk as well as what we may not think of immediately: speed of network switches, speed of disk controllers, number and speed of host BUS adapters. Notes on host BUS adapters (HBAs): Know the # of HBA ports you have. 4 Gb HBA does 400 Mb/sec. 2 Gb HBA does 200 Mb/sec. Make sure there’s enough HBA capacity to sustain the CPU throughput (ie, make sure HBA isn’t the bottleneck). Also the speed at which it all talks. If you have a 4 Gb machine but a 2 Gb switch, you end up having 2 Gb throughput. Upgrade the network at the same time you upgrade machines.

Because we are talking about data warehousing, it is often not possible to eliminate disk I/O, so the goal is to have the fastest I/O throughput possible. Data warehouses need to be sized on I/O throughput not number of I/O’s.

I made a post earlier about how to determine I/O throughput for a system, which used information from this session. Justin Swanhart already pointed out that this is based on the fact that Oracle can do hash joins and MySQL can only do nested loop joins. I wonder, though, if there is indeed no case when using MySQL for which I/O throughput is a more useful metric than iops.

Disk arrays that are expensive are usually sized for iops, not throughput, and because they’re expensive the disk array is shared throughout the company. A DBA needs to ask ‘how many connections into the storage array do I have? How many disk controllers do I have? Where are my physical disks, and which controllers are they hanging off of?’

Typical 15k rpm disk can do about 25-35 Mb/sec (per disk) random i/o’s. Disk manufacturers will throw out numbers like 200-300 Mb/sec but that’s sequential I/O and leading edge of the drive. Make sure all your LUNs are not coming off the same set of disks, so that you’re not conflicting on disk seeks.

Continue to part 2, partitioning.


Interested in working with Sheeri? Schedule a tech call.

4 Comments. Leave new

Data Warehousing Best Practices: Comparing Oracle to MySQL, part 2 (partitioning) | The Pythian Blog
July 29, 2010 4:00 pm

[…] Data Warehousing Best Practices: Comparing Oracle to MySQL, part 1 (introduction and power) […]

Roland Bouman
July 29, 2010 7:55 pm

Hi! noticed 2 things:

“A star schema is a selfish model, used by a department, because it’s already got aggregation in it.”

This sound like bollocks to me :)

While it is true that many people draw star schemas from a (more or less) 3NF normalized data warehouse as departmental data marts, aggregation is not an intrinsic property of star schemas at all. Aggregation may be added for performance reasons, but ideally the fact table has one row for the lowest possible level of interest – aggregation should be done as much as possible by the reporting and/or OLAP tool, so, at query time.

“Another schema-related topic I had a hard time putting into words before this workshop was the difference between a star and a snowflake schema: compared to a star schema, in a snowflake schema, you have more than one fact table and maybe some dimensions that are not used often.”

Star and snowflake schemas are both examples of a dimensional model. In a dimensional model, you have one fact table which contains columns for the metrics (aka measures), and columns that reference the dimension tables which provide the context for the metrics. While querying these dimensional models, the fact table is joined to one or more dimension tables, and the metric columns are typically aggregated, using several columns from the dimension tables for grouping. One datamart can contain multiple fact tables, and typically the fact tables in one datamart share (a number of) dimensions.

The difference between star and snowflakes is that in a star schema, each dimension is implemented in a single dimension table. The star schema dimension table is denormalized and contains all data for all levels of all hierarchies along which the dimension can be rolled up to get aggregated results from the fact table.

In a snowflake, you still have a single fact table, which still points to the dimension tables. The only difference is that in a snowflake, dimensions are implemented using multiple tables, typically a single dimension table is used at each level of each hierarchy of the dimension.

Example: an order line fact table will have a link to an order date dimension (among many others). The order date dimension should allow users to view the data in the fact table at different levels of time: year, quarter, month, day are examples of such levels; another example would be year, week, weekday.

Note that Y-Q-M-D and Y-W-D are two distinct an non-overlapping hierarchies of a single date dimension – two ways to partition time.

Now in a star schema, the date dimension would have columns for all levels of all hierarchies (Y, Q, M, day-in-month, W, day-in-week). This is necessarily a denormalized table because of the functional dependencies between the levels of the hierarchies.

In a snowflake, we would still have the same fact table, the only difference would be that each level of each hierarchy would get it’s own table. so, at the lowest leve of the dimension you’d have one day table. The day table would point to a day-in-week table for that level of the Y-W-d hierachy, and also to a month table for the Y-Q-M-d hierarchy)

So basically, both are dimensional models but in a snowflake the dimension tables are normalized (at least each level gets its own table).

Justin Swanhart
July 30, 2010 12:31 pm

The three Ps aren’t applicable to MySQL:

Power) All queries are single threaded, so the core speed can be more important than the number of cores

Partitioning) MySQL partitioning only does partition elimination, and it doesn’t do it very well. MySQL 5.5 is supposed to make some improvements here.

Parallelism) In a single query, there is none.

I never suggest using a SAN for MySQL. SAN are great for shared-everything databases (RAC) but not for shared-nothing ones like MySQL. RAC has OCFS, a clustered filesystem designed to work with technology like SANS.

A star/snowflake schema always has one fact table, since by definition the schema is about one factual subject (like sales). There can be dimensions that are shared between different fact tables (puppet dimensions) like the date/time dimensions, perhaps lists of customers, etc.

A snowflake schema contains one or more normalized dimension tables, that is, the hierarchical information isn’t stored redundantly in a single table.

An example of a dimension in a star schema might be:

create table dim_product (
product_id int,
category_name char(10),
sub_category_name char(10),
base_price decimal(6,3)

but in a snowflake:
create table dim_product (
product_id int,
base_price decimal(6,3)

create table dim_product_category (
product_id int,
category_name char(10),
category_type enum(‘top’,’sub’)

Sheeri Cabral
July 30, 2010 12:58 pm

Justin, Roland — thank you! You have done a very good job of showing how Oracle and MySQL are different. This is what community is about, and why I have those caveats.


Leave a Reply

Your email address will not be published. Required fields are marked *