Good SQL Querying

Jul 10, 2008 / By Sheeri Cabral

Tags: ,

By “Good SQL Querying”, I am not referring to “how to make your queries more perfomant.” I am about to go on a mini-rant about how to make readable and self-documenting SQL queries.

One practice that will get me instantly going on a rant is using a comma join. There is NO reason to do the following:

-- uses the sakila sample database
SELECT first_name, last_name, address
FROM customer,address;

What kind of join did the original author intend? A CROSS JOIN? Or did they really want an INNER JOIN and forget the WHERE clause?

The answer: you do not know for sure; you can only guess. Had the query been

SELECT first_name,last_name,address
FROM customer INNER JOIN address;

you would know that the author intended an INNER JOIN; had the query been

SELECT first_name,last_name,address
FROM customer CROSS JOIN address;

you would know that the author intended a CROSS JOIN. I advocate using INNER JOIN instead of JOIN because like the comma, JOIN is ambiguous.

For full disclosure, the MySQL EXPLAIN plan is exact same for the above queries. The difference is purely semantic, but I think it a very important difference, because it makes the query author’s intentions clear. There are issues with the comma join not having the same precedence as INNER JOIN, but that is for more complex queries. To wit:

SELECT first_name,last_name,address
FROM customer, address INNER JOIN city;

will actually do address INNER JOIN city first, then join with customer. This was changed to be more like the SQL standard, so it’s only in MySQL 5.0.12 and higher that it acts this way.

My feeling is that a comma join should NEVER be used in MySQL. There is no such thing as a “comma join”, really what you mean is an “inner join” or a “left/right join” or “cross join” (or “natural join”).

I know that naysayers will say that

SELECT first_name,last_name,address
FROM customer INNER JOIN address;

actually does a CROSS JOIN (Cartesian product), so it is folly to rely only on the name of the join. And I agree. However, if you name your joins appropriately for what you want and intend them to be, it is much easier to catch mistakes.

And while we are on the topic of good SQL query techniques…..I would like to rant a bit about join criteria. “Join criteria” is the technical term for the operations that determine how tables get joined. In the following query:

SELECT first_name, last_name, address
FROM customer INNER JOIN address
WHERE customer.address_id=address.address_id
AND last_name="Smith";

customer.address_id=address.address_id is the join criteria. last_name="Smith" is actually the filter criteria.
“Filter criteria” is the technical term for the operations that determine which rows are returned in the result. “Select criteria” is the technical term for the operations that determine which fields (columns) are returned in the result.

In the example, the WHERE clause has the join criteria and the filter criteria. There is a more appropriate place for the join criteria — the FROM clause, where the joins are actually specified. This would be:

SELECT first_name, last_name, address
FROM customer INNER JOIN address ON customer.address_id=address.address_id
WHERE last_name="Smith";

The ON keyword means that the part that follows determines the join criteria. If the columns in the tables to be joined have the same name, you can use the USING keyword:

SELECT first_name, last_name, address
FROM customer INNER JOIN address USING (address_id)
WHERE last_name="Smith";

When you look at the difference between that and

SELECT first_name, last_name, address
FROM customer INNER JOIN address
WHERE customer.address_id=address.address_id
AND last_name="Smith";

You can see that the former query very clearly states “Give me the first_name, last_name and address from the result of joining the customer and address tables together, using the address_id field; but only where the last_name is Smith.”

For more examples, compare my idea of “good”:

SELECT first_name, last_name, address, city, country
FROM customer INNER JOIN address USING (address_id)
INNER JOIN city USING (city_id)
INNER JOIN country USING (country_id)
WHERE last_name="Smith";

with my idea of “not-so-good”:

SELECT first_name, last_name, address, city, country
FROM customer,address,city,country
WHERE customer.address_id=address.address_id
AND last_name="Smith"
AND address.city_id=city.city_id
AND city.country_id=country.country_id;

The WHERE clause is overrun by join criteria, and you really have to think about it to figure out what is the join criteria vs. filter criteria in the query. The “good” example has all the join criteria in the FROM clause.

Now, an example where it really helped me to debug when the query was written in “good” format (note that this no longer uses the sakila sample database):

SELECT day_date,COUNT(sales)
FROM day_tbl LEFT JOIN sales_tbl
WHERE sales_date=day_date
GROUP BY day_date;

What I wanted was a list of days and the number of sales per date, and if there were no sales on that date, show the date with a count of 0. However, that is not what the query returns. The query returns only dates that have at least one sale.

The problem is that the query puts the join criteria in the WHERE clause. Even though there’s a LEFT JOIN, there is technically no join criteria in this query, only a filter. No matter whether you specify LEFT JOIN, INNER JOIN or CROSS JOIN, putting the join criteria in the WHERE clause will return the same results — the equivalent of an INNER JOIN.

I cannot get the results I want unless I put the join criteria where it belongs — in the FROM clause.

SELECT COUNT(sales),day_date
FROM day_tbl LEFT JOIN sales_tbl ON sales_date=day_date
GROUP BY day_date;

This actually returns the results I want — the join is based on the join criteria, and there is no filtering of results going on.

One note: syntax requires that the USING clause use parentheses:

USING (a[,b...])

The ON clause, however, allows you to use parentheses, but it does not require their use. I prefer to use them, because it more clearly delineates which operations are the join criteria. In some examples above I have used them, in others I have not.

Here is an example of using parentheses in the ON clause to avoid confusion:

SELECT COUNT(sales),day_date
FROM day_tbl LEFT JOIN sales_tbl ON (sales_date=day_date
AND day_date BETWEEN '2008-01-01' and '2008-12-31')
WHERE amount>500.00
AND type='paid'
GROUP BY day_date;

22 Responses to “Good SQL Querying”

  • To start with, you should never write a query involving columns from multiple tables without pre-qualifying all columns with either the table name or a table alias.

    MySQL is very lax in this area, and for example in 4.x will accept this syntax when the same column name may occur across tables. When migrating for example to MySQL 5, this then fails as MySQL 5 works to be more ANSI compliant.

    INNER JOIN I feel is unnecessary syntax, general practice from different RDBMS experience shows that table1,table2 implies an INNER JOIN. It is more relevant when queries contain LEFT JOIN for example, when infact it should be an INNER JOIN.

  • Sheeri Cabral says:

    My only problem with qualifying columns is that using the full table name makes the query somewhat more unreadable/unmanageable, and using an alias makes EXPLAINs a bit more ambiguous.

    I have the practice of qualifying columns when there is ambiguity only.

    Right, the problem is that while *other* RDBMS systems use a comma to mean “INNER” join, MySQL does not. Therefore, examples that use comma joins are perfectly acceptable for other systems, NOT for MySQL.

  • Bill Karwin says:

    Hi, great post. The “comma-join” was the standard way to do a join in ANSI SQL-89. It quickly became apparent that other types of joins would be useful, but by the time SQL-92 defined the JOIN keywords and its modifiers, Oracle and Sybase had invented their own proprietary outer join operators that developers could use in the WHERE clause.

    Unfortunately, despite the fact that the SQL standard decided on syntax for outer joins, and all RDBMS brands have supported the standard syntax for close to ten years, a lot of developers still use “comma-join” syntax.

    I guess they are learning SQL from books and examples written in the dark ages, or written by authors who themselves learned “comma-join” syntax before SQL-92 syntax became widely supported.

  • Justin Swanhart says:

    Sheeri,

    Actually “,” means CROSS JOIN. When you do:

    select a.c1, b.c2
    from a, b
    where a.c1 = b.c1

    you are actually asking for a cartesian product filtered on the predicate a.c1 = b.c1. Databases optimize this by not actually doing a cross product, but instead using an more optimal join algorithm such as nested loops, hash join or merge join based on the given predicate. If the predicate is not provided, then the entire cross product is output.

  • Justin Swanhart says:

    Also, for readability of JOIN clauses, instead of parenthesis, I prefer indentation:

    SELECT a.c1,
           a.c2,
           b.c1,
           c.cX 
      FROM a
      JOIN b
        ON a.c1 = b.c1
       AND a.status > 0 
      JOIN c
        ON c.c1 = a.c1
       AND c.c2 = b.c2
       AND c.cX IN (1,5,9)
    
  • Rob Wultsch says:

    Justin,
    I use Baron’s coding standards (http://www.xaprb.com/blog/2006/04/26/sql-coding-standards/), after coming to many of the same conclusions. I think it is a touch cleaner than what you suggest.

  • Joe Izenman says:

    I started working on a SQL Coding Standards document for my company last week, and the first two things I wrote down while brainstorming were “Ban comma joins” and “Use ON clause”.

    We ran into a an epic hassle last year of going through multiple clients’ code to strip out comma joins in queries that were mixing them with explicit LEFT and INNER JOINs, because we were migrating them to a server with MySQL 5.0. I would rather avoid that in future.

  • rudy says:

    i have a real problem with join queries that use qualifiers only on ambiguous columns — they require that you know, intuitively or otherwise, in order to grasp what’s going on, exactly which table contains which columns!!

    and that’s putting far too great a cost on the very debatable benefit of cleaner sql

    qualifying every column does not make the query “somewhat more unreadable/unmanageable” when you don’t know which table the columns are coming from

    let’s take your last query as an example:

    SELECT COUNT(sales),day_date
    FROM day_tbl LEFT JOIN sales_tbl ON (sales_date=day_date
    AND day_date BETWEEN ‘2008-01-01′ and ‘2008-12-31′)
    WHERE amount>500.00
    AND type=’paid’
    GROUP BY day_date;

    i’ll bet you a dollar that the amount and type columns come from the right table, the sales_tbl table (oh, and embedding “tbl” into a table name is also super annoying, but that’s a different rant for another day)

    and you know what? that LEFT OUTER JOIN don’t hunt — it’s actually an inner join, because of those conditions on columns of the right table in the WHERE clause

  • Sheeri Cabral says:

    Rudy — actually it’s still a LEFT JOIN, because each day_date for 2008 in day_tbl is represented in the result, even if there is no corresponding value in sales_tbl.

    Try it out yourself. I don’t like embedding _tbl into a table name, actually, but “day” and “date” are reserved words and I’d rather write day_tbl than `day`.

    You’re right that you want the columns to be intuitive. That’s a schema design issue, and I try to make it obvious which table a column comes from. That’s why the schema has “sales_date” instead of just “created” or something else ambiguous.

  • Justin Swanhart says:

    I find, that if I want to use a reserved word for a table, pluralizing that word often works out well.

    days and dates for example

  • rudy says:

    if, for a specific day, there is no matched row in the sales table, then the columns in the intermediate result produced by the LEFT OUTER JOIN that would’ve come from the sales table will be NULL, right? i mean, that’s how a LEFT OUTER JOIN works

    so what happens in the WHERE clause? one of those columns has to be greater than 500, and another of those columns has to be equal to ‘paid’

    NULL is not greater than anything, and NULL is not equal to anything

    therefore, all the unmatched rows will be filtered out

    in effect, it’s an inner join

    unless, of course, “amount” and “type” are actually in the day_tbl ;o)

  • Arjen Lentz says:

    And for optimum evil, try NATURAL [INNER|LEFT|RIGHT] JOIN.
    Unless there is a very strict and sane identifier naming rule in place with the schema design, it’ll break things in oh so many ways at a future time you won’t expect ;-)

  • Great post Sheeri, and the number of comments certainly shows that you hit a nerve :-)

    Agree with everything you say, and very happy that you also took time to mention USING .. way too many people are unaware of just how powerful that is, not only when writing it but also in helping someone understand what’s going on when they read the query 6 months from now.

    My personal addition would be a rant about those people that insist on aliasing every single table and column with a shorthand name, because “it makes it easier to understand and write”.

  • Hi Sheeri!

    nice post. Few comments:

    “There is no such thing as a “comma join”, really what you mean is an “inner join” or a “left/right join” or “cross join” (or “natural join”).”

    This suggests that NATURAL JOIN is a proper type. It is not, you can have inner and outer NATURAL joins. I don’t regard LEFT/RIGHT as different types either – it’s just syntax, there is not difference in the concept. So, I end up with three proper join types: cross join, inner join, outer join. (I don’t write caps now because I’m referring to the concepts – not the actual syntax).

    One of the things I expected you’d write about is LEFT vs RIGHT joins. Personally I always only use LEFT – never RIGHT. It doesn’t really matter which one you use but I really don’t see the point of using both.

    “Now, an example where it really helped me to debug when the query was written in “good” format (note that this no longer uses the sakila sample database):”

    In MySQL 5.1 and MySQL 5.0 you get a syntax error when you leave out the ON clause for LEFT and RIGHT join (insanely enough, not for INNER JOIN). So I don’t see what how the formatting could’ve helped there.

    Your remark about “join criteria” vs “filter criteria” is nice and I tend to think about it that way too. That is to say, I used to call all conditions that relate tables by comparing their columns as the ‘join condition’. It turns out you can’t really rigorously define it.
    consider

    SELECT City.Name, Country.Name, Country.IndepYear
    FROM City LEFT JOIN Country
    ON City.ID = Country.Capital
    AND Country.IndepYear IS NOT NULL
    WHERE City.Name IN (‘Kabul’, ‘Oranjestad’, ‘Qandahar’)

    (“all cities, and if they are the capital of an independent country, the country name and year of independence”)

    Here “Country.IndepYear IS NOT NULL” does not realy relate a city to a country, it is more like an extra condition to be applied when matching capital cities to their country. We can’t move it to the WHERE clause, as that would change the meaning of the query (effectively turning it into an inner join of “all capital cities of independent countries”)

    So, what are we to call this “Country.IndepYear IS NOT NULL” part? The only sensible thing I can think of is to call the entire thing

    City.ID = Country.Capital AND Country.IndepYear IS NOT NULL

    the join condition, and not make any judgment of the nature of the expression.

    (although I still feel that for inner joins it is best to keep the ‘filtering’ ones separate from the ‘join’ ones, but it’s just a habit.)

    kind regards,

    Roland

  • Hi, Nice post. Completely agree with you on the part about ” join criteria vs. filter criteria “. Thats the main reason I shifted from using “comma joins” to ” left/right/inner join” and this has changed the way I think and write queries.

  • Sheeri Cabral says:

    Rudy, you’re absolutely correct, my apologies. The filtering queries will make it act like an INNER JOIN. This is a great example of debugging too — you can see, because I wrote LEFT JOIN, that I intended a LEFT JOIN, so you can see my mistake. The filters such as WHERE amount>500 should be moved to the join criteria in the FROM claus.

  • rudy says:

    thanks sheeri

    maybe some day i can also make you see the benefit of qualifying all columns in join queries — after all, you’re not doing it for yourself, you’re doing it for whoever has to maintain the query later, eh

    ;o)

  • Toby says:

    Nitpicks: “performant” in your first line has a typo; but it’s an ugly word, what about “faster”? Also “criteria” is plural, so “criteria where it belongs” and other cases are incorrect.

  • Gints Plivna says:

    I’ve tried to summarize various join types here http://www.gplivna.eu/papers/sql_join_types.htm
    I’d like to add that natural join syntax is very dangerous and actaully asks for trouble as soon as table structure changes.
    Speaking about join criteria vs filter criteria – it doesn’t really matter for inner joins where these are written. However for outer joins it really does matter where another criteria are written – result changes – see chapter 6.4 Difference between predicates in ON clause vs. predicates in WHERE clause in my article.
    And speaking cross join vs inner join – the biggest problem is that MySQL had mixed them up, and historically cannot probably so easy enforce cross join like real cartesian join raising error when on clause is applied and inner join like normal inner join. I think the situation is the same like old outer join operators in Oracle (+) or MS SQL (*= and =*).
    And speaking about explicit JOIN keyword or putting tables in where clause – there are benefits for both approaches and both approaches are according to standard, so I personally think that’s the matter of each project – whether they like one or another syntactic type. That’s more matter of taste :)

  • Use SQL to Create a Calendar in Mutliple Languages « oracle fusion identity says:

    […] Sheeri Cabral […]

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>