Source Controlling the Database Schema

Feb 12, 2008 / By Sheeri Cabral

Tags: ,

In a linkage of biblical proportions, Log Buffer #83 pointed me to Tom Kyte’s reiteration which pointed me to Coding Horror’s rant about source controlling the database schema.

Now, for starters, I agree with Tom’s sarcasm and Coding Horror’s rant — the database schema really should be source controlled in the same place as the application code, because otherwise how do you know what changes happened when, and which version of the code goes with which version of the schema.

The problem I have is this — being a purist (and by the way, *not* a programmer so there could very well be concepts I’m missing), I really want to source control the DDL/GRANT statements and whatever that are applied to the production database, because that way anyone can see exactly what was run, and I can do a schema backup and compare to the source controlled version.

But then you have the problem where you have the old schema + the code *to be run* and then you have the new schema, and what do you do with that file of code to be run? It’s now old, but you don’t want to delete it.

If the answer is “don’t source control the change statements, just the differences” then you’re requiring folks to either figure out how to get from point A to point B, or start with sample data only.

Plus, sometimes the DDL/GRANT statements are complex, or there are a lot of them to be run in a particular order, such that I *want* to source control them for backup and revision’s sake — even though SQL is declarative, there are still a few ways to skin each cat. How do I resolve that? (a separate and/or private repository?)

And if that all is the case, what are the security ramifications of having the schema, the code, and the data all in one place (the repository)? I’ve worked at places that have *real* 3-point authentication for a reversible hash — a user-supplied password, that wasn’t anywhere in the code (as far as I know it was passed mouth-to-mouth and not even in documentation); the encrypted data itself; and the code with the encryption/decryption algorithms. Sure, in the repository you might have test scripts, but where and how do you back up the repository? In the same manner as the data?

I guess folks have to be careful about any real data they backup, making sure it’s in separate places that their repository is backed up?

7 Responses to “Source Controlling the Database Schema”

  • Lukas says:

    This is indeed a huge issue that most people try hard to ignore and simply accept the consequences. Its been accepted as a fact of life by most programmers/DBAs it seems.

    I have put a fair bit of thinking on this topic, but for the most part I have only gotten to the point of documenting the different approaches and their pro’s and con’s. You can read somewhat recent posting on the topic here:
    http://pooteeweet.org/blog/918

    Grant and user management stuff I have not really had any luck in putting on the map here at all yet. I guess you could come up with a system that makes it possible to parameterize the execution (or asks for any placeholders for which no values can be automatically determined).

  • Sheeri Cabral says:

    Lukas,

    Actually GRANT and user management is, in my eyes, a simpler issue — if a GRANT statement is issued twice, there isn’t really a problem. Whereas if an ALTER TABLE statement is issued twice, you’ll either get an error (ie, if you’re adding a column) or a success but end up doing something bad like duplicating an index. :-\

  • Source control on the database schema, while a laudable idea, is not without its problems. For starters, if you need to go to an earlier version of the schema, how do you do that? With the code, it’s easy — just recompile an earlier snapshot and you’re done. But with a schema, there is also associated data, perhaps millions of rows, which would also have to be transformed. There is no straightforward means to go back to an earlier schema unless one wanted to forget about the data. And saving just the ALTER statements, for instance, does not allow one to incrementally go back to an older schema at all.

    So, you’re kinda stuck. The only useful purpose of putting the schema under version control is to allow one to view the history only. Putting all the data as well under version control will be prohibitively expensive for anything but trivial datasets.

    Yet, not all is lost. Here lies possibilities for new and more powerful tools specifically tailored to handle the DBA’s needs for keeping track of historical changes to the schema. Time for me to cogitate on this a bit.

  • Heidi Schmidt says:

    For tracking which statements were run when… I think more towards auditing (fine grained or basic or oracle vault)
    For auditing I’ve seen it is more important to audit the case not expected than to audit every possible instance. The volume of changes would dull the eyes looking for an exception to the pattern.
    It helps to have the business rules on hand to identify which is the exception and which is the rule.

    Source control is looking up though. In Oracle there are two ways you could go about doing Source Control of the schema
    ( and proabably a third if you sql traced the change management pack dbms_job session to see what package or SQL statements it ran and worked to take OEM Grid Control GUI out of the middle)

    Oracle Grid Control has built in the “change management pack” (been in some shape or form since 8i) which allows you to create a dbms_job that runs to capture the baseline of anything and everything (schema, user, tablespace, tables, views, etc)
    Then you can capture and publish that report to end users.
    It can publish in HTML or ASCII format. I’ve used it to hunt down the changes that occur from week to week.

    Or you can do wonders with command line control. In 10G using DBMS_METADATA and bit of tweaking in your SQLplus settings to spool the file out and automate a check in of the DDL.
    Then CVS or subversion diffs become the tool that shows what has changed.

    Consistently though, the capture has to be the same for both options. More so with version control software because the differentiation of what is a delta is done line by line and not at the higher object to object level the Change Management pack can yield.

    Heidi

  • Sheeri Cabral says:

    Fred — I think there’s probably a limited dataset to which that applies — for instance, if you have a list of “active countries” or something, or having a lookup table of static values, then yes, you want to keep that under revision control.

    The point is to say “go here, and give me a working application at that point.” You don’t necessarily need a lot of *user* supplied data, though you may need some static tables. If an application needs millions of rows to just function, it’s probably a specialized application.

  • [...] missed one post, though — Sheeri Cabral’s item on source controlling the database schema here on Pythian’s blog. Sheeri (a.k.a., the MySQL SheBA) also finally caves, and tells why [...]

  • Sheeri, We are actually dealing with this very issue at OLB here. What we will probably do in the interim is to take a LVM snapshot of the entire MySQL database and “version control” that at the *release* level — so when we do a *release* we’ll be able to completely duplicate what went out to QA.

    Going from QA to Production is trickier, but what we are planning to do is create a cannned set of ALTER/GRANT statements to operate on the *prior* database snapshot, and this way we have complete reproducibility, at least in our dev/qa environments. There should NEVER be a need to backrev the production DB more than to the immediate prior release, and even then that should be a rare event.

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>