After a post asserting that one must, as a rule, always version their database work, Scott Allen detailed an approach to making the best of versioning databases. Allen presented a comprehensive, practical approach to creating the baseline, using change scripts to manage schematic revisions, controlling programmatic database objects (like views, stored procedures, functions, and triggers), and leveraging branching and merging.
Allen began the series after a post where he gives what his experience has shown as the three rules for development with relational databases. The rules are:
1. Never use a shared database server for development work.
Like many conveniences in software development, a shared database is a tar pit waiting to fossilize a project. Developers overwrite each other's changes. The changes I make on the server break the code on your development machine. Remote development is slow and difficult. Avoid using a shared database at all costs, as they ultimately waste time and help produce bugs.
2. Always Have a Single, Authoritative Source For Your Schema
Everyone should know where the official schema resides, and have a frictionless experience in getting a fresh database setup. I should be able to walk up to a computer, get the latest from source control, build, and run a simple tool to setup the database (in many scenarios, the build process can even setup a database if none exists, so the process is one step shorter).
3. Always Version Your Database
the common goal is to propagate changes from development, to test, and ultimately to production in a controlled and consistent manner. A second goal is to have the ability to recreate a database at any point in time. This second goal is particularly important if you are shipping software to clients. If someone finds a bug in build 20070612.1 of your application, you must be able to recreate the application as it appeared in that build - database and all.
Allen states the goal of versioning databases as being able to push out changes in a consistent, controlled, testable, and reproducible manner. Many coaches would agree that achieving this goal is paramount for an agile team to be effective.
Having established the importance of versioning databases, Allen continues with a series of 4 posts describing his preferred approach to achieving it.
The first of these describes what Allen asserts as the starting point for versioning a database - generating a baseline schema. Ultimately, the baseline is a script or collection of scripts containing all the SQL commands required to generate the application's database from scratch. This would include SQL commands to create every object (table, constraint, function, view, index, etc), commands to populate lookup tables with static data, and commands to insert any bootstrap data needed by the application. Once created and proven to be accurate, Allen says to "commit the file(s) to source control" and "consider your schema baselined!"
To create this baseline, Allen advises use of a tool that can retrospectively create the scripts from an existing database (as opposed to writing them each by hand), and describes his preference as to the structure of the resulting script files:
I like to keep all of the SQL needed to create tables, constraints, defaults, and primary indexes in a single file. Any views, stored procedures, and functions are scripted one per file.
If you go the multiple file approach, make sure to write a batch file, shell script, application, or some other form of automation that can automatically locate and run all of the script files required to install the database. Human intervention in this process is a step backwards.
Allen stresses the suggestion to include in this baseline a table that will record any future changes to the schema, then procedes in his next three posts to describe the how these changes should occur.
First, Allen discusses the mechanism he uses to manage changes to schema objects other than views, stored procedures, and functions - change scripts. This approach specifies that each change (or set of related changes) be represented as an "incremental" update in a newly created script file, similar to that of Ruby Migrations. In a nutshell, as a team discovers the database requires a change, they create and test a new script that alters the database in the desired fashion (including any necessary data migration) and commit it to source control. Once published, the script is never to be changed again.
Allen's approach then for making updates to views, stored procedures, and functions is the complete opposite as to other database objects - one "create command" file per object, then update this file as updates to the object are required. As to why he prefers this, Allen states:
The simple reason is to find problems as early as possible. If someone commits a schema change and the change removes a column used by a view, you'll find out there is an error early – hopefully before the build escapes for testing. Likewise, if someone checks in a view but forgets to publish a schema change the view needs, someone else is going to show up at their desk a few minutes later asking why they are breaking the software.
A second reason is to avoid some rare errors I've seen. Some databases have a tendency to screw up execution plans when the schema changes underneath a view. Dropping everything and starting over avoids this problem, which is really hard to track down.
Allen strongly stresses the importance of using an automated tool to make optimal use of the strategies listed above:
The magic happens when a developer, tester, or installer updates from source control and runs a tool that updates their local database. The tool uses a three step process:
1. The tool applies new schema changes by comparing the available schema change files to the SchemaChangeLog records in the database.
2. The tool will DROP all stored procedures, views, and functions in the database.
3. The tool will run all of the scripts needed to add the views, stored procedures, and functions back into the database.
As for the benefits of using this strategy, particularly the automated tool, Allen gives these examples:
Since the schema change scripts are in source control, you can recreate your database as it looked at any point in time. Is a customer reporting a bug on build 3.1.5.6723? Pull the source code tagged or labeled with that version number and run the baseline, then all schema change scripts included in the tag. You now have the same database and have a much better chance to recreate the bug. Also, changes move from development to test, and ultimately into production in a consistent, orderly, and reproducible manner.
Allen completes his series with a discussion about how the approach he describes handles branching and merging, as both are likely realities of any application that lives past its first version. Allen suggests Branch For Release as his preferred branching strategy, and explains that he re-baselines the database with each new release. He presents an example describing this, and adds a scenario in which a defect is found in an older, branched release that requires a schema change. On the branched release, a new script for the change is created with no problem. The question then posed is how to reflect the change on the current release, now residing on the mainline:
To get this fix into the mainline, there are two options. Well, actually there are an infinite number of options to consider depending on how you apply your updates, but here are two options:
1. Merge the schema change script into the mainline as 01.00.0046, and fix the 2.0 baseline script to incorporate this change.
2. Write a new schema change script, 02.00.0003, that has the same changes as change 46 in the branch.
With option #1 you have to be careful because any database that updated to v2.0 will not take the 46th change script from the branch (unless you write your tools differently than I do). You have to force people to run this script manually, or you go around destroying any existing v2.0 databases (which at this point. should only be on development and test machines anyway). This is not a great option, but if you are not deep into 2.0 it is sometimes viable.
Option #2 is a bit friendlier. The v1.0 databases will pick up the fix from 01.00.0046. The v2.0 databases will pick up the fix from 02.00.0003. You have to be careful though, to write the 02.00.0003 change script so that it won't try to reapply changes if the 01.00.0046 script ran.
In other words, databases installed from the v2.0 baseline script need to apply the 02.00.0003 script, but production type databases that have been around since 1.0 will use the 01.00.0046 script, and you don't want 02.00.0003 to create an error by making changes that are already in place when the database eventually updated to v2.0.
For more on this and related subjects, see Scott Ambler's take on agile database development.