Microsoft Sync Framework is a Data synchronization platform from Microsoft to build offline-enabled applications. Sync Framework can be used for occasionally connected clients to connect to central server, for peer-peer applications and a whole lot of applications where data needs to be synchronized between multiple data stores. There are in-built mechanisms for conflict-detection and handling.
Microsoft first announced Sync Framework CTP in late 2007 - suggesting that it is going to make building sync applications much easier than before, allowing huge amount of reuse for synchronization logic through providers and overall improvement in the data-anywhere philosophy.
It has been more than three years since then and it looks like the promise is being delivered - Microsoft has managed to make it a comprehensive and robust framework that supports synchronization of data in a number of ways. And Microsoft has built a number of solutions on top of this framework, especially with SQL Server, SQL Azure and Windows Azure Storage.
Although most of Microsoft’s marketing efforts for Sync Framework revolve around its’ own database products, it must be said that the framework itself can work with any data-store that has ADO.NET capability. That includes a lot of open source RDBMS like PostgreSQL and MySQL, file systems, object oriented databases, and lots more. What’s even more interesting is that it is possible to build synchronization solution across different datastore types - for instance between PostgreSQL and SQL Server, or access and MySQL. This makes the framework extremely powerful.
In this article, I will stick to using the Sync Framework for building a simple client-server synchronization for non-SQL Server database, PostgreSQL. The technique can be easily extended to any other database, like Oracle, MySQL, DB2, etc, but each database will require its own flavor of customizations depending on supported data-types, procedural language syntax and other things.
This solution is based on an old sample provided by Microsoft for utilizing Sync Framework - since in SQL Server 2005 there was no in-built change tracking mechanism, this sample provided simple client-server architecture we could use. This original SQL Server version can be downloaded here.
(The version shows as deprecated because SQL Server 2008 has its own change tracking mechanism; however for non-MS databases like MySQL or PostgreSQL, this sample is still very useful)
I have adapted this to work with PostgreSQL. You can download the modified sample from here: Download.
First, let us understand the Synchronization architecture for a simple client-server deployment - the server is the central repository of data, the client needs to have a local cache or copy of the server data (either full or partial) - this is based on three premises:
- There is a change tracking mechanism present in both the client and server databases- anchor values to track last time synchronization happened (in each client), timestamp columns and tombstone tables. These can be used to get new inserts/updates/deletes after the last anchor value was set.
- Each datastore is accessed using a SyncProvider in the sync application. The SyncProvider defines all the methods required to access the change tracking mechanism of the datastore.
- A SyncAgent or a SyncOrchestrator co-ordinates synchronization between the two SyncProviders and also takes care of conflict handling. We will be using a SyncAgent in our example.
The change tracking mechanism
The change tracking almost invariably happens through a combination of triggers and timestamp columns - for instance triggers to store lastupdatetimestamp on updates and to keep a record of deleted data and deletion timestamp in a tombstone table. There will then be a set of functions (Stored procedures, if you come from SQL Server background) required to get the incremental inserts, updates and deletes from the last set anchor, a function to get the new anchor value (generally the current timestamp) and a function to apply inserts, updates and deletes from the other datastore to this datastore.
For a simple change tracking mechanism, the following objects will be required for each db table that is present in your database. The sample project attached will give an example of how each of these objects can look in a postgresql database.
<tablename>_incrinserts <tablename>_incrupdates <tablename>_incrdeletes <tablename>_applyinserts <tablename>_applyupdates <tablename>_applydeletes <tablename>_getupdateconflicts <tablename>_getdeleteconflicts
A tombstone table:
<tablename>_tombstone
And create, update and delete triggers
<tablename>_oninsert <tablename>_onupdate <tablename>_ondelete
The last two are required to get conflict data so that it can be taken up for conflict resolution in the sync logic.
An anchor table is required at the database level, to hold sync anchors. This is required only in the client database and not in the server.
Anchor
And couple of more functions to retrieve the last sent and received anchor values
Sp_get_new_anchor Sp_get_last_sent_anchor Sp_get_last_received_anchor
Anchor
The anchor table in the client database has one row for each of the data tables that are being synced. There are three columns - table name, last received anchor and last sent anchor. The received and sent anchors hold the information about when was the last data received from server and sent to server respectively. These are then used to get only incremental changes after this time both from the server and the client.
The sent anchor will be the timestamp value from the client database and the received anchor will be the timestamp value from the server database. This is a small thing, but it makes a difference if there is a time difference between the machines hosting the server and the client.
SyncProviders
The actual synchronization is handled by the SyncEngine and the Sync providers. The sync providers handle the retrieval and applying of sync data from the particular implementation of the database and abstracts the framework logic from the database implementation. This is the reason, that for Sync Framework to work on postgresql database, all you need is a syncprovider that is customized to work with it.
In the attached sample, PostgreSQLClientSyncProvider is one such provider. It inherits ClientSyncProvider, and connects the client PGsql database to the main sync application. The other provider used in DBServerSyncProvider - this connects the server database to the sync application.
SyncEngine and SyncAgent
SyncEngine initializes the syncproviders and the syncagent to initiate the synchronization. The Sync agent orchestrates the flow of data between the two SyncProviders. While initializing the SyncProviders, the SyncEngine also feeds database schema information to the SyncProviders through SyncAdapters (one syncadapter for each table) and provides the necessary get-incremental and apply commands that the providers need.
One question would be how does the syncagent know what data is to be passed as parameter values when the apply commands are processed. This is simple - all the data retrieved from the incremental commands are stored in the sync session as datatables. Whenever the Sync Framework comes across a parameter, it tries to search within the sync session to see whether a similar named column exists in the sync session for the same table name. If it does, it matches it and uses that for as values for the corresponding parameters. This is why the parameter names in applys and the column names in select incrementals should match. You can try it out if you want - if you change the parameter names or give aliases to the columns in the select-incremental sps, then the synchronization won’t work as expected.
Now that we have covered the concepts, it is time to dig into the actual code:
First we have replaced all the SqlCommand, SqlTransaction and SqlParameter objects with NpgsqlCommand, NpgsqlTransaction and NpgsqlParameter objects. For this you need a reference to the Npgsql dll that can be downloaded from here.
For this example, we have taken only one table - orders. Order_id is supposed to be the primary key and hence it has to be unique.
To ensure that even in disconnected mode this id remains unique, we have used sequences starting from different values in the client and the server database. This design helps to ensure that we don’t have to worry about too many insert conflicts. The pkey sequences will be totally different for both client and server. To have a sizeable range, we are defining this column as bigint.
The anchor value is another tricky thing to consider - this is generally serialized after retrieving and deserialized before setting. Since the datatypes used in SQL Server for this are not available in Postgresql, we need a slightly different strategy. Here I am using timestamp converted into bigint as the anchor values.
This gives two advantages - since the bigint values can be easily compared with each other, it keeps our stored procedures simple. Also since this datatype is supported by Npgsql, it becomes very easy to retrieve this into our sync logic.
All the procedures that are defined are self-explanatory and with the above concepts, you will understand them very easily. These stored procedures are then used in both the client and server sync providers and synchronization is called.
To run the sample application,
- Prepare the databases - create two databases using the client and server scripts.
- Download and install Sync framework
- Open the solution PostgreSQLProviderSample.sln
- Modify the line 80 and 87 in the SyncEngine.cs file to point the connection strings to the above databases
- Run the code in debug mode
- Click on the "Make Random Insert" button by pointing to both the server and the client databases to create random rows.
- Click on synchronize to synchronize the rows.
- Try random updates and deletes too and then synchronize to see how the changes flow from client to server and vice-versa.
If you notice the ids of the rows generated in server and client, you will notice that they are from entire different series. This is by design - without this, if inserts happen in client as well as server, then the chances of conflicts happening in your database are much more and it unnecessarily increases the complexity of the synchronization mechanism. Another way to achieve this is by using a GUID column but that can have some performance implications.
So there you have it - a sample application that utilizes Microsoft Sync Framework to synchronize data across two Postgresql databases! Try it out and let me know what you think!