BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles AtomServer – The Power of Publishing for Data Distribution

AtomServer – The Power of Publishing for Data Distribution

This item in japanese

Consider this – you work for a company with a federation of fully independent web sites, implemented in half a dozen different programming languages, on several different platforms. Each independent website has its own database system and schema, and is managed by teams with varying skill sets, located in eight sites throughout the United States and Europe. And the company is growing.

Your job? Enable these disparate systems to share crucial data conveniently and rapidly amongst themselves.

Your design criteria are:

  • High Traffic Capability– the service would need to move approximately 1M pieces of data a day at launch

  • Transactional Correctness – the service must be accurate as the authoritative source of data for all clients

  • Resiliency – the service must be easy to upgrade with seamless data republishing when formats change

  • Loose Coupling –with so many systems, each must be able to manage themselves independently

  • Adoption – the system must have a low barrier to entry for clients implemented in a variety of languages (Java, C#, PHP, Ruby, and ColdFusion)

  • Adaptability – the system must support many different types of data and be extensible to add new types of data on demand

We were faced with exactly this problem about a year ago at Homeaway.com, where we both work. And it didn’t take long to recognize two design tenets – first, that a distributed, publish-subscribe service is a great way to address resiliency and loose coupling of subsystems, and second, that building RESTful services (as opposed to heavyweight protocols like SOAP) is a natural solution for systems that need high scalability, extensibility, and ease of adoption.  These two principles led us directly to Atom – a RESTful publishing protocol – and to a new breed of data service called an Atom Store. We’ve spent the last year implementing an Atom Store for Homeaway. And from that real-world implementation we have extracted the open source Atom Store framework, named AtomServer ( http://www.atomserver.org), described in this article.

Atom

Atom is comprised of two specifications – the Atom Syndication Format, which defines an XML-based language to describe web feeds, and the Atom Publishing Protocol, which describes a RESTful HTTP protocol for retrieving and manipulating such feeds.

Atom was conceived as a replacement for RSS (Rich Site Summary), which generally contains human authored text, such as blog entries. Consequently, the internal structure of an Atom entry or feed (the XML elements and attributes) conveys the semantics of publishing such as authors, languages, titles and so on. Don’t let this fool you; Atom entries are well suited for carrying all sorts of data as their payload.

Atom entries are the individual records of data, and Atom feeds are lists of entries. Because Atom is a RESTful protocol, resources are accessed by executing HTTP methods on URIs that identify resources – in this case, entries and feeds. For example, retrieving a feed of blog entries might be accomplished by doing a GET on a URI like http://your-atomserver/entries/myblog, and the response might look like:

<?xml version="1.0"?>
<feed xmlns="
http://www.w3.org/2005/Atom">
  <link rel="alternate"
href="http://your-blogserver/MyBlog"/>
  <updated>2007-04-14T20:00:39Z</updated>
  <title>My Weblog</title>
 <entry>
    <title>First Post</title>
    <author>Chris Berry</author>  
  <link rel="edit"
href="http://your-blogserver/MyBlog/1234"/>
    <updated>2007-04-14T20:00:00Z</updated>
    <id>1234</id>
  </entry>
 <entry>
    <title>Next Post</title>
    <author>Bryon Jacob</author>  
  <link rel="edit"
href="http://your-blogserver/MyBlog/1235"/>
    <updated>2007-05-01T17:00:00Z</updated>
    <id>1235</id>
  </entry>
</feed>

Atom also supports the notion of categories that can be applied to an entry. Categories are arbitrary string tags that can be applied to an entry for the purposes of marking it as part of some group. Feed URIs can then be modified to incorporate filtering – returning only the entries from a feed that have a particular category applied to them.

There are a couple of terms that have special meaning in the context of Atom. Entries are grouped together in a fixed two-level hierarchy of workspaces and collections. Workspaces contain some number of collections, and collections contain some number of entries. The URI to a feed consists of the workspace and collection of the feed

http://your-atom-server/workspace/collection

and the URI to an entry consists of the workspace, collection, and EntryId. This identifier for the entry must be unique within the owning collection:

http://your-atom-server/workspace/collection/entryid

Atom, like RSS, provides the basis for a web syndication framework. There are a large number of existing clients that understand Atom, including browsers, newsreaders, and programmatic clients in practically every popular language. Additionally, because Atom is actually just a small set of conventions on top of HTTP and XML, it is easily accessible by any web-ready programming platform.

Extensions and Layered Protocols

The core Atom protocol describes the basic operations for manipulating feeds and entries, methods for error reporting and handling, and specifically provides for the concept of extensions. Atom extensions are additional XML elements and attributes that can appear in Atom XML documents, and additional HTTP request parameters that can be applied to a URI to modify the behavior of a server that supports Atom. Two important extensions are:

  • OpenSearch – defines a protocol for searching, including methods to introspect the server to determine what kinds of searches are supported. An OpenSearch enabled service returns search results as an Atom feed, with individual results represented as entries

  • Feed Paging – addresses pagination for time-based data by defining next and previous link types for Atom feeds which clients can use to page forward and back through a multi-page feed

Probably the most visible and influential use of Atom is GData, Google’s web API for accessing the data from their many services. GData incorporates the core Atom spec and OpenSearch, as well as a number of custom extensions to cover additional features not addressed by those specifications.

When you no longer limit Atom entries to web content like blogs or news feeds, and extend Atom to the management of general data, you have an Atom Store; a generic data store of inter-linked Atom entries, which you can edit using the Atom Publishing Protocol, and then search over using OpenSearch. AtomServer grew out of a desire to leverage this strategy for distributing access to our own data.

Distributed Control Over Data Access

One of the benefits to our use of the Atom protocol is the inherently distributed nature of the system. In AtomServer, we combined the notion of pagination, limiting the number of entries that we return with each request, with the mechanism for polling the server for feed updates. Clients should start at the beginning of the feed, and continue requesting pages of data until there are no more. Then, the client should periodically check for new data (i.e., whether there is another page of data past the last one that was successfully processed)

AtomServer itself simply marks each entry with an incrementing counter on every change, and clients are required to store the last value of the counter that was processed for each feed they read. On subsequent polls for a feed, the start-index of the next page of data is set to the end-index of the previous page. This method effectively handles the case where AtomServer is used to handle a high volume of rapidly changing data; multiple entries could change in the course of a second.

When you pull a feed with no start-index, you start at 0 by default. For example,

GET http://your-atom-server/widgets/acme

will return an extension tag, named endIndex in the http://atomserver.org/namespaces/1.0/ namespace, as a child of the feed element. This will contain the last index on the retrieved page:

<as:endIndex>23</as:endIndex>

That number should be passed as the start-index query parameter on the next poll:

GET http://your-atom-server/widgets/acme?start-index=23

When the requested page has no new data to offer, the server will return a 304 NOT MODIFIED response. This signals that it is advisable to wait for the given configured polling interval before asking for the next page of data.

Managing Entry Identity with POST and PUT

In Atom, it is essential that each entry have an ID that is unique within its owning workspace and collection – these three components together make the entry’s URI a unique identifier for the entry, differentiable from the other entries in the service. AtomServer supports creation of new entries using two different HTTP methods – POST and PUT.

When an entry is created with a POST, the URI that is used is the URI to the collection in which the new entry is to be inserted. In this case, AtomServer is responsible for assigning an Entry Id to the new Entry, and providing that ID back to the POST caller in the response body.

POST http://your-atom-server/widgets/acme

When an entry is created with a PUT instead, the responsibility for assigning the Entry Id is placed on the client doing the PUT. The URI to which the PUT is made is the URI to the entry that is to be created.

PUT http://your-atom-server/widgets/acme/1000.xml

Updating an existing entry is done by making a PUT to the Entry’s URI – in that sense creating an Entry with a PUT is like a “lazy” update. If there is no such Entry, it is created, otherwise the existing entry is updated.

Guaranteeing Data Integrity with Optimistic Concurrency

In order to guarantee consistent, predictable data in the highly distributed world of Atom, AtomServer uses Optimistic Concurrency to manage writes to the system. Optimistic concurrency states that a writer to the AtomServer must know the current revision number of the resource he is editing, and that he should write to the resource assuming that he will be able to complete the write operation, but be able to gracefully handle the case where someone has written the resource in the meantime.

For example, assume that systems A and B both want to make changes to the Acme Widgets in a given data feed. A comes along, and asks for the current representation of Acme Widget 123:

GET http://your-atom-server/widgets/acme/123.xml

And is returned the following “edit link” within its feed response.

<link href=”/widgets/acme/123.xml/2” rel=”edit”/>

And now A sets about making its edits to the representation of 123.xml. While A is doing its work, B comes along and requests the current version of 123.xml, and gets the same response. B’s edits take less time than A’s, so B immediately writes its changes back to the edit link:

PUT http://your-atom-server/widgets/acme/123.xml/2

This succeeds, returning a 200 OK, letting B know that its edit has been committed successfully to the AtomServer. Now, A finishes its edit and tries to write to the same edit link, but the revision number has been updated due to B’s edit. Consequently, A will receive a 409 CONFLICT HTTP error, indicating that someone has changed the resource he is attempting to update since he last refreshed his view of the resource from the server. In this case, A should GET the resource again, this time getting a new edit link, and repeat the process. Note that this allows A to make his changes to a copy of /widgets/acme/123.xml that already contains B’s changes, so the system prevents A from overwriting B’s changes blindly.

In many systems, there will be only a single, authoritative writer for a given set of data. In those cases, to reduce the overhead of managing Optimistic Concurrency, it is possible to override optimistic concurrency by providing an asterisk (*) as the revision number:

PUT http://your-atom-server/widgets/acme/123.xml/*

However, it is important to use this feature only when a client knows for certain that it will be the only writer to a given resource.

Category Queries

In Atom, Categories are specified on entries as a pair of values – the Scheme and the Term. The Scheme is essentially a “namespace” of categories, and the Term is the specific value within that namespace.

Borrowing from GData’s extensions to Atom, AtomServer allows for a special feed syntax that lets clients filter a feed by the categories applied to entries. For example, to get a feed of only the Acme Widgets that had been tagged as having the color “red,” the request could be:

GET http://your-atom-server/widgets/acme/-/(urn:colors)red

If multiple categories are specified, only entries that have ALL of the given categories (a Boolean AND) will be received. For example,

GET http://your-atom-server/widgets/acme/-/(urn:colors)red/(urn:size)big

will return all of the big, red Acme Widgets. An arbitrary combination of categories using AND and OR, in prefix notation can also be specified with

GET http://your-atom-server/widgets/acme/-/OR/(urn:colors)red/AND/(urn:size)big/(urn:color)blue

This will return all of the Acme Widgets that are either red, or are both big and blue. These feeds should be treated by the client exactly the same as any other feed – they can be polled and paginated just like a feed without categories.

AtomServer

AtomServer is an off-the-shelf implementation of an Atom Store. It is implemented as a Java web application, and should deploy into any J2EE Servlet Container. Under the covers, AtomServer uses the Apache Project’s open-source implementation of the Atom Protocol, called Abdera, to process the RESTful verbs and XML vocabulary of Atom.

Abdera is an excellent library for adding an Atom front-end to an existing application. AtomServer, by contrast, is a full-fledged Atom Store. Out of the box, it provides all of the components needed to store and interact with the Atom metadata, as well as the contents of the Atom entries themselves.

AtomServer’s protocol borrows from GData’s design wherever appropriate. In some cases we’ve made slightly different decisions to improve URL readability, to simplify query structures, or to implement features not covered by GData’s specification.

AtomServer manages all of the Atom metadata associated with an entry in a relational database, and stores the actual entry content either in a relational database or on a file system, depending on your particular needs. AtomServer automatically handles all of the aspects of the Atom protocol (URI interpretation, parsing of the Atom elements and extensions, update timestamps, entry categorization), so you only need to publish changes to the server and poll at intervals for feed changes.

AtomServer is easy to use. It deploys either as a simple WAR file, or alternatively, as a standalone server, running within an embedded Jetty Server. Most applications should be able to use AtomServer by simply providing a very small amount of configuration – a few Spring Beans that configure that application's Atom workspaces and the content storage.

Conclusion

AtomServer has several important, advanced features that we have not yet covered. We’ve run out of space for those features this time, but in an upcoming article we will dive into:

  • An auto-tagger for Atom categories- An easily configured mechanism to "auto tag" entries when they are either created or updated. An XPathAutoTagger is built in, which allows you to XPath into your content and conditionally associate Atom categories with it.

  • Batch operations- Full support for operating on a mixed "batch" of create, update, or delete requests.

  • Aggregate feeds-The powerful ability to join together disparate entries ­­– from different collections or workspaces – into an aggregate entry, using Atom categories. Including the ability to request feeds of these aggregates. So instead of forcing you to deal with several feeds – having to tie the information back together yourself – you can listen to a single aggregate feed, which will reflect the changes to any of its parts.

AtomServer is real. It is in live production use at our company, handling more than a million requests a day, with several million entries in our store. Building on a RESTful specification such as Atom while leveraging the design of existing services like GData have ensured a solid foundation on which to build. We hope that you will pick up a copy and tell us what you think. You can get AtomServer from http://www.atomserver.org, and be up and running in minutes.

Rate this Article

Adoption
Style

BT