Introduction
This series is an investigation of how the Web and its related technologies can help solve many of the information management, data integration and software architectural problems that have been plaguing us for years. The goal is to achieve some of the promises of the Service-Oriented Architecture vision in a scalable and flexible way. These ideas do not represent the only way to successfully build modern systems and there are places they do not apply. The series is intended, however, to pitch a consistent, coherent and viable vision based on the choices that have made the Web such a success. In some ways, it will seem to fly in the face of the guidance the software industry has been providing. In other ways, it represents some of the best thinking by an amalgam of researchers and practitioners who have spent more time thinking about these problems than we can comprehend.
In the first article of this series, we took a deeper dive into the REpresentational State Transfer (REST) architectural style as a unifying basis for the subsequent discussion. Non-RESTful systems can certainly participate in this vision, but there are specific benefits to taking an information-focused approach. The clients of our systems rarely care what tools we use to satisfy their needs unless there is a compelling optimization or productivity gain from a particular choice. They do care, however, about being able to browse information and control the context in which it is used.
The benefits afforded by RESTful systems are flexibility, discoverability, productivity, scalability and a consistent, simple interaction style. Fully embracing the style is not easy, but it does have some clear rewards. While we usually focus on the invocation of the service through a uniform interface (URL), we often forget its role as an identifier. The following URL:
http://someserver.com/report/2010/01/05
certainly looks like an interface to some kind of a reporting system we could invoke via HTTP. It is also a unique identifier to a specific report. Keep in mind that URLs are also URIs, so they serve the dual purpose of identifying resources and being resolvable. If we wanted to comment on the report, provide a rating, indicate authorship or any other kind of metadata, this URL is a useful handle with which to do so even if we do not resolve the reference. It is a global name to uniquely identify an instance of the reporting service. In this way, the URIs of the Web (usually URLs) unify access to documents, data, services and, as we shall see, concepts. This consistent naming scheme is the start. Being able to resolve the references on demand in a content-negotiated way is a powerful next step. Now that we have names for all of our various information resources we need a way to describe them collectively via arbitrary data through a common mechanism.
Resource Metadata
As an example of describing these resources, we might like to say:
- Brian Sletten created the service
http://someserver.com/report/2010/01/05
. - Brian Sletten is a Person.
- The service
http://someserver.com/report/2010/01/05
was published on 2010-01-05. - The service
http://someserver.com/report/2010/01/05
is a Payroll Reporting Service.
These facts represent different kinds of relationships, but they all follow the same pattern: Associate a value with a subject through some kind of a relationship. How you choose to identify and represent the subjects, relationships and values will have important consequences, so let's dig into the topic a bit more.
The publication date statement seems relatively straightforward; a date is date is a date. The authorship statement is a little more complicated. We could simply use the name "Brian Sletten" as the value of the statement. In some circumstances, this might be sufficient. But what if someone wants to contact the person responsible for creating the service. Without more information, we are putting the burden on them to track down the contact information. It would be just as easy for us to use a URL (assuming one existed) to refer to the creator. Something like http://server/employee/bsletten
would do nicely. It allows us to disambiguate references to "Brian Sletten" in a global context. A side benefit of minting resource locators for people is that they become resolvable content like anything else. Once we know who created the service, we can resolve the reference to retrieve things like contact information.
This is where we bump into REST again. We can define a default response format of HTML for the person reference if the user is coming in through a browser. This would make sense to make it easy for another human to find what she is looking for. However, we can also support content negotiation to request the information back as XML, JSON or RDF. Whether we express information as data about the resource (retrieved when we resolve the reference) or as metadata (stored externally about the resource) will be resource-specific. In some cases, you may wish to do both. Giving resolvable URLs to non-network resources such as people and concepts has traditionally been problematic. Are you referring to the person or a document about the person? We will discuss some solutions in subsequent articles. For now, we'll just assume that we have a mechanism for solving the ambiguity.
Another benefit of using a URL to represent "Brian Sletten" is that we can hang statements on it, such as that he is a Person, he has an email address, etc. In this way, we can grow a graph of related facts organically. We connect facts about one subject to other subjects (as values in that relationship). We can imagine asking questions about the values that are related to other subjects as we discover references to the new subjects. A graph is a much more extensible model than a table, so we can easily support new facts about any of our nodes over time. As we learn new things about resources, we can attach them to our model. This has the benefit of allowing us to accumulate knowledge from a variety of sources. It also allows a decentralized system where anyone can express facts about anything. Whether we choose to include their facts in our consideration is up to us.
The final value we identified above, Payroll Reporting Service, is a term that means something to our organization. Presumably there are other kinds of services and they are organized in a meaningful fashion (e.g. a Payroll Reporting Service is a type of a Reporting Service). We will need additional mechanisms for organizing terms into these relationships but we will return to that in the next article.
We must first consider the relationships that connect these values to the subject. If we were using a relational database to capture these results, we might create column names such as "creator", "publish_date" or "instance_of". We would likely use a technology like Hibernate to map these column names into an object model so we could write software to use the information. This approach might work for a small number of relationships, but it clearly will not scale up to support arbitrary metadata; we cannot keep adding columns anytime someone says something new about a resource. We would have lots of shallow tables with only a few entries.
Assuming we did head down that path, however, we would run into further problems when we consider merging multiple databases of resource metadata. Perhaps the engineering and marketing departments have different IT staff who capture the metadata differently. In order to allow a common view across these systems, we will probably have to write a new layer, create a merged database or do something else painful and harder than it needs to be. Doing so once or twice is possible. Doing it for every new integration is unthinkable. Once we contemplate partner integration strategies of services across organizational boundaries we can pretty much just give up.
Advocates of the WS-* technology stacks might jump up and say that they have solved the problem with common schema systems, UDDI metadata repositories and the like, but they really have not. The problem is that any approach that involves a common model outside of a particular community of interest is likely to fail. Forcing partners to use the same schemas ignores the fact that, collectively, we simply do not see the world the same way. Nor do we have the same information needs. Common models become least common denominator models; invariably, someone's needs are left on the floor. This is not to say that the WS-* technology stack does not solve some problems, it simply does not solve this one.
But again, let us assume we have committed to a strategy that relies on some kind of common model. Inevitably, we will have to integrate with another partner who has not committed to the same strategy. How do we align our metadata then? How do we explain what the terms mean? How do we connect our terms and relationships to their terms and relationships?
Finally, once we decide to move beyond the realm of metadata how does actual data fit in? Do we have to start from scratch with a new common data model? How do we connect our metadata to our data? By now it should be clear that many of the technology choices of the past twenty years or so have simply entered the fray at the wrong level. We attempt to model domains and processes and people and data separately using insufficient and rigid abstractions. We ignore realities about how information is produced and consumed. Information does not have a format and rarely has boundaries. There is no explicit distinction between data and metadata. Perhaps most fundamentally, we do not and usually will not agree to a consistent world view. Any top down IT initiative that ham-handedly tries to get around this reality is doomed to fail as we have seen time and time again.
We need another strategy that combines the efficiencies of top down efforts with the reality of organic, bottom up perspectives. We need a data model that frees us from the constraints of a particular schema, language, object model, product or world view. We need to encourage people to agree where they do, but allow them to disagree as well. We need to let them care about different things and be fully supported by any modeling activities. Where we cannot support them centrally, we need to allow their needs to be met on the edge. This is not to say we are heading toward anarchy. Where we need to validate and restrict, we certainly can.
The W3C's Semantic Web initiative is a collection of technologies that helps provide just these features. We will explore the larger goals of the effort over time, but for now, we will focus on the Resource Description Framework (RDF) and SPARQL for their capacity to describe, link and query resource metadata.
RDF
RDF is a building block technology. It represents an extensible data model that supports the Open World Assumption. It is useful as a metadata model for describing information resources. As we shall see in later articles, however, it provides a mechanism for expressing data as well. When we build upon a consistent naming scheme (URIs and URLs) and the Web Architecture of loosely-coupled resources negotiated into different forms, RDF gives us the ability to link and describe all of these resources in powerful ways.
At a basic level, you can imagine a series of facts expressed in RDF. We can accumulate these facts over time and from a variety of sources into a model backed by a directed graph. Whatever form the source data is in, we can usually convert it into RDF relationships and add them to our knowledge base. In this way, we lose the problem of schema integration efforts.
RDF "facts" are expressions that relate the subject to a value. The subjects are either URI-addressed resources or unnamed blank nodes. The values can be literal values (strings, dates, numbers) or other URI-addressed nodes. The subject of one statement might be the value of another. Let's consider the facts we wanted to express about the reporting reporting service above. The subject is easy http://someserver.com/report/2010/01/05
. The date value is also easy: '2010-01-05'
. How we wish to connect them requires a bit more thought. We want to refer to the publication date of the service. We could make up our own term but we do not need to. There is a widely used metadata specification called Dublin Core (http://dublincore.org/) that involves publication metadata. It was developed by a group of librarians interested in standard terms for referring to journals, books, online articles, etc. We are not required to use their terms, but there is no reason not to.
By browsing around their site we see that there are quite a few useful terms such as title, description and license. To indicate when the service was published, the 'date' term seems promising. By clicking through the link, we see a description of the term, how it applies and what its intent is. This collection of RDF terms is referred to as a vocabulary. An RDF vocabulary is usually domain-specific (in this case publication metadata). Each term is described so as to be useful by both software and humans.
What we quickly see is that in addition to being well-described, the term is grounded in a URL: http://purl.org/dc/terms/date
. This is a globally unique name for the term. If we choose to use this term, others will know exactly what the term means even if they have never seen it before. Besides the human readable description, there is a machine-readable description as well. We can get to it by resolving the URL for the term. By clicking through the URL above, we get redirected to: http://dublincore.org/2008/01/14/dcterms.rdf#date
. This redirection is an important aspect of URI curation, but we will return to that in a future article; it is not necessary for simply publishing RDF vocabularies.
What we get back when we resolve an RDF vocabulary is usually an RDF/XML serialization of an RDF model. In this way, RDF is used to describe itself. Don't get too stressed about how to interpret the complex model yet, but if you are curious, it is a series of facts encoded in hierarchical XML entity relationships.
<rdf:RDF xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:dcam="http://purl.org/dc/dcam/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"> . . . <rdf:Description rdf:about="http://purl.org/dc/terms/date"> <rdfs:label xml:lang="en-US">Date</rdfs:label> <rdfs:comment xml:lang="en-US"> A point or period of time associated with an event in the lifecycle of the resource. </rdfs:comment> <dcterms:description xml:lang="en-US"> Date may be used to express temporal information at any level of granularity. Recommended best practice is to use an encoding scheme, such as the W3CDTF profile of ISO 8601 [W3CDTF]. </dcterms:description> <rdfs:isDefinedBy rdf:resource="http://purl.org/dc/terms/"/> <dcterms:issued>2008-01-14</dcterms:issued> <dcterms:modified>2008-01-14</dcterms:modified> <rdf:type rdf:resource="http://www.w3.org/1999/02/22-rdf-syntax-ns#Property"/> <dcterms:hasVersion rdf:resource="http://dublincore.org/usage/terms/history/#dateT-001"/> <rdfs:range rdf:resource="http://www.w3.org/2000/01/rdf-schema#Literal"/> <rdfs:subPropertyOf rdf:resource="http://purl.org/dc/elements/1.1/date"/> </rdf:Description> . . . </rdf:RDF>
Terms relate to an outer rdf:Description
resource referenced by an rdf:about
or rdf:id
attribute. Again, we'll dig deeper into these ideas later, but for now try to read it as follows. Find the the things that are being talked rdf:about
(remember, the full URI expands within the rdf
namespace ground at http://www.w3.org/1999/02/22-rdf-syntax-ns
). Here, this indicates the subjects of the RDF facts. As an example, we can unwind the relationships expressed about the dc:date
term:
<http://purl.org/dc/terms/date>
<rdfs:label>
"Date".<http://purl.org/dc/terms/date>
<rdfs:comment>
"A point or period of time associated with an event in the lifecycle of the resource.".- ...
<http://purl.org/dc/terms/date>
<dcterms:issued>
"2008-01-14".- ...
By now, you are probably thinking this is incredible overkill and that a column in a database is a much simpler implementation strategy. That may be, but you lose a lot in the process. A database column means nothing outside of the database; you cannot even refer to the relationship outside of a SQL script or object-relational mapping (ORM) configuration file. Because you cannot refer to the relationship, you cannot connect that aspect of the record to another, unrelated database. Or information captured in a spreadsheet. Or a report. Or an RSS feed. Or an online service. Information comes to us in many forms these days and we simply cannot rely on relational database tables to be where we do all of our integration. We need a way to capture and represent how information is connected outside of any particular programming language or database technology in order to unify it all.
Another problem is that we generally cannot refer to our records outside of the context of our databases either. Certainly, we have identifiers, but how expressive is a long series of digits? Is that the primary key? A global key? In what context? One of the main values of a URL is that it grounds the identity into a global context. Any reference in any context can be connected to other references. You might assume that this means that we all need to use the same terms, but this is not the case. There are efficiencies to doing so, but we will learn how to get around this later in the series.
Fortunately, we are not suggesting you toss your relational databases. They are well-understood, ubiquitous, well-tooled and here to stay. The RDF model can still be used though. Logical references to records can be achieved by using a RESTful URL. Where we need to, we can map column names to RDF relationships (or just make them up on the fly -- this is not as crazy as it sounds). What we have accomplished is having a logical way of referring to information in whatever form it is, outside of the context of that form. When we come across a resource reference, we can query a metadata repository for more information about it or we can simply resolve it and see what we get back. The MIME type of what is returned will give us an indication of how to parse the response. We can discover references to new and related information within the responses. We can also discover alternative forms to retrieve the information in a more convenient format. The devil is still in the details, but this is the basic concept of the vision. We are embracing Webs of information, public and private.
Now that we understand more of the environment, how do we express an RDF fact? There are several RDF serialization formats we can use. We have already seen RDF/XML and it is pretty verbose. Expressing the fact in the Turtle format is deceptively simple:
<http://someserver.com/report/2010/01/05> <http://purl.org/dc/terms/date> "2010-01-05" .
We have a subject, a predicate and a value; a triple, or single fact. Keep in mind that RDF is a graph model. The serialization formats are simply how we store or transfer the information. The graph from the above triple would be parsed into a model that looks conceptually like this:
Now we can extend our knowledge base with some more metadata about the resource in question:
<http://someserver.com/report/2010/01/05> <http://purl.org/dc/terms/creator> <http://server/employee/bsletten> .
We now have two triples and our graph would look like:
Our facts serialized in RDF/XML would look like this (yes, it is kind of ugly):
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <rdf:Description rdf:about="http://someserver.com/report/2010/01/05"> <creator xmlns="http://purl.org/dc/terms/" rdf:resource="http://server/employee/bsletten"/> <date xmlns="http://purl.org/dc/terms/">2010-01-05</date> </rdf:Description> </rdf:RDF>
The point in highlighting this is that RDF data can be incredibly fluid. It can exist in a native RDF format (such as RDF/XML) or it can be generated on the fly from another data source. It is also eminently portable. Whatever form the native data is stored in, once converted into RDF, it can be exported from one store and imported into another RDF-aware store trivially.
Let's add a few more facts to our knowledge base while we are at it.
<http://server/employee/bsletten> <http://www.w3.org/1999/02/22-rdf-syntax-ns/type> <http://xmlns.com/foaf/0.1/Person> . <http://someserver.com/order/1234> <http://purl.org/dc/terms/creator> <http://server/dept/eng> . <http://server/employee/bsletten> <http://xmlns.com/foaf/0.1/mbox> "bsletten@example.com" .
Here we have indicated that "Brian Sletten" is a Person. We have chosen the term 'Person' from the Friend-of-a-Friend (FOAF) vocabulary (http://foaf-project.org). This is another widely-used collection of terms involved with social networks, professional interests, educational backgrounds, etc. We also indicate that a different service was created by the engineering department. We have used a made up URL for our 'engineering department' example to have another Dublin Core 'creator' that is not a person for our query examples. We also added Brian's (fake) e-mail address. Our final knowledge base is shown in the following RDF/XML serialization:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <rdf:Description rdf:about="http://server/employee/bsletten"> <type xmlns="http://www.w3.org/1999/02/22-rdf-syntax-ns/" rdf:resource="http://xmlns.com/foaf/0.1/Person"/> <mbox xmlns="http://xmlns.com/foaf/0.1/">bsletten@example.com</mbox> </rdf:Description> <rdf:Description rdf:about="http://someserver.com/order/1234"> <creator xmlns="http://purl.org/dc/terms/" rdf:resource="http://server/dept/eng"/> </rdf:Description> <rdf:Description rdf:about="http://someserver.com/report/2010/01/05"> <creator xmlns="http://purl.org/dc/terms/" rdf:resource="http://server/employee/bsletten"/> <date xmlns="http://purl.org/dc/terms/">2010-01-05</date> </rdf:Description> </rdf:RDF>
We will defer code examples until the next article, but programmatically, you would read triples into a model and then query it for certain patterns. In order to do that, we need a query language. Fortunately, SPARQL is a W3C standard available for querying RDF graphs.
Querying RDF
SPARQL is a recursive acronym that stands for the SPARQL Protocol and RDF Query Language. We will dive more deeply into it in part three of this series, but now we will touch upon the basics to extract information from our knowledge base. For our purposes now, SPARQL allows you to express patterns that are matched against an RDF graph. This is either considered a "default graph" (specified outside of the query) or from a series of one or more named graphs (from within the query). We are going to use Twinkle's SPARQL handling to keep things simple. If you want to follow along, go download Twinkle. It is a tool for querying SPARQL data sources.
Copy the final RDF/XML serialization above and paste it into a file such as /Users/brian/facts.rdf. Now run Twinkle:
java -jar twinkle.jar
You should see a Swing user interface show up with a default query window. In that window, select the File button and browse to the file you just saved out. Select it. This will now establish that data as the default graph for our queries. If you are comfortable with SQL, SPARQL will (eventually) feel very comfortable to you as well. We are going to match patterns to select results from the graph.
A basic query that lists all of the facts in a graph is:
select ?s ?p ?o where { ?s ?p ?o . }
Your results should look something like the following:
We have asked SPARQL to find all triples that match the pattern of unbound variables. Avoid doing this on large graphs! The result set is shown in the image above as a table with each triple spelled out. We used the names s
, p
and o
to represent the unbound variables that we want to match based on their positions in statements: subject, predicate and object. The query engine will match these patterns against the arcs and nodes of the graph. The names of the variables are less important than their positions in the patterns. As always, good variable names make queries easier to understand. A more constrained query might be something like:
select ?service where { ?service <http://purl.org/dc/terms/creator> ?creator . }
In this case, we have chosen different names for the variables and are only selecting for one of them. At this point, we do not care who the creator was, we just want to know any service that has a creator. Try modifying the query so you get both the service and the creator. Now, how about just the creator? You will quickly get tired of typing URLs. To make it easier, SPARQL has support for prefixes:
PREFIX dc: <http://purl.org/dc/terms/> select ?service where { ?service dc:creator ?creator . }
The final query we will do for now shows how we can use the shape of the graph to ask more pointed questions. We have two creators in our knowledge base, one Person and one non-Person. To be accurate, we don't know what the non-Person creator is, we have just discussed it being the Engineering department in our text. As far as our facts are concerned, we do not know anything about it. If we wanted to ask for the e-mail address of anyone who is a Person and the creator of a service, we could use:
PREFIX dc: <http://purl.org/dc/terms/> PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns/> select ?mbox where { ?s dc:creator ?c . ?c rdf:type foaf:Person . ?c foaf:mbox ?mbox . }
There are some other syntactic conveniences we could have leveraged, but for now we just want to introduce the idea that when can query along the graph. For every thing (?s) that has a dc:creator ?c, if that thing (?c) is a foaf:Person, grab that thing's (?c) e-mail address if we know it.
Perhaps you can start to see how we can explore information spaces by querying how information resources are connected: people to publications to topics to geographic regions and on and on. In the process, we will discover what relationships exist and how things are related and allow people to take advantage and build upon what they know as never before.
Conclusion
We have touched on only the beginning of these concepts, but hopefully the idea of loosely-coupled information resources is starting to make some sense. We can keep the information in the form it is stored in natively, but access it in a different form, on demand. The technology stack associated with the Semantic Web is by no means perfect. The parties involved have spent years arguing over the most minute details. These disagreements continue to this day. But, they have made some very important discoveries about how we can name and resolve information flexibly in a wide variety of contexts. These initial choices compound to create a vibrant ecosystem of information. We can get around many of the high inertia obstacles that have thwarted our IT systems in the past. In the next article we will take another step down the Rabbit Hole to explore the deeper realities of these webs of data.