A taste of semantic web and RDFa

  1. The amount of data stored on the web is vast. Traditionally, most of this has been restricted to human interpretation. The next step is the semantic web. Imagine software that can tap into this enormous source of data and perform queries for you, presenting it all in one single place. This is conceivable by introducing semantics to web content.

    The amount of data stored on the web is vast. Traditionally, most of this has been restricted to human interpretation. The next step is the semantic web. Imagine software that can tap into this enormous source of data and perform queries for you, presenting it all in one single place. This is conceivable by introducing semantics to web content.

    Semantics

    The idea is based on making web content machine readable. For this, we need semantics. We already have markup languages that offer semantics. The content on the web today mainly consist of HTML documents. A program can look at HTML code and extract text, links and images and it can interpret some structure. There is, however, nothing that explains what a certain block of content is meant to represent. There are no tags defined for an Author or a Book. Using XML, however, this is possible by defining types in an XML schema. Indeed, XML is machine readable, but we need something more dynamic for the semantic web. We need to identify content and their relations.

    Triples

    Relations can be defined by so-called triples. A triple consists of a subject, a predicate and an object. Such a triple could be Michael knows John. Michael is the subject, knows is the predicate and John is the object. We need to identify these items and on the web it's natural to use URIs for this purpose.

    Michael has a homepage at michaelrocks.com. John doesn't have his own homepage, but he's got a profile at the website of his employer at fancycakes.com/John. Lets modify our triple with these identifiers; michaelrocks.com knows fancycakes.com/John. Okay! A program could almost understand this. The program only needs to know what knows means. Let's say we want to find all the people that knows the people Michael knows. To get to the problem here we must realize that this is the vast web. Michael's website has a vocabulary where this relation is named knows. It might, however, not be so on the site of John's employer. To do this kind of query, we need to be able to compare relations. That requires a common vocabulary. Their wise friend Jill has created a vocabulary that the two sites can share. Thus we can add an identifier to the predicate as well. From Michael's website we can extract the following triple: michaelrocks.com jill.net/vocabulary#knows fancycakes.com/John.

    RDF and RDFa

    W3C has defined an XML based standard called RDF that let us define such triples. We are, however, not going to dive into that. If Michael wanted to make his triples available on the web, he'd have to publish RDF files in addition to his HTML files. Duplicate data is not desirable. Also there's no unified, standardized way of connecting HTML content with RDF files. This is where the W3C recommendation RDFa comes in. RDFa lets us define the triples in XHTML code by offering some extra attributes.

    In essence, RDFa, let us use the attributes "about", "src", "property", "rel" and "rev". Several of these attributes aren't new in the HTML standard, but are restricted to certain elements. In fact we can already create "triples" using HTML using the base, link and meta elements, but they apply to the whole HTML document. RDFa lets us add triples anywhere in the HTML document. A simple Library example is given below.

    <?xml version="1.0" encoding="ISO-8859-15"?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
        "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml"
        xmlns:dc="http://purl.org/dc/elements/1.1/"
        xmlns:foaf="http://xmlns.com/foaf/0.1/"
        version="XHTML+RDFa 1.0"
        xml:lang="en">
      <head>
        <title>Library @ arenybakk.com</title>
      </head>
      <body>
      
        <div class="book" about="http://www.arenybakk.com/library/mybook">
          <h1 property="dc:title">My Book</h1>
          <div class="byline">
            <span property="dc:date" content="2010-07-10">July 2010</span>
            <a href="http://www.arenybakk.com/user?name=Are%20Nybakk" rel="foaf:maker">Are Nybakk</a>
          </div>
          <p property="dc:description" xml:lang="no">Litt om boka.</p>
        </div>
        
        <div class="book" about="http://www.somedude.com/abc">
          <h1 property="dc:title">A.B.C.</h1>
          <div class="byline">
            <span property="dc:creator">John Doe</span>
          </div>
          <p property="dc:description">An introduction to the english language.</p>
        </div>
        
      </body>
    </html>

    Notice that the xhtml document needs a special dtd in the doctype declaration and a version attribute on the html element. Also notice vocabularies are using xml namespaces. There doesn't seem to be a lot of widely used vocabularies yet, but there are two popular ones. Both of them are used in this example; Dublin core (dc) and Friend of a friend (foaf).

    Table view

    Well that's all nice and stuff, but now let's imagine how this can be used. We have described relationships in triples, but we also know relations from the database world. Let us add the information from the xhtml file into a table. The columns/properties correspond to the predicates, the leftmost cell in each row is the subject and the rest of the cells are the objects. That looks a lot like a table in a database doesn't it?

    "URI""dc:title""dc:date""dc:creator""foaf:maker""dc:description"
    "http://www.arenybakk.com/mybook""My Book"01.10.2009"http://www.arenybakk.com/user?name=Are%20Nybakk""Litt om boka."
    "http://www.somedude.com/abc""A.B.C.""John Doe""An introduction to the english language."

    Queries

    There is a special query language for the semantic web called SPARQL. We're not going to explore that here, but we can try to write some SQL queries to illustrate how this data can be used. Below are three queries. The first one merges data from the library with that of a book database for books written by John Doe. The second does a lookup in the book database for the author who wrote My book. The third and last one does the same as the second one, but uses a different vocabulary for the lookup.

    SELECT * 
      FROM http://www.arenybakk.com/books a
        INNER JOIN http://www.bookdatabase.com/books?all=true b
          ON b.dc:creator = a.dc:creator
      WHERE a.dc:creator = 'John Doe'
    SELECT a.uri 
      FROM http://www.bookdatabase.com/books?all=true a
      WHERE a.dc:creator = (
          SELECT dc:creator
            FROM http://www.arenybakk.com/books
            WHERE dc.title = 'My Book'
        )
    SELECT a.uri 
      FROM http://www.bookdatabase.com/books?all=true a
      WHERE a.foaf:maker = (
          SELECT foaf:maker
            FROM http://www.arenybakk.com/books
            WHERE dc.title = 'My Book'
        )

    This is by no means a complete tutorial to neither the semantic web or RDFa, but I hope it might cast some light on the subject. The idea seems very intriguing to me and I'm sure it will be a hot topic in the years to come.