If you’ve liked my last post, you should take a look at Kevin Kelly’s video at TED, about the first and the next 5,000 days of the Web, which a friend pointed out to me. “Smarter” is Kevin’s first tag to describe the next web, and given the timescale he chooses, it is a safe bet. In the short term, ubiquitous (Kevin’s second or third tag) is probably more likely to describe the Web.
Below, I am pursuing my exploration of the interaction between Web 3.0, the Semantic Web and Linked Data. I shared my thoughts on Web 3.0 in the previous post, so now let's tackle the Semantic Web, and what it will take for it to really happen.
Semantic Web
What is the semantic web? Here again I’ll refer to a post I wrote not too long ago, in which I wrote this is “a web in which machines get the meaning of information and use that understanding to transform/organize/synthesize data intelligently on our behalf.” Definition varies, but overall I think we all agree that the Semantic Web is an attempt at enabling machines to better understand and transform data. This is the overarching Objective, with a big O, of the semantic web.
In a world with a working Semantic Web, I should not only be able to know without launching a full web expedition, which Chinese restaurant in a 5-mile/km radius carries Peking Duck, but also to aggregate and filter information from various subprime real estate lenders by region and map that against mortgage default rates and lenders' pools of debt by risk level in a snap. That type of easy data transformation could help avoid a financial crisis of gigantic proportions, which, some would argue, is a handy benefit worth its weight in trillions of dollars.
Because I think the next step, the How, is usually where we get lost and diverge, now I’d like to decompose things about the Semantic Web a little further, while hopefully keeping it simple.. . So I’ll propose that the Semantic Web will really be enabled by two very different things:
- Linked Data (or other formats embedding links at the data level)
- Text Analysis and other technologies to structure data
I explain why I think that below.
But first, a bird’s eye view of the whole Web 3.0 landscape, which should help summarize my perspective on this space. Double-click it for a larger version.
Linked Data
At the beginning there was unstructured data. And then men (women too, but mostly men, in their usual thirst for an edge over the competing tribes) decided that structuring it made it easier to find it, read it, and exchange it. So they structured data, created formats, lists, tables, agreed on standards and ultimately stumbled upon a key discovery: the relational database, or RDBMS for short, based on the relational model. That great approach to structuring things opened the door to a whole new world: a database-powered world.
The problem with RDBMSs is that, for all their power and flexibility, they require you to create your tables and decide how they are interlinked before you have populated them, and often before you actually know all you’ll do with them. It’s like setting the walls of a house prior to inhabiting it: it makes complete sense only until you learn that your in-laws are moving in with you. At that point, aside from that urge to run away, you would love to be capable of reconfiguring the house (with a big wall in the middle, preferably). And that’s where it gets tricky, because you need to move the content out, hire contractors or do it yourself if you’re that kind, and then put it all back into the new walls. And that filthy yellow sofa you’ve had since grade 8 doesn’t quite fit in the new place anymore...
Another problem with RDBMS is that everyone can define their tables one way or the other and, in fact, that’s what they do. In the absence of a meta-language to tell the machines what is contained in those RDBMS tables and how it all ties together, it’s virtually impossible for them to make two different databases speak with each other. Not to mention billions of them.
So in a way, the RDBMS model is too constraining and too structured for many applications. That is one of the reasons why most of the data out there remains unstructured (I read somewhere that unstructured data represented over 95% of the data exchanged daily – which sounds about right, in a wrong kind of way of course). It’s not modular, not elastic enough. Now think of something that would be.
Enter Linked Data. What Linked Data really does is breaking the walls of the RDBMS and offering a semi-structured way to create structured information. In some ways, it bridges the gap between unstructured data and structured data (RDBMS and others). It does that by using RDF, which embeds the linking directly into the data. With RDF, each concept becomes an association of 3 tags, each with a role: subject-predicate (verb)-object. The subject and object are two entities and the predicate is how they relate to each other.
For geekier readers, let me add this thought (other readers can jump 2 paragraphs down): in a sense, what RDBMS was doing was compressing RDF by removing duplicates of predicates. In a single RDBMS table, all entries in each column are linked to the other columns through the same predicates. If I have a type of pizza in the first column and its price in the second column, we know those two characteristics are all linked by a unique predicate such as “price of”. And if price is stored in another table, those two tables can be linked by the same predicate once and for all. No need to repeat that predicate a thousand times in my data store. Yet, that’s what RDF does, I believe (someone at some point will propose mechanisms to compress it out if they haven’t yet, but then we get back to a less machine-readable design!). This way, if you’re linking from outside to a specific pizza type, the information from just that pizza type comes embedded with a way to access its price too. Obviously, RDF comes at a price, since you now have all this duplicated information to store and process every time. That’s why we are seeing a lot of focus on scalability and processing cost in the industry.
In the Nov/Dec 2008 issue of Nodalities magazine (this is a link to the text version, but the magazine contains illustrations I highly recommend you take a look at), Bill Roberts of Swirrl provides a great example of the same information structured in an RDBMS and in RDF. I owe a grateful word to Bill and Talis, as this greatly clarified the relationship between those two relational models for me. End of the geekiest part of this post.
In sum, Linked Data offers a new way to establish linkages at the data level, as opposed to the document level we are mainly used to, and it does that in a more flexible manner than relational databases, which already tied data at the data scale in a pre-defined manner.
To complete the picture, Linked Data also introduces universal pointers in the form of URIs. Those are public addresses that all instances of a similar concept on the web can point to, so that machine can infer the meaning in your document based on what they already know about that address. It also enables indexes to tie together all those related instances, since they all point to the same node. In theory. In practice there is no such thing as really-universal URIs yet, although DBpedia is probably the best-known repository of URIs. Building on my previous post about Information Overload, I suspect we are also going to need much better filters to reduce the linking noise once URIs become more mainstream. But that's another story.
The key idea of this post is that Linked Data offers a new medium to link structured data that is then more machine-readable. It does not by itself add any semantic meaning to the information, but it better carries that semantic information once you have it. So, while Linked Data is not semantic, creating links at the data level paves the way to a true Semantic Web.
(this post continues in part 3, to be published on Wednesday 13)
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_e.png?x-id=725587e8-076a-4ff8-aed9-e26198cdeee9)
