Blog

Personal musings. Views are my own.

WTF is Linked Open Data?

I’ve been hearing about Linked Open Data for years. I’ve sat in on sessions at conferences and followed many discussions on Twitter and e-mail lists. At times, the tone of these conversations seemed like “this is such an awesome tool that nobody is using.” But I never really understood what it was. I was left still wondering “WTF is Linked Open Data?”

I think I figured it out, at least in a general sense. I’m sharing my discoveries here, so other confused developers can benefit.

Relationships

Databases are really good at storing fielded values, but the ways in which they link entities generally lack meaning. Linked Open Data (LOD) rethinks these connections, because a fundamental idea behind Linked Open Data is to describe how people, places and things relate to each other.

The way that I’ve worked with this concept recently has been with noSQL-style RDF triples. Each entity has a collection of relationships described in the following way:

Thing | relates to | thing    

So for example:

The Old Guitarist | was created by | Pablo Picasso    

A linked open “database” will consist of a series of these triples describing how various pieces of data are connected to each other (for example).

Identifiers

LOD requires that data sets be shared and mapped to one another to benefit from each other’s knowledge bases, because “open” data is “freely available for reuse in practical formats with no licensing requirements” (h/t Mia Ridge). To do this we must give all our entities permanent identifiers that start with HTTP and are available on the web. These “URL IDs”, or URIs, should resolve to the data they represent, making them human- and machine-retrievable.

If we take the triple format we described above, the example record might look like this:

http://data.museum.org/assets/The_Old_Guitarist | http://data.museum.org/relations/Created_by | http://data.museum.org/assets/Pablo_Picasso    

Those URIs will resolve to RDF documents describing those things and how they’re related to other things, allowing viewers to traverse our collection in many directions to many levels.

Namespaces

RDF provides shorthand for long URIs, so we can define reusable prefixes, or namespaces, to make our triples more readable.

@PREFIX musasset http://data.museum.org/assets
@PREFIX musrelation http://data.museum.org/relations

musasset:The_Old_Guitarist | musrelation:Created_by | musasset:Pablo_Picasso    

Connecting our collections

“Linked” data means our repositories are not silos but are mapped to one another. Instead of reinventing the wheel, I can create a whole ton of data about artworks, creators and places at my museum connected with an already vetted data set, say DBpedia. In this case, my example would now look something like this:

musasset:The_Old_Guitarist | dbpedia:Created_by | dbpedia:Pablo_Picasso    

With my data connected with existing sets, I could merge my data with other museums and search and browse our combined sets in tandem. For example, I can find works other museums have that were created by Picasso along with ours. I can also link my data to another museum’s data in meaningful ways:

musasset:The_Old_Guitarist | dbpedia:Created_with | othermus:A_specific_blue_oil_paint    

In practice, however, connecting disparate data sets with each other in this flexible way can get tricky.

What if my museum’s notion of that blue paint differs from the other museum’s? What if their research yielded different conclusions, so their vocabulary doesn’t accurately reflect what we’re trying to say? Different institutions’ collections can vary tremendously, likewise the ways we describe them. How do we merge our data sets while staying true to our institutional methods for describing our works?

There are plenty of standards out there that can help with this challenge. But the more flexible the standard gets, the more convoluted it can be to describe our collections. At the same time, I think it’s okay that there will never be one data set that represents all our viewpoints, that we won’t always agree on definitions for the same concepts. Humans have a huge variety of perspectives of our world and our histories, and different vocabularies can and should reflect that variation.

A-haaaa!

So that’s what I gather: Linked Open Data is a super-flexible way of describing how data relates to one another, and provides a framework to connect our data sets with each other in meaningful ways.

Credit where it’s due

Thanks to David Henry and Jarred Moore from the Missouri History Museum for a workshop they facilitated on Linked Open Data at Museums and the Web 2014. Thanks to Mia Ridge for her feedback and keeping #lodlam on my radar (check out a similar, more thorough overview she wrote two years ago). Thanks to Micah and Kyle for asking me to share what I’ve figure out.