…in the RDA Ontologies. Do we? After all, they’re a big part of the ‘Access’ in Resource Description and Access (RDA). But they’re not particularly semantically meaningful, especially if you have the component parts available. An Access Point is just a structured string. For instance a ‘Publication Statement’ Access Point for “The Daytona daily news” might look like:

“Daytona Beach, Florida : Geo. F. Crouch, 1903-1926″

It has a formal syntactic structure, and semantics derived from adherence to that structure when the string is created:

“Place of Publication” : “Publisher’s Name”, “Date of Publication”

Note that the punctuation is part of the formal grammar that helps parse a grammatically correct statement into its semantically meaningful constituent parts.

And this is the way we’ve been doing things forever (well it seems like forever) — semantics is derived from proper use of a syntax that everybody who is creating and using shared data has agreed upon in advance. And as long as everybody uses precisely the same syntax this works great. It works really, really well with structured syntaxes like MARC21:

260 $a Daytona Beach, Florida
260 $b Geo. F. Crouch
260 $c 1903-1926

…and hierarchical syntaxes like XML:

<publicationStatement>
  <placeOfPublication>Daytona Beach, Florida</placeOfPublication>
  <publisherName>Geo. F. Crouch</publisherNam>
  <dateOfPublication>1903-1926</dateOfPublication>
</publicationStatementt>

The RDA documents go so far as to call an Access Point an Element and its constituent parts Sub-Elements, again clearly thinking of this nice syntacticly-defined semantics.

But what if your data says, semantically, that “Place of Publication” isn’t the Name of the place or the Label for the place, but a URI that identifies the Place itself; a resource rather than a string. The Access Point rules don’t let you stick a URI in the Publication Statement where a string is supposed to be.

What about “Publisher’s Name”? That’s clearly going to be a string no matter what — names tend not to be resources. But there’s probably a Publisher resource out there, somewhere, with a URI that identifies the Publisher and probably has a Name or a Label property that provides a string that you can stick in the Publication Statement.

We’ll just ignore “Publication Date” for the now, since that’s a very different can of worms: slimy, smelly worms.

At the moment, RDA doesn’t acknowledge the existence of the resources supplying the strings for an Access Point, and it lets substantial ambiguity sneak in with property names like “Place of Publication” rather than “Place of Publication Name” like they resolved with “Publisher’s Name”. But that ambiguity didn’t exist when all the data was strings — strings you used for indexing, and displayed to the user, and didn’t have to go fetch from somewhere because they were right there in that 260 field.

I listened to a radio program that referred to all of the money that everyone in the world has available to invest as “The Global Pool of Money” and I think that applies quite nicely to the Linked-Data notion of the Semantic Web — “The Global Pool of Data”.

The open world model of the Semantic Web assumes that you will never have all of the available data that describes a resource, and the RDF data model supports this. Resources often exist, outside of traditional library data, available from the Global Pool of Data, that can supply the necessary labels.

But of course we usually just have the labels. This is library data made for cards that need to be put in the correct order and read by a person. And the Global Pool of Data usually just has a bunch of resources. This is linked data, in no particular order at all, meant to be read by a machine.

So, specifying Access Points as pre-coordinated strings actually provides us with a major opportunity when defining the ontologies; several opportunities actually:

  • We can formalize each Access Point specification into what Dublin Core calls a Syntax Encoding Scheme (SES) and say that each Access Point has a datatype.
  • We can clarify the semantics of using a label rather than a resource for properties (sub-elements) like “Publisher’s Name”
  • We can clarify the semantics of using a resource for “Place of Publication” and say that the label used in an Access Point must be the Name of the Place and this is distinctly different.

So, refined to use properties that are a bit more semantically clear, we have a slightly modified Publication Statement:

“Place of Publication Name” : “Publisher’s Name”, “Date of Publication”

…we tie these properties specifically to a FRBR Manifestation that RDA says must be what they describe, and in RDF the supporting ontology looks like:

rda:placeOfPublicationManifestation a owl:ObjectProperty
rda:PlaceOfPublicationName a owl:DatatypeProperty
rda:publisherManifestation a owl:ObjectProperty
rda:PublisherName a owl:DatatypeProperty

Here’s our sample instance again (by the way, this data is from a real linked data resource):

“Daytona Beach, Florida : Geo. F. Crouch, 1903-1926″

<http://chroniclingamerica.loc.gov/lccn/sn93063916>
  rda:publisherManifestation <http://???> (blank, but we know one must exist)
    rda:PublisherName "Geo. F. Crouch"
  rda:placeOfPublicationManifestation <http://dbpedia.org/resource/Daytona_Beach%2C_Florida>
    rda:PlaceOfPublicationName "Daytona Beach, Florida"
    rdfs:label "Daytona Beach, Florida"

This tiny chunk of data was gathered by hand and mapped (by me) from the existing resources and the labels supplied by those resources.

Someday there will be services that comb through linked data looking for missing data like that Publisher resource, and will perform a search on the Global Pool of Data looking for resources with labels matching that library data, expressed in RDA/RDF, fill in the missing pieces and present the lucky cataloger, and ultimately the user, with all that rich linked data.

The eXtensible Catalog project is working on services that do just that kind of thing, so someday may not be too far off.


Be Sociable, Share!
By Jon, November 3, 2009, 10:42 am (UTC-5)

Add your own comment or set a trackback

Currently 1 comment

  1. Comment by Rob Styles

    Jon,

    I think the thrust of your argument is bang on the nail. It chimes strongly with work I did some time ago (and am building on further currently). events.linkeddata.org/ldow2008/slides/RobStyles_SemanticMarc.pdf

    There is a key difference in the semantics of data found in MARC records and the data we would like to publish as Linked Data. The data in the marc record is a statement of what is printed on the book, not a statement of truth.

    So, where it says “Publisher Statement” that’s because it’s the statement made in the book, which all comes back to book-in-hand cataloging for the purpose of stock management and discovery within a library.

    This is key because the statement printed in the book will not change over time, whereas names and locations of publishers change as companies merge, split, go broke and re-form.

    What is needed is both – literal values for what is printed on the book-in-hand (hence tying it to the manifestation in most cases) but also properties referring to the organizations and peoples involved. We can start to build on data mining techniques and bringing in external data to populate those properties.

    Thanks for prompting more thought about this.

    rob

Add your own comment



Follow comments according to this article through a RSS 2.0 feed