A couple of weeks ago I made a short presentation at a linked data session at the American Association of Law Libraries (AALL). Many of the audience members were people I’ve known since I was a baby librarian (this is the group where I started my career as a presenter, and they invite me back every couple of years to talk about this kind of stuff.) One of the questions from the audience was one I hear fairly often: “Who’s paying for this?” I always assume, perhaps wrongly, that the questioner is responding to pressure from an administrator, but in fact anyone with with administrative and budget responsibilities–particularly in Technical Services, which has been under budget siege for decades–does and should think about costs.

What I said to her was that we were all going to pay for it, and it seems to me that this isn’t just a platitude–given that the culture of collaboration we have developed assures us that many (though certainly not all) of the costs associated with the change we contemplate will be shared, as in the effort noted in my previous post. But the costs are difficult to assess at this stage, because we don’t know how long the transition will be nor exactly what the ‘end result’ will look like. If indeed we have three options for change—metamorphosis, evolution, and revolution—it seems we’ve not yet made the decision on what it’s going to be. If there are still some hoping for metamorphosis—where everything happens inside the pupa and the result is much more attractive (but the DNA hasn’t changed), well, it may be too late for that option.

Evolution–defined as creative destruction and adaptation leading to a stronger result, with similar (but not identical) DNA–is much more what I’d like to see, and particularly if we look carefully at what we have–the MARC semantics, the SemWeb friendly RDA vocabularies, and the strong community culture in particular–and build on those assets, we have a fighting chance for a good result. The trick is that we don’t have millenia to accomplish this, we have a couple of years at best, if we work really quickly and keep our wits about us. The interesting thing about evolution is that when the environment changes and the affected species either adapt or disappear, it’s never entirely clear what that adaptation will look like prior to the point-of-no-return.

As for revolution–perhaps that’s the possible result where Google and its partners take over the things we used to do in libraries when we brought users and resources together. They’re doing metadata now (not as well as we do, I’m thinking) but if we keep trying to make our ‘catalogs’ work better instead of getting ourselves out there on the Web, I don’t think the result will be pretty.

By Diane Hillmann, August 7, 2012, 3:39 pm (UTC-5)

A few years ago, I wrote an article for a collection of writings in honor of Tom Turner, a talented metadata librarian at Cornell who sadly died too young. That article, “Looking back—looking forward: reflections of a transitional librarian” (Metadata and Digital Collections: Festschrift in honor of Tom Turner), although it meanders around a bit (she said after reading it over for the first time in a couple of years) is a pretty good view of where my head has been these past couple of decades. A lot has happened, and I’ve been lucky enough to have been in the thick of a lot of it.

One question that occasionally comes up is about what was in that kool-aid I drank that caused me to jump ship from many of the traditional library ways of thinking to something quite different. There’s not a simple answer to that. It was a combination of things, certainly, but perhaps best expressed towards the end of the article:

“In October of 2004, I attended a panel presentation where three experts were asked to inform library practitioners by providing “evaluation” information about a large Dublin Core-based metadata repository. It was, in a small way, yet another version of the blind men and the elephant. The first presenter provided large numbers of tables giving simple numeric tallies of elements used in the repository, with no more analysis than a relational database might reveal in ten minutes. The second provided results of a research project where users were carefully observed and questioned about what information they used when making decisions about what they were given in search results—i.e. useful data, and a good start on determining the usefulness of metadata, but with no attention paid to the metadata that was used behind the scenes, well before any user display was generated. The third presenter, a young computer scientist, relied almost entirely on tools developed for textual indexing, and, concluding that the diversity of the metadata was a problem, suggested that the leaders of the project should insist that all data providers follow stricter standards.

These presentations seemed sadly reflective of most attempts to approach the problems of creating and sharing metadata in the world beyond MARC. Traditional libraries built a strong culture of metadata sharing and an enormous shared investment in training and documentation around the MARC standard. The MARC development process codified the body of knowledge and practice that supported this culture of sharing and collaboration, building, in the process, a community of metadata experts who took their expertise into a number of specialized domains. We clearly are now at a critical juncture. Moving forward in both realms, traditional and “new” metadata requires that we understand clearly where we have been and what has been the basis for our past success. To do that we need much better research and evaluation of our legacy and current models, a clearer articulation of short term and long term goals, and a strategy for attaining those goals that is openly endorsed and supported by stakeholders in the library community.”

Looking back, I’m not sure I can exactly pinpoint the moment when we stopped thinking clearly about the road ahead, but until very recently, it sure looked like we had. It was probably somewhere around the time when it seemed like RDA would never be finished in our lifetimes, and that even if finished, would be too much like AACR2 to even consider implementing. Somewhere around that time, CC:DA drafted a ‘no-confidence’ memo to the JSC, and I showed Don Chatham (ALA Publishing) a late draft of it while he was at the DC-2006 conference in Mexico. At that meeting, a few of us suggested to Don that it might be a good idea to have the JSC get together with DCMI and see what could be done, which ended up being the ‘London Meeting’ of five years ago that changed the conversation significantly. This spring a five year anniversary celebration was held in London, where a big part of the discussion was about just how big that impact was, in terms of where the library community is today. As usual, nothing like a good crisis to get things moving.

It will be no surprise to readers of this blog, but I spend a fair bit of time thinking and talking about what comes next, and some of the focus of that question, particularly lately, has to do with MARC. A big part of the problem of answering that “what’s next?” question is that we tend to use ‘MARC’ to refer not just to the specification but as a stand-in for traditional practices and library standards as a whole, and this muddies the conversation considerably. What too often happens is that the ‘good stuff’ of MARC, the semantics that represent years of debate about use cases and descriptive issues, doesn’t get separately, or properly, discussed. Those semantics ought really to be seen as a distillation of all those things we know about bibliographic description, tested by time and generations of catalogers, and very much worth retaining. How we express those semantics needs to be updated, for sure (many are already expressed in the RDA vocabularies), but differentiating between baby and bathwater is clearly a necessary part of moving ahead.

One of the things about the library community that few outsiders understand is the extent to which “the library community” has developed a culture of collaboration–to the extent I’ve never seen anywhere else. Librarians collectively build and contribute bibliographic and name authority records, subject and classification proposals, participate (passionately) in endless debates about interpretation of rules and policies, and most remarkably–write this stuff up and share it extensively, asking for comments (and most often getting them). Participating in other communities after having been part of the one built by librarians is very often a frustrating thing–my expectations of participation are clearly too high.

A great example of how this works is NACO (Name Authority Cooperative Project), for those who participate in building and maintaining good authority records. In my days as Authorities Librarian at Cornell, I helped increase Cornell’s participation in the program by a significant amount, an accomplishment of which I’m still quite proud. Recently I noted a new post to the RDA-L list about some PCC (Program on Cooperative Cataloging) committee work around authority records:

“The PCC Acceptable Headings Implementation Task Group (PCCAHITG) has successfully tested the programming code for Phase 1 in preparation for the LC/PCC Phased Implementation of RDA as described in the document entitled “The phased conversion of the LC/NACO Authority File to RDA” found at the Task Group’s page

The activities of this group display some of the important characteristics of successful community activity: the goals are clearly stated, and timelines spelled out; assumptions are tested and results analysed; conclusions and recommendations are written, explained and exposed for comment. This particular activity is noteworthy because it is a collaboration between the Library of Congress and a group of PCC member institutions, and is Phase 1 of a set of planned activities designed to move cooperatively built and maintained authority data into compliance with RDA.

As for the old broads? I was recently reminded of this wonderful quote from Bette Davis: “If you want a thing well done, get a couple of old broads to do it.” So true, so true …

By Diane Hillmann, August 5, 2012, 12:56 pm (UTC-5)

Metadata is produced and stored locally, published globally, consumed and aggregated locally, and finally integrated and stored locally. This is the create/publish/consume/integrate cycle.

Providing a framework for managing metadata throughout this cycle is the goal of the Dublin Core Abstract Model and the Dublin Core Application Profile (DCAM/DCAP).

The basic guidelines and requirements for this cycle are:

  • Metadata MUST be syntactically VALID and semantically COHERENT when it’s CREATED and PUBLISHED.
  • Globally PUBLISHED Metadata SHOULD be TRUE, based on the domain knowledge of the publisher.
  • PUBLISHERS of metadata MUST publish the semantics of the metadata, or reference publicly available semantics.
  • CONSUMERS of published metadata SHOULD assume that the global metadata is locally INVALID, INCOHERENT, and UNTRUE.
  • CONSUMED metadata MUST be checked for syntactic validity and semantic coherence before integration.
  • AGGREGATED metadata SHOULD indicate its PROVENANCE.
  • CONSUMED metadata MAY be considered TRUE based on its PROVENANCE.
  • Locally INTEGRATED global metadata MUST be syntactically VALID, semantically COHERENT, and SHOULD be TRUE based on local standards.

The DCAM takes as its base the rdf data model because of its simplicity and flexibility of that model. The DCAM refines the rdf data model in order to support syntactic interoperability between the rdf data model and non-rdf data models.

A DCAP defines a distinct subset of the set of all things and defines the domain-specific knowledge of the properties of those things, and the relationships those things have to other things. It expresses that knowledge through the DCAM and a set of related documentation (the Singapore Framework). A complete DCAP should provide the necessary domain-specific infrastructure to fully support the create/publish/consume/integrate cycle for any system, any domain, and any data model. A DCAP based on the DCAM should be able to be used by a machine to generate a valid data model in any modeling syntax, and any modeling syntax should be able to be expressed as a DCAP.

We strongly recommend that metadata be published and consumed using the RDF data model. The strength of the rdf data model for publishing and aggregating metadata lies in its extreme flexibility of expression, it’s self-describing semantics, and its assumption of validity. These strengths become weaknesses when creating and validating local metadata that must conform to a set of local restrictions. An effective DCAP will provide metadata creators with the ability to create locally valid metadata, publish it as rdf, validate consumed rdf against local standards, and ultimately integrate global metadata into the local knowledge base that contains local descriptions of the things being described.

There are many systems, many data models, many publish and subscribe models, many storage and validation models. There are many paths to integration. There are very few that provide a generic and neutral method for modeling and documenting inter-system metadata integration. The DCAM/DCAP specification has the potential to be one of those few.

By Jon, July 19, 2012, 11:54 am (UTC-5)

By-passing taggregations identifies those MARC21 variable data field tags whose level 0 RDF properties, representing individual subfields, can be used generally in data triples and mapping triples without losing information and the semantic coherency of the record. These tags have subfields which are independent of one another, with no need to keep them together if the tag is repeated in a record.

RDF graph of Music format ontology

RDF graph of Music format ontology

This graph is another example of adding MARC fruit to the cornucopia by using the sub-property ladder. The MARC21 property used in the graph is a level 0 element. The MARC21 property’s description is equivalent to the ISBD property’s description, as hinted at in the ISBD scope note, but an unconstrained version is used to avoid the ISBD property’s domain. We could use the Web Ontology Language (OWL) ontological property owl:equivalentProperty to represent the relationship between the properties, but we can also use the rdfs:subPropertyOf property by applying it in both directions. That is, if two properties P1 and P2 are related as P1 rdfs:subPropertyOf P2 AND P2 rdfs:subProperty P1, then P1 owl:equivalentProperty P2 and P2 owl:equivalentProperty P1.

Unfortunately, it is unsafe to use level 0 properties for dependent subfields in repeatable tags in this way, even if a specific record contains only one instance of such a tag. Triples from that record will cluster with triples from another record about the same bibliographical resource, either by sharing the same resource URI or an equivalence link. Taggregations are required to avoid semantic confusion, otherwise we wouldn’t know which “Materials specified” goes with which “Place of publication …” or “Name of publisher …” in the publication statement example given in Taggregations.

 

By Gordon Dunsire, July 18, 2012, 9:44 am (UTC-5)

If we were asked (as we sometimes are) what we’d like to see develop as a result of the BibFrame effort, the emphasis in our answer would have both technical and social aspects.

First, given the technologies developing in several different places and considering what we can do now to bring Linked Open Data into our somewhat closed world, some concrete suggestions: We have the ability to share metadata expressed not just in a single common ‘vocabulary’, but to share it using many different vocabularies, expressed and distributed using RDF, OWL/RDFS, RDFa, Microdata, and other tools; we have methods of specifying the use of these ‘semantic’ building blocks (DC Application Profiles and emerging provenance specifications from W3C and DCMI) that allow machines to use, process and distribute data in ways that do not require a central enabling node; and finally, we have technologies and strategies in place to map between existing and prospective metadata schemas that flatten the fences between communities quite thoroughly.

This is the post-MARC world of library-related metadata, where the format native to our metadata is no longer its most important characteristic; there’s no lossy crosswalking requirement to transform data to serve different needs; and there are fewer (eventually no) barriers to sharing. The tremendous value that MARC represents–the semantics built over many decades in response to an enormous corpus of use cases (all of which are fully documented on the MARBI site)–continues to be vitally important as we move into a different arena. But the mid-20th century requirements that dictated the MARC syntax, and the constricting consensus model that has been required to maintain it, no longer apply to the current and future requirements of the global library community.

The usual caveats apply here–some of these technologies aren’t entirely ready for prime time, but in the world we live in, ‘finished’ is more likely to be used for something defunct rather than a goal for standards and tool development. The areas we’re personally most familiar with (no surprises here) are the complementary domains of vocabulary management and mapping. The first of those is up and running (although the ‘continuous improvement’ engine is going full bore) as the Open Metadata Registry; the second, mapping, is an important interest of ours and has generated papers, articles, and presentations (see below for a selection), not to mention the usual plethora of posts to blogs (including this one) and discussion lists. We believe that the OMR and mapping capabilities under development work together to enable legacy data to shift into the open linked data world efficiently, effectively, and with a minimum of loss, in the process enriching the useful data options available for everyone.*

Often overlooked in the glitter of technology is the possibility of pulling together the communities that were torn asunder as a result of past technical limitations (and other reasons). In our common past, the library community built their data sharing conventions on a platform based on the consensus use of AACR2 and MARC. Some library communities–law and music the most prominent–were willing to compromise in order to work within the larger library community rather than strike out on their own. Others, like the art library and museum community and the archivists, did make the break, and have developed data standards that meet their needs better than the library consensus was able to do. But those specialized standards have resulted in yet more silos, in addition to the huge MARC one.

If the intent is to ‘replace MARC’ (as it is at some level), it’s about re-placing MARC and its built-in limitations from the center of our world back into the crowd, where it becomes one-among-many. Also a value in that shift is the ability to expand the data sharing environment MARC enabled a half century ago to a broader community of interest that includes museums, publishers, and archives. Meeting the goal of dissolving those silos and making our data easily understandable and reusable in a host of ways will help initiate that ‘Web of Data’ we’ve been anticipating for many years. As Eric Miller explained so well in his Anaheim BibFrame update: by moving towards a linked data world we actually look well beyond the Library/Archives/Museum worlds by definition–it’s a very big world out there. But by collaborating with the LAM community as a whole to get there, we reap a great many benefits, not the least of which are perspectives that are much broader in some significant ways than ours. Limiting our view merely to a ‘MARC Linked Data Model’ might be an important beginning step, but it falls short of where our vision needs to extend.

And the fact is that MARC will not be going away for a long time, if ever. There will be a lot of variation in how the transition is done by libraries, depending on institutional support, short term and long term needs, and existing partnerships. The process of moving MARC into the linked data world has already started. RDA and its RDF vocabularies was a start, as is the development of a complete RDF version of MARC, located at marc21rdf.info. Several years worth of pre-conferences, presentations and discussions, at ALA and beyond, have prepared the soil for these changes. But we need a plan, and some concrete steps to take–steps that include the groups who have been working in the trenches without a great deal of support, but making progress regardless. The BibFrame effort needs to be more than a playground for the technologists, because in most instances, the technology is not what’s holding us back–it’s the institutional inertia and the difficulties of finding ways forward that don’t pit us against one another. The plan we need balances the technical and social, the quick-win with the long-term momentum, and the need for speed with the public discussion that takes time and builds buy-in.

What we have in our sights is an opportunity to reverse the long term trend towards balkanized metadata communities and to make the future one for which there are fewer fences between, and more data exchanged among these three communities with obviously similar challenges and interests. We think the time has come to use the vastly changed technology environment to do that.

*It would be too easy for a response to this post to be in the form of “Oh, you’re just tooting your own horn here,” and indeed we are in some measure doing that. But we do this work because we believe it’s important. We don’t believe it’s important merely because it’s what we do. We believe in the value of the work we’ve done and will do, and we see a great deal of relevance for it as part of the BibFrame discussions.

Selection of papers, and presentations:

Jon Phipps’ presentation on mapping at ALA Anaheim
A Reconsideration of Mapping in a Semantic World / Gordon Dunsire, Diane Ileana Hillmann, Jon Phipps, Karen Coyle
Gordon Dunsire’s presentation at the recent London meeting “RDA, 5 years on”

Post written by Diane Hillmann and Jon Phipps.

By Diane Hillmann, July 4, 2012, 2:35 pm (UTC-5)

The methodology of treating MARC21 variable data field tags as aggregated statements in RDF is discussed in Taggregations. There are some circumstances when this approach is redundant and level 0 RDF properties based on individual subfields can be used directly in MARC21 data triples that are semantically complete and coherent, and in mapping triples relating MARC21 to other metadata schema.

We can by-pass the need for an aggregated statement when there are no semantic dependencies between the subfields of a tag, and thus net more low hanging MARC fruit. The most obvious case is when there is only one subfield in the tag: the contents of the tag are the same as the contents of the subfield. Note that the repeatable status of a tag is, generally, not relevant as there is no intrinsic semantic dependence between multiple occurrences of a tag in a record.

There appear to be no MARC21 tags with just one subfield, but I think it is reasonable (at this stage of analysis) to ignore the Linkage subfield ($6) and the Field link and sequence number subfield ($8), although further investigation is required.

Disregarding $6 and $8, the following tags have a single subfield and no indicators:

  • 018
  • 025
  • 038
  • 042
  • 254
  • 256
  • 263
  • 306
  • 508
  • 515
  • 525
  • 547
  • 550
  • 580

[Note: Linking Entry Fields (76X-78X) are excluded from this analysis, for the time being.]

For example, the level 0 property (m21:M515__a) for subfield a of tag 515 “Numbering Peculiarities Note” can be used directly for a data triple.

RDF graph of repeated instances of MARC21 tag 515

RDF graph of repeated instances of MARC21 tag 515

In this example, a MARC21 record for a resource (ex:1) has two occurrences of tag 515, and the same level 0 property can be used directly to express the data in RDF, without the need for an aggregated statement property for the tag.

Disregarding $6 and $8, the following tags have a single subfield and use one or both indicators:

  • 384
  • 511
  • 516
  • 522
  • 567
  • 653

Thus the appropriate level 0 property for subfield a of tag 384 “Key” can also be used for data triples. There are three properties available (m21:M384__a, m21:M3840_a, m21:M3841_a), each based on a different first indicator value (#, 0, 1).

RDF graph of separate instances of MARC21 tag 384 with different first indicators

RDF graph of separate instances of MARC21 tag 384 with different first indicators

In this example, the MARC21 records for three different resources (ex:1, ex:2, ex:3) have an occurrence of tag 384, but with different values for the first indicator.

A few tags have multiple subfields with no apparent semantic dependencies between them:

  • 010
  • 027
  • 030
  • 035
  • 040*
  • 066
  • 074
  • 088

For example, the level 0 properties for subfield a and subfield z of tag 088 “Report Number” can be used directly for data triples because there is no need to keep a report number ($a) together with a cancelled or invalid report number ($z).

RDF graph of separate subfields of MARC21 tag 088

RDF graph of separate subfields of MARC21 tag 088

[* It is not clear from the MARC Bibliographic documentation for tag 040 if the sequence of repeats of subfield d ("Modifying agency") is significant; that is, if there is semantic dependency between repeated instances of this subfield. There is a hint that sequence is significant: "Subfield $d is not repeated when the same MARC code or name would occur in adjacent $d subfields" (emphasis in original). This seems to imply that successive modifying agencies are added in sequence, with the inference that the last has modified the record as it stood after the previous modification. But the only example given of a repeat of the subfield says "modified by ... and by ..." (rather than "... and then by ..."), implying that sequence does not matter. Such ambiguity in the documentation of a schema used by many thousands of libraries to create millions of bibliographic records is not helpful.]

 

 

By Gordon Dunsire, June 7, 2012, 8:20 am (UTC-5)

The technique described in Using the sub-property ladder works well to “dumb-up” raw, level 0 data from MARC21 fixed-length data fields to interoperate with metadata from other schemas. Unfortunately, it cannot be used with most MARC21 variable data fields (tags) and subfields. We cannot simply dumb-up a subfield to the level of its parent tag because most tags have more than one subfield; the meaning of a tag is a combination of the meanings of its subfields and tag-level data  is a composite of subfield-level data.

There is another technique we can use to bridge the semantic gap between a subfield and its tag: tags generally can be treated as “aggregated statements”, where the value of a tag is a literal string, or statement, which is composed of the values of subfields.

For example, a record may contain a tag 260 (Publication, Distribution, etc.) with subfield a (Place of publication, distribution, etc.) = “Edinburgh :”, subfield b (Name of publisher, distributor, etc.) = “Castle Press,”, and subfield c (Date of publication, distribution, etc.) = “2012.”. The contents of the tag, “$aEdinburgh :$bCastle Press,$c2012.” can be turned into a tag-level value, “Edinburgh : Castle Press, 2012.”, by substituting a space for each subfield indicator ($) and code pair. We can then use a tag-level property with the label “Publication, Distribution, etc. (Imprint)” and URI “m21plus:T260″ to publish the metadata statement “This resource – has Publication, Distribution, etc. – ‘Edinburgh : Castle Press, 2012.’” as an RDF triple.

The instructions for deriving the tag-level value or aggregated statement from the subfield values are known as a syntax encoding scheme (SES). This is part of the Dubin Core abstract model, allowing specific SESs to be used in an application profile. There can be many different ways of deriving the value; the example above works because MARC21 subfields contain embedded punctuation that delineates the component parts when the subfield encoding is removed. This simple SES allows a MARC21 record to conform to the syntax prescribed by the International Standard Bibliographic Description (ISBD) for compound statements. Unfortunately, this makes it difficult to apply any other SES to the subfields without first removing the punctuation.

It would be much better if the instructions for adding ISBD punctuation to MARC21 data were embedded in an SES. Then a different SES could produce “Published in 2012 by Castle Press in Edinburgh” rather than “Published in 2012. by Castle Press, in Edinburgh :”. This is the approach taken by ISBD itself, and there is clearly an opportunity here for collaboration between the MARC21 and ISBD communities. The same approach is envisaged for RDA.

The aggregated statement technique is also very useful when a MARC tag is repeated. Using tag 260 again as an example, a record may contain multiple publication statements for intervening publishers, where the tag’s first indicator has value “2″. If there are two such tags, then there may be two or more publication places and two or more publisher names, for example “$32001-2005$aEdinburgh :$bMudhut Publishing” and ”$32006-$aEdinburgh :$bCastle Press” (subfield 3 is for Materials specified). A linked data representation of the record needs to keep the places, names, and dates correctly associated so that they don’t get mixed up, for example “Mudhut Publishing” with “2006-”. The tag-level RDF property (m21plus:T260) can be used with an aggregated statement to keep the level 0 data associated with the correct repeat of the tag, avoiding the use of blank nodes in the RDF graph of a specific record.

RDF graph of MARC21 Publication statement data

RDF graph of MARC21 Publication statement data

As the graph shows, the two Publication statements must have URIs so that they can link to the correct subfield values. The URIs identify the literal strings of the aggregated statements, and are instances of an SES; all SESs are sub-classes of the class of literal strings. A blank node, on the other hand, has no URI and uses a local identifier to make the links; such links appear broken in a non-local environment.

To sum up, it seems useful to represent MARC21 tags as RDF properties associated with a syntax encoding scheme. We intend to add these properties to the Open Metadata Registry. Specific encoding schemes can then be assigned using an application profile. There must be many examples of instructions for processing tag subfields for output and display which can form the basis of suitable encoding schemes.

By Gordon Dunsire, May 20, 2012, 5:08 pm (UTC-5)

I discussed the utility of the sub-property relationship in Getting to higher MARC branches, Netting more MARC fruit, and Adding MARC fruit to the cornucopia. Coincidentally, Bob DuCharme posted Simple federated queries with RDF which outlines the same technique and provides additional information on its use for resource discovery. Those posts are somewhat technical, and I tried to lighten up in my presentation Turtle dreaming at the recent Dublin Core Metadata Initiative (DCMI) seminar Five years on. This post is another attempt to demonstrate in a non-technical way (I hope) how useful and powerful the sub-property relationship can be.

A metadata attribute, like ‘title’, that is to be used for linked data in the Semantic Web is usually represented in Resource Description Framework (RDF) as a property. A property can be used as the predicate part of a triple: “Subject – predicate – object”, where ‘Subject’ is what the triple is about (e.g. a resource), ‘predicate’ is the aspect of the subject, and ‘object’ is the value of that aspect for the specified subject. For example:

“This resource – (has) title – ‘Using the sub-property ladder’”

is a single metadata statement in triple format. We can think of this as conforming to the triple template:

“Specified resource – (has) attribute – value”.

Note that prefixing the predicate with ‘has’ turns it into a verbal phrase and renders the statement in (near) natural language.

We can also make meta-metadata statements in triple format. These are ‘data about metadata’ rather than ‘data about data’, and are often referred to as ontological triples to distinguish them from data triples such as the example above. The triple template for one type of meta-metadata statement is:

“Specified RDF element – (has) relationship – Other specified RDF element”

Note that a relationship between metadata elements is also represented in RDF as a property. In particular, ‘sub-property’ is a pre-defined relationship between two RDF property elements, giving the ontological triple:

“Property 1 – (is) sub-property of – Property 2″

Furthermore, such relationships can embed semantic rules that can be processed automatically by software known as ‘semantic reasoners’ or just plain ‘reasoners’. The rule embedded in the sub-property relationship is: If “P1 – (is) sub-property of – P2″, then any data triple using P1 as its predicate can generate another data triple using P2 as its predicate, with the same subject and object. Let’s call this kind of ontological triple a mapping triple, because it effectively maps one property to another.

Suppose we have two attributes ‘title’ and ‘varying form of title’. I can create the mapping triple:

“‘varying form of title’ – (is) sub-property of – ‘title’”.

If we have a data triple:

“This resource – (has) varying form of title – ‘Pat presents cataloguing for beginners’”

then a reasoner will automatically generate the data triple:

“This resource – (has) title – ‘Pat presents cataloguing for beginners’”

In a similar way, we can create the mapping triple:

“‘title statement’ – (is) sub-property of – ‘title’”

and from the data triple:

“This resource – (has) title statement – ‘Cataloguing for beginners’”

generate:

“This resource – (has) title – ‘Cataloguing for beginners’”

So what? Further suppose that the ‘title’ attribute is from the DCMI metadata terms, and the ‘varying form of title’ and ‘title statement’ attributes are from the MARC21 tags 245 and 246. So a MARC21 record for the resource might contain the set of data triples:

“This resource – (has) 245 [title statement] – ‘Cataloguing for beginners’”
“This resource – (has) 246 [varying form of title] – ‘Pat presents cataloguing for beginners’”

A reasoner can generate the set of data triples:

“This resource – (has) [DC] title – ‘Cataloguing for beginners’”
“This resource – (has) [DC] title – ‘Pat presents cataloguing for beginners’”

In other words, we have generated a DC record from a MARC21 record. Or we have generated a title index for the MARC21 record. Or both.

Let’s add an RDA attribute and an ISBD attribute mapping to the mix:

“[RDA] ‘title proper’ – (is) sub-property of – [DC] ‘title’”
“[ISBD] ‘has title proper’ – (is) sub-property of – [DC] ‘title’”

The data triples:

“That resource – (has) [RDA] title proper – ‘Cataloguing for geeks’”
“Another resource – [ISBD] has title proper – ‘Does cataloguing have a future?’”

can generate the corresponding DC triples, and we end up with:

“This resource – (has) [DC] title – ‘Cataloguing for beginners’”
“That resource – (has) [DC] title – ‘Cataloguing for geeks’”
“Another resource – (has) [DC] title – ‘Does cataloguing have a future?”
“This resource – (has) [DC] title – ‘Pat presents cataloguing for beginners’”

So now we have a title index to metadata from multiple heterogeneous sources. And the beginnings of a set of records in Dublin Core format.

Note that the attribute which is the sub-property must be entirely narrower in its semantics than the related super-property. If we create the mapping triple:

“‘title’ – (is) sub-property of – ‘varying form of title’”

then we generate the data triple:

“This resource - (has) 246 [varying form of title] – ‘Cataloguing for beginners’”

which is incorrect.

As a result, a data triple generated by a sub-property mapping triple is usually ‘dumber’ than the original data triple; detail is lost because the generated triple uses an attribute which is broader in meaning than the original. This ‘dumbing-up’ is necessary to produce interoperable metadata from different schemas – but data is not permanently lost because the original triple is still available for use in other applications. Needless to say, data triples created with broad attributes cannot be “smartened-down”, at least on their own.

The sub-property relationship can be chained. We can create a new attribute property, MARC21 ‘title’, which could be used in an application for making a title index to MARC21 records, as already mentioned. This new attribute is a super-property of all the MARC21 title-type attributes, and is also a sub-property of the DC ‘title’ attribute:

“[MARC21] ‘title statement’ – (is) sub-property of – [MARC21] ‘title’”
“[MARC21] ‘varying form of title’ – (is) sub-property of – [MARC21] ‘title’”
“[MARC21] ‘title’ – (is) sub-property of – [DC] ‘title’”

Doing this does not affect the previous mapping triples relating each MARC21 title-type attribute directly to the DC ‘title’ attribute, although it  makes them redundant because this new set of mapping triples generates exactly the same data triples at the DC level from the MARC21 originals.

Different application can therefore re-use and, if necessary, augment the sub-property chains for each of the high-level core attributes found in most bibliographic metadata schemas, such as title, author/creator/agent, subject, target audience, etc. The chains form a net(work) of mappings, or map, which can automatically dumb-up triples from any level of semantic granularity to any higher level.

We should only have to publish such maps or part-maps once, openly so that anyone can use them and add to them. If the professional communities develop the maps first, much effort will be saved and much authority imparted. This requires collaboration and action real soon now – the ISBD Review Group and the Joint Steering Committee for Development of RDA have started with the development of a mapping between the ISBD and RDA element sets.

These maps should remain valid forever, so the effort is worth expending. The original data triples use the original properties based on the schema attributes at the time and they will be valid “for their time”, in the same way that many catalogues are likely to contain records created under the Anglo-American Cataloguing Rules, with its ‘general material designation’ attribute long after the successor standard RDA: resource description and access has been adopted with its ‘content type’ and ‘carrier type’ attributes.

And mappings from the MARC21 element sets will show, we hope, that it may not be necessary to convert the entire contents of every MARC21 record as a result of the Bibliographic Framework Transition Initiative!

But the professional communities lack a framework to help them collaborate as a super-community. A network of mappings is more (socially) efficient than an aggregation of one-to-one mappings between pairs of schemas. We need (name)spaces to add intermediary attribute properties and publish the mappings; we need protocols for managing semantic change as schemas evolve; we need lightweight protocols for authorizing mappings; we need systems for ensuring the long-term preservation of RDF element sets and mapping triples.

By Gordon Dunsire, May 12, 2012, 1:17 pm (UTC-5)

Why create “a separate property for every combination of tag, two indicators, and subfield”, as I mentioned in Low-hanging MARC fruit?

As Karen says in her comment on Getting to higher MARC branches, in the case of the Target audience data element there is no need for the RDF properties defined for each MARC21 “resource type” when the Target audience value vocabulary can be linked directly to the property defined at the higher level of the MARC21 resource.

The data element is one of several in the MARC 21 format for bibliographic data that are “defined the same in the specifications for more than one type of material” and share the same value vocabulary; this requires inspection of the vocabulary for each type, as there seems to be no explicit indication in the text. We have used the prefix “common” in the vocabulary URIs to reflect this, as in the value URI of the object of the ex:1 data triple (“commonaud#j”). Examples include commonaud (MARC21-008: Target audience) itself, as well as commonfor (MARC21-008: Form of item) and commonnat (MARC21-008: Nature of contents).

Despite this, I think it is worth including the 006 equivalents of the Target audience element in the RDF graph, as they represent “special aspects of the item being cataloged that cannot be coded in field 008″. This ensures that any semantic relationship between the 006 and 008 fixed-length data elements is preserved in the data. The semantic relationship between “type” and “form” of material in the 006/008 relationship table is not specified, so I assume that it will be incorporated, along with usage guidelines, etc., in one or more application profiles. Application profiles can also ensure that values from the common Vocabulary Encoding Scheme are preserved in data triples entailed during dumb-up to the level of the resource as a whole.

This illustrates one of the intended benefits of the level 0 properties:

  • Maximum flexibility for inclusion in application profiles and mappings to external namespaces, of which m21plus is just an example.

Other benefits include:

  • Support of very simple algorithms for the production of data triples from existing MARC21 records. There is only a single decision point, based on  the data in Leader 06 of the record.
  • Support for round-tripping data triples back to MARC21 encoded records.

Disadvantages of the level 0 properties include:

  • The need to create super-properties to support data triples at the level of the whole resource and for aggregated statements.
  • The need to add mappings between level 0 and super-properties.

We have developed a full set of m21plus super-properties for the fixed-length data fields, with rdfs:subPropertyOf mappings from the level 0 properties, and hope to load them into the OMR real soon now for use by applications.

We also expect to find more benefits and disadvantages in using the level 0 properties as we continue to investigate the MARC21 variable data fields.

 

By Gordon Dunsire, April 30, 2012, 3:37 am (UTC-5)

Netting more MARC fruit discussed the use of the rdfs:subPropertyOf property to allow MARC21 data to interoperate with triples based on similar properties in element sets from other metadata schemas. We can show the procedure in action using the Target audience property described in Getting to higher MARC branches.

A quick examination of the namespaces for Dublin Core terms, the FRBR entity-relationship model, ISBD, and RDA reveals properties for intended or target audience in each. The FRSAD (FR for Subject Authority Data) namespace also has a property labelled “has audience”, but it refers to the class Nomen which contains labels and identifiers for subject topics; it is about subject headings and classifications, not the resource itself, so has a completely different context and is not suitable for interoperating with MARC21′s Target audience data.

The relevant triples for each property are:

Dublin Core terms:

@prefix dct: <http://purl.org/dc/terms/>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
dct:audience rdfs:comment "A class of entity for whom the resource is intended or useful." .
dct:audience rdfs:label "audience" .
dct:audience rdfs:range dct:AgentClass .

FRBR:

@prefix frbrer: <http://iflastandards.info/ns/fr/frbr/frbrer/>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#&gt.
@prefix skos: <http://www.w3.org/2004/02/skos/core#>.
frbrer:P3006 skos:definition "Relates a work to the class of user for which the work is intended, as defined by age group, educational level, or other categorization." .
frbrer:P3006 rdfs:label "has intended audience" .
frbrer:P3006 rdfs:domain frbrer:C1001 .

ISBD:

@prefix isbd: <http://iflastandards.info/ns/isbd/elements/>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#&gt.
@prefix skos: <http://www.w3.org/2004/02/skos/core#&gt.
isbd:P1091 skos:definition "Relates a resource to a note providing non-evaluative information as to the potential or recommended use of the resource and/or the intended audience." .
isbd:P1091 rdfs:label "has note on use or audience" .
isbd:P1091 rdfs:domain isbd:Resource .

RDA:

@prefix rda: <http://rdvocab.info/Elements/>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#&gt.
@prefix skos: <http://www.w3.org/2004/02/skos/core#&gt.
@prefix unc: <http://.../>.
rda:intendedAudienceWork skos:definition "The class of user for which the content of a resource is intended, or for whom the content is considered suitable, as defined by age group (e.g., children, young adults, adults, etc.), educational level (e.g., primary, secondary, etc.), type of disability, or other categorization." .
rda:intendedAudienceWork rdfs:label "Intended audience (Work)" .
rda:intendedAudienceWork rdfs:domain rdafrbr:Work .
rda:intendedAudienceWork rdfs:subPropertyOf unc:intendedAudience .

Unconstrained RDA:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#&gt.
@prefix skos: <http://www.w3.org/2004/02/skos/core#&gt.
@prefix unc: <http://.../>.
unc:intendedAudience skos:definition "The class of user for which the content of a resource is intended, or for whom the content is considered suitable, as defined by age group (e.g., children, young adults, adults, etc.), educational level (e.g., primary, secondary, etc.), type of disability, or other categorization." .
unc:intendedAudience rdfs:label "Intended audience" .

And for our high-level MARC21 property:

@prefix m21plus: <http://marc21rdf.info/elements/.../>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#&gt.
@prefix skos: <http://www.w3.org/2004/02/skos/core#&gt.
m21plus:M00Aud skos:definition "The intellectual level of the target audience for which the material is intended." .
m21plus:M00Aud rdfs:label "Target audience" .

The MARC21 property has a narrower definition than any of the others. Taking into account the domain and range constraints leaves only the unconstrained RDA property as a candidate super-property:

m21plus:M00Aud rdfs:subPropertyOf unc:intendedAudience .

This allows triples from MARC21 records to have entailments which use the same unconstrained property as triples from RDA records; the FRBR-constrained RDA property is already a sub-property of the unconstrained property.

RDF graphs of MARC21 and RDA data triples and entailments.

RDF graphs of MARC21 and RDA data triples and entailments.

Of course, the original MARC21 triple is itself an entailment from a level 0 triple.

We can use the same procedure to align and map the other properties we found for target or intended audience. The resulting RDF ontology is:

@prefix dct: <http://purl.org/dc/terms/>.
@prefix frbrer: <http://iflastandards.info/ns/fr/frbr/frbrer/>.
@prefix isbd: <http://iflastandards.info/ns/isbd/elements/>.
@prefix m21plus: <http://marc21rdf.info/elements/.../>.
@prefix rda: <http://rdvocab.info/Elements/>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix unc: <http://.../>.
dct:audience rdfs:subPropertyOf unc:intendedAudience .
frbrer:P3006 rdfs:subPropertyOf unc:intendedAudience .
m21plus:M00Aud rdfs:subPropertyOf unc:intendedAudience .
rda:intendedAudienceWork rdfs:subPropertyOf unc:intendedAudience .
unc:intendedAudience rdfs:subPropertyOf unc:P1091 .
isbd:P1091 rdfs:subPropertyOf unc:P1091 .

Note that the ISBD property is broader in definition that the unconstrained RDA property, but is itself constrained by its domain. So we need an unconstrained version of the ISBD property, which has the broadest semantic of all the related properties.

RDF graph of Target/intended audience ontology

RDF graph of Target/intended audience ontology

This ontology allows all the data about the intended audience of a resource to be available as a single, common attribute; our MARC fruit is added to the basket automatically, once it has been plucked as a level 0 triple.

By Gordon Dunsire, April 23, 2012, 4:14 pm (UTC-5)