Metadata is produced and stored locally, published globally, consumed and aggregated locally, and finally integrated and stored locally. This is the create/publish/consume/integrate cycle.

Providing a framework for managing metadata throughout this cycle is the goal of the Dublin Core Abstract Model and the Dublin Core Application Profile (DCAM/DCAP).

The basic guidelines and requirements for this cycle are:

  • Metadata MUST be syntactically VALID and semantically COHERENT when it’s CREATED and PUBLISHED.
  • Globally PUBLISHED Metadata SHOULD be TRUE, based on the domain knowledge of the publisher.
  • PUBLISHERS of metadata MUST publish the semantics of the metadata, or reference publicly available semantics.
  • CONSUMERS of published metadata SHOULD assume that the global metadata is locally INVALID, INCOHERENT, and UNTRUE.
  • CONSUMED metadata MUST be checked for syntactic validity and semantic coherence before integration.
  • AGGREGATED metadata SHOULD indicate its PROVENANCE.
  • CONSUMED metadata MAY be considered TRUE based on its PROVENANCE.
  • Locally INTEGRATED global metadata MUST be syntactically VALID, semantically COHERENT, and SHOULD be TRUE based on local standards.

The DCAM takes as its base the rdf data model because of its simplicity and flexibility of that model. The DCAM refines the rdf data model in order to support syntactic interoperability between the rdf data model and non-rdf data models.

A DCAP defines a distinct subset of the set of all things and defines the domain-specific knowledge of the properties of those things, and the relationships those things have to other things. It expresses that knowledge through the DCAM and a set of related documentation (the Singapore Framework). A complete DCAP should provide the necessary domain-specific infrastructure to fully support the create/publish/consume/integrate cycle for any system, any domain, and any data model. A DCAP based on the DCAM should be able to be used by a machine to generate a valid data model in any modeling syntax, and any modeling syntax should be able to be expressed as a DCAP.

We strongly recommend that metadata be published and consumed using the RDF data model. The strength of the rdf data model for publishing and aggregating metadata lies in its extreme flexibility of expression, it’s self-describing semantics, and its assumption of validity. These strengths become weaknesses when creating and validating local metadata that must conform to a set of local restrictions. An effective DCAP will provide metadata creators with the ability to create locally valid metadata, publish it as rdf, validate consumed rdf against local standards, and ultimately integrate global metadata into the local knowledge base that contains local descriptions of the things being described.

There are many systems, many data models, many publish and subscribe models, many storage and validation models. There are many paths to integration. There are very few that provide a generic and neutral method for modeling and documenting inter-system metadata integration. The DCAM/DCAP specification has the potential to be one of those few.

By Jon, July 19, 2012, 11:54 am (UTC-5)

By-passing taggregations identifies those MARC21 variable data field tags whose level 0 RDF properties, representing individual subfields, can be used generally in data triples and mapping triples without losing information and the semantic coherency of the record. These tags have subfields which are independent of one another, with no need to keep them together if the tag is repeated in a record.

RDF graph of Music format ontology

RDF graph of Music format ontology

This graph is another example of adding MARC fruit to the cornucopia by using the sub-property ladder. The MARC21 property used in the graph is a level 0 element. The MARC21 property’s description is equivalent to the ISBD property’s description, as hinted at in the ISBD scope note, but an unconstrained version is used to avoid the ISBD property’s domain. We could use the Web Ontology Language (OWL) ontological property owl:equivalentProperty to represent the relationship between the properties, but we can also use the rdfs:subPropertyOf property by applying it in both directions. That is, if two properties P1 and P2 are related as P1 rdfs:subPropertyOf P2 AND P2 rdfs:subProperty P1, then P1 owl:equivalentProperty P2 and P2 owl:equivalentProperty P1.

Unfortunately, it is unsafe to use level 0 properties for dependent subfields in repeatable tags in this way, even if a specific record contains only one instance of such a tag. Triples from that record will cluster with triples from another record about the same bibliographical resource, either by sharing the same resource URI or an equivalence link. Taggregations are required to avoid semantic confusion, otherwise we wouldn’t know which “Materials specified” goes with which “Place of publication …” or “Name of publisher …” in the publication statement example given in Taggregations.

 

By Gordon Dunsire, July 18, 2012, 9:44 am (UTC-5)

If we were asked (as we sometimes are) what we’d like to see develop as a result of the BibFrame effort, the emphasis in our answer would have both technical and social aspects.

First, given the technologies developing in several different places and considering what we can do now to bring Linked Open Data into our somewhat closed world, some concrete suggestions: We have the ability to share metadata expressed not just in a single common ‘vocabulary’, but to share it using many different vocabularies, expressed and distributed using RDF, OWL/RDFS, RDFa, Microdata, and other tools; we have methods of specifying the use of these ‘semantic’ building blocks (DC Application Profiles and emerging provenance specifications from W3C and DCMI) that allow machines to use, process and distribute data in ways that do not require a central enabling node; and finally, we have technologies and strategies in place to map between existing and prospective metadata schemas that flatten the fences between communities quite thoroughly.

This is the post-MARC world of library-related metadata, where the format native to our metadata is no longer its most important characteristic; there’s no lossy crosswalking requirement to transform data to serve different needs; and there are fewer (eventually no) barriers to sharing. The tremendous value that MARC represents–the semantics built over many decades in response to an enormous corpus of use cases (all of which are fully documented on the MARBI site)–continues to be vitally important as we move into a different arena. But the mid-20th century requirements that dictated the MARC syntax, and the constricting consensus model that has been required to maintain it, no longer apply to the current and future requirements of the global library community.

The usual caveats apply here–some of these technologies aren’t entirely ready for prime time, but in the world we live in, ‘finished’ is more likely to be used for something defunct rather than a goal for standards and tool development. The areas we’re personally most familiar with (no surprises here) are the complementary domains of vocabulary management and mapping. The first of those is up and running (although the ‘continuous improvement’ engine is going full bore) as the Open Metadata Registry; the second, mapping, is an important interest of ours and has generated papers, articles, and presentations (see below for a selection), not to mention the usual plethora of posts to blogs (including this one) and discussion lists. We believe that the OMR and mapping capabilities under development work together to enable legacy data to shift into the open linked data world efficiently, effectively, and with a minimum of loss, in the process enriching the useful data options available for everyone.*

Often overlooked in the glitter of technology is the possibility of pulling together the communities that were torn asunder as a result of past technical limitations (and other reasons). In our common past, the library community built their data sharing conventions on a platform based on the consensus use of AACR2 and MARC. Some library communities–law and music the most prominent–were willing to compromise in order to work within the larger library community rather than strike out on their own. Others, like the art library and museum community and the archivists, did make the break, and have developed data standards that meet their needs better than the library consensus was able to do. But those specialized standards have resulted in yet more silos, in addition to the huge MARC one.

If the intent is to ‘replace MARC’ (as it is at some level), it’s about re-placing MARC and its built-in limitations from the center of our world back into the crowd, where it becomes one-among-many. Also a value in that shift is the ability to expand the data sharing environment MARC enabled a half century ago to a broader community of interest that includes museums, publishers, and archives. Meeting the goal of dissolving those silos and making our data easily understandable and reusable in a host of ways will help initiate that ‘Web of Data’ we’ve been anticipating for many years. As Eric Miller explained so well in his Anaheim BibFrame update: by moving towards a linked data world we actually look well beyond the Library/Archives/Museum worlds by definition–it’s a very big world out there. But by collaborating with the LAM community as a whole to get there, we reap a great many benefits, not the least of which are perspectives that are much broader in some significant ways than ours. Limiting our view merely to a ‘MARC Linked Data Model’ might be an important beginning step, but it falls short of where our vision needs to extend.

And the fact is that MARC will not be going away for a long time, if ever. There will be a lot of variation in how the transition is done by libraries, depending on institutional support, short term and long term needs, and existing partnerships. The process of moving MARC into the linked data world has already started. RDA and its RDF vocabularies was a start, as is the development of a complete RDF version of MARC, located at marc21rdf.info. Several years worth of pre-conferences, presentations and discussions, at ALA and beyond, have prepared the soil for these changes. But we need a plan, and some concrete steps to take–steps that include the groups who have been working in the trenches without a great deal of support, but making progress regardless. The BibFrame effort needs to be more than a playground for the technologists, because in most instances, the technology is not what’s holding us back–it’s the institutional inertia and the difficulties of finding ways forward that don’t pit us against one another. The plan we need balances the technical and social, the quick-win with the long-term momentum, and the need for speed with the public discussion that takes time and builds buy-in.

What we have in our sights is an opportunity to reverse the long term trend towards balkanized metadata communities and to make the future one for which there are fewer fences between, and more data exchanged among these three communities with obviously similar challenges and interests. We think the time has come to use the vastly changed technology environment to do that.

*It would be too easy for a response to this post to be in the form of “Oh, you’re just tooting your own horn here,” and indeed we are in some measure doing that. But we do this work because we believe it’s important. We don’t believe it’s important merely because it’s what we do. We believe in the value of the work we’ve done and will do, and we see a great deal of relevance for it as part of the BibFrame discussions.

Selection of papers, and presentations:

Jon Phipps’ presentation on mapping at ALA Anaheim
A Reconsideration of Mapping in a Semantic World / Gordon Dunsire, Diane Ileana Hillmann, Jon Phipps, Karen Coyle
Gordon Dunsire’s presentation at the recent London meeting “RDA, 5 years on”

Post written by Diane Hillmann and Jon Phipps.

By Diane Hillmann, July 4, 2012, 2:35 pm (UTC-5)

The methodology of treating MARC21 variable data field tags as aggregated statements in RDF is discussed in Taggregations. There are some circumstances when this approach is redundant and level 0 RDF properties based on individual subfields can be used directly in MARC21 data triples that are semantically complete and coherent, and in mapping triples relating MARC21 to other metadata schema.

We can by-pass the need for an aggregated statement when there are no semantic dependencies between the subfields of a tag, and thus net more low hanging MARC fruit. The most obvious case is when there is only one subfield in the tag: the contents of the tag are the same as the contents of the subfield. Note that the repeatable status of a tag is, generally, not relevant as there is no intrinsic semantic dependence between multiple occurrences of a tag in a record.

There appear to be no MARC21 tags with just one subfield, but I think it is reasonable (at this stage of analysis) to ignore the Linkage subfield ($6) and the Field link and sequence number subfield ($8), although further investigation is required.

Disregarding $6 and $8, the following tags have a single subfield and no indicators:

  • 018
  • 025
  • 038
  • 042
  • 254
  • 256
  • 263
  • 306
  • 508
  • 515
  • 525
  • 547
  • 550
  • 580

[Note: Linking Entry Fields (76X-78X) are excluded from this analysis, for the time being.]

For example, the level 0 property (m21:M515__a) for subfield a of tag 515 “Numbering Peculiarities Note” can be used directly for a data triple.

RDF graph of repeated instances of MARC21 tag 515

RDF graph of repeated instances of MARC21 tag 515

In this example, a MARC21 record for a resource (ex:1) has two occurrences of tag 515, and the same level 0 property can be used directly to express the data in RDF, without the need for an aggregated statement property for the tag.

Disregarding $6 and $8, the following tags have a single subfield and use one or both indicators:

  • 384
  • 511
  • 516
  • 522
  • 567
  • 653

Thus the appropriate level 0 property for subfield a of tag 384 “Key” can also be used for data triples. There are three properties available (m21:M384__a, m21:M3840_a, m21:M3841_a), each based on a different first indicator value (#, 0, 1).

RDF graph of separate instances of MARC21 tag 384 with different first indicators

RDF graph of separate instances of MARC21 tag 384 with different first indicators

In this example, the MARC21 records for three different resources (ex:1, ex:2, ex:3) have an occurrence of tag 384, but with different values for the first indicator.

A few tags have multiple subfields with no apparent semantic dependencies between them:

  • 010
  • 027
  • 030
  • 035
  • 040*
  • 066
  • 074
  • 088

For example, the level 0 properties for subfield a and subfield z of tag 088 “Report Number” can be used directly for data triples because there is no need to keep a report number ($a) together with a cancelled or invalid report number ($z).

RDF graph of separate subfields of MARC21 tag 088

RDF graph of separate subfields of MARC21 tag 088

[* It is not clear from the MARC Bibliographic documentation for tag 040 if the sequence of repeats of subfield d ("Modifying agency") is significant; that is, if there is semantic dependency between repeated instances of this subfield. There is a hint that sequence is significant: "Subfield $d is not repeated when the same MARC code or name would occur in adjacent $d subfields" (emphasis in original). This seems to imply that successive modifying agencies are added in sequence, with the inference that the last has modified the record as it stood after the previous modification. But the only example given of a repeat of the subfield says "modified by ... and by ..." (rather than "... and then by ..."), implying that sequence does not matter. Such ambiguity in the documentation of a schema used by many thousands of libraries to create millions of bibliographic records is not helpful.]

 

 

By Gordon Dunsire, June 7, 2012, 8:20 am (UTC-5)

The technique described in Using the sub-property ladder works well to “dumb-up” raw, level 0 data from MARC21 fixed-length data fields to interoperate with metadata from other schemas. Unfortunately, it cannot be used with most MARC21 variable data fields (tags) and subfields. We cannot simply dumb-up a subfield to the level of its parent tag because most tags have more than one subfield; the meaning of a tag is a combination of the meanings of its subfields and tag-level data  is a composite of subfield-level data.

There is another technique we can use to bridge the semantic gap between a subfield and its tag: tags generally can be treated as “aggregated statements”, where the value of a tag is a literal string, or statement, which is composed of the values of subfields.

For example, a record may contain a tag 260 (Publication, Distribution, etc.) with subfield a (Place of publication, distribution, etc.) = “Edinburgh :”, subfield b (Name of publisher, distributor, etc.) = “Castle Press,”, and subfield c (Date of publication, distribution, etc.) = “2012.”. The contents of the tag, “$aEdinburgh :$bCastle Press,$c2012.” can be turned into a tag-level value, “Edinburgh : Castle Press, 2012.”, by substituting a space for each subfield indicator ($) and code pair. We can then use a tag-level property with the label “Publication, Distribution, etc. (Imprint)” and URI “m21plus:T260″ to publish the metadata statement “This resource – has Publication, Distribution, etc. – ‘Edinburgh : Castle Press, 2012.’” as an RDF triple.

The instructions for deriving the tag-level value or aggregated statement from the subfield values are known as a syntax encoding scheme (SES). This is part of the Dubin Core abstract model, allowing specific SESs to be used in an application profile. There can be many different ways of deriving the value; the example above works because MARC21 subfields contain embedded punctuation that delineates the component parts when the subfield encoding is removed. This simple SES allows a MARC21 record to conform to the syntax prescribed by the International Standard Bibliographic Description (ISBD) for compound statements. Unfortunately, this makes it difficult to apply any other SES to the subfields without first removing the punctuation.

It would be much better if the instructions for adding ISBD punctuation to MARC21 data were embedded in an SES. Then a different SES could produce “Published in 2012 by Castle Press in Edinburgh” rather than “Published in 2012. by Castle Press, in Edinburgh :”. This is the approach taken by ISBD itself, and there is clearly an opportunity here for collaboration between the MARC21 and ISBD communities. The same approach is envisaged for RDA.

The aggregated statement technique is also very useful when a MARC tag is repeated. Using tag 260 again as an example, a record may contain multiple publication statements for intervening publishers, where the tag’s first indicator has value “2″. If there are two such tags, then there may be two or more publication places and two or more publisher names, for example “$32001-2005$aEdinburgh :$bMudhut Publishing” and ”$32006-$aEdinburgh :$bCastle Press” (subfield 3 is for Materials specified). A linked data representation of the record needs to keep the places, names, and dates correctly associated so that they don’t get mixed up, for example “Mudhut Publishing” with “2006-”. The tag-level RDF property (m21plus:T260) can be used with an aggregated statement to keep the level 0 data associated with the correct repeat of the tag, avoiding the use of blank nodes in the RDF graph of a specific record.

RDF graph of MARC21 Publication statement data

RDF graph of MARC21 Publication statement data

As the graph shows, the two Publication statements must have URIs so that they can link to the correct subfield values. The URIs identify the literal strings of the aggregated statements, and are instances of an SES; all SESs are sub-classes of the class of literal strings. A blank node, on the other hand, has no URI and uses a local identifier to make the links; such links appear broken in a non-local environment.

To sum up, it seems useful to represent MARC21 tags as RDF properties associated with a syntax encoding scheme. We intend to add these properties to the Open Metadata Registry. Specific encoding schemes can then be assigned using an application profile. There must be many examples of instructions for processing tag subfields for output and display which can form the basis of suitable encoding schemes.

By Gordon Dunsire, May 20, 2012, 5:08 pm (UTC-5)

I discussed the utility of the sub-property relationship in Getting to higher MARC branches, Netting more MARC fruit, and Adding MARC fruit to the cornucopia. Coincidentally, Bob DuCharme posted Simple federated queries with RDF which outlines the same technique and provides additional information on its use for resource discovery. Those posts are somewhat technical, and I tried to lighten up in my presentation Turtle dreaming at the recent Dublin Core Metadata Initiative (DCMI) seminar Five years on. This post is another attempt to demonstrate in a non-technical way (I hope) how useful and powerful the sub-property relationship can be.

A metadata attribute, like ‘title’, that is to be used for linked data in the Semantic Web is usually represented in Resource Description Framework (RDF) as a property. A property can be used as the predicate part of a triple: “Subject – predicate – object”, where ‘Subject’ is what the triple is about (e.g. a resource), ‘predicate’ is the aspect of the subject, and ‘object’ is the value of that aspect for the specified subject. For example:

“This resource – (has) title – ‘Using the sub-property ladder’”

is a single metadata statement in triple format. We can think of this as conforming to the triple template:

“Specified resource – (has) attribute – value”.

Note that prefixing the predicate with ‘has’ turns it into a verbal phrase and renders the statement in (near) natural language.

We can also make meta-metadata statements in triple format. These are ‘data about metadata’ rather than ‘data about data’, and are often referred to as ontological triples to distinguish them from data triples such as the example above. The triple template for one type of meta-metadata statement is:

“Specified RDF element – (has) relationship – Other specified RDF element”

Note that a relationship between metadata elements is also represented in RDF as a property. In particular, ‘sub-property’ is a pre-defined relationship between two RDF property elements, giving the ontological triple:

“Property 1 – (is) sub-property of – Property 2″

Furthermore, such relationships can embed semantic rules that can be processed automatically by software known as ‘semantic reasoners’ or just plain ‘reasoners’. The rule embedded in the sub-property relationship is: If “P1 – (is) sub-property of – P2″, then any data triple using P1 as its predicate can generate another data triple using P2 as its predicate, with the same subject and object. Let’s call this kind of ontological triple a mapping triple, because it effectively maps one property to another.

Suppose we have two attributes ‘title’ and ‘varying form of title’. I can create the mapping triple:

“‘varying form of title’ – (is) sub-property of – ‘title’”.

If we have a data triple:

“This resource – (has) varying form of title – ‘Pat presents cataloguing for beginners’”

then a reasoner will automatically generate the data triple:

“This resource – (has) title – ‘Pat presents cataloguing for beginners’”

In a similar way, we can create the mapping triple:

“‘title statement’ – (is) sub-property of – ‘title’”

and from the data triple:

“This resource – (has) title statement – ‘Cataloguing for beginners’”

generate:

“This resource – (has) title – ‘Cataloguing for beginners’”

So what? Further suppose that the ‘title’ attribute is from the DCMI metadata terms, and the ‘varying form of title’ and ‘title statement’ attributes are from the MARC21 tags 245 and 246. So a MARC21 record for the resource might contain the set of data triples:

“This resource – (has) 245 [title statement] – ‘Cataloguing for beginners’”
“This resource – (has) 246 [varying form of title] – ‘Pat presents cataloguing for beginners’”

A reasoner can generate the set of data triples:

“This resource – (has) [DC] title – ‘Cataloguing for beginners’”
“This resource – (has) [DC] title – ‘Pat presents cataloguing for beginners’”

In other words, we have generated a DC record from a MARC21 record. Or we have generated a title index for the MARC21 record. Or both.

Let’s add an RDA attribute and an ISBD attribute mapping to the mix:

“[RDA] ‘title proper’ – (is) sub-property of – [DC] ‘title’”
“[ISBD] ‘has title proper’ – (is) sub-property of – [DC] ‘title’”

The data triples:

“That resource – (has) [RDA] title proper – ‘Cataloguing for geeks’”
“Another resource – [ISBD] has title proper – ‘Does cataloguing have a future?’”

can generate the corresponding DC triples, and we end up with:

“This resource – (has) [DC] title – ‘Cataloguing for beginners’”
“That resource – (has) [DC] title – ‘Cataloguing for geeks’”
“Another resource – (has) [DC] title – ‘Does cataloguing have a future?”
“This resource – (has) [DC] title – ‘Pat presents cataloguing for beginners’”

So now we have a title index to metadata from multiple heterogeneous sources. And the beginnings of a set of records in Dublin Core format.

Note that the attribute which is the sub-property must be entirely narrower in its semantics than the related super-property. If we create the mapping triple:

“‘title’ – (is) sub-property of – ‘varying form of title’”

then we generate the data triple:

“This resource - (has) 246 [varying form of title] – ‘Cataloguing for beginners’”

which is incorrect.

As a result, a data triple generated by a sub-property mapping triple is usually ‘dumber’ than the original data triple; detail is lost because the generated triple uses an attribute which is broader in meaning than the original. This ‘dumbing-up’ is necessary to produce interoperable metadata from different schemas – but data is not permanently lost because the original triple is still available for use in other applications. Needless to say, data triples created with broad attributes cannot be “smartened-down”, at least on their own.

The sub-property relationship can be chained. We can create a new attribute property, MARC21 ‘title’, which could be used in an application for making a title index to MARC21 records, as already mentioned. This new attribute is a super-property of all the MARC21 title-type attributes, and is also a sub-property of the DC ‘title’ attribute:

“[MARC21] ‘title statement’ – (is) sub-property of – [MARC21] ‘title’”
“[MARC21] ‘varying form of title’ – (is) sub-property of – [MARC21] ‘title’”
“[MARC21] ‘title’ – (is) sub-property of – [DC] ‘title’”

Doing this does not affect the previous mapping triples relating each MARC21 title-type attribute directly to the DC ‘title’ attribute, although it  makes them redundant because this new set of mapping triples generates exactly the same data triples at the DC level from the MARC21 originals.

Different application can therefore re-use and, if necessary, augment the sub-property chains for each of the high-level core attributes found in most bibliographic metadata schemas, such as title, author/creator/agent, subject, target audience, etc. The chains form a net(work) of mappings, or map, which can automatically dumb-up triples from any level of semantic granularity to any higher level.

We should only have to publish such maps or part-maps once, openly so that anyone can use them and add to them. If the professional communities develop the maps first, much effort will be saved and much authority imparted. This requires collaboration and action real soon now – the ISBD Review Group and the Joint Steering Committee for Development of RDA have started with the development of a mapping between the ISBD and RDA element sets.

These maps should remain valid forever, so the effort is worth expending. The original data triples use the original properties based on the schema attributes at the time and they will be valid “for their time”, in the same way that many catalogues are likely to contain records created under the Anglo-American Cataloguing Rules, with its ‘general material designation’ attribute long after the successor standard RDA: resource description and access has been adopted with its ‘content type’ and ‘carrier type’ attributes.

And mappings from the MARC21 element sets will show, we hope, that it may not be necessary to convert the entire contents of every MARC21 record as a result of the Bibliographic Framework Transition Initiative!

But the professional communities lack a framework to help them collaborate as a super-community. A network of mappings is more (socially) efficient than an aggregation of one-to-one mappings between pairs of schemas. We need (name)spaces to add intermediary attribute properties and publish the mappings; we need protocols for managing semantic change as schemas evolve; we need lightweight protocols for authorizing mappings; we need systems for ensuring the long-term preservation of RDF element sets and mapping triples.

By Gordon Dunsire, May 12, 2012, 1:17 pm (UTC-5)

Why create “a separate property for every combination of tag, two indicators, and subfield”, as I mentioned in Low-hanging MARC fruit?

As Karen says in her comment on Getting to higher MARC branches, in the case of the Target audience data element there is no need for the RDF properties defined for each MARC21 “resource type” when the Target audience value vocabulary can be linked directly to the property defined at the higher level of the MARC21 resource.

The data element is one of several in the MARC 21 format for bibliographic data that are “defined the same in the specifications for more than one type of material” and share the same value vocabulary; this requires inspection of the vocabulary for each type, as there seems to be no explicit indication in the text. We have used the prefix “common” in the vocabulary URIs to reflect this, as in the value URI of the object of the ex:1 data triple (“commonaud#j”). Examples include commonaud (MARC21-008: Target audience) itself, as well as commonfor (MARC21-008: Form of item) and commonnat (MARC21-008: Nature of contents).

Despite this, I think it is worth including the 006 equivalents of the Target audience element in the RDF graph, as they represent “special aspects of the item being cataloged that cannot be coded in field 008″. This ensures that any semantic relationship between the 006 and 008 fixed-length data elements is preserved in the data. The semantic relationship between “type” and “form” of material in the 006/008 relationship table is not specified, so I assume that it will be incorporated, along with usage guidelines, etc., in one or more application profiles. Application profiles can also ensure that values from the common Vocabulary Encoding Scheme are preserved in data triples entailed during dumb-up to the level of the resource as a whole.

This illustrates one of the intended benefits of the level 0 properties:

  • Maximum flexibility for inclusion in application profiles and mappings to external namespaces, of which m21plus is just an example.

Other benefits include:

  • Support of very simple algorithms for the production of data triples from existing MARC21 records. There is only a single decision point, based on  the data in Leader 06 of the record.
  • Support for round-tripping data triples back to MARC21 encoded records.

Disadvantages of the level 0 properties include:

  • The need to create super-properties to support data triples at the level of the whole resource and for aggregated statements.
  • The need to add mappings between level 0 and super-properties.

We have developed a full set of m21plus super-properties for the fixed-length data fields, with rdfs:subPropertyOf mappings from the level 0 properties, and hope to load them into the OMR real soon now for use by applications.

We also expect to find more benefits and disadvantages in using the level 0 properties as we continue to investigate the MARC21 variable data fields.

 

By Gordon Dunsire, April 30, 2012, 3:37 am (UTC-5)

Netting more MARC fruit discussed the use of the rdfs:subPropertyOf property to allow MARC21 data to interoperate with triples based on similar properties in element sets from other metadata schemas. We can show the procedure in action using the Target audience property described in Getting to higher MARC branches.

A quick examination of the namespaces for Dublin Core terms, the FRBR entity-relationship model, ISBD, and RDA reveals properties for intended or target audience in each. The FRSAD (FR for Subject Authority Data) namespace also has a property labelled “has audience”, but it refers to the class Nomen which contains labels and identifiers for subject topics; it is about subject headings and classifications, not the resource itself, so has a completely different context and is not suitable for interoperating with MARC21′s Target audience data.

The relevant triples for each property are:

Dublin Core terms:

@prefix dct: <http://purl.org/dc/terms/>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
dct:audience rdfs:comment "A class of entity for whom the resource is intended or useful." .
dct:audience rdfs:label "audience" .
dct:audience rdfs:range dct:AgentClass .

FRBR:

@prefix frbrer: <http://iflastandards.info/ns/fr/frbr/frbrer/>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#&gt.
@prefix skos: <http://www.w3.org/2004/02/skos/core#>.
frbrer:P3006 skos:definition "Relates a work to the class of user for which the work is intended, as defined by age group, educational level, or other categorization." .
frbrer:P3006 rdfs:label "has intended audience" .
frbrer:P3006 rdfs:domain frbrer:C1001 .

ISBD:

@prefix isbd: <http://iflastandards.info/ns/isbd/elements/>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#&gt.
@prefix skos: <http://www.w3.org/2004/02/skos/core#&gt.
isbd:P1091 skos:definition "Relates a resource to a note providing non-evaluative information as to the potential or recommended use of the resource and/or the intended audience." .
isbd:P1091 rdfs:label "has note on use or audience" .
isbd:P1091 rdfs:domain isbd:Resource .

RDA:

@prefix rda: <http://rdvocab.info/Elements/>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#&gt.
@prefix skos: <http://www.w3.org/2004/02/skos/core#&gt.
@prefix unc: <http://.../>.
rda:intendedAudienceWork skos:definition "The class of user for which the content of a resource is intended, or for whom the content is considered suitable, as defined by age group (e.g., children, young adults, adults, etc.), educational level (e.g., primary, secondary, etc.), type of disability, or other categorization." .
rda:intendedAudienceWork rdfs:label "Intended audience (Work)" .
rda:intendedAudienceWork rdfs:domain rdafrbr:Work .
rda:intendedAudienceWork rdfs:subPropertyOf unc:intendedAudience .

Unconstrained RDA:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#&gt.
@prefix skos: <http://www.w3.org/2004/02/skos/core#&gt.
@prefix unc: <http://.../>.
unc:intendedAudience skos:definition "The class of user for which the content of a resource is intended, or for whom the content is considered suitable, as defined by age group (e.g., children, young adults, adults, etc.), educational level (e.g., primary, secondary, etc.), type of disability, or other categorization." .
unc:intendedAudience rdfs:label "Intended audience" .

And for our high-level MARC21 property:

@prefix m21plus: <http://marc21rdf.info/elements/.../>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#&gt.
@prefix skos: <http://www.w3.org/2004/02/skos/core#&gt.
m21plus:M00Aud skos:definition "The intellectual level of the target audience for which the material is intended." .
m21plus:M00Aud rdfs:label "Target audience" .

The MARC21 property has a narrower definition than any of the others. Taking into account the domain and range constraints leaves only the unconstrained RDA property as a candidate super-property:

m21plus:M00Aud rdfs:subPropertyOf unc:intendedAudience .

This allows triples from MARC21 records to have entailments which use the same unconstrained property as triples from RDA records; the FRBR-constrained RDA property is already a sub-property of the unconstrained property.

RDF graphs of MARC21 and RDA data triples and entailments.

RDF graphs of MARC21 and RDA data triples and entailments.

Of course, the original MARC21 triple is itself an entailment from a level 0 triple.

We can use the same procedure to align and map the other properties we found for target or intended audience. The resulting RDF ontology is:

@prefix dct: <http://purl.org/dc/terms/>.
@prefix frbrer: <http://iflastandards.info/ns/fr/frbr/frbrer/>.
@prefix isbd: <http://iflastandards.info/ns/isbd/elements/>.
@prefix m21plus: <http://marc21rdf.info/elements/.../>.
@prefix rda: <http://rdvocab.info/Elements/>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix unc: <http://.../>.
dct:audience rdfs:subPropertyOf unc:intendedAudience .
frbrer:P3006 rdfs:subPropertyOf unc:intendedAudience .
m21plus:M00Aud rdfs:subPropertyOf unc:intendedAudience .
rda:intendedAudienceWork rdfs:subPropertyOf unc:intendedAudience .
unc:intendedAudience rdfs:subPropertyOf unc:P1091 .
isbd:P1091 rdfs:subPropertyOf unc:P1091 .

Note that the ISBD property is broader in definition that the unconstrained RDA property, but is itself constrained by its domain. So we need an unconstrained version of the ISBD property, which has the broadest semantic of all the related properties.

RDF graph of Target/intended audience ontology

RDF graph of Target/intended audience ontology

This ontology allows all the data about the intended audience of a resource to be available as a single, common attribute; our MARC fruit is added to the basket automatically, once it has been plucked as a level 0 triple.

By Gordon Dunsire, April 23, 2012, 4:14 pm (UTC-5)

Getting to higher MARC branches showed how the RDFS subPropertyOf property can be used to entail, or infer, RDF data triples with a broader semantic or meaning than the original, to “dumb-up” or aggregate data based on level 0 properties. Only the property or predicate part of the triple is different; the subject URI and object URI or literal remain unchanged.

The Target audience example represents one of several categories we have identified for the level 0 properties for the fixed-length data fields in MARC. This one is characterized as two or more properties with the same definition using the same Vocabulary Encoding Scheme (VES).

A more complex category is characterized as two or more properties with the same definition using the same VES where the underlying attribute allows  more than one value and the order of values is significant. An example is the 008 attribute Relief of maps:

RDF graph of MARC21 Relief of maps ontology (partial)

RDF graph of MARC21 Relief of maps ontology (partial)

“sP” in the graph stands for rdfs:subPropertyOf. The graph does not show the level 0 properties for Relief (2), (3), and (4).

This ontology separates the semantics of the order of the data from the type of data while preserving the data values. It offers any application a property for the most important relief type specified on the item being described (m21plus:M00Rel1 “Relief (1)”), a property for the second most important relief type (m21plus:M00Rel2 “Relief (2)”), etc., a property for the relief type regardless of importance (m21plus:M008MP18-21 “Relief of Maps”), etc., and a property for the relief type regardless of importance or category of material (m21plus:M00Rel “Relief”).

Target audience and Relief are, of course, at the level of attribute or relationship found in many other schemas. The rdfs:subPropertyOf property can be used to establish mappings as semantic relationships between similar properties in multiple namespaces, as described in our paper A reconsideration of mapping in a semantic world.

While properties may be similar in different namespaces, care has to be taken to avoid semantic incoherency and the inadvertent addition of false information into entailed triples. This has been  easy with the level 0 properties because the basic definitions are identical, and none of the properties is bound by a domain or range. The situation changes when properties from other namespaces are considered. Definitions may be the same in meaning, or broader or narrower. For example, the definition of m21plus:M00Aud, “The intellectual level of the target audience for which the material is intended”, is clearly narrower than the definition of the corresponding RDA property, “The class of user for which the content of a resource is intended, or for whom the content is considered suitable, as defined by age group (e.g., children, young adults, adults, etc.), educational level (e.g., primary, secondary, etc.), type of disability, or other categorization”. If the RDA property is declared a sub-property of the MARC21 property, resulting entailments will give values which are out of scope, for example type of disability, even if the constraint of the MARC21 VES for target audience is ignored. Conversely, declaring the MARC21 property as a sub-property of the RDA property does not lead to such problems, as RDA can accommodate the VES URIs within its scope and range.

However, it turns out we are on risky ground if we do this, because the RDA property has the FRBR class Work as its domain. This means that in any entailed triple, the URI of the MARC21 resource which is the subject of the original triple is inferred to be a work (a member of the class Work). But unless the MARC21 data has been pre-FRBRized, other triples with the same subject URI will entail triples stating the same resource is also a manifestation, etc., which breaks the FRBR model. Even though the RDA namespace is currently using its own “weak” version of the FRBR classes, there is no guarantee that it won’t link to the FRBR namespace in the future. The situation is therefore best avoided, and what is required is a version of the RDA property that does not have a domain. Fortunately, this was anticipated by the DCMI/RDA Task Group (now the DCMI Bibliographic Metadata Task Group), and “unconstrained” versions of all the RDA properties were added when the namespace was created, as described in our paper RDA vocabularies: process, outcome, use.

As a general procedure, then, a MARC21 high-level property can be mapped out to another namespace by using sub-property relationships, when the target property completely encompasses the MARC21 property and has no domain or range, or is completely encompassed by the MARC21 property and has no range (to avoid the VES issue). The entailed triples produced by the mappings allow maximum interoperability of MARC21 data with that from other schemas without transforming the data values.

By Gordon Dunsire, April 20, 2012, 9:39 am (UTC-5)

In the previous blog I discussed Low-hanging MARC fruit in the MARC21 fixed-length data fields 006, 007, and 008. These fields also contain useful data that hangs slightly higher up, but can be reached with a short ladder. The ladder rungs are constructed using the RDF Schema subPropertyOf property. This is an ontological property which takes the RDF class Property as its domain and range; in other words, it links instances of two properties:

P1 rdfs:subPropertyOf P2 – where P1 and P2 are specific properties.

The subPropertyOf property contains the inference rule or entailment that:

If P1 rdfs:subPropertyOf P2, and X P1 Y, then X P2 Y

That is, if property P1 is a sub-property of property P2, then a machine can entail a triple using property P2 with the same subject and object as any triple using property P1.

There is a lot of semantic overlap in the MARC21 fields. For example, field 006 positions 01-17 relate to positions 18-34 in one of the field 008 configurations; they use the same values. 006 is used in cases when an item has multiple characteristics that cannot be coded in field 008. There is no semantic difference between the 006 and 008 data – a multi-component item may be catalogued as a whole using 008 for the main component and 006 for other components, or each component may be catalogued separately with its own 008 field.

We can aggregate this data by declaring sub-property relationships between corresponding 006 and 008 “level 0″ properties and a new common super-property:

E.g. Create a new property M00Aud with label “Target audience”, and declare M006a05 (“Target audience of Language material”), M006t05 (“Target audience of Manuscript language material”) and M008BK22 (“Target audience of Books”) as sub-properties:

@prefix m2100x: <http://marc21rdf.info/elements/00X/>.
@prefix m21plus: <http://marc21rdf.info/elements/.../>.
m21plus:M00Aud rdfs:label "Target audience" .
m2100x:M006a05 rdfs:subPropertyOf m21plus:M00Aud .
m2100x:M006t05 rdfs:subPropertyOf m21plus:M00Aud .
m2100x:M008BK22 rdfs:subPropertyOf m21plus:M00Aud .

A machine can use this RDF graph to entail new triples from existing data:

ex:1 m2100x:M006a05 m21terms:commonaud#j .
=> ex:1 m21plus:M00Aud m21terms:commonaud#j .
ex:2 m2100x:M006t05 m21terms:commonaud#e .
=> ex:2 m21plus:M00Aud m21terms:commonaud#e .
ex:3 m2100x:M008BK22 m21terms:commonaud#g .
=> ex:3 m21plus:M00Aud m21terms:commonaud#g .

RDF graphs of data triples and entailments (dotted lines)

RDF graphs of data triples and entailments (dotted lines)

Here, three different resources (ex:1, ex:2, ex:3) have target audience data stored in three different MARC21 fixed-length fields. The entailed triples store the data using a common property that encompasses the semantic of the level 0 properties by discarding their differences, which are the material categories. Each entailed triple states “This resource has target audience …”, dropping the distinction of material category which is unnecessary for this metadata attribute.

Using the entailed triples, we only need to process the higher-level property to create, for example, a “Target audience” index for a set of MARC21 records, rather than having to gather the data from the level 0 properties every time.

We can go further. The same value vocabulary for Target audience is used for other categories of material:

  • M006c05 (“Target audience of Notated music”)
  • M006d05 (“Target audience of Manuscript notated music”)
  • M006g05 (“Target audience of Projected medium”)
  • M006i05 (“Target audience of Nonmusical sound recording”)
  • M006j05 (“Target audience of Musical sound recording”)
  • M006k05 (“Target audience of Two-dimensional nonprojectable graphic”)
  • M006m05 (“Target audience of Computer file or Electronic resource”)
  • M006o05 (“Target audience of Kit”)
  • M006r05 (“Target audience of Three-dimensional artifact or naturally occurring object”)
  • M008CF22 (“Target audience of Computer Files”)
  • M008MU22 (“Target audience of Music”)
  • M008VM22 (“Target audience of Visual Materials”)

So we can declare sub-property relationships between each of these level 0 properties and the higher-level “Target audience” property, and generate the entailed triples.

Note that we could create an intermediary rung on our ladder, say M00BKAud “Target audience (Language material)”, to aggregate data at the material category level, and then declare a sub-property relationship with M00Aud to aggregate to the category-free level. There is no specific use-case for this at the moment. If the need arises, this can be done without affecting the existing sub-property relationships and entailments, because the subPropertyOf property is transitive: P1 rdfs:subPropertyOf P2 and P2 rdfs:subPropertyOf P3 entails P1 rdfs:subPropertyOf P3.

Our ladder “dumbs-up” the level 0 data; each sub-property entailment uses a higher-level property that is broader in semantic than the last. The ladders merge at each stage and are just one rung in length, so what we get is more like a climbing net to get to the higher-hanging fruit.

RDF graph of MARC21 Target audience ontology

RDF graph of MARC21 Target audience ontology

Applications can now deal with just one attribute property for Target audience and avoid the messiness at level 0. And there is just one property to align and map to corresponding properties from other bibliographic metadata schemas …

By Gordon Dunsire, April 18, 2012, 9:34 am (UTC-5)