Why create “a separate property for every combination of tag, two indicators, and subfield”, as I mentioned in Low-hanging MARC fruit?

As Karen says in her comment on Getting to higher MARC branches, in the case of the Target audience data element there is no need for the RDF properties defined for each MARC21 “resource type” when the Target audience value vocabulary can be linked directly to the property defined at the higher level of the MARC21 resource.

The data element is one of several in the MARC 21 format for bibliographic data that are “defined the same in the specifications for more than one type of material” and share the same value vocabulary; this requires inspection of the vocabulary for each type, as there seems to be no explicit indication in the text. We have used the prefix “common” in the vocabulary URIs to reflect this, as in the value URI of the object of the ex:1 data triple (“commonaud#j”). Examples include commonaud (MARC21-008: Target audience) itself, as well as commonfor (MARC21-008: Form of item) and commonnat (MARC21-008: Nature of contents).

Despite this, I think it is worth including the 006 equivalents of the Target audience element in the RDF graph, as they represent “special aspects of the item being cataloged that cannot be coded in field 008″. This ensures that any semantic relationship between the 006 and 008 fixed-length data elements is preserved in the data. The semantic relationship between “type” and “form” of material in the 006/008 relationship table is not specified, so I assume that it will be incorporated, along with usage guidelines, etc., in one or more application profiles. Application profiles can also ensure that values from the common Vocabulary Encoding Scheme are preserved in data triples entailed during dumb-up to the level of the resource as a whole.

This illustrates one of the intended benefits of the level 0 properties:

  • Maximum flexibility for inclusion in application profiles and mappings to external namespaces, of which m21plus is just an example.

Other benefits include:

  • Support of very simple algorithms for the production of data triples from existing MARC21 records. There is only a single decision point, based on  the data in Leader 06 of the record.
  • Support for round-tripping data triples back to MARC21 encoded records.

Disadvantages of the level 0 properties include:

  • The need to create super-properties to support data triples at the level of the whole resource and for aggregated statements.
  • The need to add mappings between level 0 and super-properties.

We have developed a full set of m21plus super-properties for the fixed-length data fields, with rdfs:subPropertyOf mappings from the level 0 properties, and hope to load them into the OMR real soon now for use by applications.

We also expect to find more benefits and disadvantages in using the level 0 properties as we continue to investigate the MARC21 variable data fields.

 

By Gordon Dunsire, April 30, 2012, 3:37 am (UTC-5)

Netting more MARC fruit discussed the use of the rdfs:subPropertyOf property to allow MARC21 data to interoperate with triples based on similar properties in element sets from other metadata schemas. We can show the procedure in action using the Target audience property described in Getting to higher MARC branches.

A quick examination of the namespaces for Dublin Core terms, the FRBR entity-relationship model, ISBD, and RDA reveals properties for intended or target audience in each. The FRSAD (FR for Subject Authority Data) namespace also has a property labelled “has audience”, but it refers to the class Nomen which contains labels and identifiers for subject topics; it is about subject headings and classifications, not the resource itself, so has a completely different context and is not suitable for interoperating with MARC21′s Target audience data.

The relevant triples for each property are:

Dublin Core terms:

@prefix dct: <http://purl.org/dc/terms/>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
dct:audience rdfs:comment "A class of entity for whom the resource is intended or useful." .
dct:audience rdfs:label "audience" .
dct:audience rdfs:range dct:AgentClass .

FRBR:

@prefix frbrer: <http://iflastandards.info/ns/fr/frbr/frbrer/>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#&gt.
@prefix skos: <http://www.w3.org/2004/02/skos/core#>.
frbrer:P3006 skos:definition "Relates a work to the class of user for which the work is intended, as defined by age group, educational level, or other categorization." .
frbrer:P3006 rdfs:label "has intended audience" .
frbrer:P3006 rdfs:domain frbrer:C1001 .

ISBD:

@prefix isbd: <http://iflastandards.info/ns/isbd/elements/>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#&gt.
@prefix skos: <http://www.w3.org/2004/02/skos/core#&gt.
isbd:P1091 skos:definition "Relates a resource to a note providing non-evaluative information as to the potential or recommended use of the resource and/or the intended audience." .
isbd:P1091 rdfs:label "has note on use or audience" .
isbd:P1091 rdfs:domain isbd:Resource .

RDA:

@prefix rda: <http://rdvocab.info/Elements/>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#&gt.
@prefix skos: <http://www.w3.org/2004/02/skos/core#&gt.
@prefix unc: <http://.../>.
rda:intendedAudienceWork skos:definition "The class of user for which the content of a resource is intended, or for whom the content is considered suitable, as defined by age group (e.g., children, young adults, adults, etc.), educational level (e.g., primary, secondary, etc.), type of disability, or other categorization." .
rda:intendedAudienceWork rdfs:label "Intended audience (Work)" .
rda:intendedAudienceWork rdfs:domain rdafrbr:Work .
rda:intendedAudienceWork rdfs:subPropertyOf unc:intendedAudience .

Unconstrained RDA:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#&gt.
@prefix skos: <http://www.w3.org/2004/02/skos/core#&gt.
@prefix unc: <http://.../>.
unc:intendedAudience skos:definition "The class of user for which the content of a resource is intended, or for whom the content is considered suitable, as defined by age group (e.g., children, young adults, adults, etc.), educational level (e.g., primary, secondary, etc.), type of disability, or other categorization." .
unc:intendedAudience rdfs:label "Intended audience" .

And for our high-level MARC21 property:

@prefix m21plus: <http://marc21rdf.info/elements/.../>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#&gt.
@prefix skos: <http://www.w3.org/2004/02/skos/core#&gt.
m21plus:M00Aud skos:definition "The intellectual level of the target audience for which the material is intended." .
m21plus:M00Aud rdfs:label "Target audience" .

The MARC21 property has a narrower definition than any of the others. Taking into account the domain and range constraints leaves only the unconstrained RDA property as a candidate super-property:

m21plus:M00Aud rdfs:subPropertyOf unc:intendedAudience .

This allows triples from MARC21 records to have entailments which use the same unconstrained property as triples from RDA records; the FRBR-constrained RDA property is already a sub-property of the unconstrained property.

RDF graphs of MARC21 and RDA data triples and entailments.

RDF graphs of MARC21 and RDA data triples and entailments.

Of course, the original MARC21 triple is itself an entailment from a level 0 triple.

We can use the same procedure to align and map the other properties we found for target or intended audience. The resulting RDF ontology is:

@prefix dct: <http://purl.org/dc/terms/>.
@prefix frbrer: <http://iflastandards.info/ns/fr/frbr/frbrer/>.
@prefix isbd: <http://iflastandards.info/ns/isbd/elements/>.
@prefix m21plus: <http://marc21rdf.info/elements/.../>.
@prefix rda: <http://rdvocab.info/Elements/>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix unc: <http://.../>.
dct:audience rdfs:subPropertyOf unc:intendedAudience .
frbrer:P3006 rdfs:subPropertyOf unc:intendedAudience .
m21plus:M00Aud rdfs:subPropertyOf unc:intendedAudience .
rda:intendedAudienceWork rdfs:subPropertyOf unc:intendedAudience .
unc:intendedAudience rdfs:subPropertyOf unc:P1091 .
isbd:P1091 rdfs:subPropertyOf unc:P1091 .

Note that the ISBD property is broader in definition that the unconstrained RDA property, but is itself constrained by its domain. So we need an unconstrained version of the ISBD property, which has the broadest semantic of all the related properties.

RDF graph of Target/intended audience ontology

RDF graph of Target/intended audience ontology

This ontology allows all the data about the intended audience of a resource to be available as a single, common attribute; our MARC fruit is added to the basket automatically, once it has been plucked as a level 0 triple.

By Gordon Dunsire, April 23, 2012, 4:14 pm (UTC-5)

Getting to higher MARC branches showed how the RDFS subPropertyOf property can be used to entail, or infer, RDF data triples with a broader semantic or meaning than the original, to “dumb-up” or aggregate data based on level 0 properties. Only the property or predicate part of the triple is different; the subject URI and object URI or literal remain unchanged.

The Target audience example represents one of several categories we have identified for the level 0 properties for the fixed-length data fields in MARC. This one is characterized as two or more properties with the same definition using the same Vocabulary Encoding Scheme (VES).

A more complex category is characterized as two or more properties with the same definition using the same VES where the underlying attribute allows  more than one value and the order of values is significant. An example is the 008 attribute Relief of maps:

RDF graph of MARC21 Relief of maps ontology (partial)

RDF graph of MARC21 Relief of maps ontology (partial)

“sP” in the graph stands for rdfs:subPropertyOf. The graph does not show the level 0 properties for Relief (2), (3), and (4).

This ontology separates the semantics of the order of the data from the type of data while preserving the data values. It offers any application a property for the most important relief type specified on the item being described (m21plus:M00Rel1 “Relief (1)”), a property for the second most important relief type (m21plus:M00Rel2 “Relief (2)”), etc., a property for the relief type regardless of importance (m21plus:M008MP18-21 “Relief of Maps”), etc., and a property for the relief type regardless of importance or category of material (m21plus:M00Rel “Relief”).

Target audience and Relief are, of course, at the level of attribute or relationship found in many other schemas. The rdfs:subPropertyOf property can be used to establish mappings as semantic relationships between similar properties in multiple namespaces, as described in our paper A reconsideration of mapping in a semantic world.

While properties may be similar in different namespaces, care has to be taken to avoid semantic incoherency and the inadvertent addition of false information into entailed triples. This has been  easy with the level 0 properties because the basic definitions are identical, and none of the properties is bound by a domain or range. The situation changes when properties from other namespaces are considered. Definitions may be the same in meaning, or broader or narrower. For example, the definition of m21plus:M00Aud, “The intellectual level of the target audience for which the material is intended”, is clearly narrower than the definition of the corresponding RDA property, “The class of user for which the content of a resource is intended, or for whom the content is considered suitable, as defined by age group (e.g., children, young adults, adults, etc.), educational level (e.g., primary, secondary, etc.), type of disability, or other categorization”. If the RDA property is declared a sub-property of the MARC21 property, resulting entailments will give values which are out of scope, for example type of disability, even if the constraint of the MARC21 VES for target audience is ignored. Conversely, declaring the MARC21 property as a sub-property of the RDA property does not lead to such problems, as RDA can accommodate the VES URIs within its scope and range.

However, it turns out we are on risky ground if we do this, because the RDA property has the FRBR class Work as its domain. This means that in any entailed triple, the URI of the MARC21 resource which is the subject of the original triple is inferred to be a work (a member of the class Work). But unless the MARC21 data has been pre-FRBRized, other triples with the same subject URI will entail triples stating the same resource is also a manifestation, etc., which breaks the FRBR model. Even though the RDA namespace is currently using its own “weak” version of the FRBR classes, there is no guarantee that it won’t link to the FRBR namespace in the future. The situation is therefore best avoided, and what is required is a version of the RDA property that does not have a domain. Fortunately, this was anticipated by the DCMI/RDA Task Group (now the DCMI Bibliographic Metadata Task Group), and “unconstrained” versions of all the RDA properties were added when the namespace was created, as described in our paper RDA vocabularies: process, outcome, use.

As a general procedure, then, a MARC21 high-level property can be mapped out to another namespace by using sub-property relationships, when the target property completely encompasses the MARC21 property and has no domain or range, or is completely encompassed by the MARC21 property and has no range (to avoid the VES issue). The entailed triples produced by the mappings allow maximum interoperability of MARC21 data with that from other schemas without transforming the data values.

By Gordon Dunsire, April 20, 2012, 9:39 am (UTC-5)

In the previous blog I discussed Low-hanging MARC fruit in the MARC21 fixed-length data fields 006, 007, and 008. These fields also contain useful data that hangs slightly higher up, but can be reached with a short ladder. The ladder rungs are constructed using the RDF Schema subPropertyOf property. This is an ontological property which takes the RDF class Property as its domain and range; in other words, it links instances of two properties:

P1 rdfs:subPropertyOf P2 – where P1 and P2 are specific properties.

The subPropertyOf property contains the inference rule or entailment that:

If P1 rdfs:subPropertyOf P2, and X P1 Y, then X P2 Y

That is, if property P1 is a sub-property of property P2, then a machine can entail a triple using property P2 with the same subject and object as any triple using property P1.

There is a lot of semantic overlap in the MARC21 fields. For example, field 006 positions 01-17 relate to positions 18-34 in one of the field 008 configurations; they use the same values. 006 is used in cases when an item has multiple characteristics that cannot be coded in field 008. There is no semantic difference between the 006 and 008 data – a multi-component item may be catalogued as a whole using 008 for the main component and 006 for other components, or each component may be catalogued separately with its own 008 field.

We can aggregate this data by declaring sub-property relationships between corresponding 006 and 008 “level 0″ properties and a new common super-property:

E.g. Create a new property M00Aud with label “Target audience”, and declare M006a05 (“Target audience of Language material”), M006t05 (“Target audience of Manuscript language material”) and M008BK22 (“Target audience of Books”) as sub-properties:

@prefix m2100x: <http://marc21rdf.info/elements/00X/>.
@prefix m21plus: <http://marc21rdf.info/elements/.../>.
m21plus:M00Aud rdfs:label "Target audience" .
m2100x:M006a05 rdfs:subPropertyOf m21plus:M00Aud .
m2100x:M006t05 rdfs:subPropertyOf m21plus:M00Aud .
m2100x:M008BK22 rdfs:subPropertyOf m21plus:M00Aud .

A machine can use this RDF graph to entail new triples from existing data:

ex:1 m2100x:M006a05 m21terms:commonaud#j .
=> ex:1 m21plus:M00Aud m21terms:commonaud#j .
ex:2 m2100x:M006t05 m21terms:commonaud#e .
=> ex:2 m21plus:M00Aud m21terms:commonaud#e .
ex:3 m2100x:M008BK22 m21terms:commonaud#g .
=> ex:3 m21plus:M00Aud m21terms:commonaud#g .

RDF graphs of data triples and entailments (dotted lines)

RDF graphs of data triples and entailments (dotted lines)

Here, three different resources (ex:1, ex:2, ex:3) have target audience data stored in three different MARC21 fixed-length fields. The entailed triples store the data using a common property that encompasses the semantic of the level 0 properties by discarding their differences, which are the material categories. Each entailed triple states “This resource has target audience …”, dropping the distinction of material category which is unnecessary for this metadata attribute.

Using the entailed triples, we only need to process the higher-level property to create, for example, a “Target audience” index for a set of MARC21 records, rather than having to gather the data from the level 0 properties every time.

We can go further. The same value vocabulary for Target audience is used for other categories of material:

  • M006c05 (“Target audience of Notated music”)
  • M006d05 (“Target audience of Manuscript notated music”)
  • M006g05 (“Target audience of Projected medium”)
  • M006i05 (“Target audience of Nonmusical sound recording”)
  • M006j05 (“Target audience of Musical sound recording”)
  • M006k05 (“Target audience of Two-dimensional nonprojectable graphic”)
  • M006m05 (“Target audience of Computer file or Electronic resource”)
  • M006o05 (“Target audience of Kit”)
  • M006r05 (“Target audience of Three-dimensional artifact or naturally occurring object”)
  • M008CF22 (“Target audience of Computer Files”)
  • M008MU22 (“Target audience of Music”)
  • M008VM22 (“Target audience of Visual Materials”)

So we can declare sub-property relationships between each of these level 0 properties and the higher-level “Target audience” property, and generate the entailed triples.

Note that we could create an intermediary rung on our ladder, say M00BKAud “Target audience (Language material)”, to aggregate data at the material category level, and then declare a sub-property relationship with M00Aud to aggregate to the category-free level. There is no specific use-case for this at the moment. If the need arises, this can be done without affecting the existing sub-property relationships and entailments, because the subPropertyOf property is transitive: P1 rdfs:subPropertyOf P2 and P2 rdfs:subPropertyOf P3 entails P1 rdfs:subPropertyOf P3.

Our ladder “dumbs-up” the level 0 data; each sub-property entailment uses a higher-level property that is broader in semantic than the last. The ladders merge at each stage and are just one rung in length, so what we get is more like a climbing net to get to the higher-hanging fruit.

RDF graph of MARC21 Target audience ontology

RDF graph of MARC21 Target audience ontology

Applications can now deal with just one attribute property for Target audience and avoid the messiness at level 0. And there is just one property to align and map to corresponding properties from other bibliographic metadata schemas …

By Gordon Dunsire, April 18, 2012, 9:34 am (UTC-5)

It’s been six months since Diane Hillmann, Jon Phipps, and I published a “level 0″ set of RDF elements (all properties) and value vocabularies based on the MARC21 format. That was largely a mechanical process, as we created a separate property for every combination of tag, two indicators, and subfield for most (but not yet all) tags in MARC Bibliographic. I hear this was well-received at the MARC Formats Interest Group session at ALA Midwinter in Dallas in January, with questions like ”When are you going to do the same for MARC Holdings or MARC Authority?“ Well, we’re thinking about it …

Meanwhile, I’ve been looking to see what low-hanging fruit can be plucked from the MARC tree of knowledge. The obvious place to start was the fixed-length data fields tagged as 006, 007, and 008. These have the least complicated mix of syntax and semantic, with no indicators or repeatable subfields to worry about. And there are value vocabularies associated with the coded content, so linked data triples with object URIs are a possibility.

The main complication is the dependency of the codes on the category of material, recorded in the MARC21 record Leader. This introduces a decision node in the process of recasting legacy metadata as RDF triples; it determines which code sets, and therefore value vocabularies, are being used. For all level 0 elements, the property URI is constructed mechanically from the MARC21 coding, so the decision only affects the choice of value vocabulary to be used as the object of the data triple.

The results are encouraging. Here’s a simple example from the record for “Legacy” by Roderick Buchanan, taken from the main catalogue of the National Library of Scotland. It has just a single 008 tag:

@prefix m2100x: <http://marc21rdf.info/elements/00X/>.
@prefix m21terms: <http://marc21rdf.info/terms/>.
ex:1
m2100x:M00806 m21terms:alltyp#s ;
m2100x:M00807-10 "2011" ;
m2100x:M00815-17 <http://id.loc.gov/vocabulary/countries/enk> ;
m2100x:M008BK29 "0" ;
m2100x:M008BK30 "0" ;
m2100x:M008BK31 "0" ;
m2100x:M008BK33 m21terms:booklit#0
m2100x:M00835-37 <http://id.loc.gov/vocabulary/languages/eng> .

Extending the object URIs to their labels gives the RDF graph:

A more complicated example is the CD version of Abbey Road by The Beatles, with one 006, two 007, and one 008 tags in a record from OCLC WorldCat:

@prefix m2100x: <http://marc21rdf.info/elements/00X/>.
@prefix m21terms: <http://marc21rdf.info/terms/>.
ex:5
m2100x:M00600 m21terms:formofmaterial#m ;
m2100x:M006m09 m21terms:computertyp#u ;
m2100x:M00700 m21terms:cat#s ;
m2100x:M00700 m21terms:cat#c ;
m2100x:M007c01 m21terms:electrosmd#o ;
m2100x:M007c03 m21terms:electrocol#c ;
m2100x:M007c04 m21terms:electrodim#g ;
m2100x:M007c05 m21terms:electrodsnd#a ;
m2100x:M007s01 m21terms:soundrecordingsmd#d ;
m2100x:M007s03 m21terms:soundrecordingspd#f ;
m2100x:M007s04 m21terms:soundrecordingcpc#s ;
m2100x:M007s05 m21terms:soundrecordinggro#n ;
m2100x:M007s06 m21terms:soundrecordingdim#g ;
m2100x:M007s07 m21terms:soundrecordingwid#n ;
m2100x:M007s08 m21terms:soundrecordingtap#n ;
m2100x:M007s09 m21terms:soundrecordingkin#m ;
m2100x:M007s10 m21terms:soundrecordingmat#m ;
m2100x:M007s11 m21terms:soundrecordingcut#n ;
m2100x:M007s12 m21terms:soundrecordingspc#e ;
m2100x:M007s13 m21terms:soundrecordingcap#e ;
m2100x:M00806 m21terms:alltyp#r ;
m2100x:M00807-10 "2009" ;
m2100x:M00811-14 "1969" ;
m2100x:M00815-17 <http://id.loc.gov/vocabulary/countries/cau> ;
m2100x:M008MU18-19 m21terms:musicfoc#rc ;
m2100x:M008MU20 m21terms:musicfom#n ;
m2100x:M008MU21 m21terms:musicpar#n ;
m2100x:M008MU33 m21terms:musictra#n ;
m2100x:M00835-37 <http://id.loc.gov/vocabulary/languages/eng> .

This yields the (partial) extended RDF graph:

The MARC Bibliographic manual says “Coded data elements are potentially useful for retrieval and data management purposes“. Any graph connecting to these examples can use them for retrieval, provided the resources URIs ex:1 and ex:5 are linked to the location of one or more copies. For open online resources this might be sufficient, because the resource is a “link” away. For physical resources, more information is required to get access (retrieve), including basic human identification attributes such as title, author, and edition. That additional data is usually present in the MARC21 variable data fields, and I’ll discuss it in a future blog post, but it doesn’t have to come from there. So what we need are stable URIs for the resources ex:1 and ex:5, and some triples containing location information for copies. OCLC has a lot of that data in one place. Next most helpful are standard identifiers such as ISBN and ISSN, because they will help to link to graphs from the publishing, bookselling, and reading communities. Then some title, author, and edition information would be nice …

It would be very useful if national, regional, or international cataloguing agencies could get it together to put this on their agendas, soon.

Finally, notice the “0″ values in ex:1 and the “not applicable” value in ex:5. The MARC21 fixed-length data fields support theOpen World Assumption, unlike the MARC/AACR record as a whole, which definitely uses the Closed World Assumption, for example by not recording a first-edition statement.

By Gordon Dunsire, March 26, 2012, 11:48 am (UTC-5)

Like most people who do blogging (whether regularly, or sporadically like I do), I keep a list of ideas for posts, which I often add to, but less often write up. I’ve been a very poor blogger recently, and it’s not because nothing is going on, LOTS is going on. Perhaps it’s more that I’m waiting for some point where I could nail something down, and that moment seems not to be arriving. But one of my notes caused me toI re-read a post I did over a year ago, and look at some of the other parts of that interview with the three luminaries at that ALISE program. I ran across a comment Janet Swan Hill made when asked about lessons learned from the last transition to AACR2 :

“So I … think … the loss of independence, the loss of autonomy is one of the largest themes that I have seen. Another huge theme that I have seen in that period of time is we are still undergoing a period of grieving, I think, for the fact that we are learning that we have to put up with good enough.”

I agree with Janet’s insight—I see that kind of grieving frequently (most often displayed as anger) coming up in the cataloging venues of our profession. I sympathize, actually, much more than I often articulate—I’m far more likely to display frustration instead. But I think the problem lies in our definition of quality—we’ve put ourselves in a box where according to our deeply held notions of what quality is, we can never again achieve anything we can be proud of, because the world won’t pay for that particular kind of quality control anymore. All of us as human beings want to be appreciated for what we do, to achieve mastery in the area of work we’ve chosen, and somehow, many think it’s not possible to do that anymore in the world we see coming.

This is not entirely an illusion. The reality is that the old world where we built and maintained by hand our catalogs for users who needed our work to find the resources they required is gone, never to return. In fact, studies suggest that many of the newer users don’t understand the catalog at all, and use it infrequently, if ever. Certainly, because most libraries still have catalogs and still create information for them, it may be possible to maintain the illusion that there will always be catalogs, and therefore, there must always be catalogers to maintain them. We do all this work on computers, isn’t that enough?

Well, no, unfortunately it’s not enough, because we’re still creating catalogs and catalog cards, despite the computer technology we use today to create catalog records. But though I can understand the dismay about that disruptive fact, it seems to me that there’s plenty to look forward to. Make no mistake, with that forward looking vision there are still humans—well-trained and competent humans—continuing to pay attention to quality in their data, although using different techniques and certainly fewer human resources. Far too often the changes we see coming are translated in our brains as the death of quality in our world, but I don’t think that’s the case. How we define, measure, and assure quality will change, no doubt about it, but first we need to think realistically about what it means.

If we’re lucky and we do a good job figuring all this out, it will be ‘good enough.’ I would contend that ‘good enough’ was always the best we had on offer—there was never perfection, not ever. I remember when I was still working at Cornell, having routines that were run after every data load, to catch the known typos and other problems (some of which we’d created ourselves). Given how our catalogs were structured, this was important work, and made a difference to our users.

I can remember, too, during the many moons I spent on MARBI, that there were many discussions about whether or not the definition or structure of a particular field or subfield could potentially be misinterpreted or misused. My colleague Paul Weiss was particularly likely to argue that we should prevent people from doing such things, and one year I got a baseball cap with ‘USMARC Police’ stitched across the front that I would throw across the table when he started up that argument yet again. My point was that there was no MARC Police, and we’d better give up any fantasy that anyone would be on the enforcement end of good practices. (Although I recently noted that there’s a musical group called ‘Marc Police’ out there). More globally, there are no data police, so instead of pretending and wearing our baseball caps to prove they exist, we need to figure out some useful strategies for this new world we’re venturing into.

Consider, if the changes we’re talking about come to pass (and I believe they will), we’ll have statements instead of records, much less text, more batch improvement strategies, and to go along with that, different ways to measure quality. I wrote some of this up with my colleague Tom Bruce a few years ago, and it’s available here. The big message is that we need to change the conversation about quality and talk about it in an entirely new way. Quality is not about eyeballs trained one-by-one on individual records, but about new methods, new tools, and new attitudes. We will certainly need to use both our computer resources and our human resources more intelligently and flexibly, to share what we learn (whether effective or not), and to work closely with other collaborators in our endeavor—particularly the developers and coders who know more about what computers can do (and how to do it), than we do.

But I do keep my ‘USMARC Police’ cap in my office, just in case I ever need to throw it again.

By Diane Hillmann, December 7, 2011, 9:44 am (UTC-5)

Recently I retweeted the following:
“nice quote “your data ages like fine wine, whereas your software applications age like fish” in @mattwall’s j.mp/o8zsQG (via @edsu)”

Since then I’ve been thinking about the important lesson encapsulated in those less-than-140 characters, and how we’ve not really internalized this lesson in LibraryLand, no matter how many times we’ve migrated data. I remember many years ago, when I was working in the Cornell Law Library in the catalog card era, we were told by a university official that in case of fire, everyone should grab a shelf list drawer or two and head out the door. We were pretty stunned by this instruction, but they’d worked it all out—that catalog was the biggest investment the library had, and the only way to re-create it after a fire (if for nothing else than to determine the insurance to be paid for all those lost books), was via that shelf list.

Although a lot has changed since then, and most of those catalog cards were long ago recycled as scrap paper, the data they contained is (are?) still around, and still powering the online catalogs at Cornell. The catalog card drawers themselves were part of an ancient (and esthetically pleasing) piece of furniture, rescued from Boardman Hall, which was torn down in the 1950s to make way for Olin Library, a move many believe was a terrible mistake (Olin is the only modern building on Cornell’s Arts Quad). But I digress.

Like most libraries Cornell used OCLC’s services to create catalog cards, not paying much attention to the data being created as part of that process until well down the road. Also like many, Cornell actually had a clutch of ‘holding libraries,’ physical spaces associated with particular schools and programs, each creating what was effectively it’s own database via OCLC. But unlike most, Cornell bit that multiple-records bullet early and when the data was loaded into NOTIS, there was only one iteration of a bibliographic record, with all the local ‘holding libraries’ attached to it. A mini-version of OCLC’s ‘master record’, is one way to look at it, I suppose. It was a sensible, if not particularly popular move, and we all had occasion later to thank our lucky stars we had crossed that bridge as a group, rather than as individuals, when we saw the headaches our comrades were coping with.

My last data migration for Cornell was the one that moved data from the old NOTIS system to Voyager, and it was a year-long project that, if nothing else, reaffirmed my biases towards standard data. Although, like everyone else, we had some standard data (MARC bibs and authorities) and a lot of non-standard data (acquisitions and circulation), the bibliographic portion was, we agreed, the most important part, because everything else ‘hung off’ that bib record. Clearly, the data remained where our investment lay—by the end we weren’t even installing new versions of NOTIS in all the modules we used (and the ones we did install turned out to be mistakes). NOTIS was very old fish indeed by the time we moved to Voyager, and Voyager now, like most of the so-called ‘new generation’ of integrated library systems based on relational databases, is fast becoming a pungent geriatric fish as well.

Enough of looking back (interesting as that can be). The questions now revolve around how different we think our future will look. Will we continue to use/reuse our considerable legacy of data to build the services we want moving forward? If so, what are the steps we need to take, to transform our legacy data to RDA or any other more modern packaging for our data? We have a large number of value vocabularies as well as the MARC 21 schema we still rely on, which we will need to consider part of that plan for re-use.

I’ve seen a lot of ‘new rules for data’, but these are mine:
–Data should be able to be encoded in a variety of ways, to suit a variety of functions, uses, and systems
–Data should be managed at a granular, statement level, but also be available in a variety of record ‘formats’ (with records being understood as primarily an on-the-fly method of aggregating data for a variety of downstream users)
–Although current data is expressed mostly as text strings, data improvement strategies will be designed to change most of them to URIs as soon as practicable.
—Data definitions and specifications will be easily available on the web, allowing mapping to be simpler and easier to tweak

And the most important rule:
—Never, never make data decisions to fit the system flavor of the month, and ‘out’ any system that degrades our data as the price of functionality

This is not to say that the transition of our old data to what we need for a newer environment is going to be seamless, lossless or even easy. It will be none of those things. But I would contend that it’s not rocket science either, and we’d be well advised not to indulge in needless hand-wringing until we’ve explored the issues more fully. Stay tuned …

By Diane Hillmann, September 8, 2011, 9:30 am (UTC-5)

I’m supposed to be writing a paper (as part of a team and as designated herder) but like most people I have strategies for avoiding such tasks, not necessarily in ways that are entirely useless, just useless in the context of a particular deadline. In this instance, I’ve been listening to an interview of Janet Swan Hill done last summer at ALA Annual and now available on a website called “Gathering our Stories: Developing a National Oral History Program of Retiring/Retired Librarians”.

It’s definitely worth listening to the interview—Janet has been present for many of the important moments in the collective past of most catalogers in this country, and her viewpoint is always worth listening to. This is not to say that I always agree with her—I don’t always, and in particular I don’t agree with her position on RDA. Some of that disagreement arises from the fact that she, like most catalogers (and far too many library administrators), thinks of RDA as the successor to AACR2, the cataloging ‘rules’. I, on the other hand don’t care at all about the rules (there, I’ve said it, are you all happy now?) Instead I see the potential of RDA elsewhere: in the vocabularies specifically, and not incidentally in the revolution they represent in the way we envision our future in metadata. Put more succinctly, it’s not what we say, but how we say it, that makes RDA a big leap forward.

But Houston, we have a problem. Janet defines it very well (quotes from the transcript accompanying the video interview above):

“I suspect that we will go ahead and implement RDA, uh, after I retire. (Laughter) I suspect that many libraries will not implement it because one of the things that proponents of RDA are most eager to say is “Oh, it won’t make that much difference. Your old records will be compatible with the new ones.” So a lot of libraries are going to listen to that and say so why should I implement the thing.”

People, this is a huge problem for us. It’s a REVOLUTION we’re talking about with RDA, not just shifting the deck chairs on the Titanic, and it has little or nothing to do with the rules. And yes, it will cost us something to implement, but the ridiculous testing regime initiated in part because Janet (yes, this Janet) convinced the LC Working Group on the Future of Bibliographic Control to include a recommendation that work on RDA be suspended, will not help us determine whether or how we should implement RDA.

If I sound frustrated, it’s because I am. For most of the past few years, as the RDA Vocabularies have been developed, the marketing effort for RDA mounted by the JSC and ALA Publishing has been wholly focused on the guidance text and the RDA Toolkit. Only very recently have the vocabularies and their value been included in the educational efforts that have been mounted nationally and internationally around RDA. [See the ALA Webinars coming up for evidence of change.]. The small, cranky group that developed the vocabularies has gotten even crankier as a result, but there are days when I worry that without better understanding of what RDA represents, our efforts will be too little, too late. As we all wait for the result of the Testing Theater effort (see this previous post for my opinion on that) it seems less and less likely that a clear message will emerge from that confused process, and we definitely need a clear message from those most librarians still consider the leaders of the US library community.

The most recent cause for concern has been the draft ‘PoCo Discussion Paper on RDA Implementation alternatives‘. The beginning portion ended with literally the only sentence in the problem statement portion of the report that I could easily agree with: “In any scenario PCC must adapt to a hybrid environment.” But the question is, will that hybrid environment be facing backward or forward? And the question not asked in the report, but definitely assumed: in that inevitable hybrid environment, what would be the role of an organization such as PCC? The current value of the PCC is built almost entirely on the consensus-based environment of the past, where agreements on basic functionality of cataloging records emerged from a common necessity to provide a standard ‘floor’ below which efforts to rein in costs should not sink. But is that value the same in the future environment? Based on this report, it seems clear that the thinking of the writers of the discussion paper is still deeply embedded in the past, and they see the future as an entirely problematic extension with few opportunities for libraries or users in the change that RDA represents. “Perpetuating the hybrid environment long term will have a negative (and costly) impact on our catalogs and on all areas of bibliographic control.”

It seems very clear from the issues presented in the discussion paper that the negative view of the future stems from the lack of understanding of what will actually need to change to enable libraries to fully implement RDA, and what that change offers us at this critical time for libraries. A real RDA implementation, with the benefits already under extensive discussion in the library community, cannot, CANNOT, actually happen in a MARC environment with the inwardly focused assumptions in evidence in the discussion paper. This is not to say that documentation, training, protecting our legacy in terms of our MARC records and authority files are not rightfully topics that we ought to be discussing, but those discussions need to happen with fuller understanding of the environment we will be working in as we move our focus to the web, and away from our current catalogs. [See Karen Coyle’s TechSource reports here and here for a great start in understanding what we need to do.]

At the end of its paper, the Task Group proposes the following:

“Recognizing that there is a cost associated with choosing a direction that is different from the US national libraries, recognizing that PCC institutions will face a hybrid environment, and recognizing that there is a value to the PCC in member contributions from either rule set, the PCC should formally adopt RDA, regardless of the outcome of the US RDA Test, and the decision of the US national libraries, but it should set no time limit on implementation of RDA by PCC institutions.”

I heartily agree with this conclusion, and I say to the Task Group—tell us how we can help you consider some options that don’t stop with the unsustainable assumption of cramming RDA into MARC. Persuade us that moving to RDA is something we should embrace. Because the route you seem to outline can’t result in success, and libraries need successful paths, as well as correct decisions.

By Diane Hillmann, April 24, 2011, 3:55 pm (UTC-5)

At my keynote at Code4Lib a few weeks ago [recorded here about 90 minutes in], I got a good laugh when I equated the continuum that catalogers and programmers inhabit to that described by Kinsey in his famous discussion of sexuality. Since then, perhaps as a response to my presentation and Eric Hellman’s at the end of Code4Lib there seems to have been a resurgence of the conversation that comes and goes, particularly on cataloging blogs and discussion lists, about whether catalogers should learn to code and thus, perhaps, shift their personal position on the continuum I was describing (though probably not on Kinsey’s).

Some examples of this discussion can be found here and here.

To be honest, I get a little frustrated by these conversations, mostly because I think they miss the point about what it is that both catalogers and programmers bring to the table. Far too often, the conversation devolves to: ‘Why can’t you be more like me?’ I frankly don’t think that point makes any more sense now than it did some decades ago when the same arguments were made in support of all librarians learning to catalog. It’s not that I’m trying to discourage librarians, particularly the cohort at the beginning of their careers who see technology as a big part of their futures, from delving more deeply in the mysteries of code. Those who see the value and have the opportunity to learn should take advantage of that, just as more programmers working in the library sector should be exploring the history and culture of knowledge organization in libraries [A good place to start: The Intellectual Foundation of Information Organization, by Elaine Svenonius. Cambridge, MA : MIT Press, 2000]. Note that I didn’t say ‘cataloging’, because it’s more than that, just as what programmers do and how they think is only partially about coding. Whatever we can do to move ourselves closer to the middle of that continuum, to understand more about how technology works under the hood, and more about how library data was organized and created over the last century or so, the better we’ll be able to work together to solve the problems we see limiting our forward progress. For me, it’s about respect and understanding, which may or may not include emulation.

I’m perfectly willing to admit that some of my irritation with the argument that learning to code is necessary for librarians is that I don’t know programming at all, and the likelihood that I’ll learn to program at this stage of my life is similar to the likelihood that I’ll grow a few more inches (in a vertical direction, mind you) before I shuffle off this mortal coil. I don’t think my lack of programming knowledge has impeded me in learning what I need to know about the technology that interests me and has been the focus of my career for the last 15 years or so. In fact, one of the compliments I received a few years ago from a programmer is one I particularly treasure: he told me that I thought like a programmer. It will surprise nobody that I have no idea what that really means, but I took it as a compliment, and it was certainly meant as one.

I’m far more interested in learning more about ontologies, knowledge and vocabulary management, and information architecture, and it seems to me that this is an area where the significant gaps in librarian knowledge affect our ability to envision our future and make it happen. For the most part, we have some basic understanding about vocabularies but it’s almost entirely built on MARC (mine certainly was a few years ago), and that’s not going to help us much moving forward. This area is not, in my experience, one where programmers have either interest or knowledge, but it’s a natural extension of the path librarians are already headed down.

According to Myers-Briggs, I’m an ENTJ, and aside from learning the interesting categorizations of people that is a big part of Myers Briggs, my take-away from the workshops I attended was that there’s no good-better-best kind of personality or approach for any particular profession, task or team. Particularly for a team, what you want is diversity, not a group that thinks all one way. I’ve never forgotten this point, and still think it’s the key to any of our endeavors. I think the Code4Lib model is a terrific one for getting our heads together and figuring out how to move forward, and I hope to continue to look for ways to get more catalogers to attend and think about how they might contribute, as well as airing these issues in their own venues. (And many thanks to the programmers who show up regularly at ALA!)

Aside from my strong feeling that there are other, more significant gaps in our knowledge than coding, there are two additional aspects of this ‘librarians-should-be-coders’ discussion that really worry me: first is that it will discourage those who don’t have the opportunities to learn coding from learning what they need to know to understand the technology that drives our world, well enough to participate in the change we need. My second big concern is that we’ll start focusing again on the ‘why can’t you be more like me’ instead of remembering that we need the skills and understanding of a broad range of librarians and technologists to get where we need to go, not just the ones who have been convinced that coding is the best way to prove their enthusiasm and commitment to moving ahead.

By Diane Hillmann, March 2, 2011, 5:12 pm (UTC-5)

Some of you have already seen the live feed or the recordings for last week’s Code4Lib conference. If you have, you might already know that I was the keynote speaker for that conference. (The archive page is here, my part is about 90 minutes into session 1; slides are available too). The whole story of how I got there is interesting, and beyond that I’d like to talk about what I took away from it. I attended all of Tuesday and Wednesday, and left Thursday morning (after my return from ALA Midwinter in January, I’ve developed a strong disincentive to book the last flights into Ithaca from anywhere), thus missing the Thursday morning events. I’ve since caught up with those recordings.

The invitation came from conference host Robert McDonald, and was totally out of the blue. Code4Lib has an admirable process for choosing keynoters–they have a wiki and backchannel list (that anyone can join), which keeps the voting off the main discussion list. I’ve never attended Code4Lib before, though I’ve been a lurker and an occasional participant in the discussions on the list for some years, and I know many of the regulars. As someone who hadn’t attended the conference before, it never dawned on me to participate in the voting. I didn’t get the most votes, but when the high vote getter turned them down, I was asked. At first I was pretty intimidated by the whole idea, but that passed fairly quickly, and I started to get excited by the challenge it represented, both for me personally, and as a representative of a whole host of librarians who never get a chance to talk to a room full of library programmers. It was clearly not an opportunity to be wasted.

I gave a lot of thought to what I wanted to talk about, and started and abandoned several topics before settling on one. It clicked for me when I participated in a discussion at ALA Midwinter amongst attendees at the organizing meeting for the Linked Library Data IG. The discussion was about the discouraging fact that programmers and librarians (particularly catalogers) don’t seem to be connecting on the important issues of our libraries, instead we talk past one another. I think the general assumption is that this is a cultural divide, and it is on a superficial level, but a much more important reason is that we almost never gather together to discuss where we’re going. We all work for institutions that we believe are critically important in today’s society, but we’re not working together to solve the problems we can see in front of us.

So my talk for C4L covered a number of areas, including advice to programmers on how to find and connect with librarians/catalogers in their institutions who might be ready to work with them more closely, and what the priorities should be for that work. Despite a fairly rough start to the talk (the IU laptop I was using had a new version of PowerPoint that behaved quite oddly in presenter mode), it went fairly well and the response was wonderful. During the rest of that day and the following one, I had some great conversations with other attendees about the issues I brought up, and there will be some follow through on several of those. I was very pleased in particular that my plea for building demonstration projects that would show how the RDA Vocabularies can be used was taken very seriously, and I will be following up on that one.

One question I threw out to the audience was whether anyone had read our article in DLib, ‘RDA Vocabularies: Process, Outcome, Use’. About a half dozen had, but probably twice that many tweeted the URL, so perhaps some more have read it subsequently. I’m not sure why such a disappointing number have seen the article, but I hope that some who are interested in moving away from the frustrating parsing of MARC data will see the light.

I also talked a bit about how the library world had been ill-served by the narrow marketing of RDA as primarily the guidance text (it’s still happening, unfortunately), as well as the whole RDA testing regime. Because the tests crammed RDA data into MARC, it really doesn’t operate as a test of RDA itself, or of the usefulness of FRBR. What we’ve ended up with is a vast amount of misunderstanding: many traditionalists still believe that RDA is not that different from AACR2, while those who believe that RDA isn’t enough change (or the change we need, to coin a phrase) believe the same thing but come to a different conclusion. As I said to the C4L group: “I get why catalogers like MARC, but I don’t get why you guys aren’t all over the RDA Vocabs.”

After my own time in the spotlight, I became just another participant (the difference was that everybody knew who I was and I had to squint at their badges to see who they were). Thankfully nobody got freaked out that I was knitting socks while listening to other people’s presentations (and at least one pulled out her half-knitted sock to show me). With a laptop in front of me (not to mention IRC and Twitter), I wouldn’t have heard a thing. But, listening to the wide variety of presentations, I was very impressed by the amount of creativity, and the diversity of projects presented. I understood most of it, at least at a general level (though not perhaps on an operational one), and took some notes about a few insights I wanted to think about as I work on various projects. It was really a great conference, and the organizers did a fabulous job with everything. Do take a look at the video, and think about how you might make some connections with the catalogers or programmers in your life. We are all in this together, and we need to find better ways to converse and collaborate to make our ideas real.

Oh, and lest I forget, thanks to all the folks who shared their wonderful and special beer with me during the after hours social time in the hospitality suite. You just may have turned me from an always-wine to a sometimes-beer broad. (And don’t worry, Declan, the beer washed out of my jacket just fine!)

By Diane Hillmann, February 15, 2011, 8:41 am (UTC-5)