Metadata Matters

It’s not just me that’s getting old

Diane Hillmann — Tue, 09 Feb 2016 14:19:16 +0000

Having just celebrated (?) another birthday at the tail end of 2015, the topics of age and change have been even more on my mind than usual. And then two events converged. First I had a chat with Ted Fons in a hallway at Midwinter, and he asked about using an older article I’d published with Karen Coyle way back in early 2007 (“Resource Description and Access (RDA): Cataloging Rules for the 20th Century”). The second thing was a message from Research Gate that reported that the article in question was easily the most popular thing I’d ever published. My big worry in terms of having Ted use that article was that RDA had experienced several sea changes in the nine (!) years since the article was published (Jan./Feb. 2007), so I cautioned Ted about using it.

Then I decided I needed to reread the article and see whether I had spoken too soon.

The historic rationale holds up very well, but it’s important to note that at the time that article was written, the JSC (now the RSC) was foundering, reluctant to make the needed changes to cut ties to AACR2. The quotes from the CC:DA illustrate how deep the frustration was at that time. There was a real turning point looming for RDA, and I’d like to believe that the article pushed a lot of people to be less conservative and more emboldened to look beyond the cataloger tradition.

In April of 2007, a mere few months from when this article came out, ALA Publishing arranged for the famous “London Meeting” that changed the course of RDA. Gordon Dunsire and I were at that meeting–in fact it was the first time we met. I didn’t even know much about him aside from his article in the same DLIB issue. As it turns out, the RDA article was elevated to the top spot, thus stealing some of his thunder, so he wasn’t very happy with me. The decision made in London to allow DCMI to participate by building the vocabularies was a game changer, and Gordon and I were named co-chairs of a Task Group to manage that process.

So as I re-read the article, I realized that the most important bits at the time are probably mostly of historical interest at this point. I think the most important takeaway is that RDA has come a very long way since 2007, and in some significant ways is now leading the pack in terms of its model and vocabulary management policies (more about that to come).

And I still like the title! …even though it’s no longer a true description of the 21st Century RDA.

Denying the Non-English Speaking World

Diane Hillmann — Sun, 03 Jan 2016 22:05:00 +0000

Not long ago I encountered the analysis of BibFrame published by Rob Sanderson with contributions by a group of well-known librarians. It’s a pretty impressive document–well organized and clearly referenced. But in fact there’s also a significant amount of personal opinion in it, the nature of which is somewhat masked by the references to others holding the same opinion.

I have a real concern about some of those points where an assertion of ‘best practices’ are particularly arguable. The one that sticks in my craw particularly shows up in section 2.2.5:

2.2.5 Use Natural Keys in URIs
References: [manning], [ldbook], [gld-bp], [cooluris]

Although the client must treat URIs as opaque strings, it is good practice to construct URIs in a systematic and human readable fashion for both instances and ontology terms. A natural key is one that appears in the information about the resource, such as some unique identifier for the resource, or the label of the property for ontology terms. While the machine does not care about structure, memorability or readability of URIs, the developers that write the code do. Completely random URIs introduce difficult to detect semantic and algorithmic errors in both publication and consumption of the data.

Analysis:

The use of natural keys is a strength of BIBFRAME, compared to similarly scoped efforts in similar communities such as the RDA and CIDOC-CRM vocabularies which use completely opaque numbers such as P10001 (hasRespondent) or E33 (Linguistic Entity). RDA further misses the target in this area by going on to define multiple URIs for each term with language tagged labels in the URI, such as rda:hasRespondent.en mapping to P10001. This is a different predicate from the numerical version, and using owl:sameAs to connect the two just makes everyone’s lives more difficult unnecessarily. In general, labels for the predicates and classes should be provided in the ontology document, along with thorough and understandable descriptions in multiple languages, not in the URI structure.

This sounds fine so long as you accept the idea that ‘natural’ means English, because, of course, all developers, no matter their first language, must be fluent enough in English to work with English-only standards and applications. This mis-use of ‘natural’ reminds me of other problematic usages, such as the former practice in the adoption community (of which I have been a part for 40 years) where ‘natural’ was routinely used to refer to birth parents, thus relegating adoptive parents to the ‘un-natural’ realm. So in this case, if ‘natural’ means English, are all other languages inherently un-natural in the world of development? The library world has been dominated by the ‘Anglo-American’ notions of standard practice for a very long time, and happily, RDA is leading away from that, both in governance and in development of vocabularies and tools.

The Multilingual strategy adopted by RDA is based on the following points:

More than a decade of managing vocabularies has convinced us that opaque identifiers are extremely valuable for managing URIs, because they need not be changed as labels change (only as definitions change). The kinds of ‘churn’ we saw in the original version of RDA (2008-2013) convinced us that label-based URIs were a significant problem (and cost) that became worse as the vocabularies grew over time.
We get the argument that opaque URIs are often difficult for humans to use–but the tools we’re building (the RDA Registry as case in point) are intended to give human developers what they want for their tasks (human readable URIs, in a variety of languages) but ensure that the URIs for properties and values are set up based on what machines need. In this way, changes in the lexical URIs (human-readable) can be maintained properly without costly change in the canonical URIs that travel with the data content itself.
The multiple language translations (and distributed translation management by language communities) also enable humans to build discovery and display mechanisms for users that are speakers of a variety of languages. This has been a particularly important value for national libraries outside the US, but also potentially for libraries in the US meeting the needs of non-English language communities closer to home.

It’s too easy for the English-first library development community to insist that URIs be readable in English and to turn a blind eye to the degree that this imposes understanding of the English language and Anglo-American library culture on the rest of the world. This is not automatically the intellectual gift that the distributors of that culture assume it to be. It shouldn’t be necessary for non-Anglo-American catalogers to learn and understand Anglo-American language and culture in order to express metadata for a non-Anglo audience. This is the rough equivalent of the Philadelphia cheese steak vendor who put up a sign reading “This is America. When ordering speak in English”.

We understand that for English-speaking developers bibframe.org/vocab/title is initially easier to use than rdaregistry.info/Elements/w/P10088 or even (heaven forefend!) “130_0#$a” (in RDF: marc21rdf.info/elements/1XX/M1300_a). That’s why RDA provides rdaregistry.info/Elements/w/titleOfTheWork.en but also, eventually, rdaregistry.info/Elements/w/拥有该作品的标题.ch and rdaregistry.info/Elements/w/tieneTítuloDeLaObra.es, et al (you do understand Latin of course). These ‘unnatural’ Lexical Aliases will be provided by the ‘native’ language speakers of their respective national library communities.

As one of the many thousands of librarians who ‘speak’ MARC to one another–despite our language differences–I am loathe to give up that international language to an English-only world. That seems like a step backwards.

Review of: DRAFT Principles for Evaluating Metadata Standards

Diane Hillmann — Mon, 14 Dec 2015 21:59:15 +0000

Metadata standards is a huge topic and evaluation a difficult task, one I’ve been involved in for quite a while. So I was pretty excited when I saw the link for “DRAFT Principles for Evaluating Metadata Standards”, but after reading it? Not so much. If we’re talking about “principles” in the sense of ‘stating-the-obvious-as-a-first-step’, well, okay—but I’m still not very excited. I do note that the earlier version link uses the title ‘draft checklist’, and I certainly think that’s a bit more real than ‘draft principles’ for this effort. But even taken as a draft, the text manages to use lots of terms without defining them—not a good thing in an environment where semantics is so important. Let’s start with a review of the document itself, then maybe I can suggest some alternative paths forward.

First off, I have a problem with the preamble: “These principles are intended for use by libraries, archives and museum (LAM) communities for the development, maintenance, governance, selection, use and assessment of metadata standards. They apply to metadata structures (field lists, property definitions, etc.), but can also be used with content standards and value vocabularies”. Those tasks (“development, maintenance, governance, selection, use and assessment” are pretty all encompassing, but yet the connection between those tasks and the overall “evaluation” is unclear. And, of course, without definitions, it’s difficult to understand how ‘evaluation’ relates to ‘assessment’ in this context—are they they same thing?

Moving on to the second part about what kind of metadata standards that might be evaluated, we have a very general term, ‘metadata structures’, with what look to be examples of such structures (field lists, property definitions, etc.). Some would argue (including me) that a field list is not a structure without a notion of connections between the fields; and although property definitions may be part of a ‘structure’ (as I understand it, at least), they are not a structure, per se. And what is meant by the term ‘content standards’, and how is that different from ‘metadata structures’? The term ’value vocabularies’ goes by many names, and is not something that can go without a definition. I say this as an author/co-author of a lot of papers that use this term, and we always define it within the context of the paper for just that reason.

There are many more places in the text where fuzziness in terminology is a problem (maybe not a problem for a checklist, but certainly for principles). Some examples:

1. What is meant by ’network’? There are many different kinds, and if you mean to refer to the Internet, for goodness sakes say so. ‘Things’ rather than ‘strings’ is good, but it will take a while to make it happen in legacy data, which we’ll be dealing with for some time, most likely forever. Prospectively created data is a bit easier, but still not a cakewalk — if the ‘network’ is the global Internet, then “leveraging ‘by-reference’ models” present yet-to-be-solved problems of network latency, caching, provenance, security, persistence, and most importantly: stability. Metadata models for both properties and controlled values are an essential part of LAM systems and simply saying that metadata is “most efficient when connected with the broader network” doesn’t necessarily make it so.

2. ‘Open’ can mean many things. Are we talking specific kinds of licenses, or the lack of a license? What kind of re-use are you talking about? Extension? Wholesale adoption with namespace substitution? How does semantic mapping fit into this? (In lieu of a definition, see the paper at (1) below)

3. This principle seems to imply that “metadata creation” is the sole province of human practitioners and seriously muddies the meaning of the word creation by drawing a distinction between passive system-created metadata and human-created metadata. Metadata is metadata and standards apply regardless. What do you mean by ‘benefit user communities’? Whose communities? Please define what is meant by ‘value’ in this context? How would metadata practitioners ‘dictate the level of description provided based on the situation at hand’?

4. As an evaluative ‘principle’ this seems overly vague. How would you evaluate a metadata standard’s ability to ‘easily’ support ‘emerging’ research? What is meant by ‘exchange/access methods’ and what do they have to do with metadata standards for new kinds of research?

5. I agree totally with the sentence “Metadata standards are only as valuable and current as their communities of practice,” but the one following makes little sense to me. “ … metadata in LAM institutions have been very stable over the last 40 years …” Really? It could easily be argued that the reason for that perceived stability is the continual inability of implementers to “be a driving force for change” within a governance model that has at the same time been resistant to change. The existence of the DCMI usage board, MARBI, the various boards advising the RDA Steering Committee, all speak to the involvement of ‘implementers’. Yet there’s an implication in this ‘principle’ that stability is liable to no longer be the case and that implementers ‘driving’ will somehow make that inevitable lack of stability palatable. I would submit that stability of the standard should be the guiding principle rather than the democracy of its governance.

6. “Extensible, embeddable, and interoperable” sounds good, but each is more complex than this triumvirate seems. Interoperability in particular is something that we should all keep in mind, but although admirable, interoperability rarely succeeds in practice because of the practical incompatibility of different models. DC, MARC21, BibFrame, RDA, and Schema.org are examples of this — despite their ‘modularity’ they generally can’t simply be used as ‘modules’ because of differences in the thinking behind the model and their respective audiences.

I would also argue that ‘lite style implementations’ make sense only if ‘lite’ means a dumbed-down core that can be mapped to by more detailed metadata. But stressing the ‘lite implementations’ as a specified part of an overall standard gives too much power to the creator of the standard, rather than the creator of the data. Instead we should encourage the use of application profiles, so that the particular choices and usages of the creating entity are well documented, and others can use the data in full or in part according to their needs. I predict that lossy data transfer will be less acceptable in the reality than it is in the abstract, and would suggest that dumb data is more expensive over the longer term (and certainly doesn’t support ‘new research methods’ at all). “Incorporation into local systems” really can only be accomplished by building local systems that adhere to their own local metadata model and are able to map that model in/out to more global models. Extensible and embeddable are very different from interoperable in that context.

7. The last section, after the inarguable first sentence, describes what the DCMI ‘dumb-down’ principle defined nearly twenty years ago, and that strategy still makes sense in a lot of situations. But ‘graceful degradation’ and ‘supporting new and unexpected uses’ requires smart data to start with. ‘Lite’ implementation choices (as in #6 above) preclude either of those options, IMO, and ‘adding value’ of any kind (much less by using ‘ontological inferencing’) is in no way easily achievable.

I intend to be present at the session in Boston [9:00-10:00 Boston Conference and Exhibition Center, 107AB] and since I’ve asked most of my questions here I intend not to talk much. Let’s see how successful I can be at that!

It may well be that a document this short and generalized isn’t yet ready to be a useful tool for metadata practitioners (especially without definitions!). That doesn’t mean that the topics that it’s trying to address aren’t important, just that the comprehensive goals in the preamble are not yet being met in this document.

There are efforts going on in other arenas–the NISO Bibliography Roadmap work, for instance, that should have an important impact on many of these issues, which suggests that it might be wise for the Committee to pause and take another look around. Maybe a good glossary would be a important step?

Dunsire, Gordon, et al. “A Reconsideration of Mapping in a Semantic World”, paper presented at International Conference on Dublin Core and Metadata Applications, The Hague, 2011. Available at: dcpapers.dublincore.org/pubs/article/view/3622/1848

The Jane-athons continue!

Diane Hillmann — Mon, 07 Dec 2015 16:19:03 +0000

The Jane-athon series is alive, well, and expanding its original vision. I wrote about the first ‘official’ Jane-athon earlier this year, after the first event at Midwinter 2015.

Since then the excitement generated at the first one has spawned others:

the Ag-athon in the UK in May 2015, sponsored by CILIP

the Maurice Dance in New Zealand (October 16, 2015 at the National Library of New Zealand in Wellington, focused on Maurice Gee)

the Jane-in (at ALA San Francisco at Annual 2015)

the RLS-athon (November 9, 2015, Edinburgh, Scotland), following the JSC meeting there and focused on Robert Louis Stevenson

Like good librarians we have an archive of the Jane-athon materials, for use by anyone who wants to look at or use the presentations or the data created at the Jane-athons

We’re still at it: the next Jane-athon in the series will be the Boston Thing-athon at Harvard University on January 7, 2016. Looking at the list of topics gives a good idea about how the Jane-athons are morphing to a broader focus than that of a creator, while training folks to create data with RIMMF. The first three topics (which may change–watch this space) focus not on specific creators, but on moving forward on topics identified of interest to a broader community.

* Strings vs things. A focus on replacing strings in metadata with URIs for things.
* Institutional repositories, archives and scholarly communication. A focus on issues in relating and linking data in institutional repositories and archives with library catalogs.
* Rare materials and RDA. A continuing discussion on the development of RDA and DCRM2 begun at the JSC meeting and the international seminar on RDA and rare materials held in November 2015.

For beginners new to RDA and RIMMF but with an interest in creating data, we offer:
* Digitization. A focus on how RDA relates metadata for digitized resources to the metadata for original resources, and how RIMMF can be used to improve the quality of MARC 21 records during digitization projects.
* Undergraduate editions. A focus on issues of multiple editions that have little or no change in content vs. significant changes in content, and how RDA accommodates the different scenarios.

Further on the horizon is a recently approved Jane-athon for the AALL conference in July 2016, focusing on Hugo Grotius (inevitably, a Hugo-athon, but there’s no link yet).

NOTE: The Thing-a-thon coming up at ALA Midwinter is being held on Thursday rather than the traditional Friday to open the attendance to those who have other commitments on Friday. Another new wrinkle is the venue–an actual library away from the conference center! Whether you’re a cataloger or not-a-cataloger, there will be plenty of activities and discussions that should pique your interest. Do yourself a favor and register for a fun and informative day at the Thing-athon to begin your Midwinter experience!

Instructions for registering (whether or not you plan to register for MW) can be found on the Toolkit Blog.

Separating ideology, politics and utility

Diane Hillmann — Mon, 12 Oct 2015 15:08:03 +0000

Those of you who pay attention to politics (no matter where you are) are very likely to be shaking your head over candidates, results or policy. It’s a never ending source of frustration and/or entertainment here in the U.S., and I’ve noticed that the commentators seem to be focusing in on issues of ideology and faith, particularly where it bumps up against politics. The visit of Pope Francis seemed to be taking everyone’s attention while he was here, but though this event has added some ‘green’ to the discussion, it hasn’t pushed much off the political plate.

Politics and faith bump up against each other in the metadata world, too. What with traditionalists still thinking in MARC tags and AACR2, to the technical types rolling their eyes at any mention of MARC and trying to push the conversation towards RDA, RDF, BibFrame, schema.org, etc., there are plenty of metadata politics available to flavor the discussion.

The good news for us is that the conflicts and differences we confront in the metadata world are much more amenable to useful solution than the politics crowding our news feeds. I remember well the days when the choice of metadata schema was critical to projects and libraries. Unfortunately, we’re all still behaving as if the proliferation of ‘new’ schemas makes the whole business more complicated–that’s because we’re still thinking we need to choose one or another, ignoring the commonality at the core of the new metadata effort.

But times have changed, and we don’t all need to use the same schema to be interoperable (just like we don’t all need to speak English or Esperanto to communicate). But what we do need to think about is what the needs of our organization are at all stages of the workflow: from creating, publishing, consuming, through integrating our metadata to make it useful in the various efforts in which we engage.

One thing we do need to consider as we talk about creating new metadata is whether it will need to work with other data that already exists in our institution. If MARC is what we have, then one requirement may be to be able to maintain the level of richness we’ve built up in the past and still move that rich data forward with us. This suggests to me that RDA, which RIMMF has demonstrated can be losslessly mapped to and from MARC, might be the best choice for the creation of new metadata.

Back in the day, when Dublin Core was the shiny new thing, the notion of ‘dumb-down’ was hatched, and though not an elegantly named principle, it still works. It says that rich metadata can be mapped fairly easily into a less-rich schema (‘dumbed down’), but once transformed in a lossy way, it can’t easily be ‘smartened up’. But in a world of many publishers of linked data, and many consumers of that data, the notion of transforming rich metadata into any number of other schemas and letting the consumer chose what they want, is fairly straightforward, and does not require firm knowledge (or correct assumptions) of what the consumers actually need. This is a strategy well-tested with OAI-PMH which established a floor of Simple Dublin Core but encouraged the provision of any number of other formats as well, including MARC.

As consumers, libraries and other cultural institutions are also better served by choices. Depending on the services they’re trying to support, they can choose what flavor of data meets their needs best, instead of being offered only what the provider assumes they want. This strategy leaves open the possibility of serving MARC as one of the choices, allowing those institutions still nursing an aged ILS to continue to participate.

Of course, the consumers of data need to think about how they aggregate and integrate the data they consume, how to improve that data, and how to make their data services coherent. That’s the part of the new create, publish, consume, integrate cycle that scares many librarians, but it shouldn’t–really!

So, it’s not about choosing the ‘right’ metadata format, it’s about having a fuller and more expansive notion about sharing data and learning some new skills. Let’s kiss the politics goodbye, and get on with it.

Semantic Versioning and Vocabularies

Diane Hillmann — Sun, 20 Sep 2015 23:41:14 +0000

A decade ago, when the Open Metadata Registry (OMR) was just being developed as the NSDL Registry, the vocabulary world was a very different place than it is today. At that point we were tightly focussed on SKOS (not fully cooked at that point, but Jon was on the WG that was developing it, so we felt pretty secure diving in).

But we were thinking about versioning in the Open World of RDF even then. The NSDL Registry kept careful track of all changes to a vocabulary (who, what, when) and the only way to get data in was through the user interface. We ran an early experiment in making versions based on dynamic, timestamp-based snapshots (we called them ‘time slices’, Git calls them ‘commit snapshots’) available for value vocabularies, but this failed to gain any traction. This seemed to be partly because, well, it was a decade ago for one, and while it attempted to solve an Open World problem with versioned URIs, it created a new set of problems for Closed World experimenters. Ultimately, we left the versions issue to sit and stew for a bit (6 years!).

All that started to change in 2008 as we started working with RDA, and needed to move past value vocabularies into properties and classes, and beyond that into issues around uploading data into the OMR. Lately, Git and GitHub have started taking off and provide a way for us to make some important jumps in functionality that have culminated in the OMR/GitHub-based RDA Registry. Sounds easy and intuitive now, but it sure wasn’t at the time, and what most people don’t know is that the OMR is still where RDA/RDF data originates — it wasn’t supplanted by Git/Github, but is chugging along in the background. The OMR’s RDF CMS is still visible and usable by all, but folks managing larger vocabularies now have more options.

One important aspect of the use of Git and GitHub was the ability to rethink versioning.

Just about a year ago our paper on this topic (Versioning Vocabularies in a Linked Data World, by Diane Hillmann, Gordon Dunsire and Jon Phipps) was presented to the IFLA Satellite meeting in Paris. We used as our model the way software on our various devices and systems is updated–more and more these changes happen without much (if any) interaction with us.

In the world of vocabularies defining the properties and values in linked data, most updating is still very manual (if done at all), and the important information about what has changed and when is often hidden behind web pages or downloadable files that provide no machine-understandable connections identifying changes. And just solving the change management issue does little to solve the inevitable ‘vocabulary rot’ that can make published ‘linked data’ less and less meaningful, accurate, and useful over time.

Building stable change management practices is a very critical missing piece of the linked data publishing puzzle. The problem will grow exponentially as language versions and inter-vocabulary mappings start to show up as well — and it won’t be too long before that happens.

Please take a look at the paper and join in the conversation!

Five Star Vocabulary Use

Diane Hillmann — Fri, 07 Aug 2015 18:50:46 +0000

Most of us in the library and cultural heritage communities interested in metadata are well aware of Tim Berners-Lee’s five star ratings for linked open data (in fact, some of us actually have the mug).

The five star rating for LOD, intended to encourage us to follow five basic rules for linked data is useful, but, as we’ve discussed it over the years, a basic question rises up: What good is linked data without (property) vocabularies? Vocabulary manager types like me and my peeps are always thinking like this, and recently we came across solid evidence that we are not alone in the universe.

Check out: “Five Stars of Linked Data Vocabulary Use”, published last year as part of the Semantic Web Journal. The five authors posit that TBL’s five star linked data is just the precondition to what we really need: vocabularies. They point out that the original 5 star rating says nothing about vocabularies, but that Linked Data without vocabularies is not useful at all:

“Just converting a CSV file to a set of RDF triples and linking them to another set of triples does not necessarily make the data more (re)usable to humans or machines.”

Needless to say, we share this viewpoint!

I’m not going to steal their thunder and list here all five star categories–you really should read the article (it’s short), but only note that the lowest level is a zero star rating that covers LD with no vocabularies. The five star rating is reserved for vocabularies that are linked to other vocabularies, which is pretty cool, and not easy to accomplish by the original publisher as a soloist.

These five star ratings are a terrific start to good practices documentation for vocabularies used in LOD, which we’ve had in our minds for some time. Stay tuned.

What do we mean when we talk about ‘meaning’?

Diane Hillmann — Fri, 17 Jul 2015 02:35:26 +0000

Over the past weekend I participated in a Twitter conversation on the topic of meaning, data, transformation and packaging. The conversation is too long to repost here, but looking from July 11-12 for @metadata_maven should pick most of it up. Aside from my usual frustration at the message limitations in Twitter, there seemed to be a lot of confusion about what exactly we mean about ‘meaning’ and how it gets expressed in data. I had a skype conversation with @jonphipps about it, and thought I could reproduce that here, in a way that could add to the original conversation, perhaps clarifying a few things. [Probably good to read the Twitter conversation ahead of reading the rest of this.]

Jon Phipps: I think the problem that the people in that conversation are trying to address is that MARC has done triple duty as a local and global serialization (format) for storage, supporting indexing and display; a global data interchange format; and a focal point for creating agreement about the rules everyone is expected to follow to populate the data (AACR2, RDA). If you walk away from that, even if you don’t kill it, nothing else is going to be able to serve that particular set of functions. But that’s the way everyone chooses to discuss bibframe, or schema.org, or any other ‘marc replacement’.

Diane Hillmann: Yeah, but how does ‘meaning’ merely expressed on a wiki page help in any way? Isn’t the idea to have meaning expressed with the data itself?

Jon Phipps: It depends on whether you see RDF as a meaning transport mechanism or a data transport mechanism. That’s the difference between semantic data and linked data.

Diane Hillmann: It’s both, don’t you think?

Jon Phipps: Semantic data is the smart subset of linked data.

Diane Hillmann: Nice tagline

Jon Phipps: Zepheira, and now DC, seem to be increasingly looking at RDF as merely linked data. I should say a transport mechanism for ‘linked’ data.

Diane Hillmann: It’s easier that way.

Jon Phipps: Exactly. Basically what they’re saying is that meaning is up to the receiver’s system to determine. Dc:title of ‘Mr.’ is fine in that world–it even validates according to the ‘new’ AP thinking. It’s all easier for the data producers if they don’t have to care about vocabularies. But the value of RDF is that it’s brilliantly designed to transport knowledge, not just data. RDF data is intended to live in a world where any Thing can be described by any Thing, and all of those descriptions can be aggregated over time to form a more complete description of the Thing Being Described. Knowledge transfer really benefits from Semantic Web concepts like inferences and entailments and even truthiness (in addition to just validation). If you discount and even reject those concepts in a linked data world than you might as well ship your data around as CSV or even SQL files and be done with it.

One of the things about MARC is that it’s incredibly semantically rich (marc21rdf.info) and has also been brilliantly designed by a lot of people over a lot of years to convey an equally rich body of bibliographic knowledge. But throwing away even a small portion of that knowledge in pursuit of a far dumber linked data holy grail is a lot like saying that since most people only use a relatively limited number of words (especially when they’re texting) we have no need for a 50,000 word, or even a 5,000 word, dictionary.

MARC makes knowledge transfer look relatively easy because the knowledge is embedded in a vocabulary every cataloger learns and speaks fairly fluently. It looks like it’s just a (truly limiting) data format so it’s easy to think that replacing it is just a matter of coming up with a fresh new format, like RDF. But it’s going to be a lot harder than that, which is tacitly acknowledged by the many-faceted effort to permanently dumb-down bibliographic metadata, and it’s one of the reasons why I think bibframe.org, bibfra.me, and schema.org might end up being very destructive, given the way they’re being promoted (be sure to Park Your MARC somewhere).

[That’s why we’re so focused on the RDA data model (which can actually be semantically richer than MARC), why we helped create marc21rdf.info, and why we’re working at building out our RDF vocabulary management services.]

Diane Hillmann: This would be a great conversation to record for a podcast

Jon Phipps: I’m not saying proper vocabulary management is easy. Look at us for instance, we haven’t bothered to publish the OMR vocabs and only one person has noticed (so far). But they’re in active use in every OMR-generated vocab.

The point I was making was that we we’re no better, as publishers of theoretically semantic metadata, at making sure the data was ‘meaningful’ by making sure that the vocabs resolved, had definitions, etc.

[P.S. We’re now working on publishing our registry vocabularies.]

Fresh From ALA, What’s New?

Diane Hillmann — Fri, 10 Jul 2015 15:10:17 +0000

In the old days, when I was on MARBI as liaison for AALL, I used to write a fairly detailed report, and after that wrote it up for my Cornell colleagues. The gist of those reports was to describe what happened, and if there might be implications to consider from the decisions. I don’t propose to do that here, but it does feel as if I’m acting in a familiar ‘reporting’ mode.

In an early Saturday presentation sponsored by the Linked Library Data IG, we heard about BibFrame and VIVO. I was very interested to see how VIVO has grown (having seen it as an infant), but was puzzled by the suggestion that it or FOAF could substitute for the functionality embedded in authority records. For one thing, auth records are about disambiguating names, and not describing people–much as some believe that’s where authority control should be going. Even when we stop using text strings as identifiers, we’ll still need that function and should be thinking carefully whether adding other functions makes good sense.

Later on Saturday, at the Cataloging Norms IG meeting, Nancy Fallgren spoke on the NLM collaboration with Zepheira, GW, (and others) on BibFrame Lite. They’re now testing the Kuali OLE cataloging module for use with BF Lite, which will include a triple store. An important quote from Nancy: “Legacy data should not drive development.” So true, but neither should we be starting over, or discarding data, just to simplify data creation, thus losing the ability to respond to the more complex needs in cataloging, which aren’t going away, (a point demonstrated usefully in the recent Jane-athons).

I was the last speaker on that program, and spoke on the topic of “What Can We Do About Our Legacy Data?” I was primarily asking questions and discussing options, not providing answers. The one thing I am adamant about is that nobody should be throwing away their MARC records. I even came up with a simple rule: “Park the MARC”. After all, storage is cheap, and nobody really knows how the current situation will settle out. Data is easy to dumb down, but not so easy to smarten up, and there may be do-overs in store for some down the road, after the experimentation is done and the tradeoffs clearer.

I also attended the BibFrame Update, and noted that there’s still no open discussion about the ‘classic’ (as in ‘Classic Coke’) BibFrame version used by LC, and the ‘new’ (as in ‘New Coke’) BibFrame Lite version being developed by Zepheira, which is apparently the vocabulary they’re using in their projects and training. It seems like it could be a useful discussion, but somebody’s got to start it. It’s not gonna be me.

The most interesting part of that update from my point of view was hearing Sally McCallum talk about the testing of BibFrame by LC’s catalogers. The tool they’re planning on using (in development, I believe) will use RDA labels and include rule numbers from the RDA Toolkit. Now, there’s a test I really want to hear about at Midwinter! But of course all of that RDA ‘testing’ they insisted on several years ago to determine if the RDA rules could be applied to MARC21 doesn’t (can’t) apply to BibFrame Classic so … Will there be a new round of much publicized and eagerly anticipated shared institutional testing of this new tool and its assumptions? Just askin’.

What’s up with this Jane-athon stuff?

Diane Hillmann — Mon, 18 May 2015 15:13:19 +0000

The RDA Development Team started talking about developing training for the ‘new’ RDA, with a focus on the vocabularies, in the fall of 2014. We had some notion of what we didn’t want to do: we didn’t want yet another ‘sage on the stage’ event, we wanted to re-purpose the ‘hackathon’ model from a software focus to data creation (including a major hands-on aspect), and we wanted to demonstrate what RDA looked like (and could do) in a native RDA environment, without reference to MARC.

This was a tall order. Using RIMMF for the data creation was a no-brainer: the developers had been using the RDA Registry to feed new vocabulary elements into their their software (effectively becoming the RDA Registry’s first client), and were fully committed to FRBR. Deborah Fritz had been training librarians and other on RIMMF for years, gathering feedback and building enthusiasm. It was Deborah who came up with the Jane-athon idea, and the RDA Development group took it and ran with it. Using the Jane Austen theme was a brilliant part of Deborah’s idea. Everybody knows about JA, and the number of spin offs, rip-offs and re-tellings of the novels (in many media formats) made her work a natural for examining why RDA and FRBR make sense.

One goal stated everywhere in the marketing materials for our first Jane outing was that we wanted people to have fun. All of us have been part of the audience and on the dais for many information sessions, for RDA and other issues, and neither position has ever been much fun, useful as the sessions might have been. The same goes for webinars, which, as they’ve developed in library-land tend to be dry, boring, and completely bereft of human interaction. And there was a lot of fun at that first Jane-athon–I venture to say that 90% of the folks in the room left with smiles and thanks. We got an amazing response to our evaluation survey, and the preponderance of responses were expansive, positive, and clearly designed to help the organizers to do better the next time. The various folks from ALA Publishing who stood at the back and watched the fun were absolutely amazed at the noise, the laughter, and the collaboration in evidence.

No small part of the success of Jane-athon 1 rested with the team leaders at each table, and the coaches going from table to table helping out with puzzling issues, ensuring that participants were able to create data using RIMMF that could be aggregated for examination later in the day.

From the beginning we thought of Jane 1 as the first of many. In the first flush of success as participants signed up and enthusiasm built, we talked publicly about making it possible to do local Jane-athons, but we realized that our small group would have difficulty doing smaller events with less expertise on site to the same standard we set at Jane-athon 1. We had to do a better job in thinking through the local expansion and how to ensure that local participants get the same (or similar) value from the experience before responding to requests.

As a step in that direction CILIP in the UK is planning an Ag-athon on May 22, 2015 which will add much to the collective experience as well as to the data store that began with the first Jane-athon and will be an increasingly important factor as we work through the issues of sharing data.

The collection and storage of the Jane-athon data was envisioned prior to the first event, and the R-Balls site was designed as a place to store and share RIMMF-based information. Though a valuable step towards shareable RDA data, rballs have their limits. The data itself can be curated by human experts or available with warts, depending on the needs of the user of the data. For the longer term, RIMMF can output RDF statements based on the rball info, and a triple store is in development for experimentation and exploration. There are plans to improve the visualization of this data and demonstrate its use at Jane-athon 2 in San Francisco, which will include more about RDA and linked data, as well as what the created data can be used for, in particular, for new and improved services.

So, what are the implications of the first Jane-athon’s success for libraries interested in linked data? One of the biggest misunderstandings floating around libraryland in linked data conversations is that it’s necessary to make one and only one choice of format, and eschew all others (kind of like saying that everyone has to speak English to participate in LOD). This is not just incorrect, it’s also dangerous. In the MARC era, there was truly no choice for libraries–to participate in record sharing they had to use MARC. But the technology has changed, and rapidly evolving semantic mapping strategies [see: dcpapers.dublincore.org/pubs/article/view/3622] will enable libraries to use the most appropriate schemas and tools for creating data to be used in their local context, and others for distributing that data to partners, collaborators, or the larger world.

Another widely circulated meme is that RDA/FRBR is ‘too complicated’ for what libraries need; we’re encouraged to ‘simplify, simplify’ and assured that we’ll still be able to do what we need. Hmm, well, simplification is an attractive idea, until one remembers that the environment we work in, with evolving carriers, versions, and creative ideas for marketing materials to libraries is getting more complex than ever. Without the specificity to describe what we have (or have access to), we push the problem out to our users to figure out on their own. Libraries have always tried to be smarter than that, and that requires “smart” , not “dumb”, metadata.

Of course the corollary to the ‘too complicated’ argument lies the notion that a) we’re not smart enough to figure out how to do RDA and FRBR right, and b) complex means more expensive. I refuse to give space to a), but b) is an important consideration. I urge you to take a look at the Jane-athon data and consider the fact that Jane Austen wrote very few novels, but they’ve been re-published with various editions, versions and commentaries for almost two centuries. Once you add the ‘based on’, ‘inspired by’ and the enormous trail created by those trying to use Jane’s popularity to sell stuff (“Sense and Sensibility and Sea Monsters” is a favorite of mine), you can see the problem. Think of a pyramid with a very expansive base, and a very sharp point, and consider that the works that everything at the bottom wants to link to don’t require repeating the description of each novel every time in RDA. And we’re not adding notes to descriptions that are based on the outdated notion that the only use for information about the relationship between “Sense and Sensibility and Sea Monsters” and Jane’s “Sense and Sensibility” is a human being who looks far enough into the description to read the note.

One of the big revelations for most Jane-athon participants was to see how well RIMMF translated legacy MARC records into RDA, with links between the WEM levels and others to the named agents in the record. It’s very slick, and most importantly, not lossy. Consider that RIMMF also outputs in both MARC and RDF–and you see something of a missing link (if not the Golden Gate Bridge :-).

Not to say there aren’t issues to be considered with RDA as with other options. There are certainly those, and they’ll be discussed at the Jane-In in San Francisco as well as at the RDA Forum on the following day, which will focus on current RDA upgrades and the future of RDA and cataloging. (More detailed information on the Forum will be available shortly).

Don’t miss the fun, take a look at the details and then go ahead and register. And catalogers, try your best to entice your developers to come too. We’ll set up a table for them, and you’ll improve the conversation level at home considerably!