Saturday, January 25, 2014, 3:00-4:00 p.m., A Consideration of Holdings in the World Beyond MARC [PCC 203B] Sunday, January 26, 2014, 8:30-10:00 a.m., The Other Side of Linked Data: Managing Metadata Aggregation [PCC 102A] Sunday, January 26, 2014, 10:30-11:30 a.m., Mapmakers [PCC 102A]

Most ALA watchers have noticed a shift from ‘invited talks’ at Interest Group and Committee meetings to requests for proposals from the chairs, from which pool the speakers are chosen. This is, of course, in parallel with changes going on with other professional conferences, and it’s an interesting shift for a number of reasons.

There’s a democratization aspect to this change–the chairs are no longer limited in their choice to people they already know about, thereby potentially increasing the possibility that new and different ideas will get an airing. Maybe this Midwinter someone will come up with an absolutely wonderful and unexpected presentation that rockets the speaker from the unknown mob to the smaller roster of interesting known speakers. This is a good thing, I believe, even though the chance of witnessing such a rocket launch are dauntingly small.

As someone who has been around long enough (and noisily, it must be said) this shift means that I don’t need to wait for invitations to do presentations based on some chair’s idea of what might interest their group (but may no longer interest me), I can go ahead and respond to the calls that are appealing to me. I’d like to think that the result is something fresh enough to be interesting for me to prepare and an audience to listen to, without being totally divorced from prior talks that represent earlier phases. An odd result of this shift in process is that speakers who submit proposals to various committees don’t generally know who else will be speaking at a particular program until after their proposal has been approved, and maybe not even then. This particular aspect has already led to some very interesting lineups at meetings across the conference.

Because I take seriously the idea of not re-using previous talks to the extent that I could become horribly boring, I tend to apply for things that allow me to explore something that isn’t unrelated to what I’ve done before, but at least requires that I rethink something or try a different approach than I’ve used before to expose what I (and the people I work with) are thinking about. I think that’s pretty much what most audiences are looking for, right?

So below are my talks for ALA Midwinter. I may be accompanied by one or another of my colleagues on a couple of these, and will surely have their help building the presentations.

A Consideration of Library Holdings in the World Beyond MARC

Of all the MARC 21 formats, Holdings was the one most clearly designed for machine manipulation. It is granular, flexible, and intended to be used at either a detailed or summary level. It has sometimes frightened potential users because it looks complex (even where it isn’t), and in its ‘native’ form is not particularly human friendly. Some of the complexity arises because there are both display and prediction aspects in the encoding, and not all library systems have developed predictive serial check-in systems supported by MARC Holdings.

Some of the bibliographic metadata efforts now going forward ignore the existing MARC Holdings, sometimes in favor of simpler solutions based on the perception of the waning need for predictive check-in for digital subscriptions. Not much effort has been expended to bring the MARC Holdings format forward into the discussions about changing requirements and re-use of existing standards.

For the ALCTS CRS Committee on Holdings Information, Saturday, January 25, 2014, 3:00-4:00 p.m., PCC 203B.

Holdings has been an interest of mine since I was a law librarian representing the American Association of Law Libraries on MARBI. In the early computer era in libraries, where digital publication was the exception, law publishers demonstrated a great deal of creativity in their publication of updating services, from loose-leaf services and regular republication of standard tools, and law catalogers always had the best examples of holdings problems. These days, most of those materials have been subsumed by various digital tools, which have their own complexities, particularly in the context of versions, republication and compilation.

But the question remains–has what we learned from the pre-digital world of holdings functionality have relevance in the digital era?

The Other Side of Linked Data: Managing Metadata Aggregation

Most of the current activity in the library LOD world has been on publishing library data out of current silos. But part of the point of linked data for libraries is that it opens up data built by others for use within libraries, and has the potential for greater integration of library data within the larger data world. The sticking point for most librarians is that data building and distribution outside the familiar world of MARC seems like a black box, the key held by others.

Traditionally, libraries have relied on specialized system vendors to build the functionality they needed to manage their data. But the discussions I’ve heard too often result in librarians wanting vendors to tell them what they’re planning, and vendors asking librarians what they need and want. In the context of this stalemate, it behooves both library system vendors and librarians to explore the issues around management of more fine-grained metadata so that an informed dialogue around requirements can begin.

For the ALCTS Metadata Interest Group, Sunday, January 26, 2014, 8:30-10:00 a.m., PCC 102A

Transitioning from a rigidly record-based system to a more flexible environment where statement level information can be aggregated and managed is difficult to envision from the vantage point of our current MARC-based world. This has lead to a gap between what we know, and the wider world of linked open data we’d like to participate in. One of the critical steps is to understand how such a world might look, and what it requires of us and our systems. The goal is to be able to move some of that improved understanding to the point of innovation and development.

Mapmakers

It’s very clear that there will be no single answer to moving bibliographic metadata into the world beyond MARC, no direct ‘replacement’ for the simple walled garden we all have lived in for 40+ years. While it’s certainly true that the emerging global universe of bibliographic description has continued to expand and seems more chaotic than ever, there are still commonalities of understanding with the world beyond our garden walls that we’re only beginning to identify. How then can we begin to expose our understanding to that universe and develop some consensus paths forward? Specifically, what are the possibilities for using semantic mapping to provide us with the flexibility and extensibility we need to build our common future.

For the ALCTS CaMMS Cataloging & Classification Research Interest Group, Sunday, Jan. 26, 10:30-11:30, PCC 102A.

Librarians too often see ‘mapping’ and think ‘crosswalking’, but the reality is that these are quite different strategies. Crosswalking was a natural fit for the MARC environment, where the ‘one, best’ crosswalk would logically be developed centrally and implemented as part of current application needs. But the limitations of crosswalking make much less sense as we transition into a world where the Semantic Web has begun to take hold (of our heads, if not our systems!).

In the Semantic Web world, maps can contain a variety of relationships (not just the crosswalk ‘same as’), and central development and control is neither necessary nor very useful. This doesn’t mean that we’re all on our own and that collaboration isn’t still our best strategy.

By Diane Hillmann, December 9, 2013, 3:26 pm (UTC-5)

[Continuing from a post earlier today]

The second, and not unrelated, announcement had to do with the end of printed versions of the Red Books, which have traditionally represented LCSH in its most official form. In the LC report to CC:DA the cessation of publication of the Red Books was announced:

In 2012, LC conducted an extensive study on the impact and opportunities of changes in the bibliographic framework and the technological environment on the future distribution of its cataloging data and products. LC’s transition from print to online-only for cataloging documentation is a response to a steadily declining customer base for print and the availability of alternatives made possible by advances in technology. This shift will enable the Library to achieve a more sustainable financial model and better serve its mission in the years ahead.

Certainly there’s not much to argue with here–consumers have spoken, and LC, like every institution and service provider, needs to pay attention. But more troubling is what the online-only policy really means. The announcement includes some information on a planned PDF version of LCSH, and points to that PDF, plus the Cataloger’s Desktop and Classification Web products (both behind paywalls) as the remaining complete and up-to-date options.

Notable for its absence in that announcement is any comment on LCSH on id.loc.gov. Many of us are well aware of the gaps that make this version less than complete and up-to-date, and indeed the introduction to the service points out that:

LCSH in this service includes all Library of Congress Subject Headings, free-floating subdivisions (topical and form), Genre/Form headings, Children’s (AC) headings, and validation strings* for which authority records have been created. The content includes a few name headings (personal and corporate), such as William Shakespeare, Jesus Christ, and Harvard University, and geographic headings that are added to LCSH as they are needed to establish subdivisions, provide a pattern for subdivision practice, or provide reference structure for other terms. This content is expanded beyond the print issue of LCSH (the “red books”) with inclusion of validation strings. *Validation strings: Some authority records are for headings that have been built by adding subdivisions. These records are the result of an ongoing project to programmatically create authority records for valid subject strings from subject heading strings found in bibliographic records. The authority records for these subject strings were created so the entire string could be machine-validated. The strings do not have broader, narrower, or related terms.

It’s not clear to me that the caveats in this introduction are either widely read or completely understood, but I think a survey of random catalogers to ask how useful the service is and what it includes and doesn’t, you’d get a wide variety of responses. And of course, the updating strategy for the subject headings operates under the same ‘versioning’ pattern as the relators: files are reloaded periodically and there isn’t much in the way of the sort of versioning that could support any kind of notifications to users or support for updating linked data using LCSH outside of ILS’s or traditional central services like OCLC.

What LC has done could serve as a case study in how not to handle versioning of semantics in a public vocabulary. If we accept the premise that vocabulary semantics will change, there are very few methods to create stable systems that can rely on linked data. One option (preferred) is to use vocabularies from systems that provide stable URIs for past, present, and future versions of the vocabulary or (not preferred) to create a local, stable shadow vocabulary and map the local vocabulary to the public vocabulary over which you have little or no control. Mapping vocabularies in this way gives you the opportunity to maintain the semantic stability of your own system, your own ‘knowledge base’, while still providing the ability to maintain semantic integration with the global pool of linked data. Clearly, this is an expensive proposition. And it’s not as if these issues of reuse vs. extension are not currently under heavy discussion in a number of contexts: on the public schema.org discussion lists, for instance.

There are a number of related issues here that would also benefit from broader discussion. Large public vocabularies have tended to make an incomplete transition from print to online, getting stuck, like LC, attempting to use the file management processes of the print era to manage change behind a ‘service’ front end that isn’t really designed to do the job it’s being asked to do. What needs to be examined, soon and in public, is what the relationship is between these files and the legacy data which hangs over our heads like a boulder of Damocles. Clearly, we’re not just in need of access to files (whether one at a time or in batches) but require more of the kinds of services that support libraries in managing and improving their data. These needs are especially critical to those organizations engaged in the important work of integrating legacy and project data, and trying to figure out a workflow that allows them to make full use of the legacy public vocabularies.

Ignoring or denying these issues as important changes are made to the vocabularies that LC manages, on behalf of the cultural heritage communities across the globe, does a disservice to everyone. No one expects LC to come up with all the answers, just as they could not be expected (in the past, or now) to build the vocabularies themselves without the help of the community. NACO, SACO and PCC were, and are, models of collaboration. Why not build on that strength and push more of the discussion about needs and solutions into that same eager, and very competent, community?

By Diane Hillmann, July 23, 2013, 3:20 pm (UTC-5)

The Library of Congress recently made a couple of announcements, which I’ve been thinking about in the context of the provision of linked data services.

In May, LC announced that a ‘reconciliation’ had been done on the LC relators, in part to bring them into conformance with RDA role terms. This is not at all a bad thing, but the manner in which the revisions were accomplished and presented on id.loc.gov points up some serious issues with the strategy LC is currently using to manage these vocabularies.

As part of this ‘reconciliation’ LC made a variety of changes to the old list. Some definitions were changed, but in most cases the code and derived URI remained the same, creating a situation where the semantics become unreliable. It’s not easy to determine which ones have changed, because the old file was overwritten, the previous version can’t be accessed through the service, and as far as I can tell, there’s no definitive list of changes available. The only clue in the new file are ‘Change Notes’–textual notes with dates of changes–though what changes are not specified. An example can be found under the term ‘Binder’ (code bnd), where the change note has two dates:

2013-05-15: modified
1970-01-01: new

In another example, the term ‘Film editor’, the definition now starts: “A person who, following the script and in creative cooperation with the Director, selects, arranges, and assembles the filmed material, …” whereas the old usage note referred to “… a person or organization who is an editor of a motion picture film …”. This is a clear and significant change of definition because the reference to the organization entity has been dropped. Curiously the definition for the term ‘Scenarist’ continues to refer to “A person or organization who is the author of a motion picture screenplay …”, although the definition was changed at the same time. Perhaps the difference occurs because the change note for ‘Film editor’ refers to “FIAF”, which is probably the International Federation of Film Archives (the announcement refers to FIAT, a probable typo).

This M.O. may be perfectly satisfactory to support most human uses of the vocabulary, but it is clearly not all that useful for machines operating in a linked data environment. I was alerted to some of these issues by a colleague building a map based on the prior version, which now needs to be completely revised (and without a list of changes, this becomes a very laborious process). It’s also my understanding that the JSC just recently updated some of the relationship definitions for the most recent update of the RDA Toolkit, which are now out-of-sync with the ‘reconciled’ relator terms.

A number of questions arise as a result of this, perhaps chief among them the basic one of whether it makes sense to reconcile these vocabularies at all. Because this work was not discussed publicly before the reconciled vocabulary was unveiled (I might be wrong about this, but I’m sure someone will correct me if I missed something), the potential effect on legacy data is unknown, as are any other options for dealing with the issues created by lack of established process or opportunity for public comment. If you accept the premise that we will continue to live in an environment of multiple vocabularies for multiple uses, there are other strategies–mapping and extension, for instance–that might have a better chance to improve usefulness while avoiding the kinds of reliability and synchronization problems these changes bring to the fore.

In addition to the process issues, a strong case could be made that the current services presented under the id.loc.gov umbrella might benefit from some discussion about how the data is intended to be used and managed. Not everyone is tied to traditional ILSs now, and perhaps fewer will be in future, if current interest in linked data continues. Are all users of these vocabularies going to be expected to flush their caches of data every time a new ‘version’ of the underlying file is loaded? How would they know of change happening behind the scenes (unless, of course, they are careful readers of LC’s announcements)? If LC expects to provide services for linked data users, these issues must be discussed openly and use cases defined so that appropriate decisions are enabled. At a minimum, these practices need to be examined in the context of linked data principles that call for careful change to definitions and URIs to minimize surprises and loss of backward compatibility.

[To be continued]

By Diane Hillmann, July 23, 2013, 2:35 pm (UTC-5)

Many of you have heard me say “Time flies, whether you’re having fun or not”–and that has certainly been the case since I got back from the NISO Roadmap meeting a few weeks ago. Somehow, with my head down, I missed part 1 of Roy Tennant’s post “The Post-MARC Era, Part 1: “If It’s Televised, It Can’t Be the Revolution”. I’m old enough to remember the 60’s and the call to revolution that Gil Scott-Heron referred to, and in fact had a small part in it–but since it WAS live, I’ve no evidence to present about my participation, you’ll just have to believe me.

On the other hand, I’ve been very involved in the revolution under discussion in the remainder of his post, and there’s quite a bit of video to confirm that, including at the beginning of the NISO Roadmap meeting, where Gordon Dunsire and I tossed a few thought-bombs out before the conversation got going. I think it validates Roy’s point about participation to say that the points we made came up frequently in the subsequent small group sessions, which were not, I believe, on the video feed. What I observed as a participant was that more than a few folks left with some new information and (I hope) some expanded thinking about what the revolution was about; more than they came in with.

Despite the fact that I’ve acquired an undeserved reputation for being a MARC hater, I actually think that we should continue to use the semantics of MARC, and get rid of the ancient encoding standard. It’s in some ways a Dr. Jekyll and Mr. Hyde problem we have here, and we’re about to kill the ‘wrong MARC’ in our exasperated search for something simpler, because we can’t seem to get clear about what MARC is and isn’t. The reality is that the MARC semantics represent the accumulated experience in library description from the days of the 3 x 5 card with the hole in the bottom (see Gordon Dunsire’s presentation on that evolution). We’ll clearly need to map the semantics of our legacy data forward, but that doesn’t require that we carry along the ‘classic’ MARC encoding. Consider the old days of the telegraph, where messages were encoded using dots and dashes. Those messages were translated into written English for end users, who didn’t need to know Morse Code to read them. Now we use telephone messaging and email for those kinds of communications, and Morse Code doesn’t figure in there anywhere.

In addition, we need to look past all those rarely used MARC fields, and recognize that they are only irrelevant in an environment that looks very much like our current one, with artisanal catalog records records and top-down standards development. That’s not really what we’re hoping for, as we wrap our minds around what an environment based on linked open data might free us to do differently. When systems were built to process MARC-encoded records, those systems needed to be updated at regular frequencies and all the sharing partners moved in lockstep. It was very expensive to manage the code that was the plumbing of those systems and the specialized fields didn’t add much value. But remember that each of the proposals for change were extensively discussed and formally accepted. I was there for many of those discussions, and recognize that not all of them were accepted, but a considerable number were, and then not always (or often) used after they were included in MARC. Before we label all that effort wasted, and attempt to re-litigate all those decisions, let’s take a closer look at the real costs of moving those forward, in the very different environment we’re envisioning, where the costs are differently distributed and everyone need not move in lockstep. It’s entirely possible that some new communities will find these specialized fields very relevant, even though libraries have not.

Roy quotes from the BibFrame announcement, which states:

“A major focus of the initiative will be to determine a transition path for the MARC 21 exchange format in order to reap the benefits of newer technology while preserving a robust data exchange that has supported resource sharing and cataloging cost savings in recent decades.”

It’s still unclear to me (and I’m not alone here), that we really needed a ‘transition path for the MARC 21 exchange format’. Why can’t we join the rest of the world, which is tootling along quite nicely, thank you, without a bespoke exchange format? We have several useful sets of semantics, built collaboratively over the past half century–why would we need to start over? I generally read the BibFrame discussions, but rarely participate, mostly because it all seems like a reinvention of something that doesn’t need reinventing, and I have no time for that. Whatever the BibFrame people come up with will be mappable to and from the other ongoing bibliographic standards, and whoever wants to use it for exchange can certainly do that, but it will never have the penetration in the library market that MARC has.

It’s also a bit mysterious what ‘preserving a robust data exchange’ actually means. Are we talking about maintaining the current exchange of records using OCLC as the centralized node through which everything passes? What part of that ‘preservation’ is about preserving the income streams inherent in the current distribution model? What is it about linked open data, without a central node, that isn’t robust enough?

Roy ends his post with something that I didn’t expect, but definitely applaud:

“Watching the NISO event over the last two days crystallized for me that I had fallen into the trap of thinking that the Library of Congress or NISO or OCLC (my employer) would come along and save us all. I forgot that for a revolution to occur it can’t come from the seats of the existing power structure. True change only happens when everyone is involved. Those organizations may implement and support what the changes that the revolution produces, but anything dictated from on high will not be a revolution. The revolution will not be piped into our cubicles, ready for easy consumption. The revolution will be live.”

We could start by no longer waiting for LC to deliver an RDF version of MARC 21, unencumbered by 50 year old encoding standards. We already have that, at marc21rdf.info. Yeah, it needs some work, but it’ll get done a lot faster if we can get some help from the 99% of the library world. Give us a holler if you’re interested.

Clearly the revolution is not happening on the BibFrame discussion list, it is happening elsewhere.

By Diane Hillmann, May 14, 2013, 4:34 pm (UTC-5)

I saw the announcement a few weeks ago about the demise of MARBI and the creation of the new ALCTS/LITA Metadata Standards Committee. My first reaction was ‘uh oh,’ and I flashed back to the beginnings of the DCMI Usage Board. The DCUB still exists, but in a sort of limbo, as DCMI reorganizes itself after the recent change of leadership.

I was a charter member, and, with Rebecca Guenther, wrote up the original proposal for the organization of the group. It was based to some extent on MARBI–not a surprise, since Rebecca and I were veterans of that group. But there were some ambiguities in the plan for the UB that came back to bite us over the next few years–primarily having to do with essential questions about what the group was supposed to be doing, and how to accomplish its goals. These difficulties had little to do with the organizational aspects–how many members, questions of voting (which changed over time), or issues of documentation and dissemination, all of which were settled fairly easily when the group was set up (and can be found here.)

It struck me as I was reading the announcement, that it might be useful for me to revisit some of the issues that came up with the DCMI Usage Board while I was a member, and think about whether they are relevant to the new ALCTS/LITA Metadata Standards Committee. I hope this perspective may be useful for ALCTS and LITA as they get this committee going, because, frankly, I see dragons all over the place. [I should emphasize here that these are personal opinions, and don’t represent any position of the DCMI Executive group, of which I am a member.]

So, here’s a quote from the announcement describing the Committee’s responsibilities:

“The ALCTS/LITA Metadata Standards Committee will play a leadership role in the creation and development of metadata standards for bibliographic information. The Committee will review and evaluate proposed standards; recommend approval of standards in conformity with ALA policy; establish a mechanism for the continuing review of standards (including the monitoring of further development); provide commentary on the content of various implementations of standards to concerned agencies; and maintain liaison with concerned units within ALA and relevant outside agencies.”

I see a lot of big and important words in this paragraph, and would like to see some of them defined more carefully for this new context. For instance, what does ‘a leadership role in the creation and development of metadata standards’ really mean? The prospective committee members are folks who have day jobs and are likely to meet in person twice a year (perhaps in multiple meetings) at each ALA meeting, but they have been given an enormous brief, or so it seems.

First of all, what is a ‘standard’? Are ‘standards’ in this context only those which have been vetted by a standards body like NISO or ISO? Some ‘standards’ that are in relatively broad use in the bibliographic environment are in fact developed within the walls of just one institution (e.g., LC’s MODS, MADS, etc.) and though they may eventually acquire some mechanism for user participation, their definition as standards is largely self-declared by their managing institution. For that matter, how about metadata element sets developed by international bodies, like IFLA, or W3C, or Dublin Core? ALA is a voting member of NISO, which suggests to me that a clear definition of what a standard IS will be an essential step, even before an examination of the notion of what a ‘leadership role’ might be.

Then there’s the notion that standards (however defined) will be proposed to this committee for review and evaluation. Proposed by whom? Reviewed by what criteria, and evaluated by what mechanism?

For the DCUB, the brief of the group changed over time, as DCMI grew and shifted focus. At first, the UB’s brief was the review of proposals for new metadata terms. That turned out to be far more difficult than it seemed on the surface, because in order to evaluate those proposals, there first needed to be criteria for evaluation. Eventually it became clear that there were an infinite number of elements desired by an ever increasing number of communities, and whether any or all of these should be part of what was supposed to be a general set of properties became an issue. Finally, after much discussion, it was determined that the Dublin Core was not going to be the arbiter of all terms desired by all people, and the UB stopped reviewing proposals for new terms.

Another historical tidbit illustrates a possible pitfall. At one point (I’m afraid I can’t remember the timing on this), the UB was approached by a public broadcasting group that was developing a metadata schema based on Dublin Core, and they wanted us to review what they’d done and give them some feedback. So, the UB looked over what they’d done, and provided them with feedback–mostly about how they’d structured their schema, rather than the specific terms they used.

Some time later, it was pointed out to me that the Wikipedia entry on PBCore said that the UB had ‘reviewed’ their schema, in a manner implying that we’d given some stamp of approval, which we had certainly not done. Wikipedia being what it is, I went in and clarified the statement. You can probably see what I added by checking out the Wikipedia entry, and you might want to look at some of the PBCore vocabularies in the Open Metadata Registry Sandbox (this is a good example, but you’ll note that they didn’t get beyond “A”)

The RDA effort is a classic case of how much more difficult it is to develop standards than it seems at the start–and also how important process and timeliness are to the eventual determination of who will actually use the standard. The RDA development effort was started long enough ago that during the long process of development — originally begun as a classic closed-door-experts-only effort — the whole world changed.

In 2007, as part of that process, I got involved in the effort to build the vocabularies necessary for RDA to be used in a Semantic Web environment, in parallel with the continuing development of the guidance instruction and under the aegis of the DCMI/RDA Task Group (now the DCMI Bibliographic Metadata Task Group). The completion of that work (since 2009 in the hands of the JSC for review and publication), has stalled, as the JSC spends their limited time entertaining proposals for changing the guidelines that they just recently finished. Meanwhile, time continues to march ever onward, and many of those who were once waiting for the RDA vocabularies to be completed have concluded that they may never be, and have started looking elsewhere for metadata element sets.

In the meantime LC itself began it’s BibFrame project roughly two years ago. That effort, as it’s been described so far, seems unlikely to consider RDA as a significant part of its ‘solution’. Various other large users and purveyors of bibliographic data have begun to use a variety of build-your-own schemas to expose their data as linked data, the (somewhat) New Big Thing. It’s illustrative to note that these don’t tend to use RDA properties.

There was a time that MARC ruled the library world, and there’s still a nostalgia in some quarters for that world of many certainties and fewer choices. That time isn’t coming back, no matter how many new committees we set up to try to control the new, chaotic world of bibliographic data. The fact is that our world is moving too fast, and in our anxiety to get things ‘right’ we continue to build and maintain cumbersome ‘standards’ using complex processes that no longer work for us. We’re still trying to insist that the ‘continuing review’, ‘evaluation’ and ‘recommendation’ processes have clear value, but a realistic look at the current environment suggests that they may no longer be of value, or even possible.

I have no inside knowledge of how all this will come out, but I’d be much happier if the new ALCTS/LITA Metadata Standards Committee either receives or builds for itself a much clearer and achievable set of goals and tasks than they seem to have been given.

It’s a jungle out there.

By Diane Hillmann, October 26, 2012, 11:01 am (UTC-5)

A couple of weeks ago I made a short presentation at a linked data session at the American Association of Law Libraries (AALL). Many of the audience members were people I’ve known since I was a baby librarian (this is the group where I started my career as a presenter, and they invite me back every couple of years to talk about this kind of stuff.) One of the questions from the audience was one I hear fairly often: “Who’s paying for this?” I always assume, perhaps wrongly, that the questioner is responding to pressure from an administrator, but in fact anyone with with administrative and budget responsibilities–particularly in Technical Services, which has been under budget siege for decades–does and should think about costs.

What I said to her was that we were all going to pay for it, and it seems to me that this isn’t just a platitude–given that the culture of collaboration we have developed assures us that many (though certainly not all) of the costs associated with the change we contemplate will be shared, as in the effort noted in my previous post. But the costs are difficult to assess at this stage, because we don’t know how long the transition will be nor exactly what the ‘end result’ will look like. If indeed we have three options for change—metamorphosis, evolution, and revolution—it seems we’ve not yet made the decision on what it’s going to be. If there are still some hoping for metamorphosis—where everything happens inside the pupa and the result is much more attractive (but the DNA hasn’t changed), well, it may be too late for that option.

Evolution–defined as creative destruction and adaptation leading to a stronger result, with similar (but not identical) DNA–is much more what I’d like to see, and particularly if we look carefully at what we have–the MARC semantics, the SemWeb friendly RDA vocabularies, and the strong community culture in particular–and build on those assets, we have a fighting chance for a good result. The trick is that we don’t have millenia to accomplish this, we have a couple of years at best, if we work really quickly and keep our wits about us. The interesting thing about evolution is that when the environment changes and the affected species either adapt or disappear, it’s never entirely clear what that adaptation will look like prior to the point-of-no-return.

As for revolution–perhaps that’s the possible result where Google and its partners take over the things we used to do in libraries when we brought users and resources together. They’re doing metadata now (not as well as we do, I’m thinking) but if we keep trying to make our ‘catalogs’ work better instead of getting ourselves out there on the Web, I don’t think the result will be pretty.

By Diane Hillmann, August 7, 2012, 3:39 pm (UTC-5)

A few years ago, I wrote an article for a collection of writings in honor of Tom Turner, a talented metadata librarian at Cornell who sadly died too young. That article, “Looking back—looking forward: reflections of a transitional librarian” (Metadata and Digital Collections: Festschrift in honor of Tom Turner), although it meanders around a bit (she said after reading it over for the first time in a couple of years) is a pretty good view of where my head has been these past couple of decades. A lot has happened, and I’ve been lucky enough to have been in the thick of a lot of it.

One question that occasionally comes up is about what was in that kool-aid I drank that caused me to jump ship from many of the traditional library ways of thinking to something quite different. There’s not a simple answer to that. It was a combination of things, certainly, but perhaps best expressed towards the end of the article:

“In October of 2004, I attended a panel presentation where three experts were asked to inform library practitioners by providing “evaluation” information about a large Dublin Core-based metadata repository. It was, in a small way, yet another version of the blind men and the elephant. The first presenter provided large numbers of tables giving simple numeric tallies of elements used in the repository, with no more analysis than a relational database might reveal in ten minutes. The second provided results of a research project where users were carefully observed and questioned about what information they used when making decisions about what they were given in search results—i.e. useful data, and a good start on determining the usefulness of metadata, but with no attention paid to the metadata that was used behind the scenes, well before any user display was generated. The third presenter, a young computer scientist, relied almost entirely on tools developed for textual indexing, and, concluding that the diversity of the metadata was a problem, suggested that the leaders of the project should insist that all data providers follow stricter standards.

These presentations seemed sadly reflective of most attempts to approach the problems of creating and sharing metadata in the world beyond MARC. Traditional libraries built a strong culture of metadata sharing and an enormous shared investment in training and documentation around the MARC standard. The MARC development process codified the body of knowledge and practice that supported this culture of sharing and collaboration, building, in the process, a community of metadata experts who took their expertise into a number of specialized domains. We clearly are now at a critical juncture. Moving forward in both realms, traditional and “new” metadata requires that we understand clearly where we have been and what has been the basis for our past success. To do that we need much better research and evaluation of our legacy and current models, a clearer articulation of short term and long term goals, and a strategy for attaining those goals that is openly endorsed and supported by stakeholders in the library community.”

Looking back, I’m not sure I can exactly pinpoint the moment when we stopped thinking clearly about the road ahead, but until very recently, it sure looked like we had. It was probably somewhere around the time when it seemed like RDA would never be finished in our lifetimes, and that even if finished, would be too much like AACR2 to even consider implementing. Somewhere around that time, CC:DA drafted a ‘no-confidence’ memo to the JSC, and I showed Don Chatham (ALA Publishing) a late draft of it while he was at the DC-2006 conference in Mexico. At that meeting, a few of us suggested to Don that it might be a good idea to have the JSC get together with DCMI and see what could be done, which ended up being the ‘London Meeting’ of five years ago that changed the conversation significantly. This spring a five year anniversary celebration was held in London, where a big part of the discussion was about just how big that impact was, in terms of where the library community is today. As usual, nothing like a good crisis to get things moving.

It will be no surprise to readers of this blog, but I spend a fair bit of time thinking and talking about what comes next, and some of the focus of that question, particularly lately, has to do with MARC. A big part of the problem of answering that “what’s next?” question is that we tend to use ‘MARC’ to refer not just to the specification but as a stand-in for traditional practices and library standards as a whole, and this muddies the conversation considerably. What too often happens is that the ‘good stuff’ of MARC, the semantics that represent years of debate about use cases and descriptive issues, doesn’t get separately, or properly, discussed. Those semantics ought really to be seen as a distillation of all those things we know about bibliographic description, tested by time and generations of catalogers, and very much worth retaining. How we express those semantics needs to be updated, for sure (many are already expressed in the RDA vocabularies), but differentiating between baby and bathwater is clearly a necessary part of moving ahead.

One of the things about the library community that few outsiders understand is the extent to which “the library community” has developed a culture of collaboration–to the extent I’ve never seen anywhere else. Librarians collectively build and contribute bibliographic and name authority records, subject and classification proposals, participate (passionately) in endless debates about interpretation of rules and policies, and most remarkably–write this stuff up and share it extensively, asking for comments (and most often getting them). Participating in other communities after having been part of the one built by librarians is very often a frustrating thing–my expectations of participation are clearly too high.

A great example of how this works is NACO (Name Authority Cooperative Project), for those who participate in building and maintaining good authority records. In my days as Authorities Librarian at Cornell, I helped increase Cornell’s participation in the program by a significant amount, an accomplishment of which I’m still quite proud. Recently I noted a new post to the RDA-L list about some PCC (Program on Cooperative Cataloging) committee work around authority records:

“The PCC Acceptable Headings Implementation Task Group (PCCAHITG) has successfully tested the programming code for Phase 1 in preparation for the LC/PCC Phased Implementation of RDA as described in the document entitled “The phased conversion of the LC/NACO Authority File to RDA” found at the Task Group’s page

The activities of this group display some of the important characteristics of successful community activity: the goals are clearly stated, and timelines spelled out; assumptions are tested and results analysed; conclusions and recommendations are written, explained and exposed for comment. This particular activity is noteworthy because it is a collaboration between the Library of Congress and a group of PCC member institutions, and is Phase 1 of a set of planned activities designed to move cooperatively built and maintained authority data into compliance with RDA.

As for the old broads? I was recently reminded of this wonderful quote from Bette Davis: “If you want a thing well done, get a couple of old broads to do it.” So true, so true …

By Diane Hillmann, August 5, 2012, 12:56 pm (UTC-5)

Metadata is produced and stored locally, published globally, consumed and aggregated locally, and finally integrated and stored locally. This is the create/publish/consume/integrate cycle.

Providing a framework for managing metadata throughout this cycle is the goal of the Dublin Core Abstract Model and the Dublin Core Application Profile (DCAM/DCAP).

The basic guidelines and requirements for this cycle are:

  • Metadata MUST be syntactically VALID and semantically COHERENT when it’s CREATED and PUBLISHED.
  • Globally PUBLISHED Metadata SHOULD be TRUE, based on the domain knowledge of the publisher.
  • PUBLISHERS of metadata MUST publish the semantics of the metadata, or reference publicly available semantics.
  • CONSUMERS of published metadata SHOULD assume that the global metadata is locally INVALID, INCOHERENT, and UNTRUE.
  • CONSUMED metadata MUST be checked for syntactic validity and semantic coherence before integration.
  • AGGREGATED metadata SHOULD indicate its PROVENANCE.
  • CONSUMED metadata MAY be considered TRUE based on its PROVENANCE.
  • Locally INTEGRATED global metadata MUST be syntactically VALID, semantically COHERENT, and SHOULD be TRUE based on local standards.

The DCAM takes as its base the rdf data model because of its simplicity and flexibility of that model. The DCAM refines the rdf data model in order to support syntactic interoperability between the rdf data model and non-rdf data models.

A DCAP defines a distinct subset of the set of all things and defines the domain-specific knowledge of the properties of those things, and the relationships those things have to other things. It expresses that knowledge through the DCAM and a set of related documentation (the Singapore Framework). A complete DCAP should provide the necessary domain-specific infrastructure to fully support the create/publish/consume/integrate cycle for any system, any domain, and any data model. A DCAP based on the DCAM should be able to be used by a machine to generate a valid data model in any modeling syntax, and any modeling syntax should be able to be expressed as a DCAP.

We strongly recommend that metadata be published and consumed using the RDF data model. The strength of the rdf data model for publishing and aggregating metadata lies in its extreme flexibility of expression, it’s self-describing semantics, and its assumption of validity. These strengths become weaknesses when creating and validating local metadata that must conform to a set of local restrictions. An effective DCAP will provide metadata creators with the ability to create locally valid metadata, publish it as rdf, validate consumed rdf against local standards, and ultimately integrate global metadata into the local knowledge base that contains local descriptions of the things being described.

There are many systems, many data models, many publish and subscribe models, many storage and validation models. There are many paths to integration. There are very few that provide a generic and neutral method for modeling and documenting inter-system metadata integration. The DCAM/DCAP specification has the potential to be one of those few.

By Jon, July 19, 2012, 11:54 am (UTC-5)

By-passing taggregations identifies those MARC21 variable data field tags whose level 0 RDF properties, representing individual subfields, can be used generally in data triples and mapping triples without losing information and the semantic coherency of the record. These tags have subfields which are independent of one another, with no need to keep them together if the tag is repeated in a record.

RDF graph of Music format ontology

RDF graph of Music format ontology

This graph is another example of adding MARC fruit to the cornucopia by using the sub-property ladder. The MARC21 property used in the graph is a level 0 element. The MARC21 property’s description is equivalent to the ISBD property’s description, as hinted at in the ISBD scope note, but an unconstrained version is used to avoid the ISBD property’s domain. We could use the Web Ontology Language (OWL) ontological property owl:equivalentProperty to represent the relationship between the properties, but we can also use the rdfs:subPropertyOf property by applying it in both directions. That is, if two properties P1 and P2 are related as P1 rdfs:subPropertyOf P2 AND P2 rdfs:subProperty P1, then P1 owl:equivalentProperty P2 and P2 owl:equivalentProperty P1.

Unfortunately, it is unsafe to use level 0 properties for dependent subfields in repeatable tags in this way, even if a specific record contains only one instance of such a tag. Triples from that record will cluster with triples from another record about the same bibliographical resource, either by sharing the same resource URI or an equivalence link. Taggregations are required to avoid semantic confusion, otherwise we wouldn’t know which “Materials specified” goes with which “Place of publication …” or “Name of publisher …” in the publication statement example given in Taggregations.

 

By Gordon Dunsire, July 18, 2012, 9:44 am (UTC-5)

If we were asked (as we sometimes are) what we’d like to see develop as a result of the BibFrame effort, the emphasis in our answer would have both technical and social aspects.

First, given the technologies developing in several different places and considering what we can do now to bring Linked Open Data into our somewhat closed world, some concrete suggestions: We have the ability to share metadata expressed not just in a single common ‘vocabulary’, but to share it using many different vocabularies, expressed and distributed using RDF, OWL/RDFS, RDFa, Microdata, and other tools; we have methods of specifying the use of these ‘semantic’ building blocks (DC Application Profiles and emerging provenance specifications from W3C and DCMI) that allow machines to use, process and distribute data in ways that do not require a central enabling node; and finally, we have technologies and strategies in place to map between existing and prospective metadata schemas that flatten the fences between communities quite thoroughly.

This is the post-MARC world of library-related metadata, where the format native to our metadata is no longer its most important characteristic; there’s no lossy crosswalking requirement to transform data to serve different needs; and there are fewer (eventually no) barriers to sharing. The tremendous value that MARC represents–the semantics built over many decades in response to an enormous corpus of use cases (all of which are fully documented on the MARBI site)–continues to be vitally important as we move into a different arena. But the mid-20th century requirements that dictated the MARC syntax, and the constricting consensus model that has been required to maintain it, no longer apply to the current and future requirements of the global library community.

The usual caveats apply here–some of these technologies aren’t entirely ready for prime time, but in the world we live in, ‘finished’ is more likely to be used for something defunct rather than a goal for standards and tool development. The areas we’re personally most familiar with (no surprises here) are the complementary domains of vocabulary management and mapping. The first of those is up and running (although the ‘continuous improvement’ engine is going full bore) as the Open Metadata Registry; the second, mapping, is an important interest of ours and has generated papers, articles, and presentations (see below for a selection), not to mention the usual plethora of posts to blogs (including this one) and discussion lists. We believe that the OMR and mapping capabilities under development work together to enable legacy data to shift into the open linked data world efficiently, effectively, and with a minimum of loss, in the process enriching the useful data options available for everyone.*

Often overlooked in the glitter of technology is the possibility of pulling together the communities that were torn asunder as a result of past technical limitations (and other reasons). In our common past, the library community built their data sharing conventions on a platform based on the consensus use of AACR2 and MARC. Some library communities–law and music the most prominent–were willing to compromise in order to work within the larger library community rather than strike out on their own. Others, like the art library and museum community and the archivists, did make the break, and have developed data standards that meet their needs better than the library consensus was able to do. But those specialized standards have resulted in yet more silos, in addition to the huge MARC one.

If the intent is to ‘replace MARC’ (as it is at some level), it’s about re-placing MARC and its built-in limitations from the center of our world back into the crowd, where it becomes one-among-many. Also a value in that shift is the ability to expand the data sharing environment MARC enabled a half century ago to a broader community of interest that includes museums, publishers, and archives. Meeting the goal of dissolving those silos and making our data easily understandable and reusable in a host of ways will help initiate that ‘Web of Data’ we’ve been anticipating for many years. As Eric Miller explained so well in his Anaheim BibFrame update: by moving towards a linked data world we actually look well beyond the Library/Archives/Museum worlds by definition–it’s a very big world out there. But by collaborating with the LAM community as a whole to get there, we reap a great many benefits, not the least of which are perspectives that are much broader in some significant ways than ours. Limiting our view merely to a ‘MARC Linked Data Model’ might be an important beginning step, but it falls short of where our vision needs to extend.

And the fact is that MARC will not be going away for a long time, if ever. There will be a lot of variation in how the transition is done by libraries, depending on institutional support, short term and long term needs, and existing partnerships. The process of moving MARC into the linked data world has already started. RDA and its RDF vocabularies was a start, as is the development of a complete RDF version of MARC, located at marc21rdf.info. Several years worth of pre-conferences, presentations and discussions, at ALA and beyond, have prepared the soil for these changes. But we need a plan, and some concrete steps to take–steps that include the groups who have been working in the trenches without a great deal of support, but making progress regardless. The BibFrame effort needs to be more than a playground for the technologists, because in most instances, the technology is not what’s holding us back–it’s the institutional inertia and the difficulties of finding ways forward that don’t pit us against one another. The plan we need balances the technical and social, the quick-win with the long-term momentum, and the need for speed with the public discussion that takes time and builds buy-in.

What we have in our sights is an opportunity to reverse the long term trend towards balkanized metadata communities and to make the future one for which there are fewer fences between, and more data exchanged among these three communities with obviously similar challenges and interests. We think the time has come to use the vastly changed technology environment to do that.

*It would be too easy for a response to this post to be in the form of “Oh, you’re just tooting your own horn here,” and indeed we are in some measure doing that. But we do this work because we believe it’s important. We don’t believe it’s important merely because it’s what we do. We believe in the value of the work we’ve done and will do, and we see a great deal of relevance for it as part of the BibFrame discussions.

Selection of papers, and presentations:

Jon Phipps’ presentation on mapping at ALA Anaheim
A Reconsideration of Mapping in a Semantic World / Gordon Dunsire, Diane Ileana Hillmann, Jon Phipps, Karen Coyle
Gordon Dunsire’s presentation at the recent London meeting “RDA, 5 years on”

Post written by Diane Hillmann and Jon Phipps.

By Diane Hillmann, July 4, 2012, 2:35 pm (UTC-5)