As one who entered college in 1966 and experienced the sixties from the front lines, I sometimes wonder how I got to be so old and stodgy. I particularly think of this when I pass plate glass windows while walking on the street and glimpse my reflection in them. Who is that gray-haired old broad in the window? It’s always a shock, despite the fact that I’ve been gray-haired for a very long time, and demonstrably growing old at the same rate as everyone else.

I’m sure some people reading this are laughing at the self-characterization “stodgy,” given that in some quarters I know I’m thought of as a very radical library ex-cataloger with unfortunate tendencies to disparage the status quo and suggest outlandish changes to how things are done in libraries. I suppose I use the “stodgy” bit to suggest that it isn’t necessarily the case that I believe we have to throw out everything we’ve done in the past in pursuit of new approaches. I do a lot of evangelizing on this point, both because I think it makes change less frightening, and because I believe we know a lot more than we think we do about what needs to change, we just have trouble separating that knowledge from the familiar and comforting packages we know so well. We’ve been distributing our knowledge in those packages for a long time, and it’s human nature to stick with what we know when we’re challenged. After all, in our early days as a species, we learned to eat the familiar stuff because we knew it wouldn’t kill us. Those who preferred the new and unfamiliar invariably ran into trouble, and often failed to pass on their genes. Our genes come from the people who knew that the new stuff could kill you.

My own experience with change has been a mixed bag, certainly. When I was in college I worked for the campus radio station (I was a TV-Radio major). Because I had radio experience in high school, I got a “job” immediately in my freshman year, doing a program called “Dinner Date” which utilized only a certain narrow set of recordings (then on vinyl only, of course). I was supposed to play only instrumentals in the genre that we now refer to as “elevator music.” I tried, in vain, to shift the playlist a bit over to slightly jazzier stuff, bossa nova, and suchlike, but that didn’t go over very well with the then powers-that-were. (I should point out that in those days women weren’t allowed to read news because our voices didn’t have the proper authority—but that’s another story). The mission of the show was to provide quiet background music for people who might be eating their dinner, the music also interspersed (and this was not explicit, but we all got it) with a sexy female voice to keep people coming back. This was an interesting time to be doing educational radio–which was what we were supposed to be doing–before NPR and public radio, when there was a distinct boundary being maintained between commercial and non-commercial radio which didn’t have anything to do with the presence or lack of advertising. Rock and roll, which was what we all wanted to play instead of the classical music leavened with a bit of folk, jazz and elevator music on the side, was not allowed.

Our Music Director at that stage, in charge of our quite extensive library of recordings, was a fellow who rather blindly followed the practices of the past, and as part of his job he reviewed all new records that flowed in on the gravy train of freebies provided by record companies. As he reviewed, with the use of a sharp tool, he rendered unplayable all those tracks unsuitable for the mission of the station at that time. After his tenure, a real musician became the Music Director, and he put a stop to that practice, but in the meantime, the station and its music collection was permanently disabled. When the station began to change (it’s now an NPR station specializing in Jazz) their collection of recordings was almost unusable. (Now that I think of it, this strategy reminds me of the practices I railed against in my early days at Cornell, where catalogers were told to delete MARC fields that the library then had no use for, which subsequently had to be laboriously re-done when fashions changed.)

I’ve been reading Karen Coyle’s “Understanding the Semantic Web: Bibliographic Data and Metadata,” (Library Technology Reports, Jan. 2010) both because I want to comment on it more fully in this space and because I’m preparing for a class I’m teaching in the Spring quarter at the University of Washington (more about that later). Since I loaned my paper copy to Jon and won’t get it back until Friday, I’ve been limited so far to reading the first chapter, “Library Data in a Modern Context,” which is available online, and includes in its closing paragraph the following:

“The need to change does not mean that what you are doing is wrong. Instead, it often means that something in your environment has changed, something that you cannot control.”

This strikes me as a critically important point. Looking back I can see that much of our conversation about what is happening in our environment, and what needs to be part of our change strategy, is being heard as blame: “You’re doing this all wrong.” And that’s simply not true—if you don’t believe me that where we are makes perfect sense if you understand the ‘why’ of it, read that first chapter. Karen has a wonderful way of explaining things (one that I envy, I have to say), and she does a great job in taking the reader from the nineteenth century and Panizzi to the present day, and it all hangs together. It’s clear that we’ve been responding appropriately to the change in user needs and technology, and I for one think we can do it again, once we move beyond the blame game.

I get a lot of questions about how and when this change will happen—nobody really wants to be on the bleeding edge of some of this, nor to they want to be left behind, but the effect of all the strategic planning seems to be that we are all teetering on the verge of some tipping point we can’t quite see clearly. “What’s it going to take?” they ask, and I have to be honest, I don’t know. Maybe the smartest thing would be to just declare that we’ve passed that tipping point and stop waiting for somebody else to take the first plunge. Maybe if we could get everyone to read Karen’s report (the printed version can be ordered from ALA directly), as well as to look for the February issue, which will also be Karen’s work, we’ll have the courage to make a move in the right direction. Hope springs eternal (and perhaps that’s the ‘drug’ referred to in the title!).

By Diane Hillmann, February 17, 2010, 4:41 pm (UTC-5)

One of the things I often do in the weeks following ALA conferences is check out the blog posts about sessions I missed attending. One such was the session on “Recent Trends in Catalog Architecture: ALCTS Catalog Form and Function Interest Group.” I don’t recall what we were doing instead of going to this session, but it looks like we missed a very interesting conversation. That happens far too often at the jam-packed Midwinter and Annual conferences—but thankfully these days the bloggers are keeping us all better informed. Laura Akerman has posted an interesting report of the IG’s session. Slides for this session are also available.

One of the inescapable conclusions to draw from reading the reports is that the separation of metadata from display considerations seems to be well underway, at least if you consider those folks in the innovation corner. This is very welcome news, particularly for one who has had too many frustrating conversations over the past few years with folks whose heads are so deep into MARC that the revolutionary idea that display decisions are not inextricably tied to metadata fields is completely foreign. Also, it seems clear from the report that the notion of strict MARC-ish boundaries around what kinds of metadata libraries might be interested in providing in a discovery interface is no longer quite so impervious.

I was struck by the blogger’s quote from the chair: “These presentations were varied but all concerned the architecture and functionality of multiple layers - “what happens (or needs to happen) in between” to transform, combine, and synchronize metadata.” It seems that everyone is traveling down this path these days, which is very good news indeed. In the search for appropriate metaphor that we’ve all done, Frances McNamara from the University of Chicago, the first presenter, called their aggregation of resources ‘stone soup’–a nice metaphor on the whole, though it does imply very hard lumps and not a lot of flavor. The second group of presenters, Joshua P. Barton & Lucas Wing Kau Mak from Michigan State University entitled their presentation “To Fix A Leaky Sink: Envisioning The Potential of Discovery Layers,” although this seems an oddly juicy metaphor for what blogger Laura Akerman described as primarily a “think piece.”

The third presentation was by Jennifer Bowen, of the eXtensible Catalog Project (full disclosure: Jon and I worked with XC as consultants early on in their development and at various points since). XC’s view of metadata management borrows a great deal from the work we did (and published about) during the days when the National Science Digital Library (NSDL) was young and doing interesting things (before it became a development shop for the Fedora CMS). [Links to the original papers can be found at our website.) From the point of view of libraries looking at the management of metadata as something that happens outside of an ILS, their service architecture is by far their best stuff. In addition to Jennifer’s slides, anyone interested in what XC is doing should take a look at their recently released screencasts, particularly the first and the third, which includes a detailed description of what the MST (Metadata Services Toolkit) really does. Our metaphor for this functionality has always been “The Metadata Washing Machine,” so we’ve really no business complaining about anyone else’s choices!

The last presentation was by Aaron Wood, of the University of Calgary, and the placement of this particular presentation is important, because it clearly points out that once you’ve created your soup and/or washed your dirty metadata, you still need to figure out how to present results to users in ways that are far different than what we’re used to in the typical ILS interface. Calgary uses the new Summon product from Serials Solutions (which, by the way, also builds on the notion of aggregated metadata services we pioneered in NSDL). The question that turns up in the blog post “how to prevent the local institution’s collections (print and digital) from becoming marginalized in search results when combined with a much larger number of full text resources (licensed journal articles etc.)” is right to the point, because it assumes, as we all should, that the usual alphabetic display that most of our ILS systems produce, that must be scrolled or paged through, is no longer good enough. The presentation points out something we found in NSDL several years ago, that when metadata records and full text are trolled for keywords, the full text always tends to rise to the top, and that’s not necessarily always a good thing from a user point of view. Metadata in general is a poor performer when simply treated as text by full-text indexers, not just because there’s not enough text, but also because full-text indexing generally perceives no additional value in controlled vocabularies and their relationship to well-defined properties (and the richness of the relationships that lie behind the vocabularies as well).

But I’m sorry I missed this session, and thanks to Laura Akerman and the Metadata Blog for providing such a good summary of the session.

By Diane Hillmann, February 10, 2010, 5:21 pm (UTC-5)

One of my favorite aphorisms is “Time flies, whether you’re having fun or not.” I’m not sure where I heard it, but for sure I’m not creative enough to make it up on my own. The truth of it has been reinforced by the realization that here it is the end of January, post-Boston Midwinter, and I’ve done so little blogging for the past six months that it’s a stretch to call myself a blogger. Time to reclaim the turf. So this post is an attempt to summarize what I’ve been doing all that time, some of which has come to a sort of fruition, but some still ripening.

Last fall I participated as a speaker in a NISO Webinar “Bibliographic Control Alphabet Soup.” I decided for my topic to talk about some of the issues around building the RDA vocabularies from spreadsheets and ERDs (Entity Relationship Diagrams), which is what I had to work with on that task. (You can see the ERDs on the RDAOnline website). Part of my reason for trying to tackle those issues in the webinar is that the vocabularies had become a major focus of my working life for quite a while (does the word ‘obsessive’ sound too dramatic?) At the time, we (Jon Phipps, Karen Coyle, Gordon Dunsire and I) were also trying to write an article about what we’d done with the RDA elements and vocabularies, and that article came out last week in DLib Magazine. Starting the article last fall prompted me to create some diagrams in an attempt to try and convey the structure of these vocabularies, and to provide examples for folks to look at while they puzzle through the ideas. I used a subset of the diagrams in the webinar as well, but most of them are used to better purpose in the article. I’m not at all sure that the several hundred folks listening in to the webinar got much of what I was trying to convey—it’s pretty new stuff for most people, and I was trying to fit too much into my limit of 20 minutes (not nearly enough time). I hope that for those who might have been overwhelmed or confused by the webinar, the article will help to make the work we’ve done a bit clearer.

The writers of the article had a number of purposes in mind, not least to document the decisions and rationale for the strategies taken by the DCMI/RDA Task Group charged with the work of building the vocabularies. Given that the library community generally has little experience with RDF and RDF vocabularies, it seemed particularly important to attempt to provide explanations that we hoped would be accessible to most librarians, and I hope we’ll have sufficient feedback to determine how close we came to that goal. Given that we expected the article to come out just after Midwinter, we did most of our presentations in Boston on the implications of the vocabularies rather than the mechanics. I talked about these at the Technical Services ‘Big Heads’ meeting (slides) and Jon, Karen and I included some of these expectations in our introduction to application profiles at CC:DA on Monday (slides).

Just before Midwinter we started a dialogue with the JSC about next steps, and I hope that some of the issues that have come up will be open to public discussion, just as the vocabulary building was done in public. We’re all in a fairly intense learning space at the moment (at least it seems so to me), and keeping the process open and visible for all seems in support of that learning. We’re also continuing to update the vocabularies as errors are brought to our attention, and to complete portions where we need to add information. One such point in the property/subproperty hierarchies in the element sets—at the time the work was done there was a limitation in the Registry software that prevented us from including the proper hierarchies in both the general and FRBR-bounded portions of the vocabularies. That limitation is now removed, and the missing relationships will be added.

Interestingly, we’re getting lots of help in finding errors from our friends in Germany, who are doing nifty things with the vocabularies. They’ve been particularly helpful with things like inadvertent spaces introduced in URIs and other things difficult for a human proofreader to find. Because we don’t have good ways (yet) to visualize the relationships, their help has been invaluable. We urge anyone else who spots an error or who has a question to use our feedback button in the Registry to communicate with us about their concern.

By Diane Hillmann, January 27, 2010, 12:16 pm (UTC-5)

A few weeks ago I attended the opening of an amber exhibition at our wonderful Museum of the Earth which is only about 6 miles from my house. The exhibit had a little of everything: science, history, geography … and jewelry. I have to admit (and this will surprise no one who knows me) that the jewelry was a big draw, and I went laden (literally), with a varied selection of my own collection of amber. Hey, laugh if you will, but these days I work at home, and have very few opportunities to wear jewelry of any kind—so this opening was irresistible.

But, enough about jewelry, I want to talk about bugs and bibs! As you might expect in a science museum, there was far more emphasis on amber as a carrier (so to speak) of bits and pieces of the past, particularly the biological past. As a preservation medium, amber is hard to beat, though, of course, there are limitations in terms of the size of the biological specimen. I didn’t realize it, but apparently fake amber is everywhere, and one way to recognize the bio fakes is that they include specimens too big to be slowed down by sticky tree sap. The exhibit had some nice fakes, including a small snake in plastic colored to look like amber.

The interest of the scientist in amber is that it stops the process of decay for those creatures lucky (or unlucky) enough to be captured in its grasp. The amber captures a moment in a bug’s short life in a way that allows us to examine it closely and in detail in our own time, millions of years later. In much the same way, the Study of the North American MARC Records Marketplace by R2 Consulting captures a moment in time, very likely too late to have much of an effect on the future, but just in time to capture the state of the cataloging world before the tsunami arrives. [R2]

But the R2 report is as fascinating to a metadata maven as a bug in amber is to a biologist. It describes in detail the current world of cataloging distribution, focusing on the “dysfunctional market” that has grown like Topsy around distribution of MARC records. It gets exactly right the disconnect between the librarian sense that “records want to be free” and the business approach that production costs must be recouped and profit margins maintained for there to be any point in participation at all, and comes down predictably in support for the latter view.

It’s a fascinating read, particularly if the fact that LC commissioned the report is kept in mind—because this is hardly a context-free analysis. I was particularly interested in the description of the businesses outside of libraries supplying MARC records either as contractors or as part of a materials supply chain. As a former denizen of one of the large academics that R2 identifies as part of the “green tier” (more about that later), I was aware of the fact of that portion of the MARC marketplace, but had little contact with it.

The gist of the report is that the MARC distribution network is a dysfunctional hybrid, partly librarianishly “free” and part commercial marketplace. The authors feel that it should be possible to increase the supply of MARC records from “the community” without relying on poor beleaguered LC to supply them, and they give us a multitude of statistics to support that assertion. They believe that there’s enough time to accomplish this and save everybody money before the promised changes come to pass, and all must be re-thought.

My comments on this report, informed by my well-known biases, fall into a few convenient categories:

Dysfunctional? Probably …

Much of the first portion of the report is devoted to a description of the current “marketplace” and a discussion of the survey results that illuminate and inform the description. It’s here that R2 makes the case that LC is subsidizing the whole shebang, to the benefit of everyone else.

“Both libraries and vendors (at least the good ones) rely on “service” to their respective clienteles to distinguish themselves, but there are important distinctions in their respective definitions of the term. In the commercial world, service must exist within a context of profitability, in which all costs are covered and some additional increment is contributed to the company’s continued growth and as a return on the capital initially invested. The library service ethic is much more open‐ended and less directly constrained by costs.”

The report contains much interesting description of what the authors perceive as the bifurcated market, one which, in their view, inhibits the growth of useful marketplace incentives to increase output:

“This tension ‐‐ between community values and commercial values, between idealism and pragmatism, between social responsibility and private benefit – has deeply affected some aspects of the library market. Cataloging, regarded by many as the heart of librarianship, is one of those areas.”

It’s pretty clear where the authors come down in this conflict between “community” and “commercial” values:

“The impulse to share records for which the costs have not been fully recovered may make sense as a form of community good, but is not sustainable without some form of subsidy or exchange. From the commercial viewpoint, it’s simply bad business.”

And, perhaps more to the point:

“It should not go unnoticed that LC itself provides open access to its MARC records via multiple channels. The prevalence of open databases is a key factor in the economic confusion that plagues the MARC Record Market … “

The report goes on to a rather interesting and revealing categorization of the complex MARC marketplace into three tiers. The “Green Tier” includes the “ … oldest, most traditional segment of the market, in which nearly all MARC records originate.” This tier includes both libraries and businesses, as well as OCLC, and is, as such, a mix of the “community” and “commercial” as described earlier. The big thing is that they’re contributors to the marketplace, even if also consumers. According to R2’s statistics, this tier includes 97% of academic libraries, 63% of public libraries and a similar proportion of school libraries.

The next tier down (and it’s clearly down, in this categorization) is called the “Blue” or “opportunistic” tier, including by the author’s definition “ … non-OCLC libraries and underfunded libraries without adequate cataloging capacity.” More interestingly, this tier “ … is also home to open database providers, and the pervasive (did they mean to say “pernicious”?) Z39.50 protocols used to locate and obtain MARC records free of charge.” But R2 makes note of the shifting borders between tiers: “Both in Canada and in the US, historically ‘green libraries’ are adopting ‘blue tier’ practices and expectation, as library budgets are cut and as Z39.50 targets proliferate. Nearly all libraries, regardless of size or type are strategically patient, periodically re-searching the ‘blue tier’ for certain records to become ‘available’; but for ‘blue tier’ libraries, this is the primary approach to cataloging … Open Access and Open Archives Initiatives reside in the blue tier, strongly supported by the basic philosophical stance that access to information should be free.”

The “bottom” tier is the non-library “purple” tier, and this description clearly defines the real threat to the current MARC world, not just the fuzzy-wuzzy library community notion of sharing: “The non-library (purple) tier operates to a large extent without appreciation for or experience with MARC records, and without much regard for the library market in general. It is important to remain aware of activity in this segment, of course, because developments here pose the most significant competitive threats to the traditional values and economic structures of the ‘traditional green tier,’ and even the ‘opportunistic blue tier.’ This is the place where newer technologies and non-MARC data formats are used and developed.”

Obviously, we have met the enemy of libraries, and according to R2 it happens to be us. But wait, there are some unexpected companions in the nasty “purple” tier. In addition to the usual suspects, like Google and Amazon, we find … “OCLC pro-actively operates within the “traditional green tier” and within the “purple non-library” tier. OCLC member libraries, however, are also very active in the “opportunistic blue tier,” sharing records in ways that may conflict with OCLC’s proprietary intent.”

The battle lines seem clearly drawn here, with the “information wants to be free” crowd clearly the enemy, whether in sheep’s clothing as traditional librarians or explicitly displaying wolfish teeth as a member of that unappreciative crowd that cares little about the current MARC marketplace and would like to see the library data silo dismantled brick by brick. No matter that we seek these changes for the benefit of libraries struggling to live within their budgets and to innovate to serve their users as well–shame, shame!

The R2 Solution

The report’s authors actually manage to ask THE most relevant question that should be (and often is) on our minds, but only to dismiss it as out of scope:

“The practice of cataloging has never before faced the level of scrutiny it now enjoys … or endures. Two types of question predominate. First, are traditional cataloging and the MARC record—even after modernization by RDA and FRBR—still necessary in an era of full‐text indexing, OpenURL linking, and other discovery options? While this is a worthy question, it is fortunately not within the purview of this report.”

Leaving aside the odd assumption that RDA and FRBR represent the “modernization” of the traditional MARC record, they couch the issue only in the context of a limited number of technologies, never mentioning the gorilla in the room, the data being built by others outside our comfy and bounded silo. Then they go on to pose the questions they would rather address:

“How do we as a profession understand and explain the costs and benefits of producing and distributing cataloging records? Where and by whom are most original records produced? What incentives exist to stimulate production? What are the barriers that discourage production? How does the library market assign value to the work of cataloging? What is the return on any organization’s investment in producing original catalog records? How does shared cataloging and free or low‐cost distribution of records affect the market? To what degree is market activity subsidized by LC and by the work of individual libraries?”

The problem is, that without an answer to question #1, the other questions seem hardly relevant.

“As noted there, the market is in need of adjustment, if it is to create an incentive for producers while retaining the community ethic of free sharing of data. The ethic of the cooperative can only be sustained if the full costs of production are borne by the community.”

It seems to me that the market will be adjusted, and the recognition of the full costs of traditional cataloging and the plunging ROI as we address Question #1 will hasten that readjustment, but probably not in the direction R2 predicts or that those seeking compensation for their MARC record production might want.

The authors provide some telling glimpses into their world view in their discussion about crosswalks:

“ONIX to MARC record translations and fully operable MARC to non‐MARC metadata crosswalks could dramatically alter this three‐tiered landscape. To date, major players in the blue and purple tiers have failed to buy into the concept of shared bibliographic and authority data. While some efforts to encourage cross‐market cooperation are underway (notably the OCLC/NISO forum), fierce competition flourishes within and between each tier of the market. Even more problematic, each tier has distinctly different needs and incentives, making it difficult to establish an adequate degree of shared urgency and/or investment in new solutions.” [RIN]

Clearly, in a world where the only relevant data one can see “out there” is ONIX, crosswalks seem a no-brainer, but to call this view “limited” seems far too kind.

Ultimately, R2 thinks we still have time to tweak the marketplace and flog out more MARC records by identifying and marshaling unused capacity (e.g., hidden catalogers) and providing economic incentives. In my view, this is a flawed argument, and takes away from the need to plan for the transition to a much different future. I agree that MARC will indeed be used by libraries for some time, but as a lossy exchange format, not the lynchpin of the library data world. R2’s strategy prolongs the old world, jeopardizing the possibilities of moving forward in a timely manner.

The Sacred Cow Effect

Sadly, the whole report, interesting though it is as a biological specimen, fails utterly to examine the data activity outside libraries except to demonize it and its proponents. In making the Library of Congress into Poor Nell, they also deny the innovations in creating and reusing data that LC itself has accomplished, for instance, the American Memory Project, the LC Flickr Project, and many other digital initiatives that have proactively (and openly) pushed the metadata envelope in ways that inspire and engage us. The report fails also to understand that the changes they fear, the ones that they rightly expect to undermine the current marketplace completely, are already nibbling ravenously around the edges of MARC and its traditional marketplace in ways that will hardly take the 5-10 years to make change become real that R2 predicts.

Last summer at ALA in Chicago, a small group of us pulled together a linked data program, hearteningly well attended, where Eric Miller persuasively predicted that the return on investment for integrating “free” metadata from “the cloud” will trump traditional concerns about quality. [Miller] Mainstream entitles like the New York Times are moving aggressively into the linked data space, seeking to merge their data with the likes of DBpedia and FreeBase. [Sandhaus]

Consider this from MMA partner Jon Phipps: “The future cataloging marketplace will have to compete with ‘free and more than good enough’. Like the people who initially sneered at Google for being too simplistic and ignoring metadata when it came to searching, the professional cataloging community ignores (or tries to fend off) the enormous future output of Linked-Data-enabled systems at its peril. By opening up a clear relationship between the semantic web and library data sets, the RDA vocabularies represent a threat to the hegemony of catalogers. The RDA vocabularies are a a disruptive, game-changing technology.” [Phipps]

The reality is that it’s not just the marketplace that’s changing, it’s also the profession. As part of the analysis of why the numbers of catalogers reported in their survey doesn’t lead to the expected output levels, R2 speculates that “These data lead us to ask what catalogers are doing. Bob Wolven and others suggest that catalogers are being called upon to apply their knowledge of cataloging principles to new initiatives; and specifically to creating metadata for digital and archival collections.” [Wolven] R2 seems to imply that this is a bad thing, taking away resources from the business of actually churning out MARC records, but certainly these newer roles are critical to the survival and renewal of libraries, far more than shoring up current MARC record production.

The solutions the R2 report poses, from paying more attention to recouping cataloging costs and re-centralizing creation of cataloging records, if taken up, would actively undermine a transition to participation in a more open, linked data world. They represent a step backward, in a community that has already internalized the values of sharing and decentralized data critical to seeing value in the world of openly accessible data lying on our doorstep.

Oddly enough, the report ends with a quote from my old friend Sherman Clarke (unattributed, so most likely as a comment to the survey):

“We collectively need to have a model that allows us to do some of the building of BIBCO records mechanically or through accretion of metadata from institutional records or other record loads. OCLC already does considerable building of the master record from incoming records; what we need is something more like the metadata that is becoming usual in NewGen environments. If someone adds a tag or review or picture, that becomes available in the master cluster. Not a BIBCO record, but a BIBCO cloud of metadata for a particular manifestation of a work/expression.”

Yup, you got it, Sherman. The change we need is not really about records, or catalogers; it’s a new way to think about information and added value.

[Miller] Miller, Eric. “Linked Data and Libraries: Grassroots Program: From Legacy Data to Linked Data, Preparing Libraries for Web 3.0. Available at: zepheira.com/talks/ala-em-lod.pdf

[R2] Study (for the Library of Congress) of the North American MARC Records Marketplace, October 2009, R2 Consulting LLC, Ruth Fischer, Rick Lugg. Available at: www.loc.gov/bibliographic-future/news/MARC_Record_Marketplace_2009-10.pdf

[RIN] Research Information Network. (2009). Creating catalogues: bibliographic records in a networked world. Available at: www.rin.ac.uk/files/creating_catalogues_REPORT_June09.pdf

[Sandhaus] Sandhaus, Evan. “150 Years of Semantic Technology.” Presentation at the Cornell University Libraries Metadata Working Group Forum, Nov. 13, 2009. Slides will be available from: metadata-wg.mannlib.cornell.edu/forum/index.php?date=2009-11-13

[Wolven] Wolven, Robert. (2008). In search of a new model: Columbia University Libraries: Robert Wolven reflects on what’s next for cooperative cataloging. netConnect, 1/15/2008. Available at: www.libraryjournal.com/article/CA6514925.htm

By Diane Hillmann, November 24, 2009, 11:04 pm (UTC-5)

…in the RDA Ontologies. Do we? After all, they’re a big part of the ‘Access’ in Resource Description and Access (RDA). But they’re not particularly semantically meaningful, especially if you have the component parts available. An Access Point is just a structured string. For instance a ‘Publication Statement’ Access Point for “The Daytona daily news” might look like:

“Daytona Beach, Florida : Geo. F. Crouch, 1903-1926″

It has a formal syntactic structure, and semantics derived from adherence to that structure when the string is created:

“Place of Publication” : “Publisher’s Name”, “Date of Publication”

Note that the punctuation is part of the formal grammar that helps parse a grammatically correct statement into its semantically meaningful constituent parts.

And this is the way we’ve been doing things forever (well it seems like forever) — semantics is derived from proper use of a syntax that everybody who is creating and using shared data has agreed upon in advance. And as long as everybody uses precisely the same syntax this works great. It works really, really well with structured syntaxes like MARC21:

260 $a Daytona Beach, Florida
260 $b Geo. F. Crouch
260 $c 1903-1926

…and hierarchical syntaxes like XML:

<publicationStatement>
  <placeOfPublication>Daytona Beach, Florida</placeOfPublication>
  <publisherName>Geo. F. Crouch</publisherNam>
  <dateOfPublication>1903-1926</dateOfPublication>
</publicationStatementt>

The RDA documents go so far as to call an Access Point an Element and its constituent parts Sub-Elements, again clearly thinking of this nice syntacticly-defined semantics.

But what if your data says, semantically, that “Place of Publication” isn’t the Name of the place or the Label for the place, but a URI that identifies the Place itself; a resource rather than a string. The Access Point rules don’t let you stick a URI in the Publication Statement where a string is supposed to be.

What about “Publisher’s Name”? That’s clearly going to be a string no matter what — names tend not to be resources. But there’s probably a Publisher resource out there, somewhere, with a URI that identifies the Publisher and probably has a Name or a Label property that provides a string that you can stick in the Publication Statement.

We’ll just ignore “Publication Date” for the now, since that’s a very different can of worms: slimy, smelly worms.

At the moment, RDA doesn’t acknowledge the existence of the resources supplying the strings for an Access Point, and it lets substantial ambiguity sneak in with property names like “Place of Publication” rather than “Place of Publication Name” like they resolved with “Publisher’s Name”. But that ambiguity didn’t exist when all the data was strings — strings you used for indexing, and displayed to the user, and didn’t have to go fetch from somewhere because they were right there in that 260 field.

I listened to a radio program that referred to all of the money that everyone in the world has available to invest as “The Global Pool of Money” and I think that applies quite nicely to the Linked-Data notion of the Semantic Web — “The Global Pool of Data”.

The open world model of the Semantic Web assumes that you will never have all of the available data that describes a resource, and the RDF data model supports this. Resources often exist, outside of traditional library data, available from the Global Pool of Data, that can supply the necessary labels.

But of course we usually just have the labels. This is library data made for cards that need to be put in the correct order and read by a person. And the Global Pool of Data usually just has a bunch of resources. This is linked data, in no particular order at all, meant to be read by a machine.

So, specifying Access Points as pre-coordinated strings actually provides us with a major opportunity when defining the ontologies; several opportunities actually:

  • We can formalize each Access Point specification into what Dublin Core calls a Syntax Encoding Scheme (SES) and say that each Access Point has a datatype.
  • We can clarify the semantics of using a label rather than a resource for properties (sub-elements) like “Publisher’s Name”
  • We can clarify the semantics of using a resource for “Place of Publication” and say that the label used in an Access Point must be the Name of the Place and this is distinctly different.

So, refined to use properties that are a bit more semantically clear, we have a slightly modified Publication Statement:

“Place of Publication Name” : “Publisher’s Name”, “Date of Publication”

…we tie these properties specifically to a FRBR Manifestation that RDA says must be what they describe, and in RDF the supporting ontology looks like:

rda:placeOfPublicationManifestation a owl:ObjectProperty
rda:PlaceOfPublicationName a owl:DatatypeProperty
rda:publisherManifestation a owl:ObjectProperty
rda:PublisherName a owl:DatatypeProperty

Here’s our sample instance again (by the way, this data is from a real linked data resource):

“Daytona Beach, Florida : Geo. F. Crouch, 1903-1926″

<http://chroniclingamerica.loc.gov/lccn/sn93063916>
  rda:publisherManifestation <http://???> (blank, but we know one must exist)
    rda:PublisherName "Geo. F. Crouch"
  rda:placeOfPublicationManifestation <http://dbpedia.org/resource/Daytona_Beach%2C_Florida>
    rda:PlaceOfPublicationName "Daytona Beach, Florida"
    rdfs:label "Daytona Beach, Florida"

This tiny chunk of data was gathered by hand and mapped (by me) from the existing resources and the labels supplied by those resources.

Someday there will be services that comb through linked data looking for missing data like that Publisher resource, and will perform a search on the Global Pool of Data looking for resources with labels matching that library data, expressed in RDA/RDF, fill in the missing pieces and present the lucky cataloger, and ultimately the user, with all that rich linked data.

The eXtensible Catalog project is working on services that do just that kind of thing, so someday may not be too far off.


By Jon, November 3, 2009, 10:42 am (UTC-5)

Last week I was in the UK, primarily to attend a DCMI Registry Community Workshop organized by UKOLN, scheduled for Friday, July 24th. Early the following week we found out that as we were gathered in York discussing distributed registries, Rachel Heery passed away after a long battle with breast cancer. Rachel was one of the founders of the Community (then called a working group), and was involved in building a number of registries, including the DCMI Registry and the IEMSR Registry at UKOLN.

There have been a lot of postings about Rachel this week from colleagues and friends, and I wanted to add my voice to that chorus of tributes to an exceptional person. I didn’t know Rachel as well as I would have liked—we worked on different continents and generally crossed paths primarily at DC conferences. But we were both members of two distinct minorities within DCMI: women, and implementers. Neither of us trained as technologists, and came to the sometimes dauntingly technical discussions at DCMI from the point of view of those trying to use DC for real projects, too often frustrated with the 50,000 foot viewpoints expressed by the more technically astute.

Stu Weibel, who knew Rachel, as I did, in the context of Dublin Core, brought her back for me the most strongly by reminding me of one of Rachel’s characteristic interjections:

“We emulate those we admire, and I have often found myself over the years using a phrase that signaled, from Rachel, an objection worthy of discussion… a sort of lilting “Hang on…!” Those who have worked with her will hear echoes of the tone and inflection that made the phrase hers, and commanded respectful attention, a flag that something was not quite right. I always think of her when I say it, and will always try to use it in the service of the honest brokerage of common goals that characterized Rachel’s efforts.”

Stu also reminds us that Rachel’s two most important contributions to the DC efforts (aside from her considerable intellect and personal presence) were in the areas of registries and application profiles–both have been a particular focus for me and Jon over the past four years or so. It gives me some solace to think that she would be pleased that implementers are still working hard in the areas she pioneered, though sad beyond measure that she will not see those efforts bear their promised fruit.

Others who comment on Rachel’s influence and career:
Lorna Campbell
Andy Powell
Lorcan Dempsey
Her UKOLN colleagues

By Diane Hillmann, August 4, 2009, 12:43 pm (UTC-5)

One of the most interesting programs at ALA Annual that I was involved with was the Linked Data grassroots program. Here’s the blurb:

From Legacy Data to Linked Data: Preparing Libraries for Web 3.0. “How can library cataloging data be transformed to function within ‘Web 3.0′ and be understood by non-library web applications? Speakers from both the library and Semantic Web communities will explore the situation in a non-technical manner and describe current work underway to transform legacy library data into linked data.“

The speakers were: Eric Miller (President, Zepheira, Inc.), me, Jennifer Bowen (Co-Principal Investigator, eXtensible Catalog Project, University of Rochester), Rebecca Guenther (Senior Networking and Standards Specialist, Network Development & MARC Standards Office, Library of Congress). Corey Harper of NYU introduced the speakers and fielded questions at the end. Because this was a Grassroots program, attempting to make a place for emerging trends in what is often a program consisting primarily of the hot issues of a year or two ago, all the approved programs got small rooms. The one we ended up in seated about 75, and we filled the floors, the aisles and much of the hallway outside the room. The room was in the Hilton, not easy to find, so it was gratifying how many people made the effort.

American Libraries reported on the program, and from the comments I’ve received it was a successful session and has generated interest in further programming on the subject for next year (and we are actually talking about doing that). In my presentation, I made the case for the readiness of libraries for the challenges of linked data, citing the work done with the RDA vocabularies as foundational to that claim. I admit that although there was a part of that claim that was, if not actually wishful thinking, at least a rhetorical device, clearly we are at some kind of tipping point (or approaching it pretty quickly). Every six months when I talk to people at ALA extensively about this stuff, or when I’m out “on the road” talking to colleagues, there is more excitement and more interest on the part of librarians, who are definitely “getting it.”

Presentations are available on the ALA Wiki.

By Diane Hillmann, August 3, 2009, 5:25 pm (UTC-5)

ALA Annual in Chicago has been a blur—I did three presentations (which I hope to talk about and link to slides as time permits). But one issue has been rolling over in my mind ever since I blurted something about it at my first presentation on Friday of Annual, when I was last up on a panel about “The Future of MARC.” Rebecca Guenther of LC spoke about the efforts to keep MARC relevant and Ted Fons of OCLC covered similar topics from the viewpoint of “The Big O” (thanks to Karen Schneider for that wonderful appellation!) A feature of both talks was the idea that reorganizing MARC records into “FRBR-ized” views was really all that was needed to take advantage of FRBR. I argued at that time that this was not the case, and as I think more about it I’m more convinced it’s true: FRBR-ization is not the same as using FRBR in native RDA.

Part of my view is based on the differences between RDA and MARC semantics (not syntax, which is where the conversation usually goes). One of the most overlooked aspects of RDA in general is the rich vocabulary of relationships that it brings to the table for use in bibliographic description. Most people who’ve focused on RDA as a textual guidance or set of rules have overlooked this, because the relationship vocabulary appears in appendices, and most of us don’t consider appendices the most important part of anything. But consider this: in the RDA Vocabularies, each of these relationships has an identifier, is part of a hierarchy that allows expression of bibliographic relationships at several levels, and gives us the ability to use these relationships to navigate the bibliographic landscape without having to delve into records and interpret the text notes we’ve used for the same purpose in MARC. For instance, using RDA you can say that ‘Resource X’ is an abridgment of ‘Resource Y’ (and that ‘Resource Y’ has an abridged version in ‘Resource X’) in a way that a system can expose to the user with no muss or fuss. The relationship is specific, identified and explicitly defined if anybody needs that to apply or interpret it.

In contrast, FRBR-ization only exposes what we can assert based on a mapping from MARC to FRBR (or RDA), which is at best the relationships between the FRBR Group 1 entities: the Work, Expression, Manifestation and Item. With the RDA array of identified relationships, we have a whole lot more. I suppose one could say that these are not necessarily part of the FRBR panoply, but if you consider them the “horizontal” relationships that fill in between the “vertical” relationships that Work, Expression, Manifestation and Item provide, then it’s possible to see how these relationships are enabled by the way the FRBR model has allowed us to rethink our world.

This is one of the issues that makes my head hurt when I think about the RDA “testing” regime that we keep hearing about. Are we wedded to the notion that if it can’t be crammed into MARC we aren’t going to use it? Can’t we start to think about MARC as a fairly lossy output format and move on to something that expresses the relationships we know will help us maintain some important functionality and credibility in the broader data world? As Jennifer Bowen and the eXtensible Catalog folks have discovered as they build the services to transform MARC into RDA (see my post on Jennifer’s paper for more about that) transforming MARC to RDA represents a fundamentally different set of problems and trade-offs than going the other way. [By the way, the XC Project was everywhere at Annual–go Jennifer!]

And as more vendors step up to the RDA plate and begin to build applications that start with RDA rather than try and transform MARC into something that could be mistaken for RDA only in dim light, we’re going to have to accept the fact that like any other metadata mapping, there is no such thing as a free lunch or a round trip.

The Registry has some of these relationships registered already (see “RDA Roles” for the relationships between FRBR Group 1 and Group 2 and “RDA Relationships for Works, Expressions, Manifestations, Items” for the relationships between Group 1 entities), but be aware that these are not yet the final versions. I haven’t gotten the information yet about the final changes to allow me to make those updates, but when I do I’ll make an announcement to that effect.

By Diane Hillmann, July 18, 2009, 11:35 am (UTC-5)

Today I got a very disappointing note in my inbox, from the US National Libraries RDA Test Project. I guess I’d call it a “ding” letter, and I have to say it was more than a bit surprising. I had volunteered to help with the testing, not by creating records, mind you, but in analyzing the records other people create. Given the fact that I’ve been the co-chair of the DCMI/RDA Task Group, done the major part of the work in registering the RDA schemas and vocabularies, and have been involved in building the XML schemas that will be the basis of much of the data creation for many early RDA implementations, I figured my experience might come in handy. But apparently not …

Dear Diane: Thank you for your interest in the US National Libraries RDA Test project. The RDA Test Steering Committee regrets that you could not be selected as a formal test participant. Interest in the project was much greater than the Steering Committee originally anticipated, and it was necessary to select test partners from more than 90 applications. Every applicant had a great deal to offer to the project, and each was carefully considered. The Steering Committee based its final selections on the goal of ensuring that the RDA Test will reflect a cross-section of US cataloging agencies balanced by size, type of organization, OPAC and cataloging systems used, and areas of specialization in cataloging and collection development.

The Steering Committee will share the methodology for the test on its Website at URL . If you are interested in conducting your own test of RDA, we encourage you to produce records following this methodology and to share the results with the Steering Committee during the test period.

Thank you again for your interest in the RDA Test.

So, exactly what are they testing that makes my knowledge and experience useless? Darned if I know. But I can’t get beyond the notion that the testing regime I see described on the website is pretty limited, and it’s hard to imagine what the results can really tell us, aside from the obvious difficulties people will encounter in attempting to cram a FRBR-based structure into any one of our current flat MARC-based library systems.

Much more interesting, to me anyway, is the idea of what RDA records might look like in straight XML or RDF, without the necessity of the contortions involved in making it all “fit” into a MARC system. Without the layer of MARC contortion we might really be able to figure out whether catalogers could adjust to RDA and create FRBR-based records. It would be nice to think that some of the open source systems would find a way to play with these records and test some more forward-looking, rather than backward-looking implementation issues.

Any volunteers for an alternate testing regime?

By Diane Hillmann, May 29, 2009, 5:08 pm (UTC-5)

This week, Karen Coyle wrote a post about LCSH as linked data: beyond “dash-dash” which provoked a discussion on the id.loc.gov discussion list.

It seems to me that there are several memes at play in this conversation:

LCSH and SKOS

As Karen points out, LCSH is more than just a simple thesaurus. It’s also a set of instructions for building structured strings in a way that’s highly meaningful for ordering physical cards in a physical catalog. In addition, each string component has specific semantics related to its position in the string, so it’s possible, if everyone knows and agrees on the rules, to parse the string and derive the semantics of each individual component. The result is a pre-coordinated index string.

These stand-alone pre-coordinated strings are perhaps much less meaningful in the context of LOD, but this certainly doesn’t apply to the components. I think what Karen is pointing out is that, while it’s wonderful to have a subset of all of the components that can be used to construct LC Subject Headings published as LOD, there’s enough missing information to reduce the overall value. As I read it, she’s wishing for the missing semantics to be published as part of the LCSH linked data, and hoping that LC doesn’t rest on its well-earned laurels and call it a day.

Structured Strings

Dublin Core calls the rules that define a structured string a "Syntax Encoding Scheme" (SES) and basically, that’s what the rules defining the construction of LC Subject Headings seem to be. It’s structurally no different than saying that the string "05/10/09", if interpreted as a date using an encoding scheme/mask of "mm/dd/yy", ‘means’ day 10 in the month May in the year 2009 using the Gregorian calendar. Fascinatingly, that same ‘date’ can be expressed as a Julian date of "2454962", but I digress.

As far as I can tell, no one has figured out a universally accepted (or any) way to define the semantic structure of a SES in a way that can be used by common semantic inference engines, and I don’t think that anyone in this discussion is asking for that. What’s needed is a way to say "Here’s a pre-coordinated string expressed as a skos:prefLabel, it has an identity, and here are it’s semantic components."

Additional data

So…

"Italy--History--1492-1559--Fiction"

…is expressed in id.loc.gov/authorities/sh2008115565#concept as…

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix terms: <http://purl.org/dc/terms/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .

<http://id.loc.gov/authorities/sh2008115565#concept>
    skos:prefLabel "Italy--History--1492-1559--Fiction"@en ;
    rdf:type ns0:Concept ;
    terms:modified "2008-03-15T08:10:27-04:00"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
    terms:created "2008-03-14T00:00:00-04:00"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
    owl:sameAs <info:lc/authorities/sh2008115565> ;
    skos:inScheme
        <http://id.loc.gov/authorities#geographicNames> ,
        <http://id.loc.gov/authorities#conceptScheme> ;
    terms:source "Work cat.: The family, 2001"@en . 

…and has a 151 field expressed in the authority file as…

151 __* |a *Italy* |x *History* |y *1492-1559* |v *Fiction

…which has the additional minimal semantics of…

<http://id.loc.gov/authorities/sh2008115565#concept>
    loc_id:type "Geographic Name" ; #note that this is also expressed as a skos:inScheme property
    loc_id:topicalDivision "History" ;
    loc_id:chronologicalSubdivision "1492-1559" ;
    loc_id:formSubdivision "Fiction" ;
    loc_id:geographicName "Italy" .

…and this might also be expressed as…

<http://id.loc.gov/authorities/sh2008115565#concept>
   loc_id:type id.loc.gov/authorities/sh2002011429 ;
   loc_id:topicalDivision id.loc.gov/authorities/sh85061212 ;
   loc_id:formSubdivision id.loc.gov/authorities/sh85048050 ;
   loc_id:geographicName id.loc.gov/authorities/n79021783 ;
   dc:temporal "1492-1559" ;
   dc:spatial sws.geonames.org/3175395/ ;
   dc:spatial id.loc.gov/authorities/n79021783 .

Making sure that those strings in the first example are expressed as resource identifiers is also something that I think Karen is asking for. (BTW, The ability to lookup a label by URL at id.loc.gov is really useful)

I should point out that Ed, Antoine, Clay, and Dan’s DC2008 paper detailing the conversion of LCSH to SKOS goes into some detail (see section 2.7) about the LCSH to SKOS mapping, but doesn’t directly address the issue that Karen is raising about mapping the explicit semantics of the subfields.

By Jon, May 20, 2009, 3:45 pm (UTC-5)