I think I mentioned somewhere that one of my current projects is a part of the NISO Bibliographic Roadmap effort to develop vocabulary best practices in the areas of use/reuse, documentation and preservation. The project itself builds on a full-day vocabulary workshop brought together at the DCMI conference in 2011.

I share co-chair duties on the Use/Reuse subgroup with Daniel Lovins of NYU, but as we’ve all moved along on this project, the overlaps between the three subgroups tasked with writing the best practices documentation (the other groups are tasked specifically with documentation and preservation) becomes more and more apparent. We’ve determined that our final recommendations will be made together in one document, rather than spend our time trying to untie the Gordian Knot that is vocabulary issues in toto.

At some point I fielded a question from one of the group members about the difference between vocabulary maintenance and vocabulary preservation. It was a great question, and one that I hadn’t really thought much about. To some extent each topic has a basic assumption: for a vocabulary being actively maintained, preservation isn’t much of an issue. But when, for whatever reason, the vocabulary ceases being actively maintained, preservation becomes the issue. One question the Roadmap project is trying to address is how to tell the difference, given the dearth of data generally provided about the who, how and when of ownership and management of vocabularies, much less what policies and practices are being followed. There’s a whole range of issues around preservation: how it should be done and by whom, and what kind of context to wrap around the supine resource.

In some respects, the problem of vocabulary preservation is not easily separated from issues around the loss of funding for projects building vocabularies as well as vocabulary development or management tools, almost all of which were initially developed by funded projects. This suggests we have a preservation problem built on a sustainability problem. The report draft as it stands cites several projects which were initially funded in whole or part to address issues around vocabulary provision, in particular research or practice communities, but have not received funding to build out or maintain their tools or their resources. This is a significant issue, given that without the ability to maintain the infrastructure supporting the structural or conceptual vocabularies required to describe resources being aggregated or distributed, there is no such thing as proper distribution or maintenance of any data resources. Consider how many projects are now being funded to consider the problems of ‘big data’, particularly scientific research data, all of whose solutions depend to some extent on metadata vocabularies.

This concern is not that different than what we hear constantly about the physical infrastructure supporting our transportation systems, crumbling in ways not dissimilar to our data distribution systems (although without the scary safety implications). Many of us on the academic side of these questions take it for granted that funding comes and goes, and consider not the longer-term implications of these ebbs and flows and how funding agency priorities (generally focused on innovation rather than maintenance) affect the vocabulary environment. But it remains clear that a big reason we talk about vocabulary preservation is because there are long-term implications of depending on funded projects to build and maintain the infrastructure around vocabularies as well as the vocabularies themselves.

I speak from experience on this issue. The Open Metadata Registry (OMR) was built on funding from the US National Science Foundation (NSF), and when its funding ended in 2007 should have died the death of most such project based tools. That it didn’t is due almost entirely to the fact that it was built by two very stubborn people who were pretty hooked by the usefulness of vocabularies. For several years we searched in vain for additional funding, but at that time funding was scarce, and what little there was went to much sexier undertakings. For much of that time the OMR survived on a ‘too cheap to fail’ strategy, where early grant funding had paid ahead for long-term costs of server space and basic technical upkeep.

As far as I know, the OMR is one of the only free general-purpose vocabulary development and maintenance tools that has survived past its initial funding. At this point a large percentage of the library worlds’ vocabularies are maintained using the OMR, and although it is built to enable all of them to migrate at any time to another repository or tool, there is not currently much of anything available for them to migrate to. This suggests that sustainable funding is a far more critical issue than we’ve yet considered.

At present the OMR developers are setting up a proposal for funding a sustainability plan for the OMR, to spend the next 2-3 years bringing it up-to-date as a viable, community-driven Free Open Source Software project with long-term institutional support for its infrastructure, so it can be maintained after our retirement. If the plan works, the OMR will remain a place where vocabularies will be safe, secure, and maintainable, with room for maps and application profiles using those vocabularies. Other commercial and research-based options are still there, but not necessarily friendly to limited budgets or organizations with limited technical support.

During the later days of the National Science Digital Library (NSDL)–under whose aegis the OMR was developed–the NSF began to include requirements for sustainability into their grant process. I don’t know how many other granting agencies mandate those kinds of requirements, but judged by huge amounts of the abandoned detritus of decades of research funding, certainly more should.

By Diane Hillmann, May 5, 2016, 3:04 pm (UTC-5)

Announcement: The vocabulary management mess needs your attention.

In 2013 I wrote a couple of posts critical of some updating practices LC was using in its id.loc.gov services. The first post focused on the relators vocabulary, and how change was handled within that vocabulary.

The follow-up post dealt with the unacknowledged gaps in the LCSH service on id.loc.gov.

So, here we are almost three years later, and not much has changed. I went back through the examples in pt. 1, and the issues pointed out there are still exactly as noted three years ago. For those three years there have been many discussions about linked open data, but the vocabulary infrastructure needed to support LOD is still largely not ready for prime time.

And it’s not as if nobody’s been thinking about solutions. We wrote a paper about how versioning management ought to work in vocabulary services, but it seems to have been overlooked by even the large established services. It’s hard to avoid the conclusion that there just hasn’t been much recognition of the problem we were trying to solve. Trust me, it’s a real problem, and a very big one at that. It’s not just local, or attached to one institution–we’re talking about an international problem, one that could delay the uptake of linked data far longer than we’d like.

For users of vocabularies, the absence of vocabulary services (aside from simple lookup and basic file download) are a large impediment to the actual use of LOD. How does a creator of bibliographic data–who we’ve been encouraging to use vocabularies–actually use those vocabularies to manage their overall data needs over time? By downloading files and examining diffs every week (month, quarter, year)? Remember when we had to cruise websites to find out when a new software update was available? We’re at that stage right now in vocabulary management, and not making progress towards a service environment that will actually support the use of vocabularies in data.

To think about ‘how big’ the problem is, consider how many vocabularies occur regularly in instance data: ISBD, FRBRer, RDA, schema.org, DC Terms, etc. When terms in those vocabularies change, how do those managers of instance data know what has changed? Should they be expected to just leave the old data as is, or send it to the cleaners every year or so? Proper vocabulary management practices can be a big part of the answer–the machine-assisted answer.

So, what’s to be done?

Just about 10 years ago, LC initiated a Working Group on the Future of Bibliographic Control, complete with blue ribbon membership and a broad remit to look around and suggest a path for re-imagining what they and other libraries were facing as they worked towards a different future. Full day hearings were held in three locations across the US and these events drew so much interest that the streaming capabilities set up for the sessions were overwhelmed. I testified at one of those hearings, and–I’m sure you’re surprised–spoke about the value of vocabularies in this brave new world.

But the amazing thing was the level of interest and engagement of the library community in the issues discussed by the WG. I’m not sure I’ve ever seen anything like it, before or since. For a while there, every time I was asked to present at a meeting, the WG report was the desired topic. Literally everyone was talking about it–the community clearly recognized the importance of the effort and wanted to be part of it.

We definitely need something like that again–a place to bring together the community and its experts, to state the problems and brainstorm the solutions. Let’s call it a ‘Library Vocabulary Summit’, for the moment, and roll the possibilities around in our heads. We’d need funding, leadership, and marketing to make it happen. Let’s ALL talk! (Preferably around a large table, face to face, with a relevant agenda).

By Diane Hillmann, April 24, 2016, 9:58 pm (UTC-5)

I ran across a really interesting article last week, and the points it makes have been rocketing around my head as I consider what’s broken (and why) in the small world many of us live in, not to mention how we can fix those broken things. I’d really recommend taking a look at it: “Hail the maintainers: Capitalism excels at innovation but is failing at maintenance, and for most lives it is maintenance that matters more”, by Lee Vinsel & Andrew Russell.

The gist of the authors’ point–that we overvalue innovation and undervalue the maintenance of our existing technologies–is one that ought to resonate particularly with librarians, especially those of us who have toiled in the world of catalogs and the maintenance of same.

“The most unappreciated and undervalued forms of technological labour are also the most ordinary: those who repair and maintain technologies that already exist, that were ‘innovated’ long ago. This shift in emphasis involves focusing on the constant processes of entropy and un-doing – which the media scholar Steven Jackson calls ‘broken world thinking’ – and the work we do to slow or halt them, rather than on the introduction of novel things.”

One of the things that struck me in this article was the connection the authors made between ‘innovation’–associated with high tech activities primarily seen as the male domain, and ‘maintenance’, the ‘women’s work’ of the age of technology.

“We can think of labour that goes into maintenance and repair as the work of the maintainers, those individuals whose work keeps ordinary existence going rather than introducing novel things. Brief reflection demonstrates that the vast majority of human labour, from laundry and trash removal to janitorial work and food preparation, is of this type: upkeep. This realisation has significant implications for gender relations in and around technology. Feminist theorists have long argued that obsessions with technological novelty obscures all of the labour, including housework, that women, disproportionately, do to keep life on track.”

The authors make the point that these skewed values resonate throughout our society. Notice how boring most people consider the maintenance of our physical infrastructure–roads, bridges, pipelines, etc. Of course, most of us agree that there needs to be more of it, and more resources aimed in that direction, but ho hum, talk about it in any detail? Not on your life.

For those of us considering the data infrastructure to support the rebuilding and expansion of the data sharing that libraries have done for many decades, we see the same thing. The Open Metadata Registry (OMR to those of us who prefer to speak in acronyms), now too often dismissed as ‘old technology’, is a case in point. Back in 2004 when we received funding from the National Science Foundation, there weren’t many people thinking about data ‘infrastructure’–particularly for vocabulary data, but we were. Back in the day, we often talked about the OMR in the context of the plumbing beneath sidewalks and buildings. Few of us who are not plumbers think about plumbing — we all assume that it’s there, but is rarely a topic in non-plumber gatherings until something critical breaks, or a community is systematically poisoned.

And even plumbers innovate, in ways that few of us understand, much less think about. The water debacle in Flint, Mich. has brought up the problems of old pipes, but not many discussions go beyond that. In the world of vocabulary management, the OMR has been building and maintaining vocabularies for well over a decade now, and we know a lot about it. In particular we know where improvements should be made, and how OMR services can best support the activities needed by the folks building instance data. We’re also thinking about how to better sustain the effort over time.

But the OMR was built to work within an environment where there are many such services, supporting other communities looking at the same changes ahead that we see in the library data environment. For lots of reasons, that array of services hasn’t come about, though we see activity in the structural metadata world–new formats, old formats morphing, etc.–effort almost entirely devoted to semantic churning, not services. Much effort is being put into creating and publishing Linked Open Data, but very little consideration is being given to how that data will need to change over time and how the entire plumbing infrastructure of publishing and consuming that data must support the need for ongoing maintenance of both the data and the public and private vocabularies that describe that data.

[This is the first of two posts, the second proposing some ways to move forward]

By Diane Hillmann, April 21, 2016, 9:42 am (UTC-5)

Having just celebrated (?) another birthday at the tail end of 2015, the topics of age and change have been even more on my mind than usual. And then two events converged. First I had a chat with Ted Fons in a hallway at Midwinter, and he asked about using an older article I’d published with Karen Coyle way back in early 2007 (“Resource Description and Access (RDA): Cataloging Rules for the 20th Century”). The second thing was a message from Research Gate that reported that the article in question was easily the most popular thing I’d ever published. My big worry in terms of having Ted use that article was that RDA had experienced several sea changes in the nine (!) years since the article was published (Jan./Feb. 2007), so I cautioned Ted about using it.

Then I decided I needed to reread the article and see whether I had spoken too soon.

The historic rationale holds up very well, but it’s important to note that at the time that article was written, the JSC (now the RSC) was foundering, reluctant to make the needed changes to cut ties to AACR2. The quotes from the CC:DA illustrate how deep the frustration was at that time. There was a real turning point looming for RDA, and I’d like to believe that the article pushed a lot of people to be less conservative and more emboldened to look beyond the cataloger tradition.

In April of 2007, a mere few months from when this article came out, ALA Publishing arranged for the famous “London Meeting” that changed the course of RDA. Gordon Dunsire and I were at that meeting–in fact it was the first time we met. I didn’t even know much about him aside from his article in the same DLIB issue. As it turns out, the RDA article was elevated to the top spot, thus stealing some of his thunder, so he wasn’t very happy with me. The decision made in London to allow DCMI to participate by building the vocabularies was a game changer, and Gordon and I were named co-chairs of a Task Group to manage that process.

So as I re-read the article, I realized that the most important bits at the time are probably mostly of historical interest at this point. I think the most important takeaway is that RDA has come a very long way since 2007, and in some significant ways is now leading the pack in terms of its model and vocabulary management policies (more about that to come).

And I still like the title! …even though it’s no longer a true description of the 21st Century RDA.

By Diane Hillmann, February 9, 2016, 9:19 am (UTC-5)

Not long ago I encountered the analysis of BibFrame published by Rob Sanderson with contributions by a group of well-known librarians. It’s a pretty impressive document–well organized and clearly referenced. But in fact there’s also a significant amount of personal opinion in it, the nature of which is somewhat masked by the references to others holding the same opinion.

I have a real concern about some of those points where an assertion of ‘best practices’ are particularly arguable. The one that sticks in my craw particularly shows up in section 2.2.5:

2.2.5 Use Natural Keys in URIs
References: [manning], [ldbook], [gld-bp], [cooluris]

Although the client must treat URIs as opaque strings, it is good practice to construct URIs in a systematic and human readable fashion for both instances and ontology terms. A natural key is one that appears in the information about the resource, such as some unique identifier for the resource, or the label of the property for ontology terms. While the machine does not care about structure, memorability or readability of URIs, the developers that write the code do. Completely random URIs introduce difficult to detect semantic and algorithmic errors in both publication and consumption of the data.

Analysis:

The use of natural keys is a strength of BIBFRAME, compared to similarly scoped efforts in similar communities such as the RDA and CIDOC-CRM vocabularies which use completely opaque numbers such as P10001 (hasRespondent) or E33 (Linguistic Entity). RDA further misses the target in this area by going on to define multiple URIs for each term with language tagged labels in the URI, such as rda:hasRespondent.en mapping to P10001. This is a different predicate from the numerical version, and using owl:sameAs to connect the two just makes everyone’s lives more difficult unnecessarily. In general, labels for the predicates and classes should be provided in the ontology document, along with thorough and understandable descriptions in multiple languages, not in the URI structure.

This sounds fine so long as you accept the idea that ‘natural’ means English, because, of course, all developers, no matter their first language, must be fluent enough in English to work with English-only standards and applications. This mis-use of ‘natural’ reminds me of other problematic usages, such as the former practice in the adoption community (of which I have been a part for 40 years) where ‘natural’ was routinely used to refer to birth parents, thus relegating adoptive parents to the ‘un-natural’ realm. So in this case, if ‘natural’ means English, are all other languages inherently un-natural in the world of development? The library world has been dominated by the ‘Anglo-American’ notions of standard practice for a very long time, and happily, RDA is leading away from that, both in governance and in development of vocabularies and tools.

The Multilingual strategy adopted by RDA is based on the following points:

  1. More than a decade of managing vocabularies has convinced us that opaque identifiers are extremely valuable for managing URIs, because they need not be changed as labels change (only as definitions change). The kinds of ‘churn’ we saw in the original version of RDA (2008-2013) convinced us that label-based URIs were a significant problem (and cost) that became worse as the vocabularies grew over time.
  2. We get the argument that opaque URIs are often difficult for humans to use–but the tools we’re building (the RDA Registry as case in point) are intended to give human developers what they want for their tasks (human readable URIs, in a variety of languages) but ensure that the URIs for properties and values are set up based on what machines need. In this way, changes in the lexical URIs (human-readable) can be maintained properly without costly change in the canonical URIs that travel with the data content itself.
  3. The multiple language translations (and distributed translation management by language communities) also enable humans to build discovery and display mechanisms for users that are speakers of a variety of languages. This has been a particularly important value for national libraries outside the US, but also potentially for libraries in the US meeting the needs of non-English language communities closer to home.

It’s too easy for the English-first library development community to insist that URIs be readable in English and to turn a blind eye to the degree that this imposes understanding of the English language and Anglo-American library culture on the rest of the world. This is not automatically the intellectual gift that the distributors of that culture assume it to be. It shouldn’t be necessary for non-Anglo-American catalogers to learn and understand Anglo-American language and culture in order to express metadata for a non-Anglo audience. This is the rough equivalent of the Philadelphia cheese steak vendor who put up a sign reading “This is America. When ordering speak in English”.

We understand that for English-speaking developers http://bibframe.org/vocab/title is initially easier to use than http://rdaregistry.info/Elements/w/P10088 or even (heaven forefend!) “130_0#$a” (in RDF: http://marc21rdf.info/elements/1XX/M1300_a). That’s why RDA provides http://rdaregistry.info/Elements/w/titleOfTheWork.en but also, eventually, http://rdaregistry.info/Elements/w/拥有该作品的标题.ch and http://rdaregistry.info/Elements/w/tieneTítuloDeLaObra.es, et al (you do understand Latin of course). These ‘unnatural’ Lexical Aliases will be provided by the ‘native’ language speakers of their respective national library communities.

As one of the many thousands of librarians who ‘speak’ MARC to one another–despite our language differences–I am loathe to give up that international language to an English-only world. That seems like a step backwards.

By Diane Hillmann, January 3, 2016, 5:05 pm (UTC-5)

Metadata standards is a huge topic and evaluation a difficult task, one I’ve been involved in for quite a while. So I was pretty excited when I saw the link for “DRAFT Principles for Evaluating Metadata Standards”, but after reading it? Not so much. If we’re talking about “principles” in the sense of ‘stating-the-obvious-as-a-first-step’, well, okay—but I’m still not very excited. I do note that the earlier version link uses the title ‘draft checklist’, and I certainly think that’s a bit more real than ‘draft principles’ for this effort. But even taken as a draft, the text manages to use lots of terms without defining them—not a good thing in an environment where semantics is so important. Let’s start with a review of the document itself, then maybe I can suggest some alternative paths forward.

First off, I have a problem with the preamble: “These principles are intended for use by libraries, archives and museum (LAM) communities for the development, maintenance, governance, selection, use and assessment of metadata standards. They apply to metadata structures (field lists, property definitions, etc.), but can also be used with content standards and value vocabularies”. Those tasks (“development, maintenance, governance, selection, use and assessment” are pretty all encompassing, but yet the connection between those tasks and the overall “evaluation” is unclear. And, of course, without definitions, it’s difficult to understand how ‘evaluation’ relates to ‘assessment’ in this context—are they they same thing?

Moving on to the second part about what kind of metadata standards that might be evaluated, we have a very general term, ‘metadata structures’, with what look to be examples of such structures (field lists, property definitions, etc.). Some would argue (including me) that a field list is not a structure without a notion of connections between the fields; and although property definitions may be part of a ‘structure’ (as I understand it, at least), they are not a structure, per se. And what is meant by the term ‘content standards’, and how is that different from ‘metadata structures’? The term ’value vocabularies’ goes by many names, and is not something that can go without a definition. I say this as an author/co-author of a lot of papers that use this term, and we always define it within the context of the paper for just that reason.

There are many more places in the text where fuzziness in terminology is a problem (maybe not a problem for a checklist, but certainly for principles). Some examples:

1. What is meant by ’network’? There are many different kinds, and if you mean to refer to the Internet, for goodness sakes say so. ‘Things’ rather than ‘strings’ is good, but it will take a while to make it happen in legacy data, which we’ll be dealing with for some time, most likely forever. Prospectively created data is a bit easier, but still not a cakewalk — if the ‘network’ is the global Internet, then “leveraging ‘by-reference’ models” present yet-to-be-solved problems of network latency, caching, provenance, security, persistence, and most importantly: stability. Metadata models for both properties and controlled values are an essential part of LAM systems and simply saying that metadata is “most efficient when connected with the broader network” doesn’t necessarily make it so.

2. ‘Open’ can mean many things. Are we talking specific kinds of licenses, or the lack of a license? What kind of re-use are you talking about? Extension? Wholesale adoption with namespace substitution? How does semantic mapping fit into this? (In lieu of a definition, see the paper at (1) below)

3. This principle seems to imply that “metadata creation” is the sole province of human practitioners and seriously muddies the meaning of the word creation by drawing a distinction between passive system-created metadata and human-created metadata. Metadata is metadata and standards apply regardless. What do you mean by ‘benefit user communities’? Whose communities? Please define what is meant by ‘value’ in this context? How would metadata practitioners ‘dictate the level of description provided based on the situation at hand’?

4. As an evaluative ‘principle’ this seems overly vague. How would you evaluate a metadata standard’s ability to ‘easily’ support ‘emerging’ research? What is meant by ‘exchange/access methods’ and what do they have to do with metadata standards for new kinds of research?

5. I agree totally with the sentence “Metadata standards are only as valuable and current as their communities of practice,” but the one following makes little sense to me. “ … metadata in LAM institutions have been very stable over the last 40 years …” Really? It could easily be argued that the reason for that perceived stability is the continual inability of implementers to “be a driving force for change” within a governance model that has at the same time been resistant to change. The existence of the DCMI usage board, MARBI, the various boards advising the RDA Steering Committee, all speak to the involvement of ‘implementers’. Yet there’s an implication in this ‘principle’ that stability is liable to no longer be the case and that implementers ‘driving’ will somehow make that inevitable lack of stability palatable. I would submit that stability of the standard should be the guiding principle rather than the democracy of its governance.

6. “Extensible, embeddable, and interoperable” sounds good, but each is more complex than this triumvirate seems. Interoperability in particular is something that we should all keep in mind, but although admirable, interoperability rarely succeeds in practice because of the practical incompatibility of different models. DC, MARC21, BibFrame, RDA, and Schema.org are examples of this — despite their ‘modularity’ they generally can’t simply be used as ‘modules’ because of differences in the thinking behind the model and their respective audiences.

I would also argue that ‘lite style implementations’ make sense only if ‘lite’ means a dumbed-down core that can be mapped to by more detailed metadata. But stressing the ‘lite implementations’ as a specified part of an overall standard gives too much power to the creator of the standard, rather than the creator of the data. Instead we should encourage the use of application profiles, so that the particular choices and usages of the creating entity are well documented, and others can use the data in full or in part according to their needs. I predict that lossy data transfer will be less acceptable in the reality than it is in the abstract, and would suggest that dumb data is more expensive over the longer term (and certainly doesn’t support ‘new research methods’ at all). “Incorporation into local systems” really can only be accomplished by building local systems that adhere to their own local metadata model and are able to map that model in/out to more global models. Extensible and embeddable are very different from interoperable in that context.

7. The last section, after the inarguable first sentence, describes what the DCMI ‘dumb-down’ principle defined nearly twenty years ago, and that strategy still makes sense in a lot of situations. But ‘graceful degradation’ and ‘supporting new and unexpected uses’ requires smart data to start with. ‘Lite’ implementation choices (as in #6 above) preclude either of those options, IMO, and ‘adding value’ of any kind (much less by using ‘ontological inferencing’) is in no way easily achievable.

I intend to be present at the session in Boston [9:00-10:00 Boston Conference and Exhibition Center, 107AB] and since I’ve asked most of my questions here I intend not to talk much. Let’s see how successful I can be at that!

It may well be that a document this short and generalized isn’t yet ready to be a useful tool for metadata practitioners (especially without definitions!). That doesn’t mean that the topics that it’s trying to address aren’t important, just that the comprehensive goals in the preamble are not yet being met in this document.

There are efforts going on in other arenas–the NISO Bibliography Roadmap work, for instance, that should have an important impact on many of these issues, which suggests that it might be wise for the Committee to pause and take another look around. Maybe a good glossary would be a important step?

Dunsire, Gordon, et al. “A Reconsideration of Mapping in a Semantic World”, paper presented at International Conference on Dublin Core and Metadata Applications, The Hague, 2011. Available at: http://dcpapers.dublincore.org/pubs/article/view/3622/1848

By Diane Hillmann, December 14, 2015, 4:59 pm (UTC-5)

The Jane-athon series is alive, well, and expanding its original vision. I wrote about the first ‘official’ Jane-athon earlier this year, after the first event at Midwinter 2015.

Since then the excitement generated at the first one has spawned others:

  • the Ag-athon in the UK in May 2015, sponsored by CILIP
  • the Maurice Dance in New Zealand (October 16, 2015 at the National Library of New Zealand in Wellington, focused on Maurice Gee)
  • the Jane-in (at ALA San Francisco at Annual 2015)
  • the RLS-athon (November 9, 2015, Edinburgh, Scotland), following the JSC meeting there and focused on Robert Louis Stevenson
  • Like good librarians we have an archive of the Jane-athon materials, for use by anyone who wants to look at or use the presentations or the data created at the Jane-athons

    We’re still at it: the next Jane-athon in the series will be the Boston Thing-athon at Harvard University on January 7, 2016. Looking at the list of topics gives a good idea about how the Jane-athons are morphing to a broader focus than that of a creator, while training folks to create data with RIMMF. The first three topics (which may change–watch this space) focus not on specific creators, but on moving forward on topics identified of interest to a broader community.

    * Strings vs things. A focus on replacing strings in metadata with URIs for things.
    * Institutional repositories, archives and scholarly communication. A focus on issues in relating and linking data in institutional repositories and archives with library catalogs.
    * Rare materials and RDA. A continuing discussion on the development of RDA and DCRM2 begun at the JSC meeting and the international seminar on RDA and rare materials held in November 2015.

    For beginners new to RDA and RIMMF but with an interest in creating data, we offer:
    * Digitization. A focus on how RDA relates metadata for digitized resources to the metadata for original resources, and how RIMMF can be used to improve the quality of MARC 21 records during digitization projects.
    * Undergraduate editions. A focus on issues of multiple editions that have little or no change in content vs. significant changes in content, and how RDA accommodates the different scenarios.

    Further on the horizon is a recently approved Jane-athon for the AALL conference in July 2016, focusing on Hugo Grotius (inevitably, a Hugo-athon, but there’s no link yet).

    NOTE: The Thing-a-thon coming up at ALA Midwinter is being held on Thursday rather than the traditional Friday to open the attendance to those who have other commitments on Friday. Another new wrinkle is the venue–an actual library away from the conference center! Whether you’re a cataloger or not-a-cataloger, there will be plenty of activities and discussions that should pique your interest. Do yourself a favor and register for a fun and informative day at the Thing-athon to begin your Midwinter experience!

    Instructions for registering (whether or not you plan to register for MW) can be found on the Toolkit Blog.

    By Diane Hillmann, December 7, 2015, 11:19 am (UTC-5)

    Those of you who pay attention to politics (no matter where you are) are very likely to be shaking your head over candidates, results or policy. It’s a never ending source of frustration and/or entertainment here in the U.S., and I’ve noticed that the commentators seem to be focusing in on issues of ideology and faith, particularly where it bumps up against politics. The visit of Pope Francis seemed to be taking everyone’s attention while he was here, but though this event has added some ‘green’ to the discussion, it hasn’t pushed much off the political plate.

    Politics and faith bump up against each other in the metadata world, too. What with traditionalists still thinking in MARC tags and AACR2, to the technical types rolling their eyes at any mention of MARC and trying to push the conversation towards RDA, RDF, BibFrame, schema.org, etc., there are plenty of metadata politics available to flavor the discussion.

    The good news for us is that the conflicts and differences we confront in the metadata world are much more amenable to useful solution than the politics crowding our news feeds. I remember well the days when the choice of metadata schema was critical to projects and libraries. Unfortunately, we’re all still behaving as if the proliferation of ‘new’ schemas makes the whole business more complicated–that’s because we’re still thinking we need to choose one or another, ignoring the commonality at the core of the new metadata effort.

    But times have changed, and we don’t all need to use the same schema to be interoperable (just like we don’t all need to speak English or Esperanto to communicate). But what we do need to think about is what the needs of our organization are at all stages of the workflow: from creating, publishing, consuming, through integrating our metadata to make it useful in the various efforts in which we engage.

    One thing we do need to consider as we talk about creating new metadata is whether it will need to work with other data that already exists in our institution. If MARC is what we have, then one requirement may be to be able to maintain the level of richness we’ve built up in the past and still move that rich data forward with us. This suggests to me that RDA, which RIMMF has demonstrated can be losslessly mapped to and from MARC, might be the best choice for the creation of new metadata.

    Back in the day, when Dublin Core was the shiny new thing, the notion of ‘dumb-down’ was hatched, and though not an elegantly named principle, it still works. It says that rich metadata can be mapped fairly easily into a less-rich schema (‘dumbed down’), but once transformed in a lossy way, it can’t easily be ‘smartened up’. But in a world of many publishers of linked data, and many consumers of that data, the notion of transforming rich metadata into any number of other schemas and letting the consumer chose what they want, is fairly straightforward, and does not require firm knowledge (or correct assumptions) of what the consumers actually need. This is a strategy well-tested with OAI-PMH which established a floor of Simple Dublin Core but encouraged the provision of any number of other formats as well, including MARC.

    As consumers, libraries and other cultural institutions are also better served by choices. Depending on the services they’re trying to support, they can choose what flavor of data meets their needs best, instead of being offered only what the provider assumes they want. This strategy leaves open the possibility of serving MARC as one of the choices, allowing those institutions still nursing an aged ILS to continue to participate.

    Of course, the consumers of data need to think about how they aggregate and integrate the data they consume, how to improve that data, and how to make their data services coherent. That’s the part of the new create, publish, consume, integrate cycle that scares many librarians, but it shouldn’t–really!

    So, it’s not about choosing the ‘right’ metadata format, it’s about having a fuller and more expansive notion about sharing data and learning some new skills. Let’s kiss the politics goodbye, and get on with it.

    By Diane Hillmann, October 12, 2015, 10:08 am (UTC-5)

    A decade ago, when the Open Metadata Registry (OMR) was just being developed as the NSDL Registry, the vocabulary world was a very different place than it is today. At that point we were tightly focussed on SKOS (not fully cooked at that point, but Jon was on the WG that was developing it, so we felt pretty secure diving in).

    But we were thinking about versioning in the Open World of RDF even then. The NSDL Registry kept careful track of all changes to a vocabulary (who, what, when) and the only way to get data in was through the user interface. We ran an early experiment in making versions based on dynamic, timestamp-based snapshots (we called them ‘time slices’, Git calls them ‘commit snapshots’) available for value vocabularies, but this failed to gain any traction. This seemed to be partly because, well, it was a decade ago for one, and while it attempted to solve an Open World problem with versioned URIs, it created a new set of problems for Closed World experimenters. Ultimately, we left the versions issue to sit and stew for a bit (6 years!).

    All that started to change in 2008 as we started working with RDA, and needed to move past value vocabularies into properties and classes, and beyond that into issues around uploading data into the OMR. Lately, Git and GitHub have started taking off and provide a way for us to make some important jumps in functionality that have culminated in the OMR/GitHub-based RDA Registry. Sounds easy and intuitive now, but it sure wasn’t at the time, and what most people don’t know is that the OMR is still where RDA/RDF data originates — it wasn’t supplanted by Git/Github, but is chugging along in the background. The OMR’s RDF CMS is still visible and usable by all, but folks managing larger vocabularies now have more options.

    One important aspect of the use of Git and GitHub was the ability to rethink versioning.

    Just about a year ago our paper on this topic (Versioning Vocabularies in a Linked Data World, by Diane Hillmann, Gordon Dunsire and Jon Phipps) was presented to the IFLA Satellite meeting in Paris. We used as our model the way software on our various devices and systems is updated–more and more these changes happen without much (if any) interaction with us.

    In the world of vocabularies defining the properties and values in linked data, most updating is still very manual (if done at all), and the important information about what has changed and when is often hidden behind web pages or downloadable files that provide no machine-understandable connections identifying changes. And just solving the change management issue does little to solve the inevitable ‘vocabulary rot’ that can make published ‘linked data’ less and less meaningful, accurate, and useful over time.

    Building stable change management practices is a very critical missing piece of the linked data publishing puzzle. The problem will grow exponentially as language versions and inter-vocabulary mappings start to show up as well — and it won’t be too long before that happens.

    Please take a look at the paper and join in the conversation!

    By Diane Hillmann, September 20, 2015, 6:41 pm (UTC-5)

    Most of us in the library and cultural heritage communities interested in metadata are well aware of Tim Berners-Lee’s five star ratings for linked open data (in fact, some of us actually have the mug).

    The five star rating for LOD, intended to encourage us to follow five basic rules for linked data is useful, but, as we’ve discussed it over the years, a basic question rises up: What good is linked data without (property) vocabularies? Vocabulary manager types like me and my peeps are always thinking like this, and recently we came across solid evidence that we are not alone in the universe.

    Check out: “Five Stars of Linked Data Vocabulary Use”, published last year as part of the Semantic Web Journal. The five authors posit that TBL’s five star linked data is just the precondition to what we really need: vocabularies. They point out that the original 5 star rating says nothing about vocabularies, but that Linked Data without vocabularies is not useful at all:

    “Just converting a CSV file to a set of RDF triples and linking them to another set of triples does not necessarily make the data more (re)usable to humans or machines.”

    Needless to say, we share this viewpoint!

    I’m not going to steal their thunder and list here all five star categories–you really should read the article (it’s short), but only note that the lowest level is a zero star rating that covers LD with no vocabularies. The five star rating is reserved for vocabularies that are linked to other vocabularies, which is pretty cool, and not easy to accomplish by the original publisher as a soloist.

    These five star ratings are a terrific start to good practices documentation for vocabularies used in LOD, which we’ve had in our minds for some time. Stay tuned.

    By Diane Hillmann, August 7, 2015, 1:50 pm (UTC-5)