A New HUB Proposal Takes Form

Three weeks ago we announced that individuals from a number of institutions across the United States got together in Boulder, Colorado, to discuss how the biological and informatics communities might go about responding to the Advancing Digitization of Biological Collections (ADBC) solicitation from NSF. For many, this solicitation represents a once in a lifetime opportunity to tackle, in a coordinated manner, a national leadership challenge. If we can work together effectively, we can make significant inroads towards digitizing all biological specimens and data, both recent and paleontological, in collections around the United States.

When the meeting in Boulder concluded, it was agreed by all participants that high levels of transparency, communication, and involvement by the community were both needed and expected in the development of any and all HUB proposals. Since then, many of us have come to feel that these levels have not been addressed in a meaningful way and the resulting silence over the last three weeks has become a liability to the community.

We believe that it is vital for able members of the community to step up and move the process forward and to keep the broad community abreast of developments.  To wait to begin the processes of community organization and communication after all of the awards are announced would be waste the first six months to one year of the HUB’s limited time.  Therefore, it is critical that the community begin to come together now to support the success of the HUB, regardless of who is selected to lead.

In this spirit, we’d like to announce that as the end result of many discussions over the last couple of weeks, a proposal for a HUB is going forward with CU Boulder as the lead institution and with one of us (Guralnick) as the lead Principal Investigator. As yet no one is bound to remain in this collaboration, nor are we certain that others will not join, but there was broad support from Yale, University of Kansas, Berkeley, the Field Museum, University of New Mexico, Tulane, and Harvard for this HUB arrangement.

No doubt there are others of you considering submitting HUB proposals, and if so we’d like to hear from you via this blog. Why is this? The simple answer is that we firmly believe that regardless of who finally obtains funding from NSF, more feedback from the community will lead to a better HUB. This blog was started to further that discussion.

Some of the things we want to know from members of the broad community:

  • What do you want a Home Uniting Biocollections (HUB) to do?
  • How might we start that process now?
  • What kinds of things will help you the most, whether you are planning to submit a TCN or not?

Of course, we have our ideas (and we’ll want to share them), but there is a chance for us to begin syncing up now as opposed to later.

As we move forward, we’ll be using the blog forum here, listservers, and the NSF wiki, to engage you and ask questions. It is our belief that community participation now will pay big dividends when this process actually begins, no matter who ends up leading the HUB. We hope that you, individuals and groups from all biocollections, will take the time to comment and provide input.


Rob Guralnick, University of Colorado

Christopher Norris, Yale University/SPNHC

David Bloom, VertNet

This entry was posted in Uncategorized. Bookmark the permalink.

4 Responses to A New HUB Proposal Takes Form

  1. Hilmar Lapp says:

    Hi – are you going to develop the proposal draft at an openly accessible location? For encouraging broad feedback along the way, that might be the most effective mechanism.

    • nsfadbc says:

      Thank you for your question, Hilmar.

      We agree that this would be an effective mechanism to encourage feedback. We are attempting to develop a process that will provide this open conversation about the proposal AND allow us to actually get some writing accomplished by the deadline.

      Stay tuned. We’ll be posting more about our process here, and at the NSF ADBC wiki, soon.

      -David Bloom

  2. anonymous says:

    Posted on behalf of the anonymous poster:

    “We want an “Easy” button. We want fewer choices and clearer instructions on how to digitize our specimens. We want concensus.

    I am tired of hearing about “best practices” when the fact is few people ever write them up. Those you can find may have been written a decade ago. It’s still a waste of hours or days of digging, time none of us has. Plus, the paralyzing suspicion we will choose to apply the wrong “best practice” holds many of us back.

    I believe one role of the HUB should be to truly locate and promote the BEST practice for each aspect of the task before us. I know this is a moving target. I think an annual competition in various categories for various organisms should be held, scrutinized by domain experts and the standards community, peer-reviewed/voted upon, and the winning ones should be widely promoted, with downloads/links/instructions/documentation available on the HUB website.

    Down with the lassez-faire! An organizing body with no power to refine our idiosyncratic practices won’t help us at all. We have had our long period of inventions and divergence, and now we have so many choices we don’t know what to choose. Pick the best, help us use it, and then at least the inventors can agree to focus on improving on THAT.

    –An anonymous collections manager”

  3. Joe Lapp says:

    I’ve read through the Program Solicitation. I commend the undertaking, but I fear that this effort fails to acknowledge the reality that everybody wants the autonomy to organize and process data any way they want. I fear that the net effect of this effort will be to create yet another database system and further fragment the space. There is a better way to proceed.

    My name is Joe Lapp. I am co-inventor of the technology that enables businesses to do transactions with other businesses over the Internet (remote procedure call via XML) and co-creator of numerous XML technologies, including X-Path (seeded from XQL), which has become a necessary component of every web browser. I’ve taken a strong interest in spiders and have become familiar with some of the problems facing biological data.

    In short, the Program Solicitation appears to be a call to make U.S. biological databases universally accessible. As written, it requires that there be a universal schema for biological data, or at least for collections data. (“All data from the TCNs will be made available through the national HUB…”) Aside from the technical, political, and cultural hurdles that must be overcome to achieve a schema that everyone can and will use, this is a bad idea. The larger and more complex the platform, the higher the barrier to new application and data processing development, and the less development there will be — or rather, the more development there will be outside the platform. This approach stifles innovation, especially to the degree that compatibility with the platform is required, and promotes further fragmentation as people continue to create specialized, offline databases without platform burden.

    The solution NSF is proposing is top-down, where experts and authorities dictate a solution to others. This approach has a rich history, but the outcome is universally the same. The solution captures best practices of the day, wins the hearts of the well-funded and well-meaning, and immediately becomes obsolete as more innovative approaches emerge that can’t be backfit into the behemoth.

    A robust, lasting solution is only available bottom-up. A bottom-up solution is an enabling technology that wins hearts and minds because it provides an immediate benefit for little investment. It does not immediately integrate all databases but instead creates a path to usefully exposing all databases in the future. A bottom-up solution solves one important problem, perhaps the core problem, and postpones solutions to the other problems, allowing those solutions to emerge organically as innovative extensions to the enabling technology. A marketplace of solutions emerges, and the best of them naturally ride to the top and become dominant without intercession by experts and authorities.

    The web server is an example of an enabling technology. It was a simple thing that allowed for hypertext links between documents. Look at all that has organically emerged since.

    While the web is a database of unstructured, human-language data, biological data is structured. Both are already distributed databases. The web already has a means for searching its database — keyword searches via search engines. Biological data does not, and seemingly cannot because the schemas vary so dramatically. Yet biological databases *must* vary in structure because biological techniques and insight are constantly evolving, and must be encouraged to continue doing so.

    So we can see which problem a bottom-up solution must first solve. It must allow for random access across the sea of diverse biological databases while still allowing each database to do its own thing.

    The solution turns out to be pretty simple. Establish a minimal, schema-independent protocal for sharing data between databases, and tier the databases. Each tier specializes in a domain and understands its own schemas, none of which need conform to some universal standard. Tiers can be of any size, whether aggregating data from just a few individuals to aggregating data from several large governmental organizations. Critically, each tier coordinates only with its data sources. Tiers can organize around any level of taxa or any favored approach to data or even on the basis of nothing but partnership. Each tier does what it wants, but it is answerable to the requirements of the next higher tiers with which it wishes to participate. This creates an interplay of top-down and bottom-up requirements that allows any group to find its data solution, whether it already exists or must muster up as a new tier.

    This approach allows for random access across all biological databases that find a tier with which to participate. Each tier can provide its own access mechanisms specialized for its domain of interest. This is one level better than what we now have. The NSF “nationalized database” interest can be satisified by the topmost tiers that integrate with large subtiers, which themselves might integrate both deeply across subtiers and horizontally across databases. These top tiers could provide “views” into the vast collections of data beneath them, at least into the data that these tiers have opted to expose. However, as necessitated by bottom-up solutions, this world would be somewhat distant. Given the impediments to establishing a universal schema and rolling out complex platforms, timing could still prove similar.

    I want to see this world happen. This architecture has the potential to reshape far more than biological data. I’ve been working on other technologies that support this and similar data decentralization ends. I work as an independent and gladly offer my services to assist with realizing a bottom-up vision.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s