The following is an attempt to shed some light on an important and sometimes confusing aspect of the digitization process – georeferencing – in the context of the Advances in Digitization of Biological Collections. Georeferencing, in our discipline, refers to the assignment of an analytical representation of a place where an event, such as a collecting event, occurred on the earth. The primary purpose of this posting is to give a brief statement about the state of the art and give a few metrics and perspectives that may help those writing TCNs or other digitization proposals. This posting was motivated by several requests for information on the subject, or which the following is a good example.
Corinna Gries wrote:
Now questions regarding georeferencing.
John, are you anticipating the HUB developing/automating workflows for that? And should we anticipate those to be available somewhat later in the process? We have now put in for georeferencing during the later part of the project and are anticipating that we still need some workforce on the ground. But should we also anticipate to be involved with developing technology to automate workflows somewhat more than they currently are? I am not talking developing new approaches. But given the large number of herbaria we are working with and their various approaches to data management we need to have those workflows on-line rather than connected to a desktop application. Is that something you are anticipating as well?
John Wieczorek wrote:
A HUB will definitely have to better document known-good georeferencing workflows. A reasonable HUB would also consult with projects to optimize for their specific situation and provide the necessary support to make sure that any proposed georeferencing work
plan is successful. As there is no way for a HUB to anticipate the georeferencing workload and budget for it, the TCNs and other digitization projects will have to budget to get the work done, whether through personnel within the projects, or from others willing to do it.
The three fundamental steps in the georeferencing process will be Prepare, Collaborate, and Repatriate. Anyone who will do georeferencing should definitely plan to participate in a week-long workshop to be trained as part of the Prepare phase. Anyone who will georeference and has never been to one and thinks s/he doesn’t need to is a prime candidate to attend. The training workshops are quite mature and proven effective. More than twenty of them have been given internationally and over 500 people have been trained. The lastest, co-sponsored by GBIF and TanBIF, took place 26-29 Oct 2010 in Dar es Salaam, Tanzania.
The longer you can wait, the better documented the georeferencing workflows will be. The tools are all fully functional now, and further development is already funded from grants outside the HUB, so the tools WILL only get better. But there is another reason to wait as long as you can to georeference. The more you have digitized when you start to georeference, the greater the economies of scale, as you want to make your first pass georeferencing locations, not specimens.
Ideally, you will also through in your lot with others with the same georeferencing challenges and achieve further economies of scale by collaborating to georeferencing your combined holdings together. This sort of thinking will make the most of the Collaboration phase. The three biggest georeferencing collaborations to date have all used a similar workflow and the same best practices, with spectacular results. All of them are still in the lengthy repatriation process. The Prepare and Collaboration phases were easy by comparison.
As for metrics, the mean georeferencing rate for localities in the US and Canada is 30 localities per hour (following complete best practices and using the BioGeomancer Workbench batch processing). The most difficult localities (China, Russia) have been the worst case scenario with a rough rate of 6 localities per hour. So the numbers you need to make a reasonable estimate the costs is the number of localities and the rate for the region of the earth where they occur. Those may be a very difficult numbers to get for undigitized material, but it may be worth some research to estimate it based on the mean
number of specimens per location. For terrestrial vertebrates the rate seems to hold fairly well at 6 specimens per locality, but this is likely quite different in other taxonomic groups.
You can see the documentation and get an idea of what’s possible at the project web sites listed below. I think you would be safe to write into your proposal that there are proven solutions with well-known metrics for georeferencing as demonstrated by the NSF-funded vertebrate networks participating in VertNet (http://vertnet.org).