Monday, March 31, 2014

What's the point of open data in archaeobotany?

This quote from the referee statement that accompanies an Internet Archaeology open data paper (Richards and Roskams 2013).

The importance of the dataset thus lies in its contribution to a broader programme of research whose cumulative results have the potential to generate something approaching a holistic view of.... (Thomas 2013).

The sentence could be continued with a statement about whatever research area is pertinent to the dataset that is being opened up. For the data that I have been making open, this is about the history (and prehistory) of plant use in Cork and evidence for arable agriculture at various different times in the past. I think. Maybe others could use the data for something different?

The problems

I have spent some time uploading results from archaeobotanical analyses to online repositories over the past two months. Some issues emerged/rose to the surface of my consciousness as I set about doing this, and I'll discuss a few of these below.

First of all, the errors! These are archaeobotanical reports written quite a few years ago, and some written in a great hurry because of time and budget constraints. They contain typographical and formatting errors. Although I am aware of how off-putting these can be for readers, I am also now operating under time constraints (this is completely un-funded and it takes up my leisure time). I consider it more important to make the material available and open, rather than fussing too much about embarrassing slips in copy-editing attention.

Secondly, these technical reports served a fairly restricted function. The reports stand as documents of their time and their purpose when they were written. But this means that they are limited. They are limited because they are pieces that were written specifically as appendices to archaeological excavation reports, and this is all they are, they do not function well as stand alone documents. For this reason I have spent some time providing links between the plant remains reports and other, relevant and related material, such as the excavation report if it is online. But this means a lot of additional work for anyone seeking to re-use the material.

Thirdly, I have been concentrating on adding .pdfs of grey literature reports to the repositories, and the .pdf format does not really allow for easy re-use. It allows others to read what you have written about a certain assemblage, but it does not necessarily allow them to easily add/incorporate the results into their own work. As it stands then, the collection of reports in a repository acts as an information source, it serves a function of allowing my archaeobotanical colleagues (around 6 others specialise in Irish material) sight of what has been found at different sites. But in order to go any further, as the data stands as a .pdf file within the grey literature repository, if others want to use/re-use my results, currently they need to re-type. In terms of actually encouraging their use for archaeobotanical research, these reports are only a first step in the process, and making the data available openly is the next one.

The solution?

I have decided to make the tables of identification from my grey literature reports available as easily importable .csv files. At the moment I am concentrating on assemblages that contain more than 25 cereal grains. Although this is quite a small number, it is based on the cut-off point used in a study of early medieval archaeobotanical remains where archaeobotanical reports from multiple sources were re-used and compiled to produce a large scale analysis of plant material from Ireland in this period (McCormick et al. 2011, 52). These datasets are slowly being added to the Zenodo repository, under a CC-By licence.

More problems?

As I gradually add datasets to the repository, I have noted more potential pit-falls in the open-ness of the datasets. Most notable of these is the fact that many of the technical reports that the .csv files are based on were written/assembled long before radiocarbon results were obtained and the phasing of multi-period sites was sorted out. This means that some contain data from more than one period of occupation at any given site. While some phasing has usually been incorporated into the discussion of the archaeobotanical results, the samples from each phase have not always been clearly separated within the datasets. In addition, it is likely that, as it was not possible to radiocarbon date each and every sample, there will always be material from some contexts where the origin date is ambiguous. Nevertheless, in order to make the dataset more relevant for archaeobotanists to re-use it, it will be necessary to go back over the datasets again. And who knows what additional problems will emerge as the iterative process continues?

See a video discussion of the difficulties of restructuring data that was originally created for a different specific set of purposes, in order to make it open and viable as linked data (specifically Hugh Corley's comments c. minute 33) at:
https://www.youtube.com/watch?v=bkBmstZmRdM 


References


McCormick, F., Kerr, T., McClatchie, M., & O’Sullivan, A. (2011). The Archaeology of Livestock and Cereal Production in Early Medieval Ireland, AD 400 - 1100. Retrieved from http://www.emap.ie/documents/EMAP_Report_5_Archaeology_of_Livestock_and_Cereal_Production_WEB.pdf

Richards, J., and Roskams, S. (2013). Burdale: An Anglian Settlement in the Yorkshire Wolds (Data Paper). Internet Archaeology, (35). doi:10.11141/ia.35.8

Thomas, G. 'Referee Statement' in Richards, J., and Roskams, S. (2013). Burdale: An Anglian Settlement in the Yorkshire Wolds (Data Paper). Internet Archaeology, (35). doi:10.11141/ia.35.8