Paleo data standardisation and my split feelings about LiPD

Khider and coauthors (among which I was a minor one) made a valid statement in 2019 (for references go to the original, which should be freely available):

“Paleoclimatology is a highly integrative discipline, often requiring the comparison of multiple data sets and model simulations to reach fundamental insights about the climate system. Currently, such syntheses are hampered by the time and effort required to transform the data into a usable format for each application. This task, called data wrangling, is estimated to consume up to 80% of researcher time in some scientific fields (Dasu & Johnson, 2003)… this wrangling requires an understanding of each data set’s originating field and its unspoken practices and so cannot be easily automated or outsourced to unskilled labor or software. There is therefore an acute need for standardizing paleoclimate data sets.

Indeed, standardization accelerates scientific progress, particularly in the era of Big Data, where data should be Findable, Accessible, Interoperable, and Reusable (FAIR; Wilkinson et al., 2016). Standardization is critical to many scientific endeavors: efficiently querying databases, analyzing the data and visualizing the results; removing participation barriers for early‐careers scientists or people outside the field; reducing unintended errors in data management; and ensuring appropriate credit of the original authors. While the paleoclimate community has made great strides in this direction (e.g., Williams et al., 2018), much work remains.”

The Linked Paleo Data (LiPD) data container, the PaCTS reporting standard (Khider and co., see above), and PAST are all part of these “great strides”. LiPD has been used by some extensively (PAGES 2k Temperature, Temperature 12k, iso2k, PalMod Marine data) but led by others – and this relies on anecdotal evidence – to any possible reaction between deep frustration and a simple “well that is not helpful”.

I personally feel that LiPD is a good idea. As an R-user I can even appreciate its nested list structure when read into R. And here comes the “but” after a short excursion.

Khider and coauthors emphasize how climate modelling benefitted from the CF standards and the netCDF file format to which one may add that initiatives to provide standardised data were also related to the CMIP phases.

Standardised file formats and standardised metadata allowed developing standardised utilities for the climate modelling community that in turn provide channels to easily access and, as importantly, to easily manipulate and generate data files.

Which brings me back to the “but”. While the team behind LiPD does provide LiPD-utilities, my experience with them (only based on R) is that I can read a LiPD-file with them but that I don’t want to use them further than that.

That is, the LiPD format is for me still a book with seven seals when it comes to manipulating data, adding data, and don’t even get me started about generating a new data file – no, using a website or an excel-template is not an option if I want to generate 500 or so files.

While even NetCDF utilities can have a learning curve in R and other tool-sets, standardised utilities can often remove this nearly completely as long as a file follows … certain standards.

Part of my problems with interacting with LiPD may be due to the nested character that has its clear benefits but also may be counterintuitive and is not necessarily easy to navigate.

However, primarily I think that if we want LiPD to become widely accepted or even the NetCDF of paleo observational data, then we need utilities that are as easily usable as, for example, the R-package ncdf4 or tools like the climate data operators (cdo).

My biases/workflow may show but I particularly would love to see the functionalities of the ncdf4 package mirrored in a lipdR-package. This would allow writing simple scripts to write multiple LiPD-files in a few short moments, or to add a new metadata field with a line of code, or to add a newly inferred property as easily.