TITLE: Data management during and after my PhD
DATE: 2021-09-20
AUTHOR: John L. Godlee
====================================================================


During my PhD I spent a lot of time collecting field data. Along 
with colleagues from Angola I setup 15 1 ha permanent woodland 
survey plots in Bicuar National Park, Angola, where we conducted a 
census of all woody stems >5 cm diameter, and made additional 
measurements of the grass biomass and tree canopy cover. These 
plots will hopefully have another census in 2023. I also collected 
terrestrial LiDAR data in 22 1 ha woodland plots, all 15 in Bicuar 
National Park and an additional seven in southern Tanzania, to 
quantify canopy complexity.

I think these two datasets form a key product of my PhD thesis. PhD 
students often generate a lot of data, but only a minority of PhD 
students develop a long-term plan for data management and data 
dissemination. I chose to write an extra chapter in my thesis all 
about the multiple uses of the data I collected, and it's 
contribution to the greater good of the field, but it requires that 
the data are properly archived, managed, and advertised, otherwise 
nobody else will want to use it. Like the investigative chapters of 
the thesis, which I hope can be converted into manuscripts for 
peer-review, extending their lifespan and making them more 
impactful by reaching a larger audience, I hope that I can ensure 
the data I collected during my PhD has a legacy beyond just my PhD 
thesis.

  [generate a lot of data]: https://hal.univ-lille.fr/hal-01248979
  [data management]: 
https://www.researchgate.net/publication/305395587_Research_traditio
ns_and_emerging_expectations_PhD_students_and_their_research_data_ma
nagement

Lots of universities have a data management plan web page, these 
are some of the first results from Google for "phd data management":

-   University College London
-   University of York
-   University of Sheffield
-   University of Bath
-   University of Leeds
-   University of Liverpool
-   University of Birmingham
-   University of Bristol
-   University of Exeter
-   University of Southampton

  [University College London]: 
https://www.ucl.ac.uk/library/research-support/research-data-managem
ent/policies/writing-data-management-plan
  [University of York]: 
https://www.york.ac.uk/library/info-for/researchers/data/planning/
  [University of Sheffield]: 
https://www.sheffield.ac.uk/library/rdm/dmp
  [University of Bath]: 
https://library.bath.ac.uk/research-data/data-management-plans/unive
rsity-dmp-templates
  [University of Leeds]: 
https://library.leeds.ac.uk/info/14062/research_data_management/62/d
ata_management_planning
  [University of Liverpool]: 
https://libcal.liverpool.ac.uk/event/3671658
  [University of Birmingham]: 
https://intranet.birmingham.ac.uk/as/libraryservices/library/researc
h/rdm/data-management-plans.aspx
  [University of Bristol]: 
http://www.bristol.ac.uk/staff/researchers/data/
  [University of Exeter]: 
https://www.exeter.ac.uk/research/researchdatamanagement/before/plan
s/
  [University of Southampton]: 
https://library.soton.ac.uk/researchdata/phd

The University of Edinburgh, where I did my PhD, also has one, but 
I didn't see it until writing this blog post, a week after handing 
in.

Writing a DMP | The University of Edinburgh

At the end of the first year of my PhD I wrote a "Confirmation 
Report". In other institutions I've heard them referred to as 
"Upgrades". It's sort of a friendly examination that makes sure you 
have a developed plan for what to do during the PhD, before it's 
too late. You write a report that's part literature review and part 
methodology proposal, then have a mini viva with some other 
academics. I always felt like my confirmation report should have 
required a data management plan, similar to how it required an 
ethics assessment and a timeframe, but it didn't. We did have a 
short presentation on data management during the "Research Planning 
and Management" course in the first year of the PhD, which 
consisted mostly of information on how to store data on the 
University network. I would have liked to see more guidance on how 
to manage and archive large volumes of data (TBs), both during and 
after the PhD, to ensure that data is usable by others, and by your 
future self.

For the plot census data, which was only a couple of GBs, I have 
stored the data in three places:

-   University datastore - accessible by ssh, backed up regularly 
by the University
-   Hard drive stored at my parent's house
-   Hard drive stored at my house

This conforms to the 3-2-1 backup rule, which recommends keeping at 
least 3 copies of the data, with at least 2 different media types 
(hard drive, network share), and store 1 copy off-site (I have two 
different off-site locations, University, parent's house). I also 
have "cleaned" versions of the plot census data hosted on the 
SEOSAW database, which makes the data accessible to other 
researchers under agreement. I've already had a few other projects 
request to use the data, which is very nice to see.

  [3-2-1 backup rule]: 
https://en.wikipedia.org/wiki/Backup#3-2-1_rule
  [SEOSAW database]: https://seosaw.github.io/

One thing that I didn't keep good track of for a little while was 
only using one copy as the 'primary' copy, and using the others as 
backup only. At one point I was writing new data to both my 
personal hard drives and I got mixed up. Since then, I put one of 
the hard drives in a cupboard out of site, to deter me from writing 
data to it unless I wanted to do a backup. As an aside, I use rsync 
to make backups. It's quick and efficient and very rarely fails. I 
have plans to buy a NAS (Network Attached Storage), the Synology 
DS420+ looks nice, but for now having loose hard drives will have 
to suffice.

  [rsync]: https://www.google.com/search?hl=en&q=rsync
  [Synology DS420+]: https://www.synology.com/en-us/products/DS420+

The LiDAR data consists of raw .zfs files exported directly from 
the scanner, databases built by Cyclone (Leica's proprietary LiDAR 
processing software), PTX files outputted by Cyclone, and LAZ files 
created by me which compress the huge PTX files to a more 
manageable binary format.

The key items to keep in my opinion are the raw .zfs files, and the 
PTX files, as they constitute the raw untouched data in open 
formats, but the LAZ files are the ones I'll probably use most on a 
day to day basis, simply because they are small enough that drive 
I/O isn't a bottleneck for processing time.

I've got the LAZ files backed up in the same places as the plot 
census data, and also in a DataShare repository, which gives them a 
permanent DOI and makes them available for others to use. The scan 
databases I don't think I will back up, because every aspect of 
information in them is represented in some other file. The only 
convenience of keeping them is that I would be able to quickly boot 
up Cyclone and use their very good 3D rendering, but Cloud Compare 
is enough for me most of the time. The PTX files I have backed up 
both on my personal hard drives, and also on cassette at the 
University, a service which costs about £50 per pair of tapes I 
think, which is very reasonable. This isn't perfect as the cassette 
backup isn't that accessible, but the PTX files are just so big 
that it's difficult to keep them anywhere else. As long as I have 
two sets of hard drives, each stored in different places, they 
should be safe.

  [DataShare repository]: 
https://datashare.ed.ac.uk/handle/10283/3997
  [Cloud Compare]: https://www.danielgm.net/cc/