TITLE: Questions about operational IT for research
DATE: 2020-06-25
AUTHOR: John L. Godlee
====================================================================


I have a couple of open questions, similar to previous questions 
posed in my lab group on how folk set up their R environments. I 
think that discussions like this are a good way of developing a 
sense of collegiality in academic groups. Often discussion of our 
specific research is stymied by feelings that it has to be perfect 
before talking to our colleagues about it, but discussing 
operational topics like data management, data analysis, etc. is an 
effective way of sharing experience and making all our lives 
easier. Once the boring day-to-day topics are taken care of in the 
most optimal fashion, the hard work of research becomes slightly 
more pleasurable.

First question: How does one manage large file storage, rasters and 
the like? I currently download large spatial data to my local 
machine for analysis, but then my laptop periodically runs out of 
hard disk space, and I have to delete various layers. Then 
inevitably I need some file again and I have to figure out where I 
got the old file from, or I want to run an old analysis and find 
that I carelessly deleted an important file of raw data.

I've tried keeping files on Google Drive but this is a pain because 
the large files choke up the syncing on my domestic internet 
connection. I've tried keeping files on my university datastore, 
but the upload/download speed when not on the University network is 
very frustrating. At the moment I have large files on a networked 
home server, but there are two major caveats to my approach: 
firstly if I ever decide to work away from home I will no longer 
have access to those files, and secondly I do not have enough hard 
drive space for redundancy, so if my spinning disk hard drives 
fail, that's it.

As a side note on the question above and the R environment 
question, I got concerned about how much IT infrastructure my 
university is pushing onto employees and students. The general 
consensus among my lab group on the R environment question was that 
the University managed R environment, as installed using the 
'Application Catalog' is unusable for real research, due to an 
issue with managing packages. One lab group member said that when 
they talked to IT about it they were recommended to just not use 
the University R environment. This surely is a service which should 
be provided to all at the University without question?! Another 
story is from a lab group who decided that it was easier to buy 
their own high-spec image rendering desktop machine rather than 
deal with the University's poorly managed cluster computer setup. 
Finally there are all the PhD students in my office who choose to 
use their own laptops, keyboards and mice, which presumably they 
paid for themselves, rather than the terrible network-managed 
all-in-one Desktop PCs and low-end chiclet keyboards. My Desktop PC 
was pushed to the back of my desk after about 2 weeks of work in 
favour of my own laptop and an external display.

Second question: How does one create a truly reproducible QGIS 
workflow, which keeps a record of the intermediary object created, 
the processes which create them and the inputs provided?

I was recently clipping, diffing and dissolving a bunch of 
different spatial vector files to create a data-informed study 
region which will define the spatial bounds of part of my research. 
Normally I do these things in R, but this time I needed to do a 
fair amount of manual clicking so I opted for QGIS. Looking back if 
I had looked at each operation more carefully I probably could have 
got away with not doing any manual clicking, but I was short on 
time and will power. What I would really like is to export a 
script, maybe in Python because I know QGIS already interacts well 
with Python, which shows exactly what I did, down to the manual 
clicking for the creation of free-hand polygons, to produce what I 
created manually.