TITLE: Processing data from the TRY traits database
DATE: 2021-12-20
AUTHOR: John L. Godlee
====================================================================


I've been working recently with data from the TRY global plant 
traits database, to assess which dominant dry tropical tree species 
we have good trait data for, and for which we are lacking decent 
trait data. One of the key inputs to the process-based carbon cycle 
model used in the SECO project is leaf mass per area, sometimes 
expressed as the leaf area per mass, aka specific leaf area (SLA), 
so I'm focussing on that. If we find gaps in the trait coverage of 
some species, maybe we can address those gaps with data collection 
during the project.

The data requests retrieved from TRY are in a format that makes 
them quite difficult to parse in R. Instead of a 2D table, it's 
more like a 1D list, with metadata and trait data on different 
rows, linked by an observation ID. In this post I want to share the 
R code I use to create a neat dataframe from this data.

I use data.table::fread() to read in the data, because the files 
can be quite large, 3.35 GB in my case:

    try_dat <- fread("dat/18017.txt", header = TRUE, sep = "\t", 
dec = ".", 
      quote = "", data.table = FALSE, encoding = "UTF-8")

Also note that the data are tab separated, and to fix encoding 
issues it's a good idea to enforce UTF-8 encoding.

Then I rename some columns and keep the useful ones:

    try_clean <- try_dat %>%
      dplyr::select(
        obs_id = ObservationID,
        species_id = AccSpeciesID,
        species_name = AccSpeciesName,
        trait_id = TraitID,
        trait_name = TraitName,
        key_id = DataID,
        key_name = DataName,
        val_orig = OrigValueStr,
        val_std = StdValue,
        unit_std = UnitName,
        error_risk = ErrorRisk) 

I create lookup tables to match the species IDs and trait IDs later 
on:

    species_id_lookup <- try_clean %>% 
      dplyr::select(species_id, species_name) %>% 
      unique()

    trait_id_lookup <- try_clean %>% 
      dplyr::select(trait_id, trait_name) %>% 
      unique() %>%
      filter(!is.na(trait_id)) 

Then I split the data by observation ID:

    try_split <- split(try_clean, try_clean$obs_id)

Then I loop through each of those observations, extracting the 
trait data and some useful metadata that is commonly attached to 
each observation. But note that there are lots of metadata in TRY, 
and not all observations share all metadata. A lot don't even have 
latitude and longitude coordinates, limiting their usefulness.

    total <- length(try_split)
    try_df <- as.data.frame(do.call(rbind, 
mclapply(seq_along(try_split), function(x) {
      message(x, "/", total)
      x <- try_split[[x]]
      # Subset columns
      traits <- x[!is.na(x$trait_id),
        c("species_id", "trait_id", "val_orig", "val_std", 
"unit_std", "error_risk")]

      # Extract some common metadata
      meta_ext <- function(y, key_val) {
        ext <- y[y$key_id == key_val, "val_std"]
        ifelse(length(ext) == 0, NA, ext)
      }

      traits$elev <- meta_ext(x, 61)
      traits$longitude <- meta_ext(x, 60)
      traits$latitude <- meta_ext(x, 59)
      traits$map <- meta_ext(x, 80)
      traits$mat <- meta_ext(x, 62)
      traits$biome <- meta_ext(x, 193)
      traits$country <- meta_ext(x, 1412)

      return(traits)
    }, mc.cores = 3)))

Finally, I can add the trait and species names back in using the 
lookup tables:

    # Add trait and species names to dataframe
    try_df$trait_name <- trait_id_lookup$trait_short[
      match(try_df$trait_id, trait_id_lookup$trait_id)]

    try_df$species_name <- species_id_lookup$species_name[
      match(try_df$species_id, species_id_lookup$species_id)]