TITLE: An R function to split species names
DATE: 2020-06-05
AUTHOR: John L. Godlee
====================================================================


For my research assistant position I have been cleaning lots of 
taxonomic data for tree species in southern Africa. On the surface 
this seems simple, Brachystegia spiciformis gets split into 
c("Brachystegia", "spiciformis"). However, what about when the 
species is written as Brachystegia spiciformis var. kwangensis? 
Here is a list of possible species name forms I found in my dataset:

-   Brachystegia spiciformis
-   Brachystegia cf. spiciformis
-   Acacia abyssinica subsp. calophylla
-   Acacia sieberiana var. woodii

And that isn't counting the species with multiple below-species 
taxonomic ranks, like: Vachellia gerrardii subsp. gerrardii var. 
latisiliqua.

Separating these out by hand would take a very long time, so I 
wrote a function which does it for me.

First the function splits strings by spaces or optionally dots with 
no spaces, then it searches to see if a species is cf., meaning 
that the absolute species isn't known but a guess has been made, in 
which case species is replaces with indet (indeterminate) and the 
species is stored in the confer column. Then a similar process to 
search for both varieties and subspecies. If below-species ranks 
are to be returned then the dataframe is returned as is, otherwise 
the confer column replaces the indet in species if below-species 
ranks are not returned.

This function doesn't catch Brachystegia sp.2, but I have a 
separate function which replaces these with Brachystegia indet 
based on a lookup table supplied by the user.

    #' Split full species name into genus, species, and optionally 
below-species taxonomic ranks
    #'
    #' @param x vector of genus and species names
    #' @param subsp logical, should lower taxonomic ranks be 
returned?
    #'
    #' @return dataframe of character vectors with one column per 
rank
    #'
    #' @export
    #'
    splitSpecies <- function(x, subsp = TRUE) {
      x <- strsplit(x, " |[a-z]\\.[a-z]")

      x <- lapply(x, function(y) {
        # genus
        genus <- y[1]

        # cf and species 
        if (grepl("cf(\\.)?", y[2])) {
          species <- "indet"
          cf <- y[3]
          plus <- 1
        } else {
          species <- y[2]
          cf <- NA_character_
          plus <- 0
        }

        if (!is.na(y[3+plus])) {
          sub_string <- paste(y[(3+plus):length(y)], collapse = " ")

          # variety if present
          if (grepl("var(\\.)?", sub_string)) {
            string <- strsplit(sub_string, " ")
            variety <- string[[1]][which(grepl("var(\\.)", 
string[[1]])) + 1]
          } else {
            variety <- NA_character_
          }

          # subspecies if present
          if (grepl("subs(p)?(\\.)?", sub_string)) {
            string <- strsplit(sub_string, " ")
            subspecies <- string[[1]][which(grepl("subs(p)?(\\.)?", 
string[[1]])) + 1]
          } else {
            subspecies <- NA_character_
          }
          c(genus, species, cf, subspecies, variety)
        } else {
          c(genus, species, cf, NA_character_, NA_character_)
        }
      })

      out <- as.data.frame(do.call(rbind, x))
      names(out) <- c("genus", "species", "confer", "subspecies", 
"variety")[1:length(out)]

      # Replace cf. as species is subsp. == FALSE
      if (subsp) {
        out <- out
      } else {
        out$species[!is.na(out$confer)] <- 
out$confer[!is.na(out$confer)]
        out <- out[,c("genus", "species")]
      }
      return(out)
    }