TITLE: An email about ordination and environmental fits
DATE: 2020-01-15
AUTHOR: John L. Godlee
====================================================================


I got an email from a colleague asking for my opinion on analysing 
the environmental determinants of tree species diversity measured 
in multiple plots across a province in their country. I did a load 
of reading on this sort of thing about 6 months ago so I relied 
heavily on my notes to write a response and also provide an R 
script with my recommendations.

The original email, paraphrased slightly, said:

  Dear John

  I need your help with diversity analysis in R for my masters 
student. Our goal is to detect relations between environmental 
variables and diversity.

  All variables are numeric, we would like to know what kind of 
analysis you suggest?

  Ordinations? Regressions? PCA? NMDS?

  Can you guide Us?

They also sent a document with their overall research goal:

  Evaluate and relate by statistical analysis the effect of soil, 
climate and topography on the spatial distribution of diversity 
(Richness index Shannon index and Simpson index ) of vegetation in 
province.

y email response was:

  My preference for an ordination technique to analyse variation in 
species composition is NMDS. This is for two main reasons. The 
first is that NMDS deals well with instances where there are no 
shared species between two sites, which I imagine may happen a lot 
in your very large dataset, across a wide geographical area. 
Because Principle Co-ordinate Analysis (PCoA) is an eigenvalue 
analysis and must find a unique solution, when two or more sites 
share no common data, this unique solution cannot be found. The 
second reason is that in contrast to Principle Component Analysis 
(PCA), NMDS can properly represent variation in species abundances 
because it is not limited to using a euclidean distance method. I 
would recommend using the Bray-Curtis distance method.

  I don't think it is wise to use all the environmental variables 
in the NMDS, especially because as you have shown, some of the 
variables are highly correlated. I think you should choose 
variables which you think have a valid ecological reason for 
influencing the species composition.

  In addition to investigating how species composition (species 
identity and relative abundance) varies with environmental 
variables, you may wish to investigate how species diversity 
(species richness and abundance evenness) varies with environmental 
variables. To do this, I would use a simple general linear model 
with multiple environmental variables as the predictor variables 
and the Shannon index as the response variable.

  I don't have your dataset so I have provided in an attached R 
script the way I would perform this sort of analysis in R, using an 
example dataset provided in the {vegan} R package which I hope is 
similar to your data.

The R script I sent looked like this:

    # Environmental determinants of diversity 
    # John Godlee (johngodlee@gmail.com)
    # 2020-01-12

    # Preamble ----

    # Set working directory

    # Packages
    library(vegan)  # diversity(), metaMDS(), data(dune)
    library(ggplot2)  # ggplot()
    library(dplyr)  # %>%, select()
    library(tidyr)  # gather()

    # Import data 
    data(dune)
    ##' Site (rows) by species (columns) abundance matrix
    ##' 20 sites
    ##' 30 species 

    env <- read.csv("env.csv")
    ##' Dataframe with environmental variables for each of the 20 
sites
    ##' This is fake data I created for this example

    # Define a function to estimate the optimal number of 
dimensions for an NMDS
    NMDS.scree <- function(x) {
      plot(rep(1, 10), replicate(10, metaMDS(x, autotransform = F, 
k = 1)$stress), 
        xlim = c(1, 10),ylim = c(0, 0.30), 
        xlab = "# of Dimensions", ylab = "Stress", main = "NMDS 
stress plot")
      for (i in 1:10) {
        points(rep(i + 1,10), 
          replicate(10, metaMDS(x, autotransform = F, k = i + 
1)$stress)
        )
      }
    }

    # Run NMDS.scree 
    NMDS.scree(dune)
    ##' The scree plot shows that 3 dimensions produces a stress 
value below 0.1,
    ##' which is indicates that with this many dimensions, the NMDS 
provides a good
    ##' representation of the difference in species composition.

    # Run NMDS with 3 dimensions
    dune_nmds <- metaMDS(dune, distance = "bray", try = 500, trymax 
= 500, k = 3, 
      autotransform = FALSE)

    # Assess the fit of the NMDS with a stressplot (Shepard plot)
    stressplot(dune_nmds)
    ##' The stressplot shows a strong positive correlation
    ##' It plots the distances among objects in the ordination plot 
against the 
    ##' original Bray-Curtis distances. A tight positive 
correlation between 
    ##' original distance and transformed distance shows that the 
dimensionality 
    ##' reduction was successful.

    # Extract final stress value of the NMDS
    nmdsstress <- dune_nmds$stress
    ##' Should be quoted in results, to show validity of results of 
NMDS

    # Extract site (plot) scores from NMDS analysis
    plot_scores <- as.data.frame(scores(dune_nmds))  

    # Extract species scores from NMDS analysis
    species_scores <- as.data.frame(scores(dune_nmds, "species")) 
    species_scores$species_binomial <- rownames(species_scores)

    # Fit environmental variables to NMDS
    dune_envfit <- envfit(dune_nmds, env[,
      c("BIO1", "BIO12", "BIO15", "BIO17", "elevation", 
"sand_coarse")], 
    permutations = 999)
    ##' Note that these aren't my recommendations for the variables 
you should 
    ##' use on your data, they are just commonly used variables.

    # Which environmental variables significantly affect species 
composition?
    dune_envfit

    # Create ggplot of NMDS output

    # Get arrow vectors
    dune_envfit_arrows <- data.frame(dune_envfit$vectors$arrows)
    dune_envfit_arrows$var <- rownames(dune_envfit_arrows)
    dune_envfit_arrows$r2 <- dune_envfit$vectors$r
    dune_envfit_arrows$p <- dune_envfit$vectors$pvals

    dune_nmds_plot <- ggplot() + 
      geom_hline(aes(yintercept = 0), linetype = 2) + 
      geom_vline(aes(xintercept = 0), linetype = 2) + 
      geom_point(data = species_scores,
        aes(x = NMDS1, y = NMDS2), shape = 1, colour = "black") + 
      geom_point(data = plot_scores,
        aes(x = NMDS1, y = NMDS2), shape = 2, colour = "red") +
      geom_segment(data = dune_envfit_arrows, 
        aes(xend = NMDS1*r2*2, yend = NMDS2*r2*2, x = 0, y = 0), 
        arrow = arrow(length = unit(0.05, "npc")),
        colour = "blue") + 
      geom_text(data = dune_envfit_arrows,
        aes(x = NMDS1*r2*2, y = NMDS2*r2*2, label = var),
        colour = "blue") + 
        coord_equal()

    # General linear model of Shannon index vs. environment

    # Calculate Shannon index from abundance matrix
    env$shannon <- diversity(dune)

    # Run linear model with multiple predictors
    dune_lm <- lm(shannon ~ BIO1 + BIO12 + BIO15 + BIO17 + 
elevation + sand_coarse, 
      data = env)
    ##' You may find that the model should be more complex than 
this, possibly with 
    ##' interaction terms, depending on the variables you choose. 
Also, you may 
    ##' want to introduce a spatial auto-correlation term. Finally, 
you should check
    ##' that the predictor variables conform to the assumptions of 
a general 
    ##' linear model.

    # Which environmental variables significantly (p<0.05) affect 
species diversity?
    summary(dune_lm)

    # Plot of linear relationship between shannon and chosen 
environmental variables
    env_gather <- env %>% 
      dplyr::select(plotcode, shannon, BIO1, BIO12, BIO15, BIO17, 
elevation, sand_coarse) %>%
      gather(key = "var", value = "value", -plotcode, -shannon)

    dune_lm_plot <- ggplot(env_gather, aes(x = value, y = shannon)) 
+
      geom_point() + 
      geom_smooth(method = "lm") + 
      facet_wrap(~var, scales = "free_x")