TITLE: Comparing which simulated distribution is closest to the 
truth
DATE: 2020-10-05
AUTHOR: John L. Godlee
====================================================================


I had an email from a colleague who had a load of univariate 
distributions generated from different landscape evolution 
simulations. They wanted to compare which distribution was closest 
to the observed distibution in the real landscape.

To begin with, the colleague suggested using t-tests to see if two 
distributions were similar, but I think there is a problem with 
using t-tests in that way. As I interpret it, t-tests analyse 
whether two distribution means are appreciably different, rather 
than similar. Anyway, this was my discussion:

  So, you have many simulated elevation distributions, and you want 
to know which distribution closest to the true observed 
distribution? In that case, I definitely take issue with using K-S 
(Kolmogorov-Smirnov) tests or many one-sample t-tests, as they just 
don’t fit the framing of your question. Both these will test 
whether the sample mean of a simulated distribution is appreciably 
different from the observed mean in the original landscape, using a 
p-value test. Firstly, it’s not as simple as reversing the logic 
to say: “if you have a non-significant p-value then they are 
appreciably similar”. Secondly, you don’t want to know whether 
each simulation is suitable (i.e. similar) or not, you want to know 
which is most suitable. With the above tests all simulations might 
be suitable, and given the large sample size I would wager that 
they all would be, but you wouldn’t know which was the most 
sitable.

  As an aside, before resigning yourself to non-parametric tests, 
it might be worth seeing if a simple transformation of the data can 
achieve normality, or at least normal-ish. Try log-transforming or 
sqrt-ing. With a sample of 3000 points I don’t think having truly 
normal data is such a big deal anyway, and there is a lot of 
literature to support linear models being robust to violations of 
normality.

  Equivalence testing may be what you are looking for. It tests 
instead whether two distributions are suitably similar, but it 
won’t be able to tell you which one is most similar to the truth.

  My favourite approach however, having mulled this over a good 
deal, is this:

  -   Transform distributions to achieve a normal-ish distribution.
  -   Do a linear regression of observed (original landscape, 
dependent, y) against predicted (simulated, independent, x) values.
  -   Compare those regression models using AIC or some other 
information criterion to find the regression which minimises 
variance between predicted and observed values. The simulated 
values used in that regression are from the “best” simulation. 
You could also report R^2 values, which should be highest in the 
model with the lowest AIC. Note it is generally accepted that if 
two models are within 2 AIC points of each other, you cannot say 
that one is better than the other, thus you may have multiple 
simulations which are the most suitable.

  I’ve included an R script which does similar with a bunch of 
fake normal-ish data.

  For data visualisation, I would:

  -   Transform distributions to achieve a normal-ish distribution.
  -   Make an interval plot or a boxplot showing the mean and 
standard deviation of each simulated distribution compared to the 
true mean.

  My attempt is also in the R script.

  Thanks for the interesting problem, let me know what you end up 
using, John

Here is the R script I referenced:

    # Which elevation distribution is closest to the truth?
    # John Godlee (johngodlee@gmail.com)
    # 2020-10-05

    set.seed(20201005)

    # Packages
    library(dplyr)
    library(tidyr)
    library(ggplot2)

    # Create data
    distrib_list <- list(
      obs = rnorm(1000, 144, 5), 
      el1 = rnorm(1000, 110, 5), 
      el2 = rnorm(1000, 120, 5), 
      el3 = rnorm(1000, 130, 5),
      el4 = rnorm(1000, 140, 5), 
      el5 = rnorm(1000, 150, 5), 
      el6 = rnorm(1000, 160, 5)
    )

    ##' 6 simulated distrib. and 1 observed distrib.. Each has: 
    ##'   1000 points,
    ##'   standard deviation of 5
    ##'   varying mean

    # Create density plot 
    do.call(cbind, distrib_list) %>%
      as.data.frame(.) %>%
      gather(., key, value) %>%
      ggplot(.) + 
        geom_line(stat = "density",
          aes(x = value, colour = key))

    ##' Expect that `el4` is most similar to `obs`

    # Run linear regressions
    mod_list <- lapply(distrib_list[-1], function(y) { 
      lm(y ~ distrib_list[[1]])
        })

    # Extract AIC, BIC, R^2
    mod_df <- data.frame(name = names(distrib_list[-1]))
    mod_df$aic <- lapply(mod_list, AIC)
    mod_df$bic <- lapply(mod_list, BIC)
    mod_df$rsq <- lapply(mod_list, function(x) { 
summary(x)$r.squared })

    ##' There are loads of other metrics to choose from

    # Look at the extracted metrics
    mod_df

    # Visualise data
    do.call(cbind, distrib_list) %>%
      as.data.frame(.) %>%
      gather(., key, value) %>% 
      mutate(sim = case_when(
          grepl("el", key) ~ TRUE,
          TRUE ~ FALSE)) %>%
      ggplot() + 
        geom_boxplot(data = raw_dat,
          aes(x = key, y = value, colour = sim), fill = NA) +
        stat_summary(data = raw_dat, 
          aes(x = key, y = value, fill = sim),
          fun = mean, geom = "point", shape = 21, size = 5, colour 
= "black") + 
        theme(legend.position = "none")