TITLE: `BIOMASS::getWoodDensity()` description
DATE: 2022-11-23
AUTHOR: John L. Godlee
====================================================================


I looked through the code for the getWoodDensity() function from 
the {BIOMASS} R package, to get a better idea of how it estimates 
tree wood density.

  [R package]: 
https://cran.r-project.org/web/packages/BIOMASS/index.html

The function uses the wood density database from the Zanne et al. 
(2009) global wood density database. The database contains 16467 
records from 8388 taxa.

  [Zanne et al. (2009) global wood density database]: 
https://datadryad.org/stash/dataset/doi:10.5061/dryad.234

You feed the function a set of species names, and optionally family 
names, stand IDs (plots), extra wood density data supplied by the 
user, and a geographic region with which to subset the wood density 
database.

First, getWoodDensity() filters to wood density database to only 
include the families, genera, or species present in the wood 
density database. This means that an entry in the input data where 
family = "Fabaceae", genus = "Brachystegia", and species = 
"spiciformis", the filtered wood density data will include records 
from the species Acacia acuminata, because this species is in the 
Fabaceae family.

Then, getWoodDensity() calculates the mean wood density for each of 
those species. These species-level estimates are used when there is 
a match in the input data.

The function then calculates mean wood density for each genus by 
taking the mean of the species-level means calculated in the 
previous step. This means that in the case where species within a 
genus have more than one record, the genus-level mean calculated by 
the function will be slightly different to the grand mean 
calculated from the raw wood density records. This is a good 
example of Simpson's paradox. Additionally, the values of nInd 
generated for these genus-level estimates are the number of 
species-level means, not the number of raw records. These 
genus-level means are used when there is a match in the input data 
that hasn't yet been filled by a species-level estimate.

  [Simpson's paradox]: 
https://en.wikipedia.org/wiki/Simpson%27s_paradox

For genus-level indets (e.g. Fabaceae indet) the function 
calculates mean wood density for the family, again by taking the 
mean of genus-level estimates calculated in the previous step. 
Again, these family-level means are used where there is a match in 
the input data that doesn't yet have an estimate of wood density.

For family-level indets (i.e. no taxonomic information provided) it 
calculates mean wood density for the plot, using the best wood 
density estimates generated for each individual in previous steps. 
This means that if only a single individual in the plot has a wood 
density estimate, this value will be used as the wood density 
estimate for all other individuals on the plot.

In the case where no wood density estimates have been generated for 
that plot, i.e. no stems on the plot have taxonomic information, 
the mean of all other best wood density estimates generated for 
other plots is used.

In the case where no wood density estimates have been generated for 
any plots, because none of the submitted taxa have records in the 
database, the function fails.

The way the function is written it makes sense that if you are 
generating wood density estimates for multiple sites, which might 
differ in their species pool, you should run the function site by 
site in a loop or a lapply(), so that the wood density estimates 
generated at genus-level and above are calculated using species 
which are likely to occur in that site.

In the event that the function fails due to lack of data, it might 
still be possible to generate an estimate of wood density for the 
site, either the regional mean, or even the global mean from the 
wood density database.

I think the function might be doing the wrong thing by calculating 
means of means, rather than the grand mean. For example, imagine a 
genus with two species. Species "A" has 1000 wood density 
measurements with a mean of 2.56. Species "B" has 5 measurements 
with a mean of 1.5. The function is estimating the wood density of 
an average species within the genus (2.56 + 1.5)/2. If we instead 
take the grand mean ( (1000*2.56) + (5*1.5) ) / 1005 we are 
estimating the wood density of an average individual. As the 
function eventually applies these estimates to individuals from the 
input data, it makes more sense to me to calculate the grand mean.