TITLE: `BIOMASS::getWoodDensity()` description DATE: 2022-11-23 AUTHOR: John L. Godlee ==================================================================== I looked through the code for the getWoodDensity() function from the {BIOMASS} R package, to get a better idea of how it estimates tree wood density. [R package]: https://cran.r-project.org/web/packages/BIOMASS/index.html The function uses the wood density database from the Zanne et al. (2009) global wood density database. The database contains 16467 records from 8388 taxa. [Zanne et al. (2009) global wood density database]: https://datadryad.org/stash/dataset/doi:10.5061/dryad.234 You feed the function a set of species names, and optionally family names, stand IDs (plots), extra wood density data supplied by the user, and a geographic region with which to subset the wood density database. First, getWoodDensity() filters to wood density database to only include the families, genera, or species present in the wood density database. This means that an entry in the input data where family = "Fabaceae", genus = "Brachystegia", and species = "spiciformis", the filtered wood density data will include records from the species Acacia acuminata, because this species is in the Fabaceae family. Then, getWoodDensity() calculates the mean wood density for each of those species. These species-level estimates are used when there is a match in the input data. The function then calculates mean wood density for each genus by taking the mean of the species-level means calculated in the previous step. This means that in the case where species within a genus have more than one record, the genus-level mean calculated by the function will be slightly different to the grand mean calculated from the raw wood density records. This is a good example of Simpson's paradox. Additionally, the values of nInd generated for these genus-level estimates are the number of species-level means, not the number of raw records. These genus-level means are used when there is a match in the input data that hasn't yet been filled by a species-level estimate. [Simpson's paradox]: https://en.wikipedia.org/wiki/Simpson%27s_paradox For genus-level indets (e.g. Fabaceae indet) the function calculates mean wood density for the family, again by taking the mean of genus-level estimates calculated in the previous step. Again, these family-level means are used where there is a match in the input data that doesn't yet have an estimate of wood density. For family-level indets (i.e. no taxonomic information provided) it calculates mean wood density for the plot, using the best wood density estimates generated for each individual in previous steps. This means that if only a single individual in the plot has a wood density estimate, this value will be used as the wood density estimate for all other individuals on the plot. In the case where no wood density estimates have been generated for that plot, i.e. no stems on the plot have taxonomic information, the mean of all other best wood density estimates generated for other plots is used. In the case where no wood density estimates have been generated for any plots, because none of the submitted taxa have records in the database, the function fails. The way the function is written it makes sense that if you are generating wood density estimates for multiple sites, which might differ in their species pool, you should run the function site by site in a loop or a lapply(), so that the wood density estimates generated at genus-level and above are calculated using species which are likely to occur in that site. In the event that the function fails due to lack of data, it might still be possible to generate an estimate of wood density for the site, either the regional mean, or even the global mean from the wood density database. I think the function might be doing the wrong thing by calculating means of means, rather than the grand mean. For example, imagine a genus with two species. Species "A" has 1000 wood density measurements with a mean of 2.56. Species "B" has 5 measurements with a mean of 1.5. The function is estimating the wood density of an average species within the genus (2.56 + 1.5)/2. If we instead take the grand mean ( (1000*2.56) + (5*1.5) ) / 1005 we are estimating the wood density of an average individual. As the function eventually applies these estimates to individuals from the input data, it makes more sense to me to calculate the grand mean.