TITLE: Learning Julia
DATE: 2021-04-10
AUTHOR: John L. Godlee
====================================================================


Julia is a programming language I've been wanting to learn for a 
while. I've been encountering larger and larger datasets during my 
PhD, both with terrestrial LiDAR data and huge species-by-site 
diversity matrices. Running everything in R is becoming restrictive 
when I want to crunch lots of data within a reasonable amount of 
time.

Julia promises that it looks like Python, feels like Lisp, and runs 
like C, supposedly straddling that trade-off between speed, 
expressiveness, and generalisability.

I watched this Youtube tutorial to get me started on the basic 
syntax, which is quite pleasant to read, and seems familiar to me 
having a background in R with some Python and shell-scripting. I 
also found that the official Julia documentation is pretty useful 
for understanding some of the finer points in more detail.

  [Youtube tutorial]: https://www.youtube.com/watch?v=8h8rQyEpiZA
  [official Julia documentation]: https://docs.julialang.org/en/v1/

As my first practice I aimed to calculate the Shannon and Simpson 
diversity indices for a huge species by site matrix.

I created the matrix in R and wrote it to a .csv file:

    mat <- matrix(sample(0:1000, 10^7, replace = TRUE), nrow = 
100000) 

    # Write matrix for use in Julia
    write.csv(mat, "mat.csv", row.names = FALSE)

Then in R I used the {vegan} package to calculate the Shannon and 
Simpson diversity indices, and the {microbenchmark} package to time 
how long it took:

    library(vegan)

    div <- function(x) {
      shannon <- diversity(x, "shannon")
      simpson <- diversity(x, "simpson")
      return(list(shannon, simpson))
    }

    microbenchmark(
      div(mat),
      times = 100)

The mean time to complete was 2167 milliseconds.

In Julia the overhead is a bit larger, just because I'm writing my 
own functions for the diversity indices:

    # Packages
    using CSV 
    using BenchmarkTools
    using Tables

    # Import matrix from .csv
    mat = CSV.File("mat.csv") |> Tables.matrix

    # Define Shannon function
    function shannon(x)
        xno = [i for i=x if i != 0]
        N = sum(xno)
        p = xno / N
        return -sum(p .* log.(p))
    end

    # Define Simpson function
    function simpson(x) 
        N = sum(x)
        p = x / N
        return sum(p.^2)
    end

    # Iterate over columns
    function testfunc()
        shanout = []
        simpout = []
        for col in eachcol(mat)
            push!(shanout, shannon(col))
            push!(simpout, simpson(col))
        end
    end

    # Benchmark
    @benchmark testfunc()

The median time to complete was only 521 ms.

The main issue I'm having at the moment which is tripping me up a 
lot is the way Julia handles reassigning variables. While in R I 
could do something like this:

    x <- c(1,2,3)
    y <- x
    y[1] <- 10
    y != x

and have the last line evaluate as true, in Julia unless I use 
copy() when reassigning the variable, y will continue to equal x:

    x = [1,2,3]
    y = x
    y[1] = 10
    y != x

    y = copy(x)
    y[1] = 100
    y != x