TITLE: Network graph of R package usage
DATE: 2021-06-25
AUTHOR: John L. Godlee
====================================================================


I wanted to know which R packages I use the most in my work, just 
as a little toy exercise in data wrangling and network 
visualisation.

I searched through all the R scripts on my laptop using ripgrep, to 
find which packages I use:

  [ripgrep]: https://github.com/BurntSushi/ripgrep

    rg "^library(.*)$" -g *.R -g '!Library/*' > packages.txt

The -g glob excludes any files in ~/Library, because these are 
email attachments which sometimes aren't mine, and when they are 
mine they're often duplicates of a script already stored on my 
computer somewhere else.

Then in R, I can import the results and start analysing them:

    # Packages
    library(dplyr)
    library(ggplot2)
    library(GGally)
    library(network)

    # Import data
    dat_raw <- readLines("packages.txt")

The first thing is to separate the filepaths from the package names:

    # Extract file paths from lines
    paths <- gsub(":.*", "", dat_raw)

    # Check all paths are valid
    stopifnot(all(grepl(".R$", paths)))

    # Extract packages from lines
    packages <- gsub(".*library\\s?\\(\"?([A-z0-9.]+)\"?\\).*", 
"\\1", dat_raw)

    # Check number of paths = number of packages
    stopifnot(length(paths) == length(packages))

    # Create dataframe
    dat <- unique(data.frame(paths, packages)) 

The unique() removes some packages which were mistakenly called 
multiple times in the same script.

To find my most used packages, I created a bar graph:

    pack_freq_summ <- dat %>%
      group_by(packages) %>%
      tally() %>% 
      mutate(packages = factor(packages, levels = 
rev(packages[order(n)]))) %>%
      arrange(desc(n)) %>%
      slice_head(n = 10)

    ggplot() + 
      geom_bar(data = pack_freq_summ, 
        aes(x = packages, y = n), 
        colour = "black", fill = "darkgrey", stat = "identity") + 
      theme_bw() +
      labs(x = "Package", y = "Frequency")

  ![Bar plot of 10 most used R 
packages](https://johngodlee.xyz/img_full/r_packages/package_freq_ba
r.png)

Next, I wanted to create a network graph. I wanted to visualise 
which packages were used the most, which packages were used in 
conjunction with each other, and which packages are most commonly 
used in conjunction.

First, split the dataframe by R script, and remove scripts which 
only called one package.

    # Split by file
    dat_split <- split(dat, dat$paths)

    # Remove files with only one package
    dat_split_fil <- dat_split[unlist(lapply(dat_split, nrow)) > 1]

Then for each R script, use expand.grid() to create pairwise 
combinations of packages and count their frequency, then use some 
{dplyr} to clean up the results, so I'm left with a dataframe with 
three columns, from, to, and weight, where from and to are pairs of 
packages, and weight counts the number of times they are called in 
the same script:

    # Create matrix of packages by co-occurrence in files
    edge_mat <- do.call(rbind, lapply(dat_split_fil, function(x) {
      expand.grid(x$packages, x$packages)
        })) %>%
      filter(Var1 != Var2) %>%
      group_by(Var1, Var2) %>%
      tally() %>%
      rename(from = Var1, to = Var2, weight = n) %>%
      mutate(
        from = as.character(from), 
        to = as.character(to)) %>%
      group_by(grp = paste(pmax(from, to), pmin(from, to), sep = 
"_")) %>%
      slice(1) %>%
      ungroup() %>%
      select(-grp) 

Then, I create a network object and add attributes so that the 
nodes are weighted by frequency, and the edges are weighted by 
co-occurrence frequency:

    # Create network object
    net <- as.network(edge_mat, directed = FALSE)

    # Add vertex attribute, number of times package is used
    vertex_weight <- dat %>%
      group_by(packages) %>%
      tally() %>%
      as.data.frame()

    net %v% "vweight" <- vertex_weight[
      match(net %v% "vertex.names", vertex_weight$packages),"n"]

    # Add edge attribute, colors by number of times packages used 
in conjunction
    colfunc <- colorRampPalette(c("lightgray", "blue"))

    net %e% "edgecol" <- as.character(cut(
        log(net %e% "weight"), breaks = 5, labels = colfunc(5)))

And finally, create a circular network graph, where nodes are sized 
according to frequency and edges are coloured according to 
co-occurrence frequency:

    # Create plot
    ggnet2(net, mode = "circle", 
      color = "#ffc780", size = "vweight",
      label = TRUE, label.size = 2,
      edge.col = "edgecol")

  ![Network graph of package 
co-occurrence](https://johngodlee.xyz/img_full/r_packages/packages_n
et_plot.png)

I'm not totally happy with the edge colouring, but it's difficult 
because many packages only occur together once, while a few, e.g. 
{dplyr} occur in almost every script, so there's a very wide range 
of values.