Plotting an skbio ordination using R

EDIT (10/12/2020): Updated code to properly label y-axis as PC2.

Recently a friend of mine asked if I had any experience importing an skbio.stats.ordination object in R. I hadn’t, unfortunately, as I usually work with QIIME2 & Python in tandem. I do expect that I’ll be using R more often as my PhD progresses so I decided to take a stab at a very simple way to extract and plot the data from this format.

When you export an skbio ordination object, the text file contains entries for the eigenvalues, proportion explained, and sites (analogous to samples in what I do). Additionally, if the ordination is a biplot, it will contain entries for species as well (analogous to microbes in what I do). There are also spaces for “Biplot” and “Site constraints” which I’m not familiar with. These entries are basically stored in sub-TSVs so this snippet just takes each of these set of values and extracts them to their own data frame. Then I wrote some rudimentary code to plot the samples as points and features as arrows. Of course, this can be snazzed up with fancy ggplot2 accoutrements but for this demonstration I kept it relatively simple.

biplot_R

library(ggplot2)

get_ord_dataframe <- function(lines, linestart){
  header_line <- strsplit(lines[linestart], split="\t")[[1]]
  num_rows = as.numeric(header_line[2])
  num_cols = as.numeric(header_line[3])
  if (num_rows == 0){return(data.frame())}

  data <- strsplit(
    lines[(linestart+1):(linestart+num_rows)],
    split="\t"
  )
  names <- unlist(lapply(data, function(x) x[1]))
  coords <- lapply(
    data,
    function(x) strsplit(x[2:(2+num_cols-1)], split="\t")
  )
  coords <- data.frame(
    matrix(
      unlist(coords),
      nrow=length(coords),
      byrow=T
    ),
    stringsAsFactors=FALSE
  )
  coords[] <- lapply(coords, as.numeric) #make numeric
  rownames(coords) <- names
  return(coords)
}

lines <- readLines("results/tutorial/data/ordination.txt")
# format is as follows Eigvals, Proportion explained,
# Species, Site, Biplot, Site constraints

eigvals <- as.numeric(strsplit(lines[2], split="\t")[[1]])
num_axes <- length(eigvals)
prop_explained <- as.numeric(strsplit(lines[5], split="\t")[[1]])

species_linestart = min(grep("^Species", lines))
species_coords <- get_ord_dataframe(lines, species_linestart)
norm_vec <- function(x) sqrt(sum(as.numeric(x)^2))
species_coords$magnitude <- apply(species_coords, 1, norm_vec)
species_coords <- species_coords[order(-species_coords$magnitude),]

sites_linestart = min(grep("^Site", lines))
sites_coords <- get_ord_dataframe(lines, sites_linestart)

biplot_linestart = min(grep("^Biplot", lines))
biplot_coords <- get_ord_dataframe(lines, biplot_linestart)

constraints_linestart = min(grep("^Site constraints", lines))
constraints_coords <- get_ord_dataframe(lines, constraints_linestart)

ggplot(sites_coords, aes(x = X1, y = X2)) +
  geom_point() +
  geom_segment(
    data=species_coords[1:8,],
    aes(x=0, y=0, xend=X1, yend=X2),
    arrow = arrow(length=unit(0.3, "cm"))
  ) +
  xlab("PC1") +
  ylab("PC2") +
  ggtitle("DEICODE Moving Pictures Biplot")

ggsave("deicode_mp_biplot.png")