©2017 JD Watrous, M Henglin, B Claggett, S Cheng, M Jain, Brigham and Women’s Hospital and UCSD, all rights reserved.

citation: TBA

# Data manipulation
library(magrittr)
library(purrr)
library(dplyr)
library(tidyr)

# Plotting
library(ggplot2)

# Shape-based clustering
library(dtwclust)

We create some sample data here. While we will be using real data (loaded in the background) for the rest of this documentation, the data generated below provides an example of the structure of our data.

plate <-
  1:10

well <-
  1:10

metaboliteID <-
  runif(10, 100, 900)

while (length(unique(metaboliteID)) != length(metaboliteID)) {
  metaboliteID <-
    runif(10, 100, 900)
}

dat <-
  expand.grid(plate, well, metaboliteID) %>%
  setNames(c('plate', 'well', 'metaboliteID'))


dat %<>%
  mutate(plateWellID = paste0(plate, '_', well),
         value = runif(n(), 1000, 1000000)) %>%
  tbl_df()

dat
## # A tibble: 1,000 × 5
##    plate  well metaboliteID plateWellID      value
##    <int> <int>        <dbl>       <chr>      <dbl>
## 1      1     1      460.749         1_1   5725.919
## 2      2     1      460.749         2_1   7731.122
## 3      3     1      460.749         3_1 634977.748
## 4      4     1      460.749         4_1 889502.918
## 5      5     1      460.749         5_1  66150.159
## 6      6     1      460.749         6_1  18650.618
## 7      7     1      460.749         7_1 561326.297
## 8      8     1      460.749         8_1 562686.695
## 9      9     1      460.749         9_1 852415.995
## 10    10     1      460.749        10_1 931310.006
## # ... with 990 more rows

For our sample data, the columns are:

Once again, this is data generated only to give an idea of what our data looks like. It is not used beyond this point.

We measure samples on 96-well plates. For a given plate, wells are run one at a time, before moving onto the next plate. If a metabolite is unmeasured across an entire plate, we label that plate as misaligned. We then aggregate those results to calculate the mean misalignment score for a metabolite across the entire dataset.

misalign <- 
  dat %>%
  group_by(metaboliteID, plate) %>% 
  summarise(misaligned = all(is.na(value))) %>% 
  ungroup() %>% 
  group_by(metaboliteID) %>%
  mutate(misaligned = mean(misaligned)) %>%
  ungroup()


dat %<>% left_join(misalign, by = c('metaboliteID', 'plate'))

Here we show two plots, one a smaller subset of the other to show greater detail. Text has been removed to mask data identifiers. Here we can evaluate misalignment by plotting the measured values versus the chronological order by which individual samples were run (the plateWellID is structured to be plotted in chronological order in this case.) We can look for signs of excess or unusual signal variation over chronological time.

Note; an expected warning will appear to signify that points with NA, representing a metabolite that was not measured in a particular sample, are being omitted from the plot.

metabsToPlot <-
  dat %>%
  distinct(metaboliteID) %>%
  sample_n(100)

blankThm <-
  theme(
    text = element_blank(),
    strip.background = element_blank(),
    axis.ticks = element_blank()
  )

dat %>%
  ggplot() +
  geom_point(aes(x = plateWellID, y = value, colour = misaligned)) +
  facet_wrap( ~ metaboliteID, scales = 'free_y') +
  scale_color_gradient(low = 'blue', high = 'red') +
  blankThm
## Warning: Removed 359384 rows containing missing values (geom_point).