Combining Multiple Single cell RNA-seq Datasets: Part I

Since single cell RNA-seq becomes an affordable and accessible technology to any lab around the world, there have been a lot of datasets generated (for a list of existing datasets, visit the Resource page). The number of cells in these datasets ranges from hundreds to thousnads, to even tens of thousands. To use all of these datasets more efficiently, it is a good idea to combine them.

Bigger number of cells will lead to more accurate representation of cell types and time points, which are difficult to achieve in a single experiment or by a single lab - the benefits of combining datasets are obvious. But what are the criteria that need to be implemented in order to successfully combine the datasets? What kind of analyses can we do after combining the datasets? I will explore these and related topics in the blog posts.

To start, I chose to work on the BrainSpan single cell dataset and a dataset published by Nowakowski et al. in 2017. I chose these datasets because both of them were generated using the Fluidigm C1 Single Cell Auto Prep System (C1), which was the first commercialized single cell cDNA library preparation platform. Therefore, combining these datasets should be relatively straightforward. But we still need to normalize the datasets and bring with to the same scale, otherwise the combined datasets will not make much sense. Let's start with data preparation. I will use the Seurat package to manage data.

## Prepare Nowakowski datasets
df0.nowa@meta.data$major_cell_type2 <- df0.nowa@meta.data$major_cell_type
df0.nowa@meta.data$major_cell_type2[df0.nowa@meta.data$major_cell_type2 %in%
                                      c("OPC", "Microglia", "Glyc", "Choroid", "Endothelial",
                                        "Astrocyte", "Mural", "Microglia")] <- "other"
df0.nowa <- SetAllIdent(df0.nowa, id = "major_cell_type2")
## Function to plot individual gene distribution in the two datasets
plotGeneDis <- function(gene, data.type = "normalized"){
  library(reshape2)
  if(data.type=="normalized"){
    res1 <- data.frame(expr = df0.BS@data[gene,], source = "BrainSpan")
    res2 <- data.frame(expr = df0.nowa@data[gene,], source = "nowa")
  } else if(data.type=="scaled"){
    res1 <- data.frame(expr = df0.BS@scale.data[gene,], source = "BrainSpan")
    res2 <- data.frame(expr = df0.nowa@scale.data[gene,], source = "nowa")
  }
  res <- rbind(res1, res2)
  res <- suppressMessages(melt(res, value.name = "expr"))

  p1 <- ggplot(res, aes(x = expr, color = source, fill = source)) +
    geom_histogram(bins = 100,position = "identity", alpha = 0.2) +
    geom_density(alpha=.2) +
    ylim(0, 200) +
    theme(legend.position = c(0.8, 0.8))
  # pdf(file = paste(Sys.Date(), gene, "hist.pdf", sep = "_"), width = 11,
  #     height = 10, useDingbats = F);
  return(p1)
  # dev.off()
}
## Plot gene expression distribution of the two datasets
p1 <- plotGeneDis("PAX6")
p2 <- plotGeneDis("BCL11B")
p3 <- plotGeneDis("GAD1")
p4 <- plotGeneDis("OLIG1")
plot_grid(p1, p2, p3, p4, nrow = 2)

As I can see, the distribution of the normalized expression levels in Nowakowski and BrainSpan datasets are more or less similar. Are the scaled data more similar?

## Plot gene expression distribution of the two datasets
p1 <- plotGeneDis("PAX6", data.type =  "scaled")
p2 <- plotGeneDis("BCL11B", data.type = "scaled")
p3 <- plotGeneDis("GAD1", data.type = "scaled")
p4 <- plotGeneDis("OLIG1", data.type = "scaled")
plot_grid(p1, p2, p3, p4, nrow = 2)

It looks like even the scaled data doesn’t show comparible distribution. It needs to be scaled differently than the default way in Seurat.

Note: Pay attention to the normalization factor. Maybe use normalized data for the combination instead of scaled data.

LS0tCnRpdGxlOiAiMjAxOC0xMC0yMSIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQoKVG9kYXksIEkgd2FudCB0byBzdGFydCB3b3JraW5nIG9uIHByZWRpY3QgY2VsbCB0eXBlcyBvZiBzaW5nbGUgY2VsbHMgd2l0aCBleGlzdGluZyBkYXRhc2V0cyAoTm93YWtvd3NraSBldCBhbC4sIDIwMTcgYW5kIEJyYWluU3BhbikuCgpgYGB7ciBlY2hvPUZ9CiMjIyMgMS4gUmVhZCBCcmFpblNwYW4gYW5kIE5vd2Frb3dza2kgc2luZ2xlIGNlbGwgZGF0YSAjIyMjCmRmMC5CUyA8LSByZWFkUkRTKCJ+L3poZW4vZGF0YS9pUFNDX2JhY2t1cC9kZjAuQlMucmRzIikKZGYwLm5vd2EgPC0gcmVhZFJEUygifi96aGVuL2RhdGEvaE9yZy9kZjAubm93YS5yZHMiKQpgYGAKCmBgYHtyIHJlc3VsdHMgPSAiaGlkZSJ9CiMjIE5vcm1hbGl6ZSBhbmQgc2NhbGUgZGF0YXNldHMKZGYwLm5vd2EgPC0gTm9ybWFsaXplRGF0YShvYmplY3QgPSBkZjAubm93YSwgCiAgICAgICAgICAgICAgICAgICAgIG5vcm1hbGl6YXRpb24ubWV0aG9kID0gIkxvZ05vcm1hbGl6ZSIsIAogICAgICAgICAgICAgICAgICAgICBzY2FsZS5mYWN0b3IgPSAxMDAwMDApCmRmMC5ub3dhIDwtIFNjYWxlRGF0YShkZjAubm93YSwgdmFycy50by5yZWdyZXNzID0gIm5VTUkiKQoKcm93bmFtZXMoZGYwLkJTQHJhdy5kYXRhKSA8LSBtYWtlLm5hbWVzKHN1YnN0cihyb3duYW1lcyhkZjAuQlNAcmF3LmRhdGEpLCAxNywgMTAwKSwgCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICB1bmlxdWUgPSBUKQpkZjAuQlMgPC0gTm9ybWFsaXplRGF0YShvYmplY3QgPSBkZjAuQlMsIAogICAgICAgICAgICAgICAgICAgICBub3JtYWxpemF0aW9uLm1ldGhvZCA9ICJMb2dOb3JtYWxpemUiLCAKICAgICAgICAgICAgICAgICAgICAgc2NhbGUuZmFjdG9yID0gMTAwMDAwKQpkZjAuQlMgPC0gU2NhbGVEYXRhKGRmMC5CUywgdmFycy50by5yZWdyZXNzID0gIm5VTUkiKQpgYGAKCmBgYHtyfQojIyBQcmVwYXJlIE5vd2Frb3dza2kgZGF0YXNldHMKZGYwLm5vd2FAbWV0YS5kYXRhJG1ham9yX2NlbGxfdHlwZTIgPC0gZGYwLm5vd2FAbWV0YS5kYXRhJG1ham9yX2NlbGxfdHlwZSAKZGYwLm5vd2FAbWV0YS5kYXRhJG1ham9yX2NlbGxfdHlwZTJbZGYwLm5vd2FAbWV0YS5kYXRhJG1ham9yX2NlbGxfdHlwZTIgJWluJSAKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBjKCJPUEMiLCAiTWljcm9nbGlhIiwgIkdseWMiLCAiQ2hvcm9pZCIsICJFbmRvdGhlbGlhbCIsCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAiQXN0cm9jeXRlIiwgIk11cmFsIiwgIk1pY3JvZ2xpYSIpXSA8LSAib3RoZXIiCmRmMC5ub3dhIDwtIFNldEFsbElkZW50KGRmMC5ub3dhLCBpZCA9ICJtYWpvcl9jZWxsX3R5cGUyIikKYGBgCgpgYGB7cn0KIyMgRnVuY3Rpb24gdG8gcGxvdCBpbmRpdmlkdWFsIGdlbmUgZGlzdHJpYnV0aW9uIGluIHRoZSB0d28gZGF0YXNldHMKcGxvdEdlbmVEaXMgPC0gZnVuY3Rpb24oZ2VuZSwgZGF0YS50eXBlID0gIm5vcm1hbGl6ZWQiKXsKICBsaWJyYXJ5KHJlc2hhcGUyKQogIGlmKGRhdGEudHlwZT09Im5vcm1hbGl6ZWQiKXsKICAgIHJlczEgPC0gZGF0YS5mcmFtZShleHByID0gZGYwLkJTQGRhdGFbZ2VuZSxdLCBzb3VyY2UgPSAiQnJhaW5TcGFuIikKICAgIHJlczIgPC0gZGF0YS5mcmFtZShleHByID0gZGYwLm5vd2FAZGF0YVtnZW5lLF0sIHNvdXJjZSA9ICJub3dhIikKICB9IGVsc2UgaWYoZGF0YS50eXBlPT0ic2NhbGVkIil7CiAgICByZXMxIDwtIGRhdGEuZnJhbWUoZXhwciA9IGRmMC5CU0BzY2FsZS5kYXRhW2dlbmUsXSwgc291cmNlID0gIkJyYWluU3BhbiIpCiAgICByZXMyIDwtIGRhdGEuZnJhbWUoZXhwciA9IGRmMC5ub3dhQHNjYWxlLmRhdGFbZ2VuZSxdLCBzb3VyY2UgPSAibm93YSIpCiAgfQogIHJlcyA8LSByYmluZChyZXMxLCByZXMyKQogIHJlcyA8LSBzdXBwcmVzc01lc3NhZ2VzKG1lbHQocmVzLCB2YWx1ZS5uYW1lID0gImV4cHIiKSkKICAKICBwMSA8LSBnZ3Bsb3QocmVzLCBhZXMoeCA9IGV4cHIsIGNvbG9yID0gc291cmNlLCBmaWxsID0gc291cmNlKSkgKwogICAgZ2VvbV9oaXN0b2dyYW0oYmlucyA9IDEwMCxwb3NpdGlvbiA9ICJpZGVudGl0eSIsIGFscGhhID0gMC4yKSArCiAgICBnZW9tX2RlbnNpdHkoYWxwaGE9LjIpICsKICAgIHlsaW0oMCwgMjAwKSArCiAgICB0aGVtZShsZWdlbmQucG9zaXRpb24gPSBjKDAuOCwgMC44KSkKICAjIHBkZihmaWxlID0gcGFzdGUoU3lzLkRhdGUoKSwgZ2VuZSwgImhpc3QucGRmIiwgc2VwID0gIl8iKSwgd2lkdGggPSAxMSwKICAjICAgICBoZWlnaHQgPSAxMCwgdXNlRGluZ2JhdHMgPSBGKTsKICByZXR1cm4ocDEpCiAgIyBkZXYub2ZmKCkKfQpgYGAKCmBgYHtyIHdhcm5pbmc9RkFMU0UsIG1lc3NhZ2U9RkFMU0UsIGZpZy53aWR0aD0xNiwgZmlnLmhlaWdodD0xNn0KIyMgUGxvdCBnZW5lIGV4cHJlc3Npb24gZGlzdHJpYnV0aW9uIG9mIHRoZSB0d28gZGF0YXNldHMKcDEgPC0gcGxvdEdlbmVEaXMoIlBBWDYiKQpwMiA8LSBwbG90R2VuZURpcygiQkNMMTFCIikKcDMgPC0gcGxvdEdlbmVEaXMoIkdBRDEiKQpwNCA8LSBwbG90R2VuZURpcygiT0xJRzEiKQpwbG90X2dyaWQocDEsIHAyLCBwMywgcDQsIG5yb3cgPSAyKQpgYGAKCkFzIEkgY2FuIHNlZSwgdGhlIGRpc3RyaWJ1dGlvbiBvZiB0aGUgbm9ybWFsaXplZCBleHByZXNzaW9uIGxldmVscyBpbiBOb3dha293c2tpIGFuZCBCcmFpblNwYW4gZGF0YXNldHMgYXJlIG1vcmUgb3IgbGVzcyBzaW1pbGFyLiBBcmUgdGhlIHNjYWxlZCBkYXRhIG1vcmUgc2ltaWxhcj8KCmBgYHtyIHdhcm5pbmc9RkFMU0UsIG1lc3NhZ2U9RkFMU0UsIGZpZy53aWR0aD0xNiwgZmlnLmhlaWdodD0xNn0KIyMgUGxvdCBnZW5lIGV4cHJlc3Npb24gZGlzdHJpYnV0aW9uIG9mIHRoZSB0d28gZGF0YXNldHMKcDEgPC0gcGxvdEdlbmVEaXMoIlBBWDYiLCBkYXRhLnR5cGUgPSAgInNjYWxlZCIpCnAyIDwtIHBsb3RHZW5lRGlzKCJCQ0wxMUIiLCBkYXRhLnR5cGUgPSAic2NhbGVkIikKcDMgPC0gcGxvdEdlbmVEaXMoIkdBRDEiLCBkYXRhLnR5cGUgPSAic2NhbGVkIikKcDQgPC0gcGxvdEdlbmVEaXMoIk9MSUcxIiwgZGF0YS50eXBlID0gInNjYWxlZCIpCnBsb3RfZ3JpZChwMSwgcDIsIHAzLCBwNCwgbnJvdyA9IDIpCmBgYAoKSXQgbG9va3MgbGlrZSBldmVuIHRoZSBzY2FsZWQgZGF0YSBkb2Vzbid0IHNob3cgY29tcGFyaWJsZSBkaXN0cmlidXRpb24uIEl0IG5lZWRzIHRvIGJlIHNjYWxlZCBkaWZmZXJlbnRseSB0aGFuIHRoZSBkZWZhdWx0IHdheSBpbiBTZXVyYXQuCgo+IENvbmNsdXNpb246IG1heWJlIHVzZSBub3JtYWxpemVkIGRhdGEgZm9yIHRoZSBjbGFzc2lmaWNhdGlvbi4gUGF5IGF0dGVudGlvbiB0byB0aGUgbm9ybWFsaXphdGlvbiBmYWN0b3Iu
comments powered by Disqus