Download and align single cell RNA-seq data: Part I

Although post-alignment transcriptomic datasets are readily available, from time to time, one may consider realign the raw reads. The reason for realignment can be 1) to make mapping parameters (especially genome annotations) consistent; 2) adding new analysis that was not part of the original procedure. In any case, it starts from downloading raw data from an online data repository (e.g. GEO) and then align the raw data to a desired genome with appropriate annotation. I will go through the basic processes in this post.

Many people use high performance computing (HPC) clusters nowadays. As single genomic dataset becomes larger and larger, HPC, which in the old days is regarded as a luxury, now becomes almost a necessity. In my professional work, I frequently use supercomputers maintained by the San Diego Supercomputer Center (SDSC). The clusters at SDSC use TORQUE job scheduler, so the job parameters (bits of code following #PBS tag) is specific to TORQUE; other than that, the scripts are applicable to other job schedulers as well.

For this exercise, I will take the single cell dataset from my own study published in 2016 as an example. Here is the code I use to download data:

#!/bin/bash
#PBS -q hotel
#PBS -N genGen
#PBS -l nodes=1:ppn=10
#PBS -l walltime=168:00:00
#PBS -V
#PBS -m abe

cd ~/zhen/data/raw_data/GSE81475/
names=( $(awk '{print $6}' ./SraRunTable.txt) )

for name in ${names[@]}
do
    ( [ -e "./sra/${name}.sra" ] && : || ( echo "Start downloding $name.sra"
      wget -nv -nc -P ./sra/ "ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/${name:0:3}/${name:0:6}/${name}/${name}.sra"
      echo "Finished downloading $name.sra" )

      [ -e "./fastq/${name}.fastq" ] && break || echo "Start converting $name.sra to $name.fastq"
      fastq-dump --outdir ./fastq "./sra/${name}.sra"
      echo "Finished converting $name.sra to $name.fastq" ) &
done
wait
echo All processes completed!

Let’s take a look at the details of this script:

First, we can just ignore the following block of codes as they are the parameters for the HPC job scheduler.

#!/bin/bash
#PBS -q hotel
#PBS -N genGen
#PBS -l nodes=1:ppn=10
#PBS -l walltime=168:00:00
#PBS -V
#PBS -m abe

Then, I go into an directory that I created to save the files (~/zhen/data/raw_data/GSE81475/).

You may notice in the following line names=( $(awk '{print $6}' ./SraRunTable.txt) ) there is a file called SraRunTable.txt. It lists all sorts of information about all the sequencing runs in the particular experiment, including all the filenames (the sixth column in particular) for all the runs that we need to download.

My data were deposited in Gene Expression Omnibus (GEO), a functional genomics data repository, which was in turn saved in Sequence Read Archive (SRA). The specific series number associated with this dataset is GSE81475.

Near the bottom of the page, there is a link to SRA in the Relations section. Once the link is clicked, it directs to the SRA page listing all the runs. At the top of the page, click Send results to Run selector. Once the page is loaded, in the middle section, there is a button that says RunInfo Table, click on it and you will get the SraRunTable.txt(hoo-ray!).

Then, you can read all the names of the runs that we need to download with names=( $(awk '{print $6}' ./SraRunTable.txt) ). And to download the data, I use the following script with wget function.

  [ -e "./sra/${name}.sra" ] && pass || ( echo "Start downloding $name.sra"
    wget -nv -nc -P ./sra/ "ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/${name:0:3}/${name:0:6}/${name}/${name}.sra"
    echo "Finished downloading $name.sra" )

Notice the special directory structure of the SRA database ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/ followed by the first three characters of the run name ${name:0:3}, followed by the first six characters of the run name ${name:0:6}, followed by the full name of the run ${name} and lastly the actually file that is going to be downloaded ${name}.sra.

In the next blog, I will write about how to conduct alignment with STAR aligner.

comments powered by Disqus