Link back to main course page

Introduction

In this workshop, you will get your hands on some Illumina sequence data and begin to analyse it. As in the first workshop we are using sequencing data for evolutionary analysis. There are many other uses for next generation sequencing (NGS) data. The different uses of NGS (and 3rd generation) data often start with different ‘wet lab’ laboratory processes, for example RNAseq starts with RNA, not DNA.

But many of these use common bioinformatic steps at the early stages, including:
- the sequencing itself
- quality checking
- data cleaning
- mapping (aligning reads to a reference genome).

They diverge into different analyses after that.


In this workshop, you will:

  • Gain more familiarity with using command line Linux
  • Check the sequence quality using FastQC
  • Clean the sequences using cutadapt
  • Align sequences using bwa to a reference genome
  • Use samtools to manipulate alignment files
  • Understand the fastq and sam file formats

Documenting your work

Keeping track of bioinformatic analysis is as important as taking notes in the lab.

The simplest way to document bioinformatic analyses is to keep a plain text file that acts as a lab book. You should record all the commands you use, comments about why you are running them, and what files they produced (or any errors they return if they fail). Record this with a plain text editor such as Notepad (NOT MS Word or Google Docs), and keep this backed up somewhere.

Keeping records of everything you do helps you to run commands again, because you can copy text from your notes file and paste them into the terminal window. This will also make it easier for you (and us) to work out what has gone wrong when things do not work out. Copying from ms word or google docs will lead to the insertion of invisible characters in your commands.

Start keeping records now! Keep records of everything you do in this workshop (including logging in etc.). Submitting a well-documented lab book file is an essential requirement of your project.

Modern high-throughput sequencers produce huge amounts of sequences

A single lane of Illumina sequencing produces many billions bases of sequence, and traditional methods of analysing sequences (as we used in Workshop 1) are not feasible for such vast quantities of data. In this practical and the following workshop, you will be analysing Illumina paired-end whole genome re-sequence data from clinical isolated of the parasite Leishmania infantum. The DNA libraries that have been sequenced comprise of random fragments of ~300-500bp. Each of these fragments have been sequenced from each end, hence the “paired-end”, like so:


The workshop

Let’s run though the brief linux tricks presentation first. It’s on the VLE, but we will run through it with the entire class.

If you don’t already have a familiarity with the basic linux commands, mkdir, rm, cd, please complete the linux tutorial prior to starting this workshop.

If you get stuck and you don’t understand the syntax of a program you can always use –h or --help. This will bring up the help page for the program you are trying to use.


Logging into the server

The analyses will be carried out on the linux operating system.

Follow the steps to log in to the biologin.york.ac.uk server as detailed in the Introduction to linux and servers

As you are working on this within the University you can use putty (from a PC) or a terminal window (from a Mac).

If you are working on this from outside the University you will have to use the York VPN through the Pulse Secure application.

First thing: Using Notepad (or another text editor) make a plain text file on your computer called something sensible such as MBiol_Sequence_Analysis_workshop2_notes.txt

You will use this file to keep your notes.

Make sure you have logged into the server before you continue.

If your userid is in the alphabetic range ag1314 to ja1082, you can run the analysis for the workshops (and your projects) on the server called biollogin.york.ac.uk

If your username is in the alphabetic range jdh562 to xc1051, you will have you run your analysis for the workshops (and your projects) on a different server called biolnode0.york.ac.uk. To do this type:

ssh biolnode0

It is useful to be able to keep track of whether your jobs are completed or still running. To display the processes you are running on your server, type:

top -u username

You can have several terminal windows (connections to your server) running at the same time. So, open a second connection to your designated server (using putty).

In this second window, go to your scratch space and make a working directory with a sensible name (avoid using spaces in file and directory names), and move into this directory:

cd /scratch/username 
mkdir sensible_directory_name
cd sensible_directory_name 

You are now ‘in’ or working from this directory. Files you make will appear here


Loading the required programmes

The linux server system runs on a modular system, with each program loaded onto the current session with the command module load

To load all the software you will need to type the following:

source /opt/yarcc/data/biology/mbiol_sequence_analysis_2019/biolprograms

You can use the cat command to see what programs were loaded:

cat /opt/yarcc/data/biology/mbiol_sequence_analysis_2019/biolprograms

Other key commands you may need to use are as follows:

module list 

This gives you the list of currently loaded modules.

module avail 

This gives a list of all available modules on the server that you can load.

Getting the data files

Copy over the relevant data to this new directory using the cp command using the following format

cp /path/to/files /destination/path

The files you need for this workshop are bundled in a gzipped ‘tar’ file, called workshop2_files.tar.gz

This is stored in the directory /opt/yarcc/data/biology/mbiol_sequence_analysis_2019.

To copy this file to your scratch directory, do this:

cp /opt/yarcc/data/biology/mbiol_sequence_analysis_2019/workshop2_files.tar.gz .

NB: Note the space between workshop2_files.tar.gz and the dot. This means the current directory, and can be put in place of the full path for the directory we want to copy our files to

Decompress the file

tar zxvf workshop2_files.tar.gz

This will create a directory called workshop2_files. You will need to cd into this directory to see its contents. Like so:

cd workshop2_files

Then, list contents of this directory

ls

You can list the directory contents sorted by date (-t, newest first), together with details such as file size (-l, long format):

ls -lt

In here there is:
TriTrypDB-39_LdonovaniBPK282A1_Genome.fasta
A fasta format file containing the Leishmania infantum reference genome

Also, six fastq.gz files that are the paired-end Illumina sequence reads from three parasite isolates from Ethiopia: Sample1_R1.fastq.gz
Sample1_R2.fastq.gz
Sample2_R1.fastq.gz
Sample2_R2.fastq.gz
Sample3_R1.fastq.gz
Sample3_R2.fastq.gz

Note, that these files only contain 10% of the original amount of sequence data. This is to allow you to complete the workshop in a reasonable time, and to avoid you having to sit around waiting for up to 30 minutes for some of the computational steps to complete.

We have called these sample 1-3 for simplicity.
Question: Why are there two files for each sample?
What might R1 and R2 mean?

*.gz files are compressed files. Look at the structure of a fastq file:

View the first 20 lines, using zcat to unzip

zcat Sample1_R1.fastq.gz | head -n 20 

zcat uncompresses the file; and | ‘pipes’ the uncompressed file to head; which prints the first 20 lines

Use the wc (word count) program in linux to count how many sequence reads are in one of the files, by ‘piping’ zcat output to wc. Specifying –l means it counts only the number of lines in the file

Count lines of file

zcat Sample1_R1.fastq.gz | wc -l   

Understanding the format of fastq sequence data
Look up fastq format on the internet; the Wikipedia entry is good
What information does the first line of each sequence contain?
What information does the fourth line of each sequence contain, and how is it encoded?

How many sequences are there in each file?
How many megabases of sequences are there for each sample?
The Leishmania genome is 30Mb in size. Calculate what the average genome coverage is? i.e. on average, how many times has each position in the genome been sequenced?


Data quality check with fastqc

FastQC gives you a quick overview of the quality of your sequences. Use this to analyse each of your fastq files.

Now run fastQC to analyse the quality of one of the fastq files

fastqc something.fastq.gz

You will see that the FastQC run produces an html file.

Repeat the FastQC analysis for the remaining five fastq files. (Hint: Can you use a wildcard here?)


Copying files to and from the server

To view the html files produced by FastQC you will need to transfer the files from the server to your local computer, and then you can view them using a web browser.

You will need to use an FTP software such as WinSCP (or Filezilla) to connect the server to your computer, and transfer files in a secure manner.

Start WinSCP, and connect to the server by entering the relevant information as shown below:

Click Yes if you see this window popping up:

You will then see a window like this:

The right hand panel shows the files and folders on the linux server.
The left hand panel shows the files and folders on your PC or Mac.

Double click in right hand panel in the red circle shown above, and then enter the full directory path of where your fastqc html files are located, and click OK:

You should now be able to see the fastqc html files you have created. You can copy these files from the server to your PC by dragging them from the right hand panel to an appropriate folder in the left hand panel.

NB: You can also copy files from your PC/Mac to the server by dragging files from the left hand panel to the right hand panel.

Use a web browser to have a look at the fastqc reports.

What do the various sections in the FastQC report mean? Look up the FastQC webpage, in particular the “Example Reports” section at .

The sequences are generally of good quality, though some have a Kmer Content problem. But you can trim/eliminate the few poorer quality sequences.


Quality filtering fastq files

You will use a piece of software called cutadapt to do the quality filtering. Cutadapt can do many types of filtering. One important type of filtering it can do is to remove adapter contamination (which we will not be doing) and you can find out more here: http://cutadapt.readthedocs.io/en/stable/guide.html

Use cutadapt to remove bases from the 3’ end of each sequence that have a quality < 20 (how stringent is this?)

Some linux shorthand we use here: a) the backslash symbol (\) to allows one command to be specified on multiple lines. This tells the linux machine “don’t run this command yet, it’s not complete yet”. This makes long commands more human-readable. Please do this in your lab book. b) the ampersand (&) to tell linux to run this process in the background c) the greater-than symbol (>) pipes the output of a command to a file. Any output that would usually come up on the screen can be piped to a file.

The basic command to filter 3’ (tail) ends by phred score 20 is:

cutadapt -q 20 -o output.fastq input.fastq 

Trims reads by using a phred score 20, keeping only read that both reads in a pair pass the filter

 cutadapt -q 20 --pair-filter=any \
-o Sample1_R1.q20.fastq.gz \
-p Sample1_R2.q20.fastq.gz \
Sample1_R1.fastq.gz \
Sample1_R2.fastq.gz \
> Sample1_cutadaptq20.log & 

Check the window running top to keep an eye on the processes running and to see when your command finishes. A log file has been created. We can open this file in a read only mode using less, see whether you can understand the content by doing the following:

less Sample1_cutadaptq20.log

Examine what cutadapt did (why run it, if you don’t examine the results?). You can do this in three ways:
- Looking at the log file
- Listing the files that have been created
- Running FastQC again on the filtered output (Sample1_R1.q20.fastq.gz) to see what difference the filtering made

Now carry out the same quality filtering for the Sample2 and Sample3 fastq files.

NB: Both cutadapt and fastqc take a while to run, which makes it challenging to test what are the best parameters to use. We can improve this by making a subset of the data, and testing this. Since each fastq read contains 4 lines we can use the linux command below to extract some lines that we use for testing. How many reads will this extract?

zcat Sample1_R1.fastq.gz | head -n 40000 > Sample1_R1.subset.fastq
zcat Sample1_R2.fastq.gz | head -n 40000 > Sample1_R2.subset.fastq

Aligning sequenced with BWA

Now that you have clean sequences, you will now align the fastq files to the reference genome using the aligner called bwa (see http://bio-bwa.sourceforge.net/ and http://bio-bwa.sourceforge.net/bwa.shtml for details).

Before you align to a reference, you must index that reference:

bwa index TriTrypDB-39_LdonovaniBPK282A1_Genome.fasta & 

NB: By default bwa index uses the “is” algorithm, however there are two other index algorithm options, bwtsw and rb2. In this case, leishmania is a smallish genome, so the default algorithm is the best option. However for larger or more divergent genomes aligning to the reference, you may need to use a different algorithm.

While the reference is indexing you can display the various options available in bwa by typing in:

bwa

To get more detail about one of the displayed options, type in, for example:

bwa mem

The bwa websites will have further details about the available options.

Align the sequence to the reference genome

bwa mem -t 2 \
-R '@RG\tID:S1\tSM:S1\tPL:Illumina' \
TriTrypDB-39_LdonovaniBPK282A1_Genome.fasta \
Sample1_R1.q20.fastq.gz \
Sample1_R2.q20.fastq.gz \
2> Sample1.bwamem.log \
> Sample1.q20.sam & 

NB: the “2>” pipes the comments (called stderr in linux) to one file (the .log file), while the normal redirection “>” pipes the main output (stdout) to the .sam file

In the other terminal window you will be able to see your bwa aligner using up the CPU, and also when bwa is finished. This will take ~10 minutes to run; pretty fast for aligning so many sequences. While you are waiting for it to finish, go to http://bio-bwa.sourceforge.net/bwa.shtml to understand what the -t and -R options are doing (we will come back to -R in the next workshop).

If you are still waiting, listing the files in the directory will show you that bwa has started making a sam file. This contains the alignment information. View the last 20 lines:

tail -n 20 Sample1.q20.sam

Try to understand the format of the data. https://samtools.github.io/hts-specs/SAMv1.pdf contains a detailed description of the format. See section 1.4 in particular. The format is similar to the fastq file you looked previously, but now has information as to where the sequence in question is aligned in the genome

Once your bwa alignment has finished, view the log file to check that everything ran okay

less Sample1.bwamem.log 

Always look at the log files if something goes wrong. Sometimes your processes may fail because you have used the wrong syntax, or your input files are incorrect in some way. The log files will give you information as to what went wrong so that you can sort it out.

Now carry out the above steps to align the Samples 2 and 3 to the reference genome. Note that in addition to changing the file names, in the bwa mem command you will have to change the readgroup information (given by the -R option). For example, for Sample2, this should be changed to ‘@RG:S2:S2:Illumina’


Compressing and sorting bam files

samtools (http://www.htslib.org/doc/samtools.html) is a piece of software for manipulating alignment files. To display the various options available in samtools type in:

samtools

Create bam files from each of the three sam files. A bam file is merely a compressed version of the sam file. This is how you do it for Sample1.q20.sam:

samtools view -bt TriTrypDB-39_LdonovaniBPK282A1_Genome.fasta \
-o Sample1.q20.bam \
Sample1.q20.sam & 

How much space is saved by the compression?

Once the bam file has been created you can delete the sam file, using the rm command. (once a file is deleted you CANNOT get it back; there is no recycle bin)

The sequences in your sam/bam file are in the same order as they were in the fastq files, i.e. randomly ordered with respect to where they map on the reference genome. It makes sense for the subsequent steps, such as SNP genotyping to have the sequences ordered by where they aligned in the genome

Sort each of your bam files, like so:

samtools sort -o Sample1.q20.sort.bam \
Sample1.q20.bam & 

NB: you may put the ampersand “&” at the end of any command. This allows the command to run in the background, while you get your command prompt back
List the files in your directory with file details. You will see that more space saving is made by sorting the bam files.

Carry out the sam to bam conversion, and bam file sorting on the remaining two sam files.


A brief summary of your alignment

You can get a crude summary of the quality of your aligned sequence by doing:

samtools flagstat Sample1.q20.sort.bam  

Repeat for each of the sorted bam files.
What do the different statistics mean?
How are the statistics different between the two samples? Is this surprising?

If you wish, more detailed statistics can be made by doing:

samtools stats Sample1.q20.sort.bam \
> Sample1.q20.sort.bam.stats  

and then viewing the stats file with more/less.


Summary

Here’s what you learned, and some questions to discuss with others

  • How to log into a server. Why use a server? Why not just do this on a laptop?
  • How to extract tar files. Why use tar files and gzip?
  • How to check that your fastq files are healthy, and remove bad quality data.
  • How to align reads to a reference. Why did we do this?
  • How to do some things with sam files. What’s a sam file?*

More practice

In the interests of running through all these steps in the course of the workshop, you were only provided with fastq files that contained a tenth of the actual amount sequence.

If you wish to carry out these bioinformatic steps on the full fastq files you can get hold of these files from: /opt/yarcc/data/biology/mbiol_sequence_analysis_2019/workshop2_full_files.tar.gz


The End

That is all for today. More analysis of Illumina sequence data will come in Workshop 3.


Link back to main course page