Link back to main course page


In the previous workshop you:

  • Checked sequence quality using FastQC
  • Cleaned the sequences using cutadapt
  • Aligned sequences using bwa to a reference genome to create BAM alignment files
  • Sorted the BAM alignment files

In this workshop, you will continue with the next steps in the analysis of Illumina re-sequence data as it is used in population genomics. Note that there are many other uses of next generation sequencing data (NGS), and we cannot cover all of these here.

In this session you will:

  • Mark PCR duplicates in BAM alignment files using Picard
  • Use FreeBayes on the BAM files to identify variants (SNPs and indels)
  • Visually inpect read mapping and variant calls with the Integrated Genomics Viewer (IGV)

Background

The data you will be analysing is from clinical isolates of Leishmania donovani from Ethiopia. To save time, you will be processing three samples. In the next workshop you will get access to data from 40 samples for population genomics analyses.


Documenting your work is important.
Keep recording your commands in a plain text editor such as Notepad, and keep backing this up.
As part of your project, you will be required to hand in your notes. So get into the habit now


Logging into the server

Log on to your designated server as you did in Workshop 2 using putty (from a PC) or a terminal window (from a Mac).

If your userid is in the alphabetic range ag1314 to ja1082, you can run the analysis for the workshops (and your projects) on the server called biollogin.york.ac.uk:

ssh userid@biollogin.york.ac.uk

If your username is in the alphabetic range jdh562 to xc1051, you will have you run your analysis for the workshops (and your projects) on a different server called biolnode0.york.ac.uk. To do this, first login in to biollogin, and then connect to biolnode0:

ssh userid@biollogin.york.ac.uk
ssh biolnode0

Once you are logged into your designated server, use the cd command to navigate to the analysis directory you made last time.

cd /scratch/userid/something

Load the modules (i.e. the programs that you will use)

source /opt/yarcc/data/biology/mbiol_sequence_analysis_2019/biolprograms

As in the previous workshop, open a second connection to your designated server in which you can run top to keep an eye on tasks you are running:

top -u userid

How to check if your job have successfully completed

It is important that you check whether any job you have started has completed before trying to run it again (otherwise you may end up running multiple instances of the same command, each of which is trying to write to the same output files). To do this you should:
1. Check your window running top to see whether or not your job is still running
2. See if any output files have been created and whether they contain anything; ls -lt will list files in chronological order with the newest at the top and show the file size.
3. Check the contents of any log files that may be generated.


Getting the data files

The files you will need to complete today’s workshop are here: /opt/yarcc/data/biology/mbiol_sequence_analysis_2019/workshop3_files.tar.gz

Copy this to your current directory (/scratch/userid/something), untar and unzip the files.
If you have forgotten how to do this, have a look the the Getting the data files section of Workshop 2

The files are relatively large, and will take a couple of minutes to copy and unzip.

You should have copied over the quality-filtered, aligned and sorted bam files for three Leishmania samples:
Sample1.30x.q20.sort.bam
Sample2.30x.q20.sort.bam
Sample3.30x.q20.sort.bam

As well as the Leishmania genome fasta file.

These bam files are from the same samples that you used in workshop 2, but contain about three times as much sequence. The average genome coverage of each of these samples is about 30x, which is sufficient for genotyping a diploid individual.


Removing PCR duplicates

It is important to remove/mark PCR duplicate sequences as they are non-independent. You will use the MarkDuplicates tool from Picard Tools, to mark PCR duplicates so that they are not used in the subsequent steps.

Picard can do a lot of other useful things too, see here.

Make a symbolic link to the java executable picard.jar file.

ln -sf /opt/yarcc/data/biology/mbiol_sequence_analysis_2019/picard.jar .

If you list the files in your directory, you will see that you now have a link file called picard.jar that points to the location of the executable picard.jar file. This avoids having to copy the picard program into your directory.

For this workshop you will only call variants on chromosome 1, which is 280kb long (the whole genome of Leishmania is about 30Mb long). This will allow you to complete the variant calling during the workshop.

First index the three BAM files that you have just copied over. This will create an index file for each BAM file ending in .bai

samtools index Sample1.30x.q20.sort.bam

Extract only sequence reads that have aligned to chromosome 1:

samtools view -b Sample1.30x.q20.sort.bam chr1 > Sample1.30x.q20.sort.chr1.bam &

Check that you have successfully created the chromosome 1 file, and find out the size of the file.

The next step is to remove PCR duplicates. You can do this using Picard Tools MarkDuplicates like so:

java -XX:ParallelGCThreads=2 -Xmx2g -jar picard.jar MarkDuplicates \
ASSUME_SORTED=true \
VALIDATION_STRINGENCY=SILENT \
INPUT=Sample1.30x.q20.sort.chr1.bam \
OUTPUT=Sample1.30x.q20.sort.chr1.rmdup.bam \
METRICS_FILE=Sample1.30x.q20.sort.chr1.rmdup.bam.metrics \
>& Sample1.30x.q20.sort.chr1.rmdup.bam.MarkDuplicates.log &

Use less/more to have a look at the metrics file. The 8th line of this file contains information about the percentage of duplication (second number from the end).

Remember, to come out of less and return to the command line, press the q key.

Now index, pull out chromosome 1 reads, and mark PCR duplicates for samples 2 and 3.


Identifying genetic variations within the genomes

In population genomics studies, the purpose of aligning reads to a reference genome is to be able to identify differences between individuals.

The most abundant genetic variants are single nucleotide polymorphisms (SNPs) and short insertion/deletion polymorphisms (indels). You will use FreeBayes to identify these. The FreeBayes manual is here: https://github.com/ekg/FreeBayes/blob/master/README.md

NB: You will be using FreeBayes because it is fairly fast (though it still takes some time). But there are also other variant calling packages. GATK (The Genome Analysis Toolkit) is a popular choice, used widely in human genome studies, and can do lots of other things too: https://software.broadinstitute.org/gatk/documentation/


SNP and indel calling

First ‘index’ the fasta format file of the genome with samtools

samtools faidx TriTrypDB-39_LdonovaniBPK282A1_Genome.fasta

This creates a fai file. Have a look at this file. It contains information about the number and size of the chromosomes (strictly speaking the scaffolds making up the reference genome). See this description to understand the fai file format.

So you can tell FreeBayes which samples to use for variant calling, make a file containing a list of the relevant bam files:

ls *.rmdup.bam > bams.list

Here you are redirecting the output of the ls (list) command to a file called bams.list

Freebayes will identify variants (SNPs and indels) that are different from the reference genome across your three samples. At these variant positions or sites, Freebayes will also genotype each of the three samples, i.e. at each of these positions, it will determine whether each sample is homozygous for the reference allele, heterozygous or homozygous for the alternate allele):

freebayes -f TriTrypDB-39_LdonovaniBPK282A1_Genome.fasta --ploidy 2 --bam-list bams.list > three-samples.chr1.vcf &

In your projects you will be dealing with fission yeast which is haploid. Here you will have to change ploidy to 1

NB: Variant calling takes about 5-10 minutes! Be patient.


Once variant calling is completed, this will create a variant call format (VCF) file. This is a complicated file. VCF format is described in detail here: https://samtools.github.io/hts-specs/VCFv4.2.pdf
And more simply here: https://en.wikipedia.org/wiki/Variant_Call_Format
VCF files (or their binary form bcf files) are the substrate for many further population genetics analyses and genome-wide studies


Take a look at the first few lines with

less three-samples.chr1.vcf

You can scroll up/down using the arrow keys or the page up/down keys

You will see that the first few lines start with ##. These are header lines providing information about the genome that has been used, and about the format of the vcf file.

Then there is a line starting with #CHROM. This is the final header line before the actual SNP/indel information. Look carefully at this header line which describes the format of the SNP/indel information.

The first few columns of the vcf file describe the SNP or indel, as follows:

CHROM = the chromosome where the genetic variant is found

POS = the position on the chromosome on which the genetic variant is found

ID = the SNP/indel id (blank for us, but SNPs in the human genome have ids).

REF = the reference allele

ALT = the alternative allele(s)

If both the REF and ALT are single bases, the variant is a SNP.
If either REF or ALT are longer than a single base, then the variant is an indel.

QUAL = the SNP/indel call quality (phred scaled). How confident the software is that there is a variant at this position.

INFO = an information-rich column with the format KEY=value. The meaning of all the keys is described in the VCF header.

The last four columns are FORMAT, Sample1, Sample3, Sample2.

The Sample columns describe information about the three individuals; S1, S2, S3. This naming was applied during the mapping stage with bwa using the RG field (see the Aligning sequences with BWA section in workshop 2).

The FORMAT column describes what is in the individual data. This can be different, depending on what the SNP caller outputs. In our case it is: GT:DP:AD:RO:QR:AO:QA:GL. The first of these are:

GT = genotype; 00 homozygous reference; 01 heterozygous; 11 homozygous alternative

DP = read depth

The remainder are described in the header of the VCF file. You can see these with:

grep "##FORMAT" three-samples.chr1.vcf

All this information in the VCF file can be used to filter the variants.


Filtering SNP and indel calls

Initial variant calling is generally very approximate, and will identify many sites as SNPs or indels that are merely errors. The INFO field of the vcf file contains lots of information about each site in the genome, and the reads aligned there, and the quality of the variant calls.

We can use data from these INFO fields to filter the initial vcf file using vcffilter https://github.com/vcflib/vcflib/blob/master/README.md#vcffilter

Count the number of variants we have, using grep (to remind yourself what grep does have a look at the Introduction to Linux):

grep -vc '#' three-samples.chr1.vcf

Filter by variant quality 20, and pipe directly to grep to count again

vcffilter -f  "QUAL > 20" three-samples.chr1.vcf  | grep -vc '#'

Note: this quality is phred-scaled (like fastq base qualities).

Remind yourself what phred score 20 means (if you have forgotten), or read the wikipedia page on phred quality scores.


First try filtering variants with QUAL > 10, and see how many SNPs you get;

Use grep ID=DP three-samples.chr1.vcf to find out what DP is.

Then, filter by variant quality 20 and DP > 20, and count again:

vcffilter -f "QUAL > 20 & DP > 20" three-samples.chr1.vcf  | grep -vc '#'

You could of course create a new filtered vcf file by redirecting the output of vcffilter to a new file like this:

vcffilter -f "QUAL > 20 & DP > 20" three-samples.chr1.vcf  > three-samples.chr1.Q20.DP20.vcf

Use grep INFO= three-samples.chr1.vcf to look at other INFO fields

Are there any other info fields that might be good filters?


Using the Integrated Genome Viewer to scrutinize variant calls and bam files manually

As automated variant calls will include errors, we sometimes need to inspect the read alignments to check if the calls are correct. For example, it is often useful to visualise alignments and variant calls to check whether there are any issues with the alignments such as localised excessive coverage, larger than expected distance between paired-end sequences, unusual read pair orientations.

With so many reads (all may have errors in base calling), and so many differences between genomes, it is very challenging to describe all the changes with one package. The Integrated Genomics Viewer (IGV) is a good tool for this.


IGV can be run directly on your local computer, but also can be run through an internet browser (with some loss of functionality).

If your files are large, you may run into issues with slow internet speeds and not having enough RAM.

You will be using IGV to view the *rmdup.bam alignment files, and the vcf file that you have created.

First index each of the three *rmdup.bam files like this:

samtools index Sample1.30x.q20.sort.chr1.rmdup.bam

You will then need to transfer the following nine files over from the Linux server to your local computer using WinSCP. To remember how to do this, look at workshop 2

Sample1.30x.q20.sort.chr1.rmdup.bam  
Sample2.30x.q20.sort.chr1.rmdup.bam  
Sample3.30x.q20.sort.chr1.rmdup.bam  
Sample1.30x.q20.sort.chr1.rmdup.bam.bai  
Sample2.30x.q20.sort.chr1.rmdup.bam.bai  
Sample3.30x.q20.sort.chr1.rmdup.bam.bai  
three-samples.chr1.vcf  
TriTrypDB-39_LdonovaniBPK282A1_Genome.fasta  
TriTrypDB-39_LdonovaniBPK282A1_Genome.fasta.fai  

Using a web browser such as Chrome, go to the IGV web application

By default you will see that the human reference genome is loaded (22 autosomes + sex chromosomes).

To change to the Leishmania donovani genome, use the menu options at the top: Genome > Local File, and then select BOTH the genome fasta and its index file:

TriTrypDB-39_LdonovaniBPK282A1_Genome.fasta
TriTrypDB-39_LdonovaniBPK282A1_Genome.fasta.fai
and click Open.

You will now see that the 36 Leishmania chromosomes are displayed.

Then, load your vcf file into the IGV:
Tracks > Local File, and then select three-samples.chr1.vcf
(it will look like nothing much has happened)

Then, load the bam and bai index files into the IGV:
Tracks > Local File, and then select the three bam files and the three bai files

Again it will look like nothing much has happened, but you will see your vcf file (green circle) and sample bam files (blue circle) appear as below.

To see what your alignments look like, you will first need to go to chromosome 1 by selecting chr1 from the drop down menu (red circle):

You can now kind-of see the variants contained in the vcf file, but there are too many squeezed in to see them properly. You will need to zoom in further to be able to see the aligned reads.

You can either zoom in using the + button in the top right hand corner, but this can be slow.

Alternatively, you can go to a particular region that you are interested in having a closer look at (perhaps the location of a gene, or a potential inversion etc). For example, to have a look at the stretch of chromosome 1 between 4000 and 5000bp type chr1:4,000-5,000 into the search box (red circle below):

Note the annotation in red in the figure above.

The undulating grey shows how coverage varies along the chromosome.
Each of the long grey rectangles with a pointy end represents where a particular 100bp sequence read has mapped.
If these rectangles are coloured, it indicates that the reads pair orientation is not as expected. Read 1 and Read 2 of a sequence should usually pointing towards each other. If the do not, it could indicate an that one of the reads has been mapped to a different chromosome, or that there is an inversion.

Using less, open the three-samples.chr1.vcf file on the linux server. Then using the arrow keys, move to the lines of the vcf corresponding to the same region, i.e. chromosome 1 between 4000 and 5000bp. Now you can compare the vcf file to what you see in IGV. This will help you answer some of the questions below.

  • Why is coverage not even across a genome?
  • What is the difference between the blue and grey variants indicated by arrow 1?
  • What does the gap highlighted in box 2 represent?
  • The coloured lines within each sequence read (long grey rectangles with a pointy end) represent SNP differences from the reference. Why are there so many SNPs in the sequence read highlighted in oval 3?
  • What are being shown in ovals 4 and 5.

Optional extra

Try playing around with some of the display setting in IGV (cog wheels along the right-hand side). For example, try Colour by read strand and View as pairs.


There are several packages for filtering and manipulating vcf files, including:
vcffilter
vcftools
bcftools
picard

See if you can use vcftools or vcftools to separate out SNP and indel variants from your vcf file.

There are many sequence analysis forums on the internet, and usually a google search for how to solve a particular bioinformatics problem will yield useful results.


Summary

Here is what we learned, and some questions to ask yourselves (or ask the data).

  • How to use Picard to clean up a bam by removing duplicates.
    • What kind of duplicates are we removing?
    • How are they created?
  • How to use samtools to extract parts of a bam
    • Note that samtools has many other useful functions.
  • How to call and filter genetic variants from a bam (freebayes and vcffilter)

  • The basic anatomy of a vcf file.
    • What types of genetic variants did we not look at?
  • How to use IGV to inspect bam files.


Link back to main course page