Assessment: Population genomics of fission yeast

Background

Project expectations

Where to get the data from

Other possible analyses

Some tips

Report format

Data sharing

Link back to main course page

Background

Schizosaccharomyces pombe is a yeast species used in traditional brewing, but it is found in high sugar environments worldwide. It is a model organism, particularly with respect to cell cycle, DNA replication, DNA damage research. Most of this research is done using a single S. pombe strain. As S. pombe is widely used in research, excellent genomic resources are available through PomBase, including a high quality reference genome, and genome browser.

The genome of S. pombe is about 12.8Mb long (contained mainly in three chromosomes plus the mitchondrial genome), and contains about 5000 genes, 70% of which have human orthologs.

In addition to the research strain, 161 natural isolates of S. pombe have been obtained from a variety of fruits and fermented drinks across a number of countries. Fifty-seven genetic varieties are found among these 161 isolates (several of the isolates are essentially genetic clones).

For your projects you will be analysing Illumina whole-genome resequence data of these 57 genetic varities of S. pombe.

Project expectations

You need to download and analyse Illumina resequence data from at least 10 S. pomble strains yourself (you can do more, and can for example share BAM files with others see Data sharing section later on).
For each of your 10 strains you will need to quality check, filter and align the Illumina sequence (covered in Lecture 2 and Workshop 2)
For each of your 10 strains you will need to mark PCR duplicates and then carry out variant calling (covered in Lecture 3 and Workshop 3)
Using the genetic variants, investigate population structure among your chosen strains (to be covered in Lecture 4 and Workshop 4)
Using the genetic variants, investigate population genomic parameters such as nucleotide diversity within strains (pi), genetic divergence between populations (Fst), and Tajima’s D (to be covered in Lecture 4 and Workshop 4)
Any other analysis not covered above. See Other possible analyses and the marking criteria for high marks.
You will need to keep careful notes of your work which you will need to submit together with your project. See Report format section.

Where to get the data from

This Excel spreadsheet, pombe_strains.xlsx, contains details about the 57 S. pombe genetic varieties or strains. The spreadsheet has three worksheets:

The strain data worksheet contains information about where, when and from what substrait each strain was collected from.
The ENA accessions worksheet details the accession numbers of the Illumina sequences for each of the strains. Please pay attention to the notes at the top of this worksheet.
The trait.data worksheet provides phenotypic data about each of the strains. The notes at the top provide a description of each of the measured phenotypes.

Make use of the information on this spreadsheet to decide on which 10 strains you will choose to analyse. i.e. think about what hypotheses you wish to test, as this will inform you as to which strains to select for analysis

It is also important that you consider what analyses you will need to carry out to test your ideas; it is no point coming up with an amazing idea if you (or even I) do not have the bioinformatic skills to implement the necessary analyses!

Just like all published Sanger sequence data is publicly available through Genbank, all published high thoughput sequence data is also made available through the European Nucleotide Archive (ENA) and the NCBI Sequence Read Archive. To obtain the sequences for a strain, you will need to know the Accession number, which you can get from the above spreadsheet *ENA accessions worksheet). You can then use this accession number to search the ENA.

For example, the accession for strain JB1110 is ERR200230. Searching the ENA using this accession number (blue circle) brings up the following page:

Click on the Experiment (red circle above) to bring up this page:

You can now get the links to the read 1 and read 2 fastq files (red circle). Right-click on File 1, and copy the link address. You can now use the linux wget command to copy the read 1 fastq file over to your directory (somewhere sensible on /scratch/userid) on the linux servers:

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR200/ERR200230/ERR200230_1.fastq.gz

Similarly, you can then copy over the read 2 fastq file.

wget is a useful linux command for downloading files from web servers.

From PomBase, you can download:

the S. pombe reference genome

wget ftp://ftp.pombase.org/pombe/genome_sequence_and_features/genome_sequence/Schizosaccharomyces_pombe_all_chromosomes.fa.gz

the S. pombe genome annotation file

wget ftp://ftp.pombase.org/pombe/genome_sequence_and_features/gff3/Schizosaccharomyces_pombe_all_chromosomes.gff3.gz

Other possible analyses

Some possible other questions to examine are:

How does read coverage and mapping quality vary across the genome?
Software: samtools
How does the quality of SNP and/or indel calls vary across the genome?
Which SNPs and/or indel calls can we trust?
Software: vcftools, IGV (integrated genomics viewer)
Is there variation in phylogenetic relationships across the genome?
Software: RAxML
What structural variants (copy number variation, duplications, deletions, translocations, inversions) exist in these genomes?
Software: delly, lumpy amongst others
Are particular genes or gene families interesting to examine in detail?
Or analyse some other aspect of the data.

New software can be installed on the servers if you need it.

Some tips

Some of the strains have been sequenced to very high coverage, which will make some of the analyses run slowly. You may want to downsample the fastq files to 30x coverage (this is sufficient for a haploid) to speed up run times. You can do this using something like this:

zcat sampleX_R1.fastq.gz | head -n XXXX > sampleX_R1_downsampled.fastq
gzip sampleX_R1_downsampled.fastq

where XXXX is the number sequences you need to achieve 30x coverage.

Remember to change the readgroup information (@RG:XX:XX:Illumina) when you run BWA so that each sample is give a different identifier (i.e. replace XX with a sample name). For more information on specifying readgroup see this.
S. pombe has a haploid genome. You will need to take this into account when running FreeBayes.

Report format

Final submitted report should be prepared independently by each student.

The report should be no longer than 2000 words (excluding the supplementary methods, figure legends, tables, references), and should demonstrate:
* competent analysis of high-throughput sequence data
* Interpretation of the data in a biological context

The report contain these sections in the following order:

Abstract: 250 words limit.
Introduction: Explains the background of the report, including references.
Methods: Briefly summarises the data used and main bioinformatic steps employed (this methods section is included in the 2000-word limit).
Results and Discussion: With embedded discussion of each result, as for letters (eg: Nature Genetics).
Conclusion: Briefly reiterates the main finding(s), with any caveats, limitations and implications.
References:
Supplementary methods: Should detail data sources, genome versions used, software used (including version), parameters used and/or commands, and code written for the analysis (including R). Should be a concise, technical narrative.

The following are not included in the 2000 word limit:

Figure and Table legends
References
Supplementary methods

Your reports will need to be submitted via the VLE.

UPDATE: The report submission deadline is now Monday 09 December 2019, 12:00.

Please have a look at the marking criteria to understand how to achieve high marks.

Here are two example reports from previous years. Project A received a grade between 60-70, and Project B between 80-90. Please don’t try to find out who wrote these, except if you want to congratulate them. :)

(Note: These projects are from previous years where the format and word counts were different.)