Gene 760 – Problem Set 4 The purpose of this problem set is to

Gene 760 – Problem Set 4
The purpose of this problem set is to familiarize students with exome sequence analysis in the context of
human disorders. By the end of this problem set we hope students will have learned the following skills:
 Familiarity with the methods of exome sequencing
 Mechanisms by which sporadic disorders can occur in humans
 Familiarity with FASTQ, BAM and VCF formats
 Writing Python scripts to process large datasets and using external modules (pysam)
 Using test files to design a sequencing analysis pipeline using GATK/BWA/Picard
To teach these skills we will be using high-throughput sequencing data (Illumina HiSeq 2000) generated
from a targeted exome capture (Nimblegen EZ2). The data is generated from a single family from the
Simons Simplex Collection in which the proband (11114.p1) is affected with autism, while the sibling
(11114.s1), mother (, and father (11114.fa) are unaffected.
Students are to submit a gzipped tarball named [YourNetID]_PS0.tar.gz containing:
 [YourNetID]_PS0_answers.txt : Text file with the responses to the questions below
 [YourNetID] Commented Python script used to answer the final problem
 SeattleSeq annotated variant files for the proband and sibling
to ‘DROPBOX/PS4’ by 9:00PM on Sunday, April 12th.
1) Generate symbolic links from your Problem_Set_4 directory to the eight FASTQ files in: DATA/PS4. In
addition to read sequences, the files contain quality data for each basepair of sequence, represented
by ASCII characters. A description of the FASTQ format is given here:; by looking at the first few lines of each file work out if
these data are encoded in Illumina 1.5+ or Illumina 1.8+ format.
ANS: State the command you used to generate the symbolic links
ANS: State the format of these quality data and how you can know this. If you wanted to be
absolutely sure how could you check?
ANS: What is the read length (nt) of these data?
2) Detecting variants in genomic data requires multiple steps; to design a pipeline for doing this it is
best to start with a smaller, but representative, data set. We will generate a test file using the
paternal dataset which prints out 1 in every 1000 FASTQ reads from ‘Samp_11114.fa_Pair_1.fq’ and
‘Samp_11114.fa_Pair_2.fq’ (this is called subsampling).
Use the following awk command for Pair_1 and repeat it for Pair_2
awk 'NR%4000 == 13{c=3;{print}next}c-->0' Samp_11114.fa_Pair_1.fq > Test_Pair_1.fq
ANS: State the number of reads in the full FASTQ files for 11114.fa, and the test files
3) To build this exome sequencing pipeline, you will use BWA (v0.7.3a), Picard Tools, a custom python
module, and GATK (Genome Analysis Toolkit). These are installed for you courtesy of the .bashrc file
you copied in PS0.
Note that you cannot add a directory of .jar files (java executables, used by Picard and GATK) to your
PATH without using special commands. Jar files are executables that require the command structure
“java file.jar [arguments]” in order to be run on the cluster. You will need to run the following utilities
directly from the paths below, or copy/link the executable .jar files into your working directory.
IMPORTANT: To ensure smooth operation start each java command with:
‘java -Xmx2g`pwd`/tmp file.jar -jar command.jar’
ANS: What do the above java arguments mean?
ANS: BWA contains three alignment algorithms. State which one is most appropriate to this dataset,
and why.
4) Using the two test files, align the data with BWA and output to a .sam file; ensuring that the output is
compatible with Picard and GATK. You can do this by:
- Adding a read group header that is independent for each sample (e.g. for the father,
- Using the reference: ‘DATA/PS4/human_g1k_v37.fasta’.
- Marking shorter split hits as secondary
- Your command should run with 4 threads
ANS: Write the BWA command and a description of what each argument specifies/does.
ANS: In the read group header what do ID and SM stand for? What sort of information should be
5) Using the output from BWA, use Picard to convert the SAM file into a BAM file.
ANS: Write the command.
ANS: What is the advantage of using a BAM file over a SAM file? Are there any disadvantages?
6) Some commands can be joined together using the ‘|’ (pipe) symbol. The output of the first command
needs to be written to STDOUT (this often happens automatically, the ‘>’ directs STDOUT into a file)
and the input of the second command needs to capture STDIN. In Picard STDIN can be read by
setting the input file as ‘/dev/stdin’.
Combine the BWA and Picard commands into a single command using the pipe.
ANS: Write the command and a description of how it works
ANS: What are the advantages of using a pipe rather than simply doing each stage separately?
7) Use Picard to sort the resulting BAM file
(use SORT_ORDER=coordinate)
Use Picard to mark duplicates in the resulting file and create an index at the same time.
(use CREATE_INDEX=true METRICS_FILE=file_metrics ; where file_metrics is the metrics file
created by the output of the previous commands)
^ Note that these commands cannot be used with pipes since they need to read multiple lines at the
same time.
ANS: Write the commands and explain the command arguments used
ANS: Why do we mark duplicates?
8) Use GATK (TOOLS/GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar) to call variants with
the HaplotypeCaller method, output the results to a .vcf file
(add the option ‘-rf BadCigar’ to filter out any reads with incorrect Cigar strings)
ANS: Write the command you used to do this and explain the command arguments used
9) By running the commands from parts 6, 7, and 8 you have developed a simple exome sequencing
pipeline. Once this is functional without errors, you could apply it to the full FASTQ files BUT this
would take approximately 24 hours per sample!! (rough guide per sample: BWA to BAM 3.5hrs; Sort
BAM 1.5hrs; MarkDups 1.5hrs; Variant Calling 17hrs). Thus, to save you computational resources
and time, we will provide sorted bam files with marked duplicates to you so that you only have to
execute the final variant calling step (8). These files are in REFERENCE/PS4_REFERENCE/**
ANS: Write the commands used
ANS: State the number of variants in each sample (maternal, paternal, proband, sibling)
10) Write a python script to find de novo variants. Specifically filter the VCF file (preserving its format) to
identify substitution variants that are: heterozygous (35-65% variant read frequency) with at least 8
variant reads in the child; absent in both parents; covered by at least 10 reads in both parents (use
the pysam module to count reads in specific regions). Annotate the VCF file using SeattleSeq
( on the default settings.
IMPORTANT: To run the python file, use ‘py27’ rather than ‘python’. py27 is an alias in
your .bashrc that links to a version of python on the cluster with the pySAM module installed. Pysam
documentation can be found here:
ANS: How many variants were present after filtering by the Python script?
ANS: Save your python script ( and SeattleSeq-annotated variant files in
the proband and sibling
ANS: Do any de novo variants stand out as potential disease candidates? Why?
ANS: Outline a strategy to identify other variants in this family with a potentially high (Odds ratio
> 2) contribution to autism risk.
Optional programming challenges
For those of you with some programming experience looking to practice your python skills, here are a couple
optional challenges. These will NOT be worth more than a point of extra credit (negligible to your grade), and
you should not complete them if you haven’t finished the problem set already. If you decide to work on these,
please use a separate file from your main script when handing in.
Medium: Create a script that searches for possible recessive LOF alleles that may be involved in the disease state.
In this case, you should look for heterozygous SNPs in both parents that are homozygous in the affected proband
but not in the unaffected sibling.