Gene 760 – Problem Set 4 The purpose of this problem set is to

Gene 760 – Problem Set 4 The purpose of this problem set is to familiarize students with exome sequence analysis in the context of human disorders. By the end of this problem set we hope students will have learned the following skills: • Familiarity with the methods of exome sequencing • Mechanisms by which sporadic disorders can occur in humans • Familiarity with FASTQ, BAM and VCF formats • Writing Python scripts to process large datasets and using external modules (pysam) • Using test files to design a sequencing analysis pipeline using GATK/BWA/Picard To teach these skills we will be using high-­‐throughput sequencing data (Illumina HiSeq 2000) generated from a targeted exome capture (Nimblegen EZ2). The data is generated from a single family from the Simons Simplex Collection in which the proband (11114.p1) is affected with autism, while the sibling (11114.s1), mother (, and father (11114.fa) are unaffected. Students are to submit a gzipped tarball named [YourNetID]_PS0.tar.gz containing: • [YourNetID]_PS0_answers.txt : Text file with the responses to the questions below • [YourNetID] Commented Python script used to answer the final problem • SeattleSeq annotated variant files for the proband and sibling to ‘DROPBOX/PS4’ by 9:00PM on Sunday, April 12th. 1) Generate symbolic links from your Problem_Set_4 directory to the eight FASTQ files in: DATA/PS4. In addition to read sequences, the files contain quality data for each basepair of sequence, represented by ASCII characters. A description of the FASTQ format is given here:; by looking at the first few lines of each file work out if these data are encoded in Illumina 1.5+ or Illumina 1.8+ format. ANS: State the command you used to generate the symbolic links #Make symbolic links to all files in gene760 PS4 directory in your own PS4 directory
$ ln -s /home2/tgj2/gene760/DATA/PS4/* .
ANS: State the format of these quality data and how you can know this. If you wanted to be absolutely sure how could you check? Looking at some sample lines from one file, the quality score lines contain characters in the
beginning of the ASCII alphabet including >, =, and !, which is indicative of Illumina 1.8+
format. By contrast, Illumina 1.5+ quality score format is primarily comprised of the middle of
the ASCII alphabet, containing capital letters. Only Sanger uses scores below 59, only Illumina
uses scores above 73, etc. To double check, you could utilize the FASTQC tool which has an
internal guesser component in order to definitively check the format of the FASTQ file.
THOUGH some formats may be indistinguishable using this method. In the real world, you can
also gain information from the header and accompanying files, the read sizes (some machines
can’t sequence certain lengths), the date the files were generated (some score schemes did
not exist early on) and by checking the metadata and/or with the person/machine which
generated the data in the first place.
ANS: What is the read length (nt) of these data? The reads are 74nt in length.
2) Detecting variants in genomic data requires multiple steps; to design a pipeline for doing this it is best to start with a smaller, but representative, data set. We will generate a test file which prints out 1 in every 1000 FASTQ reads from ‘Samp_11114.fa_Pair_1.fq’ and ‘Samp_11114.fa_Pair_2.fq’ (this is called subsampling). Use the following awk command for Pair_1 and repeat it for Pair_2 $ awk 'NR%4000 == 13{c=3;{print}next}c-->0' Samp_11114.fa_Pair_1.fq > Test_Pair_1.fq
$ awk 'NR%4000 == 13{c=3;{print}next}c-->0' Samp_11114.fa_Pair_2.fq > Test_Pair_2.fq
ANS: State the number of reads in the full FASTQ files for 11114.fa, and the test files Since the FASTQ files have 4 lines of data for every read, the total number of reads can be
calculated by dividing the total number of lines in the file by 4.
| 40,667,776 reads
| 40,667,776 reads
| 40,668 reads
| 40,668 reads 3) To build this exome sequencing pipeline, you will use BWA (v0.7.3a), Picard Tools, a custom python module, and GATK (Genome Analysis Toolkit). These are installed for you courtesy of the .bashrc file you copied in PS0. Note that you cannot add a directory of .jar files (java executables, used by Picard and GATK) to your PATH without using special commands. Jar files are executables that require the command structure “java file.jar [arguments]” in order to be run on the cluster. You will need to run the following utilities directly from the paths below, or copy/link the executable .jar files into your working directory. /home2/tgj2/gene760/TOOLS/GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar
/home2/tgj2/gene760/TOOLS/GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar .
/home2/tgj2/gene760/TOOLS/picard-tools-1.88/SortSam.jar .
/home2/tgj2/gene760/TOOLS/picard-tools-1.88/SamFormatConverter.jar .
/home2/tgj2/gene760/TOOLS/picard-tools-1.88/MarkDuplicates.jar .
IMPORTANT: To ensure smooth operation start each java command with: ‘java -­‐Xmx2g -­‐`pwd`/tmp file.jar –jar command.jar’ ANS: What do the above java arguments mean? Java: initiates a java command
-Xmx2g: increases the maximum heap size (where all the java objects are created and stored
while running) to 2GB (see more information about heap here’pwd`/tmp: This command denotes the temporary directory that will be used
for java to create and store temporary files (more information about this here
-jar: tells java you are using a jar archive rather than raw source code
command.jar: finally this is the java executable you are running
ANS: BWA contains three alignment algorithms. State which one is most appropriate to this dataset, and why. According to the BWA manual ( BWA has three
algorithms: BWA-backtrack, BWA-SW and BWA-MEM. For this project BWA-MEM is the most
appropriate to use since it is suitable for reads 70bp-1Mbp, designed for high quality reads, and
is faster and more accurate than the other two algorithms. It is a quicker, seed-based method that
is used for local alignment, which helps because our queries are MUCH smaller than our
4) Using the two test files, align the data with BWA and output to a .sam file; ensuring that the output is compatible with Picard and GATK. You can do this by: -­‐ Adding a read group header: '@RG\tID:HiSeq.EZ2\tSM:11114.fa'. -­‐ Using the reference: ‘DATA/PS4/human_g1k_v37.fasta’. -­‐ Marking shorter split hits as secondary -­‐ Your command should run with 4 threads ANS: Write the BWA command and a description of what each argument specifies/does. $ bwa mem –t 4 -M -R '@RG\tID:HiSeq.EZ2\tSM:11114.fa' human_g1k_v37.fasta
Test_Pair_1.fq Test_Pair_2.fq > Alignment.sam
bwa mem: using the BWA-MEM alignment algorithm
-t 4: tells the cluster to run the command across 4 threads
-M: marking the shorter split hits as secondary, which ensures compatibility with Picard for the
next step in the pipeline
-R: assigns a read group header line to every read in the output. In our case it allowed for a
"11114.fa" label for the patient number and father sample, to be attached to each read
human_g1k_v37.fasta: reference genome for alignment mapping
Test_Pair_1.fq: mate pair sequence 1 (left)
Test_Pair_2.fq: mate pair sequence 2 (right)
> Test.sam: write all output from the alignment to a file called Alignment.sam
ANS: In the read group header what do ID and SM stand for? What sort of information should be included? Read group headers are used to indicate that a certain read came from the specified instrument
and run. This allows all reads with the same header to be processed similarly during downstream
analysis and be considered under the same error model.
In the read group header, ID stands for the ID name for the read group (read group identifier),
ideally it includes the sequencing system used (HiSeq) in addition to where the data came from
e.g. targeted exome capture (Nimblegen "EZ2")
SM stands for the sample name. This allows for identification of the sample itself, family ID,
relation designation, or any other relevant information
5) Using the output from BWA, use Picard to convert the SAM file into a BAM file. ANS: Write the command. $ java -Xmx2g`pwd`/tmp -jar SamFormatConverter.jar
I=Alignment.sam O=Alignment.bam
ANS: What is the advantage of using a BAM file over a SAM file? Are there any disadvantages? A BAM file is just a SAM file in binary form. The advantage of using a BAM file over a SAM is that it is smaller than a SAM and can be indexed, and so BAM files are faster to work with and require less computational time. A few potential disadvantages of the BAM format is that the SAM format is designed to be more easily read by human eyes, as well as easier to process with command line tools like cut and awk. Although BAM contains the same information, it is in binary and so it is not readable by humans. 6) Some commands can be joined together using the ‘|’ symbol. The output of the first command needs to be written to STDOUT (this often happens automatically, the ‘>’ directs STDOUT into a file) and the input of the second command needs to capture STDIN. In Picard STDIN can be read by setting the input file as ‘/dev/stdin’. Combine the BWA and Picard commands into a single command using the pipe. ANS: Write the command and a description of how it works $ bwa mem -M -R '@RG\tID:RG\tSM:11114.fa' human_g1k_v37.fasta Test_Pair_1.fq
Test_Pair_2.fq | java -Xmx2g`pwd`/tmp -jar
SamFormatConverter.jar I=/dev/stdin O=Alignment.bam
Using a pipe in between the bwa command the Picard command means that the output of the bwa
alignment, previously a SAM file, is immediately fed into the Picard SamFormatConverter command, so
that no intermediate files are generated, and the output is only a BAM file. This also allows for more
efficient serialization of the two commands as the sam file doesn’t have to finish generating before it
can be processed.
ANS: What are the advantages of using a pipe rather than simply doing each stage separately? One reason to use a pipe is because it prevents the need for the generation of intermediate files, (so
that instead of first generating a SAM file from the bwa alignment, your output is generated as a BAM)
Pipes allow us to more efficiently serialize some commands by feeding in the output stream of one to
the input stream of another, as it is being generated. This will not speed up every serial operation, only
ones where the first operation outputs data as it’s being generated, and the second can start
processing data without requiring the entire input set.
7) Use Picard to sort the resulting BAM file (use SORT_ORDER=coordinate) Use Picard to mark duplicates in the resulting file and create an index at the same time. (use CREATE_INDEX=true METRICS_FILE=file_metrics ; where file_metrics is the metrics file created by the output of the previous commands) ^ Note that these commands cannot be used with pipes since they need to read multiple lines at the same time. ANS: Write the commands and explain the command arguments used #Sorting the BAM file
$ java -Xmx2g`pwd`/tmp -jar SortSam.jar I=Alignment.bam
O=Alignment_sort.bam SORT_ORDER=coordinate
The first half of the command is the general structure for a java command as described before.
I=Alignment.bam: the input file to be sorted
O=Alignment_sort.bam: the name for the output file once sorted
SORT_ORDER=coordinate: denotes the sort order for the out sam/bam file where the default
is queryname, but could also input unsorted and null
#Marking Duplicates and indexing
$ java -Xmx2g`pwd`/tmp -jar MarkDuplicates.jar
I=Alignment_sort.bam O=Alignment_sort_MD.bam METRICS_FILE=file_metrics
I=Alignment_sort.bam: the input file to the marked and indexed
O=Alignment_sort_MD.bam: the name for the output file with the marked duplicates
METRICS_FILE=Duplicates.txt: the name of the file to write the duplication metrics to
CREATE_INDEX=true: creates an index of the BAM file when marking the duplicate
Note: 9 records were marked as duplicates.
ANS: Why do we mark duplicates? Marking duplicates helps to determine if there are multiple reads that are aligning to a given
sequence. Because this is exome sequencing, we don't necessarily care about read
abundance but only want to ensure that at least one read is mapping in a given location for
alignment. If the reads especially have the same start and end coordinates, then it is likely
they are PCR duplicates which are unnecessary and uninformative about the data being
analyzed. Marking these reads can also significantly speed up downstream tasks which will
then be able to skip these reads.
8) Use GATK (TOOLS/GenomeAnalysisTK-­‐2.4-­‐9-­‐g532efad/GenomeAnalysisTK.jar) to call variants with the HaplotypeCaller method, output the results to a .vcf file (add the option ‘-­‐rf BadCigar’ to filter out any reads with incorrect Cigar strings) ANS: Write the command you used to do this and explain the command arguments used $ java -Xmx2g`pwd`/tmp -jar GenomeAnalysisTK.jar -T
HaplotypeCaller -R human_g1k_v37.fasta -rf BadCigar -I Alignment_sort_MD.bam -o
Alignment.vcf -log HC_Alignment_Log.txt
-T: name of tool to be run, here, Haplotype Caller
-R: reference sequence file
-rf: filters out any reads with incorrect Cigar strings
-I: input BAM file
-o: output vcf file
-log: logs all output to a text file
9) By running the commands from parts 6, 7, and 8 you have developed a simple exome sequencing pipeline. Once this is functional without errors, you could apply it to the full FASTQ files BUT this would take approximately 24 hours per sample!! (rough guide per sample: BWA to BAM 3.5hrs; Sort BAM 1.5hrs; MarkDups 1.5hrs; Variant Calling 17hrs). Thus, to save you computational resources and time, we will provide sorted bam files with marked duplicates to you so that you only have to execute the final variant calling step (8). These files are in REFERENCE/PS4_REFERENCE/** ANS: Write the commands used $ java -Xmx2g`pwd`/tmp -jar GenomeAnalysisTK.jar -T
HaplotypeCaller -R human_g1k_v37.fasta -rf BadCigar -I Samp_11114.fa_sort_MD.bam
-o Fa_MD.vcf -log Fa_Alignment_Log.txt
$ java -Xmx2g`pwd`/tmp -jar GenomeAnalysisTK.jar -T
HaplotypeCaller -R human_g1k_v37.fasta -rf BadCigar -I Samp_11114.mo_sort_MD.bam
-o Mo_MD.vcf -log Mo_Alignment_Log.txt
$ java -Xmx2g`pwd`/tmp -jar GenomeAnalysisTK.jar -T
HaplotypeCaller -R human_g1k_v37.fasta -rf BadCigar -I Samp_11114.p1_sort_MD.bam
-o P1_MD.vcf -log P1_Alignment_Log.txt
$ java -Xmx2g`pwd`/tmp -jar GenomeAnalysisTK.jar -T
HaplotypeCaller -R human_g1k_v37.fasta -rf BadCigar -I Samp_11114.s1_sort_MD.bam
-o S1_MD.vcf -log S1_Alignment_Log.txt
ANS: State the number of variants in each sample (maternal, paternal, proband, sibling) This was determined by counting the number of lines in the file and subtracting the 111 lines
for the header from the line count.
Father: 101,705
Mother: 94,503
P1: 99,476
S1: 106,221
10) Write a python script to find de novo variants. Specifically filter the VCF file (preserving its format) to identify substitution variants that are: heterozygous (35-­‐65% variant read frequency) with at least 8 variant reads in the child; absent in both parents; covered by at least 10 reads in both parents (use the pysam module to count reads in specific regions). Annotate the VCF file using SeattleSeq ( on the default settings. IMPORTANT: To run the python file, use ‘py27’ rather than ‘python’. py27 is an alias in your .bashrc that links to a version of python on the cluster with the pySAM module installed. Pysam documentation can be found here: ANS: How many variants were present after filtering by the Python script? After filtering, found 356 candidates.
ANS: Save your python script ( and SeattleSeq-­‐annotated variant files in the proband and sibling ANS: Do any de novo variants stand out as potential disease candidates? Why? The following variant is a gain of a stop codon (nonsense mutation), which could have major
effects on the gene function
166210819 .
There are also 2 variants which are missense near splice sites, which could have effects
on splicing
138440554 rs55695858 G
rs200535612 T
We expect a variety of answers as many other SNPs could be considered interesting for
various reasons!
ANS: Outline a strategy to identify other variants in this family with a potentially high (Odds ratio > 2) contribution to autism risk. We have so far not used the sibling data. The clear next step in the analysis would be to eliminate variants present in the sibling to narrow down true de novo variants. It would then be worthwhile to examine existing genomic/exomic databases to identify other individuals with these SNPs (autism or not) in order to determine a true risk factor. We also are only looking at substitutions, but indels could also have high contributions to autism risk. We could focus our analysis by prioritizing SNPs in genes or regions that may interact with other genes implicated in autism risk. If we have exon sequencing data from many families with autism and from healthy populations (adjusted for ethnicity and background), we can compare the frequency of candidate SNPs in probands versus healthy family members or healthy controls. Optional programming challenges For those of you with some programming experience looking to practice your python skills, here are a couple optional challenges. These will NOT be worth more than a point of extra credit (negligible to your grade), and you should not complete them if you haven’t finished the problem set already. If you decide to work on these, please use a separate file from your main script when handing in. Medium: Create a script that searches for possible recessive LOF alleles that may be involved in the disease state. In this case, you should look for heterozygous SNPs in both parents that are homozygous in the affected proband but not in the unaffected sibling.