GENE 760 -‐ Problem Set #3

GENE 760 -­‐ Problem Set #3 This problem set will help you to learn how to analyze mRNA-­‐seq data. You will be working with data from Ayoub et al., Proc. Natl. Acad. Sci. USA 108:14950 (2011). This study generated a high-­‐resolution transcriptome map of the mouse embryonic day 14.5 (E14.5) cortex (Fig.1). At this stage, the mouse cortex consists of several transient embryonic zones: the ventricular zone (VZ), the subventricular zone (SVZ), the intermediate zone (IZ), and the cortical plate (CP). Stem cells in the VZ and SVZ give rise to neurons, which migrate through the SVZ and IZ into the cortical plate, where they differentiate and eventually form the six-­‐layered cerebral cortex. Ayoub et al. used laser microdissection to isolate RNA from the VZ, SVZ, IZ, and the CP. Single-­‐end 75 bp reads were generated for each sample, and two biological replicates were sequenced for each zone. We will focus just on VZ vs. CP. Fig 1. Isolation of RNA from cellular zones of the mouse E14.5 cortex using laser microdissection, followed by mRNA-­‐seq analysis. BEFORE STARTING!: We are starting to work with really large data sets now. You all have a 100gb quota on Louise and once you hit this programs will stop working. Delete or zip (gzip) all files you no longer need, and move gzipped files/directories onto scratch. Also, make sure to delete any intermediate files you make during this problem set. Students are to submit a gzipped tarball called [Your_NetID]_PS2.tar.gz containing the following files: • [Your_NetID]_PS3_answers.txt: Answers to the questions below • genes.fpkm_tracking (for each replicate) • CP_iso_up • CP_gene_up • VZ_iso_up • VZ_gene_up • CP_DAVID • VZ_DAVID • gene_FPKM_dist.rplot • netid_rplots.R • • dendrogram.rplot These are due to ‘DROPBOX/PS3/’ by 9AM on Thursday, Apr 2nd. The datasets for this problem set are located in your PS3 data directory. In this directory, you will find 4 FASTQ formatted sequence files: CP.fastq CP-­‐1.fastq VZ.fastq VZ-­‐1.fastq You will also find a transcript annotation file in GTF format: UCSC_knownGene_mm9_nh.gtf 1. Align each mRNA-­‐seq dataset using TopHat. TopHat uses bowtie to align RNA-­‐seq reads. The TopHat manual can be found here: TopHat can be run with or without a reference annotation. Here you will use the transcript annotation file (UCSC_knownGene_mm9_nh.gtf) provided to do your TopHat alignments. ANS: Record the TopHat commands, and the meaning of any arguments $ tophat -­‐o ./CP1_THout -­‐G UCSC_knownGene_mm9_nh.gtf -­‐p 4 -­‐-­‐transcriptome-­‐
index=transcriptome_data/known mm9 CP-­‐1.fastq $tophat -­‐o ./CP2_THout -­‐p 4 -­‐-­‐transcriptome-­‐index=transcriptome_data/known mm9 CP-­‐
2.fastq NOTE: the –G flag only needs to be provided for the first command, and is not necessary for further commands being run with the same transcriptome index. First, Tophat will first extract transcript sequences from this file, then use Bowtie to align reads to this virtual transcriptome. Whatever isn't aligned to the transcriptome will be mapped to the genome. -­‐p number of processors/threads to use -­‐ default is 1 -­‐o output directory name -­‐G providing gene model annotations/known transcripts -­‐-­‐transcriptome-­‐index <dir/prefix> when a GTF file is provided then a transcription sequence file is build and a bowtie index has to be created in order to align the reads to the known transcripts. This option lets us create a single transcriptome index that can be used for multiple runs/samples. ANS: What is the purpose of the reference annotation? What does TopHat use it for? TopHat uses the reference annotation to improve the accuracy of its mapping: it will map preferentially to existing annotated transcripts, and incorporate splice junction information in order to properly gap reads before aligning to the reference genome. 2. Use Cuffdiff (included with Cufflinks) to quantify gene and isoform expression in each sample and identify differentially expressed genes. Cufflinks uses alignments generated by TopHat to calculate transcript abundances and assemble transcripts. Cufflinks calculates transcript abundances in Fragments Per Kilobase of exon per Million mapped fragments, which is analgous to RPKM. Cuffdiff is an extension of Cufflinks that will calculate FPKMs and identify significant changes in transcript and gene expression among samples. The Cuffdiff manual is here: A tutorial on how to use Cuffdiff is here: ANS: What is the difference between RPKM and FPKM and why is FPKM preferred in paired end sequencing? RPKM: Reads Per Kilobase of transcript per Million mapped reads FPKM: Fragments Per Kilobase of transcript per Million mapped fragments In a paired-­‐end sequencing experiment, the sequencer will output two reads per fragment, but this does not necessarily mean that both reads are high quality and/or mappable. Due to this inconsistency, RPKM will end up counting both reads for some fragments and only one for others, creating an artificial enrichment for some fragments. Additionally, fragments more accurately reflect the properties of the original input library, as fragmentation can differ between library preps. Therefore FPKM is preferred for us in paired end sequence experiments. Using the TopHat alignments you generated above and the transcript annotation file provided, use Cuffdiff to calculate FPKMs for genes and isoforms in each sample, Cuffdiff takes replicates into account, so you do not have to compare single replicates independently. Cuffdiff generates several output files. Gene and isoform level FPKMs for each sample will be reported in genes.fpkm_tracking and isoforms.fpkm_tracking. Keep track of these files, since you will be copying them to the DROPBOX later The format of these files is described here: Differential expression results for each pair of comparisons (e.g., CP vs. VZ) will be reported in isoform.exp.diff and gene.exp.diff. ANS: Record your Cuffdiff commands and the meaning of any arguments you used # Indexing command structure
$ cd CP-1/
$ samtools index accepted_hits.bam
# Cuffdiff command structure
$ cuffdiff UCSC_knownGene_mm9_nh.gtf -L CP,VZ CP-1/accepted_hits.bam,
CP-2/accepted_hits.bam VZ-1/accepted_hits.bam,VZ-2/accepted_hits.bam
-o CPvsVZ
-L specifying a label for each sample, which will be included in the output files
-o directory where all of the output will be written
ANS: Identify genes significantly differentially expressed in the following comparisons: CP vs. VZ There are 6,521 genes significantly differentially expressed in the CP vs. VZ comparison.
This can be performed using grep on Louise.
$ cd CPvsVZ
$ grep –w “yes” gene_exp.diff > sigDiff_gene.txt
$ wc -l sigDiff_gene.txt
6521 sigDiff_gene.txt
3. Identify genes exclusively upregulated in the CP or VZ. Using the Cuffdiff output you generated above, determine the number and identity of genes specifically upregulated in each zone compared to the other zone. Cuffdiff will tell you which genes (and isoforms) it considers to be significantly differentially expressed after multiple testing correction (flagged as ‘yes’ in the significant column of the gene.exp.diff output files). It will also report the base 2 log of the fold change. Please see the Cuffdiff manual for details on the file formats. ANS: For each zone, report the genes that are significantly upregulated at a log2FC > 1. Report your results in the following files: CP_gene_up VZ_gene_up $ cd CPvsVZ
#extract significantly differentially expressed genes for each condition
$ grep -w "yes" gene_exp.diff > sigDiff_gene.txt
#extract upregulated genes for each condition
$ awk '{if($5=="CP" && $10 <= -1) print $0}' sigDiff_gene.txt >
$ wc -l CP_gene_up
2334 CP_gene_up
There are 2334 genes significantly upregulated in the cortical plate.
$ awk '{if($6=="VZ" && $10 >= 1) print $0}' sigDiff_gene.txt >
$ wc -l VZ_gene_up
1980 VZ_gene_up
There are 1980 genes significantly upregulated in the ventricular zone.
ANS: What is the difference between gene level and isoform level expression values? When could gene levels expression values be misleading? Gene-­‐level expression values report the expression levels for the whole gene (the sum of all isoforms of a given transcript) whereas isoform-­‐level expression values report expression of specific splicing isoforms of that gene. The gene-­‐level expression values may be misleading if there is a pronounced differential expression of different isoforms of a given gene. For example, one gene may express 90% isoform A at one time point and 90% B at another, but when summed at each time point, the overall total expression may be the same even though the processed product is significantly different. 4. Determine the top 50% exclusively upregulated genes in the CP. Using R (and your CP_gene_up file), determine the expression level of each of these genes in both tissue types then use R to generate box plots of these data. You should have a single chart with 2 box-­‐whisker plots; 1 boxplot for expression of the genes in each tissue type ANS: Save the plot as gene_FPKM_dist.rplot.png ANS: Save your R scripts as netid_rplots.R #import table into R > CP_gene_up = read.table("~/Desktop/CP_gene_up") #sort the data from low to high based on the log2 Fold Change (note we’re doing decreasing=TRUE so the largest fold change values are at the top of the list, but all fold change values are negative) > CP_gene_up_sort <-­‐ CP_gene_up[order(CP_gene_up[,10], decreasing=TRUE), ] #how many rows are there? > nrow(CP_gene_up_sort) #there are 2334 rows #take top 50% of them -­‐ so 1167 rows > CP_top_50_percent <-­‐ CP_gene_up_sort[1:nrow(CP_gene_up_sort)/2,] #how many rows and columns are there now? > nrow(CP_top_50_percent) > ncol(CP_top_50_percent) #create boxplots -­‐ limit the y-­‐axis so you can actually see plot > boxplot(CP_top_50_percent[[8]],CP_top_50_percent[[9]], main="Top 50% CP Upregulated Genes in Each Condition", ylab="FPKM", xlab="Group", names = c('CP','VZ'), ylim=c(0,100)) ANS: Does this match your expectations for what this graph should look like? Why or why not? The graph does look as expected, where the median value and the 25th-­‐75th percentile of FPKM values are higher in cortical plate as compared to ventricular zone. Since this graph is focusing on genes that are upregulated in the cortical plate condition, it would be anticipated that the cortical plate expression values would be higher than that in ventricular zone. 5. Use DAVID to identify GO Biological Process enrichments for genes upregulated in each zone. You will need to map the Gene ID/names to transcript IDs by parsing the .gtf file. Write a Python script to do this. Your script should take the following inputs from the command line: [Input gtf file] [Input genes file (CP_up)] [Output file name] and output a list of transcript IDs to the output file. ANS: Save your script as ANS: Report DAVID enrichments from the GOTERM_BP_FAT output table as flat text files with the indicated names: CP_DAVID VZ_DAVID