Skip to content

Latest commit

 

History

History
728 lines (554 loc) · 33.3 KB

Finkbeiner_StemCellReports2015.org

File metadata and controls

728 lines (554 loc) · 33.3 KB

Bioinformatics analysis in “Transcriptome-wide analysis reveals hallmarks of human intestine development and maturation in vitro and in vivo.”

Methods

RNAseq data preprocessing

RNA sequencing data obtained from of three human intestinal organoid samples was appended to the normalized RNAseq data sets of 6 unique fetal intestinal tissue samples and 6 unique adult small intestinal tissue samples obtained from publicly available sources. The UM Bioinformatics Core downloaded the reads files from the Sequencing Core storage, and concatenated those into a single .fastq file for each sample. The UM Bioinformatics Core also downloaded reads files from EBI-AE database (Adult Samples) and NCBI-GEO (SRA) database (Fetal samples). We evaluated the quality of the raw reads data for each sample using FastQC (version 0.10.1) to identify features of the data that may indicate quality problems (e.g. low quality scores, over-represented sequences, inappropriate GC content, etc.). Initial QC report indicated over-representation of Illumina adapter sequences in samples from EBI-AE data set and NCBI-GEO data set. We trimmed off the adapter sequences from the reads using Cutadapt (version 0.9.5). We used the software package Tuxedo Suite for alignment, differential expression analysis, and post-analysis diagnostics. Briefly, we aligned reads to the reference transcriptome (UCSC hg19) using TopHat (version 2.0.9) and Bowtie (version 2.1.0.0). We used default parameter settings for alignment, with the exception of: “–b2-very-sensitive” telling the software to spend extra time searching for valid alignments, as well as “–no-coverage-search” and “–no-novel-juncs” to limit the read mapping to known transcripts. In addition, we used FastQC for a second round of quality control (post-alignment), to ensure that only high quality data would be input to expression quantitation and differential expression analysis. Sequence retrieval, concatenation, and quality control analysis were conducted using 64-bit Red Hat Enterprise Linux version 6.3 at the UM Advanced Research Computing center. Sequenced RNA datasets were compiled as data frames containing normalized fragments per kilobase of exon per million fragments mapped (Fpkm) for 23,615 unique genes in each of the 15 samples by the Bioinformatics Core in the Department of Computation Medicine and Bioinformatics at the University of Michigan. We used Cufflinks/CuffNorm (version 2.2.1) for expression quantitation and differential expression analysis, using UCSC hg19.fa as the reference genome sequence and UCSC hg19.gtf as the reference transcriptome annotation. For this analysis, we used parameter settings: “–multi-read-correct” to adjust expression calculations for reads that map in more than one locus, as well as “–compatible-hits-norm” and “–upper-quartile –norm” for normalization of expression values. Normalized FPKM tables were generated using the CuffNorm function found in Cufflinks. One-way ANOVA was applied to evaluate differences in expression among the HIO, adult intestine, and fetal intestine samples for each gene using the matrixStats package for the R (version 3.1.2) statistical programming language. A P-value of less than 0.05 was considered statistically significant. Sequence retrieval, concatenation, and quality control analysis were conducted using 64-bit Red Hat Enterprise Linux version 6.3 at the UM Advanced Research Computing center.

Principle component analysis (PCA)

The complete FPKM matrix, containing frequency counts for all 23,615 genes contained in the reference genome for all RNAseq samples, was evaluated using unscaled principle component analysis (PCA) to visualize and quantify multi-dimensional variation between samples. Of the 23,615 genes annotated in the reference genome, 1,281 (5.4%) were not detected in the RNAseq analysis of any of the samples. Principle components were calculated using the function ‘prcomp’ found in the R (version 3.1.2) statistical programming language and plotted using the R package ‘ggplot2’.

Hierarchical clustering with statistical bootsrapping

Hierarchical cluster analysis based on the Canberra distance between FPKM vectors was used to classify discrete RNAseq samples according to the degree of total transcriptional dissimilarity indicated by the normalized FPKM values. Bootstrap analysis was used to assesses the uncertainty in the assigned hierarchical clustering relationships. 10,000 bootstraping iterations were generated by repeatedly randomly sampling the FPKM dataset. The bootstrap probability (BP) of a cluster is defined as the frequency of a given relationship among the bootstrap replicates. Multiscale bootstrap resampling was used to calculate an approximately unbiased (AU) p-value for a given relationship, with AU > 95 indicating a high degree of statistical significance. Analyses were conducted using R package ‘pvclust’.

Spearman correlation

Spearman correlation was applied as an additional assessment of the cumulative degree of correlation among RNAseq datasets. We computed Spearman rank correlation coefficients (ρ) in a pairwise manner among all RNAseq samples using the complete normalized FPKM data. The Spearman coefficients were plotted as a heatmap using the function ‘heatmap.2’ in the R package ‘gplots’.

Gene ontology

Biological process gene ontology categories with putative roles in intestinal development were identified using the Gene Ontology Consortium database. Genbank gene symbols classified within each gene ontology (GO) term were retrieved the BioMart package for Biocunductor in R. Fpkm data corresponding to gene symbols within a GO term were retrieved from the normalized dataframe if the ANOVA P-value was < 0.05 for the comparison among the three sample groups and mean Fpkm was greater than 1 in at least one of the three sample groups. Thus, downstream gene ontology analysis included only genes which differed significantly in expression among the HIO, fetal, and adult samples and were reproducibly expressed in at least one sample group. Normalized Z-scores for each gene within a GO term that met the inclusion criteria for were calculated as follows:

z = (x - μ)/σ

Where x is an Fpkm value, μ is the mean Fpkm among all samples in all three groups for a given gene, and σ is the standard deviation among all samples in all three groups for a given gene. Heatmaps and associated dendrograms were plotted in R using the function heatmap.2 in the package gplots.

Figures

# Multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols:   Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
## Clear temporary memory
rm(list=ls())

library(ggplot2)

multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  require(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

Principle component analysis of RAW FPKM matrix

## FMPK matrix input
data <- read.table("./DATA/HIO_v_HuSI.Duo.A_v_HuSI.Dist.A_v_HuSI.F_v_HuColon.A_v_HuStomach.F_v_HuStomach.A_v_MD_v_ME_repFpkms.txt",header=TRUE,sep="")

# subset of dataframe including only the numeric columns (excludes gene id column)
num.data <- data[,sapply(data,is.numeric)]
rownames(num.data) <- data$gene_id

pca.data <- num.data[apply(num.data, 1, var, na.rm=TRUE) != 0,]
pca <- prcomp(t(pca.data),scale=TRUE,center=FALSE)

HIO <- grep("HIO",colnames(num.data),ignore.case=T)
SIdu <- grep("SI_Du",colnames(num.data),ignore.case=T)
SIdi <- grep("SI_Di",colnames(num.data),ignore.case=T)
SIf <- grep("SI_F",colnames(num.data),ignore.case=T)
CA <- grep("lon_A",colnames(num.data),ignore.case=T)
STA <- grep("ach_A",colnames(num.data),ignore.case=T)
STf <- grep("ach_F",colnames(num.data),ignore.case=T)
MD <- grep("MD",colnames(num.data),ignore.case=T)
ME <- grep("ME",colnames(num.data),ignore.case=T)
#shape <- c(rep(2,times=length(CDI)),rep(1,times=length(NI)))

group <- c(rep("HIO",times=length(HIO)),rep("prox S.I.",times=length(SIdu)),rep("dist S.I.",times=length(SIdi)),rep("Fetal S.I.",times=length(SIf)),rep("colon",times=length(CA)),rep("Fetal stomach",times=length(STf)),rep("Stomach",times=length(STA)),rep("Def. end.",times=length(MD)),rep("ESC",times=length(ME)))
scores <- data.frame(colnames(num.data), pca$x[,1:ncol(pca$x)],group)
shape <- c(rep(22,times=length(HIO)),rep(21,times=30))

library(ggplot2)
theme <- theme(legend.position="right",legend.title=element_blank(),legend.background = element_rect(fill="white", size=.5, linetype="dotted"),panel.grid.minor=element_blank(), panel.grid.major=element_blank())

pc1.2 <- qplot(x=PC1, y=PC2, data=scores) + theme + 
  scale_fill_brewer(palette="Set3")+
  geom_point(shape=21,aes(fill=factor(group)), size=8)  

pc1.3 <- qplot(x=PC1, y=PC3, data=scores) + theme +
  scale_fill_brewer(palette="Set3")+
  geom_point(shape=21,aes(fill=factor(group)), size=8) 

pc2.3 <- qplot(x=PC2, y=PC3, data=scores) + theme +
  scale_fill_brewer(palette="Set3")+
  geom_point(shape=21,aes(fill=factor(group)), size=8) 
#print(pc2.3)

#print(pc1.2)
multiplot(pc1.2,pc1.3,pc2.3,cols=3)

./DATA/pcaplot.jpg

PCA in small intestinal, endoderm, and ES samples

smallI <- c(HIO,SIdu,SIdi,SIf,MD,ME)
num.data.i <- num.data[,smallI]
pca.data2 <- num.data.i[apply(num.data.i, 1, var, na.rm=TRUE) != 0,]
pca2 <- prcomp(t(pca.data2),scale=TRUE,center=FALSE)

groupI <- c(rep("HIO",times=length(HIO)),rep("prox S.I.",times=length(SIdu)),rep("dist S.I.",times=length(SIdi)),rep("Fetal S.I.",times=length(SIf)),rep("Def. end.",times=length(MD)),rep("ESC",times=length(ME)))
scores2 <- data.frame(colnames(num.data.i), pca2$x[,1:ncol(pca2$x)],groupI)

library(ggplot2)
theme <- theme(legend.position="right",legend.title=element_blank(),legend.background = element_rect(fill="white", size=.5, linetype="dotted"),panel.grid.minor=element_blank(), panel.grid.major=element_blank())

pc1.2 <- qplot(x=PC1, y=PC2, data=scores2) + theme + 
  scale_fill_brewer(palette="Set2")+
  geom_point(shape=21,aes(fill=factor(groupI)), size=8)  

pc1.3 <- qplot(x=PC1, y=PC3, data=scores2) + theme +
  scale_fill_brewer(palette="Set2")+
  geom_point(shape=21,aes(fill=factor(groupI)), size=8) 

pc2.3 <- qplot(x=PC2, y=PC3, data=scores2) + theme +
  scale_fill_brewer(palette="Set2")+
  geom_point(shape=21,aes(fill=factor(groupI)), size=8) 

multiplot(pc1.2,pc1.3,pc2.3,cols=3)

./DATA/pcaplot2.jpg

PCA in small intestinal and HIO samples only

smallIo <- c(HIO,SIdu,SIdi,SIf)
num.data.io <- num.data[,smallIo]
pca.data3 <- num.data.io[apply(num.data.io, 1, var, na.rm=TRUE) != 0,]
pca3 <- prcomp(t(pca.data3),scale=TRUE,center=FALSE)

groupIo <- c(rep("HIO",times=length(HIO)),rep("prox S.I.",times=length(SIdu)),rep("dist S.I.",times=length(SIdi)),rep("Fetal S.I.",times=length(SIf)))
scores3 <- data.frame(colnames(num.data.io), pca3$x[,1:ncol(pca3$x)],groupIo)

library(ggplot2)
theme <- theme(legend.position="right",legend.title=element_blank(),legend.background = element_rect(fill="white", size=.5, linetype="dotted"),panel.grid.minor=element_blank(), panel.grid.major=element_blank())

pc1.2 <- qplot(x=PC1, y=PC2, data=scores3) + theme + 
  scale_fill_brewer(palette="Set1")+
  geom_point(shape=21,aes(fill=factor(groupIo)), size=8)  

pc1.3 <- qplot(x=PC1, y=PC3, data=scores3) + theme +
  scale_fill_brewer(palette="Set1")+
  geom_point(shape=21,aes(fill=factor(groupIo)), size=8) 

pc2.3 <- qplot(x=PC2, y=PC3, data=scores3) + theme +
  scale_fill_brewer(palette="Set1")+
  geom_point(shape=21,aes(fill=factor(groupIo)), size=8) 

multiplot(pc1.2,pc1.3,pc2.3,cols=3)

./DATA/pcaplot3.jpg

Hierarchical cluster analysis of FPKM matrix in all samples

library(pvclust)
result <- pvclust(pca.data,method.dist="canberra",method.hclust="complete",nboot=10000)
save(result,file="./DATA/result")
load(file="./DATA/pvclust/DATA/result")
print(plot(result,hang=-1,float=0.008,cex.pv=0.3,font.pv=4,print.num=FALSE,print.pv=TRUE,col.pv=c(2,3,8),lwd=1,cex=0.6,main="",col="grey30",ylab="Distance"))

./DATA/pvclust1.jpg

Hierarchical cluster analysis of small intestinal tissues, endoderm, and ES cells

smallI <- c(HIO,SIdu,SIdi,SIf,MD,ME)
num.data.i <- num.data[,smallI]
pca.data2 <- num.data.i[apply(num.data.i, 1, var, na.rm=TRUE) != 0,]

result2 <- pvclust(pca.data2,method.dist="canberra",method.hclust="complete",nboot=10000)
save(result2,file="./DATA/result2")
load(file="./DATA/pvclust/DATA/result2")
print(plot(result2,hang=-1,float=0.008,cex.pv=0.3,font.pv=4,print.num=FALSE,print.pv=TRUE,col.pv=c(2,3,8),lwd=1,cex=0.6,main="",col="grey30",ylab="Distance"))

./DATA/pvclust2.jpg

Hierarchical cluster analysis of small intestinal tissues only

smallIo <- c(HIO,SIdu,SIdi,SIf)
num.data.io <- num.data[,smallIo]
pca.data3 <- num.data.io[apply(num.data.io, 1, var, na.rm=TRUE) != 0,]

result3 <- pvclust(pca.data3,method.dist="canberra",method.hclust="complete",nboot=10000)
save(result3,file="./DATA/result3")
load(file="./DATA/pvclust/DATA/result3")
print(plot(result3,hang=-1,float=0.008,cex.pv=0.3,font.pv=4,print.num=FALSE,print.pv=TRUE,col.pv=c(2,3,8),lwd=1,cex=0.6,main="",col="grey30",ylab="Distance"))

./DATA/pvclust3.jpg

Gene expression correlation matrix

cor1 <- cor(pca.data,method="spearman")

## Load libraries
library(matrixStats)
library(gplots)
library(RColorBrewer)

# creates a color palette from blue to white to red in "n" increments
my_palette <- colorRampPalette(c("blue","white","red"))(n = 299)

# defines the color breaks manually for a "skewed" color transition
col_breaks = c(seq(0,0.6,length=100),  #blue
               seq(0.6,0.85,length=100),
               seq(0.85,1,length=100))   #red    

cor.mat <- as.matrix(cor1)

result <- heatmap.2(cor.mat,
                    notecol="black",     
                    #scale="row",na.rm=T,
                    density.info="none",
                    key.xlab="Correlation",
                    key.title="",   # turns off density plot inside color legend
                    trace="none",         # turns off trace lines inside the heat map
                    margins =c(8,8),     # widens margins around plot
                    col=my_palette,       # use on color palette defined earlier 
                    breaks=col_breaks,    # enable color transition at specified limits
                    dendrogram="both",    # do not draw dendrogram
                    Colv=T,
                    Rowv=T,
                    srtCol=45,
                    cexRow= 0.7,
                    cexCol = 0.7,
                    keysize=0.25,
                    labRow = rownames(cor.mat),
                    labCol = colnames(cor.mat),
                    #rowsep=(order(GOterm$V3)[!duplicated(sort(GOterm$V3))]-1),
                    lwid = c(1.2,5),
                    lhei = c(1.2,5)
                    )          

print(result)

./DATA/corrmatrix1.jpg

Gene expression correlation matrix of small intestinal tissues, endoderm, and ES cells

cor2 <- cor(pca.data2,method="spearman")

## Load libraries
library(matrixStats)
library(gplots)
library(RColorBrewer)

# creates a color palette from blue to white to red in "n" increments
my_palette <- colorRampPalette(c("blue","white","red"))(n = 299)

# defines the color breaks manually for a "skewed" color transition
col_breaks = c(seq(0,0.6,length=100),  #blue
               seq(0.6,0.85,length=100),
               seq(0.85,1,length=100))   #red    

cor.mat <- as.matrix(cor2)

result <- heatmap.2(cor.mat,
                    notecol="black",     
                    #scale="row",na.rm=T,
                    density.info="none",
                    key.xlab="Correlation",
                    key.title="",   # turns off density plot inside color legend
                    trace="none",         # turns off trace lines inside the heat map
                    margins =c(8,8),     # widens margins around plot
                    col=my_palette,       # use on color palette defined earlier 
                    breaks=col_breaks,    # enable color transition at specified limits
                    dendrogram="both",    # do not draw dendrogram
                    Colv=T,
                    Rowv=T,
                    srtCol=45,
                    cexRow= 0.7,
                    cexCol = 0.7,
                    keysize=0.25,
                    labRow = rownames(cor.mat),
                    labCol = colnames(cor.mat),
                    #rowsep=(order(GOterm$V3)[!duplicated(sort(GOterm$V3))]-1),
                    lwid = c(1.2,5),
                    lhei = c(1.2,5)
                    )          

print(result)

./DATA/corrmatrix2.jpg

Gene expression correlation matrix of small intestinal tissues only

cor3 <- cor(pca.data3,method="spearman")

## Load libraries
library(matrixStats)
library(gplots)
library(RColorBrewer)

# creates a color palette from blue to white to red in "n" increments
my_palette <- colorRampPalette(c("blue","white","red"))(n = 299)

# defines the color breaks manually for a "skewed" color transition
col_breaks = c(seq(0,0.6,length=100),  #blue
               seq(0.6,0.85,length=100),
               seq(0.85,1,length=100))   #red    

cor.mat <- as.matrix(cor3)

result <- heatmap.2(cor.mat,
                    notecol="black",     
                    #scale="row",na.rm=T,
                    density.info="none",
                    key.xlab="Correlation",
                    key.title="",   # turns off density plot inside color legend
                    trace="none",         # turns off trace lines inside the heat map
                    margins =c(8,8),     # widens margins around plot
                    col=my_palette,       # use on color palette defined earlier 
                    breaks=col_breaks,    # enable color transition at specified limits
                    dendrogram="both",    # do not draw dendrogram
                    Colv=T,
                    Rowv=T,
                    srtCol=45,
                    cexRow= 0.7,
                    cexCol = 0.7,
                    keysize=0.25,
                    labRow = rownames(cor.mat),
                    labCol = colnames(cor.mat),
                    #rowsep=(order(GOterm$V3)[!duplicated(sort(GOterm$V3))]-1),
                    lwid = c(1.2,5),
                    lhei = c(1.2,5)
                    )          

print(result)

./DATA/corrmatrix3.jpg

Expression of host defense, digestive, mesenchymal, and stem cells markers

## David Hill
## September 29, 2014
## Updated October 21, 2014
## R script to analyze RNAseq data from Jason Spence and Stacy Finkbeiner
## Comparison of HIO transcript profile (RNAseq) with adult and fetal human intestine

## Clear temporary memory
rm(list=ls())

##input
data <- read.table("./DATA/HIO_v_HuSI.Duo.A_v_HuSI.Dist.A_v_HuSI.F_repFpkms.csv",header=TRUE,sep=",")

# -----------------------------------------------------------------------------
# SET CRITERIA FOR INCLUSION IN DOWNSTREAM ANALYSIS
# Fpmk cut-off 
fco <- 0.4
# P-value cut-off
pco <- 1
# -----------------------------------------------------------------------------


## Load libraries
library(matrixStats)

## ANOVA
# 3 groups 1) HIO 2) Adult 3) Fetal
# reference for anova 
# http://web.mst.edu/~psyworld/anovaexample.htm
# F- distribution
# http://stat.ethz.ch/R-manual/R-devel/library/stats/html/Fdist.html
## function to compare within row, returns f distribution
## The function is currently defined as
row.anova <- function(mat_all,mat1,mat2,mat3){
  mat1 <- as.matrix(mat1)
  mat2 <- as.matrix(mat2)
  mat3 <- as.matrix(mat3)
  mat_all <- as.matrix(mat_all)
  
  s1<-rowSums(mat1,na.rm=TRUE) #
  s2<-rowSums(mat2,na.rm=TRUE) #
  s3<-rowSums(mat3,na.rm=TRUE) #
  
  s1s<-rowSums((mat1^2),na.rm=TRUE) #
  s2s<-rowSums((mat2^2),na.rm=TRUE) #
  s3s<-rowSums((mat3^2),na.rm=TRUE) #
  
  sst <- (s1s+s2s+s3s)-(((s1+s2+s3)^2)/ncol(mat_all)) #
  
  ssa <- (((s1^2)/ncol(mat1))+((s2^2)/ncol(mat2))+((s3^2)/ncol(mat3)))-((s1+s2+s3)^2/ncol(mat_all)) #
  
  ssw <- sst - ssa #
  
  f.stat <- (ssa/2)/(ssw/(ncol(mat_all)-3))
  return(f.stat)
}

# calculate f-statistic
data$fstat <- row.anova(data[,3:17],data[,3:5],data[,6:11],data[12:17])
# express f-statistic as p-value
data$p <- pf(data$fstat, 2, 15, lower.tail=F)
# calculate Bonferroni correction
data$Bonf_p <- p.adjust(data$p, method = 'bonferroni', n = length(data$p))

# create three lists that will specify the treatment groups
cn <- colnames(data)
fetal <- grep("HuSI_F",cn,ignore.case=T)
adult <- grep("HuSI_D", cn, ignore.case = T)
hio <- grep("HIO", cn, ignore.case = T)
all <- grep("HIO|HuSI",cn, ignore.case =T)

#Count the number of reads with Fpmk > "fco" in each group 
data$hio_count <- rowSums(data[,hio] >= fco, na.rm=T)
data$adult_count <- rowSums(data[,adult] >= fco, na.rm=T)
data$fetal_count <- rowSums(data[,fetal] >= fco, na.rm=T)

# generate "present" call (present = 1, not present = 0) for each group. Greater than 1/3 of replicates must exceed FPMK cutoff
data$hio_present <- ifelse((data$hio_count/ncol(data[,hio])) > (1/3), 1, 0)
data$adult_present <- ifelse((data$adult_count/ncol(data[,adult])) > (1/3), 1, 0)
data$fetal_present <- ifelse((data$fetal_count/ncol(data[,fetal])) > (1/3), 1, 0)

# Group means
data$hio_mean <- rowMeans(data[,hio], na.rm=T)
data$adult_mean <- rowMeans(data[,adult], na.rm=T)
data$fetal_mean <- rowMeans(data[,fetal], na.rm=T)

# calculate log2 change
data$hio_adult_log2 <- log2(data$hio_mean/data$adult_mean)
data$hio_fetal_log2 <- log2(data$hio_mean/data$fetal_mean)
data$fetal_adult_log2 <- log2(data$fetal_mean/data$adult_mean)

#calculate z-scores
# mean expression
data$mean <- rowMeans(data[,all], na.rm=T)
# calculate row sd ignoring NA
data$sd <- apply(data[,all],1,sd, na.rm=T)
data$HIO_0.z <- (data$HIO_0 - data$mean)/data$sd
data$HIO_1.z <- (data$HIO_1 - data$mean)/data$sd
data$HIO_2.z <- (data$HIO_2 - data$mean)/data$sd

data$HuSI_Duo_A_0.z <- (data$HuSI_Duo_A_0 - data$mean)/data$sd
data$HuSI_Duo_A_1.z <- (data$HuSI_Duo_A_1 - data$mean)/data$sd

data$HuSI_Dist_A_0.z <- (data$HuSI_Dist_A_0 - data$mean)/data$sd
data$HuSI_Dist_A_1.z <- (data$HuSI_Dist_A_1 - data$mean)/data$sd
data$HuSI_Dist_A_2.z <- (data$HuSI_Dist_A_2 - data$mean)/data$sd
data$HuSI_Dist_A_3.z <- (data$HuSI_Dist_A_3 - data$mean)/data$sd

data$HuSI_F_0.z <- (data$HuSI_F_0 - data$mean)/data$sd
data$HuSI_F_1.z <- (data$HuSI_F_1 - data$mean)/data$sd
data$HuSI_F_2.z <- (data$HuSI_F_2 - data$mean)/data$sd
data$HuSI_F_3.z <- (data$HuSI_F_3 - data$mean)/data$sd
data$HuSI_F_4.z <- (data$HuSI_F_4 - data$mean)/data$sd
data$HuSI_F_5.z <- (data$HuSI_F_5 - data$mean)/data$sd

# create three lists that will specify the treatment groups (z-scores)
cn.1 <- colnames(data)
fetal.z <- grep("HuSI_F_0.z|HuSI_F_1.z|HuSI_F_2.z|HuSI_F_3.z|HuSI_F_4.z|HuSI_F_5.z",cn.1,ignore.case=T)
adult.z <- grep("HuSI_Duo_A_0.z|HuSI_Duo_A_1.z|HuSI_Dist_A_0.z|HuSI_Dist_A_1.z|HuSI_Dist_A_2.z|HuSI_Dist_A_3.z", cn.1, ignore.case = T)
hio.z <- grep("HIO_0.z|HIO_1.z|HIO_2.z", cn.1, ignore.case = T)
all.z <- grep(".z",cn.1, ignore.case =T)

#calculate z-scores
# mean z-scores expression
data$hio_zmean <- rowMeans(data[,hio.z], na.rm=T)
data$adult_zmean <- rowMeans(data[,adult.z], na.rm=T)
data$fetal_zmean <- rowMeans(data[,fetal.z], na.rm=T)


olfm4 <- subset(data, data$gene_id == "OLFM4"|data$gene_id =="GC1"| data$gene_id =="OLM4"| data$gene_id =="OlfD"| data$gene_id =="GW112"| data$gene_id =="hGC-1"| data$gene_id =="hOLfD"| data$gene_id =="UNQ362")


### SUBSETTING

#pco <- olfm4$p

# ANOVA P-value cut-off
data.a <- subset(data, data$p <= pco)
# FPMK reads exceeding "fco" in more than one replicate/group in at least one group
data.a <- subset(data.a, data.a$hio_present ==1 | data.a$adult_present == 1 | data.a$fetal_present)

data.b <- as.matrix(data.a[,all])
rownames(data.b) <-  data.a$gene_id

# create three lists that will specify the treatment groups
cn <- colnames(data.b)
fetal <- grep("HuSI_F",cn,ignore.case=T)
adult <- grep("HuSI_D", cn, ignore.case = T)
hio <- grep("HIO", cn, ignore.case = T)
all <- grep("HIO|HuSI",cn, ignore.case =T)

## Load libraries
library(matrixStats)
library(gplots)
library(RColorBrewer)


list <- read.csv("./DATA/Focused_Gene_List.csv",header=FALSE)

#list <- as.character(list$V1)

GOcad <- list[,2]

genes <- which(rownames(data.b) %in% GOcad)


GOterm <- as.data.frame(data.b[genes,])

# mean expression
GOterm$mean <- rowMeans(GOterm[,all], na.rm=T)
# calculate row sd ignoring NA
GOterm$sd <- apply(GOterm[,all],1,sd, na.rm=T)
GOterm$HIO_0.z <- (GOterm$HIO_0 - GOterm$mean)/GOterm$sd
GOterm$HIO_1.z <- (GOterm$HIO_1 - GOterm$mean)/GOterm$sd
GOterm$HIO_2.z <- (GOterm$HIO_2 - GOterm$mean)/GOterm$sd

GOterm$HuSI_Duo_A_0.z <- (GOterm$HuSI_Duo_A_0 - GOterm$mean)/GOterm$sd
GOterm$HuSI_Duo_A_1.z <- (GOterm$HuSI_Duo_A_1 - GOterm$mean)/GOterm$sd

GOterm$HuSI_Dist_A_0.z <- (GOterm$HuSI_Dist_A_0 - GOterm$mean)/GOterm$sd
GOterm$HuSI_Dist_A_1.z <- (GOterm$HuSI_Dist_A_1 - GOterm$mean)/GOterm$sd
GOterm$HuSI_Dist_A_2.z <- (GOterm$HuSI_Dist_A_2 - GOterm$mean)/GOterm$sd
GOterm$HuSI_Dist_A_3.z <- (GOterm$HuSI_Dist_A_3 - GOterm$mean)/GOterm$sd

GOterm$HuSI_F_0.z <- (GOterm$HuSI_F_0 - GOterm$mean)/GOterm$sd
GOterm$HuSI_F_1.z <- (GOterm$HuSI_F_1 - GOterm$mean)/GOterm$sd
GOterm$HuSI_F_2.z <- (GOterm$HuSI_F_2 - GOterm$mean)/GOterm$sd
GOterm$HuSI_F_3.z <- (GOterm$HuSI_F_3 - GOterm$mean)/GOterm$sd
GOterm$HuSI_F_4.z <- (GOterm$HuSI_F_4 - GOterm$mean)/GOterm$sd
GOterm$HuSI_F_5.z <- (GOterm$HuSI_F_5 - GOterm$mean)/GOterm$sd



# generate evenly spaced color palettes, specifying the number of increments
cc <- colorRampPalette(c("forestgreen","dodgerblue","darkorange","firebrick3"))(n=4)


rownames(list) <- list[,2]

GOterm <- merge(GOterm, list, by="row.names",all=TRUE)
GOterm <- GOterm[order(GOterm$V1),]
 # replace cluster numbers with color palette

GOterm$V3[GOterm$V1 == levels(GOterm$V1)[1]] <- cc[1]
GOterm$V3[GOterm$V1 == levels(GOterm$V1)[2]] <- cc[2]
GOterm$V3[GOterm$V1 == levels(GOterm$V1)[3]] <- cc[3]
GOterm$V3[GOterm$V1 == levels(GOterm$V1)[4]] <- cc[4]

all.z <- grep(".z",colnames(GOterm), ignore.case =T)
GOmap.z <- as.matrix(GOterm[,all.z])
rownames(GOmap.z) <- GOterm$V2

# creates a color palette from blue to white to red in "n" increments
my_palette <- colorRampPalette(c("blue","white","red"))(n = 299)

# defines the color breaks manually for a "skewed" color transition
col_breaks = c(seq(min(GOmap.z),-1,length=100),  #blue
               seq(-1,1,length=100),
               seq(1,max(GOmap.z),length=100))   #red      



result <- heatmap.2(GOmap.z,
                    notecol="black",     
                    #scale="row",na.rm=T,
                    density.info="none",
                    key.xlab="Z-score",   # turns off density plot inside color legend
                    trace="none",         # turns off trace lines inside the heat map
                    margins =c(max(nchar(colnames(GOmap.z))),max(nchar(rownames(GOmap.z)))),     # widens margins around plot
                    col=my_palette,       # use on color palette defined earlier 
                    breaks=col_breaks,    # enable color transition at specified limits
                    dendrogram="column",    # do not draw dendrogram
                    Colv=T,
                    Rowv=F,
                    cexCol=1.5,
                    srtCol=45,
                    cexRow=1.5,
                    keysize=0.25,
                    rowsep=(order(GOterm$V3)[!duplicated(sort(GOterm$V3))]-1),
                   #rowsep=c(1:nrow(GOmap.z)),
                   #colsep=c(1:ncol(GOmap.z)),
                   RowSideColors = GOterm$V3, #clustercolor$cluster
                    lwid = c(1,5),
                    lhei = c(1,5)
                    )          


./DATA/CustomHeatmapRevised.jpg

tubb3 <- grep("TUBB3",rownames(GOmap.z))

GOmap.z.a <- GOmap.z[-tubb3,]
row.colors <- GOterm$V3[-tubb3]

result <- heatmap.2(GOmap.z.a,
                    notecol="black",     
                    #scale="row",na.rm=T,
                    density.info="none",
                    key.xlab="Z-score",   # turns off density plot inside color legend
                    trace="none",         # turns off trace lines inside the heat map
                    margins =c(max(nchar(colnames(GOmap.z.a))),max(nchar(rownames(GOmap.z.a)))),     # widens margins around plot
                    col=my_palette,       # use on color palette defined earlier 
                    breaks=col_breaks,    # enable color transition at specified limits
                    dendrogram="column",    # do not draw dendrogram
                    Colv=T,
                    Rowv=F,
                    cexCol=1.5,
                    srtCol=45,
                    cexRow=1.5,
                    keysize=0.25,
                    rowsep=(order(row.colors)[!duplicated(sort(row.colors))]-1),
                   RowSideColors = row.colors, #clustercolor$cluster
                    lwid = c(1,5),
                    lhei = c(1,5)
                    )          


./DATA/CustomHeatmapRevised_noTUBB3.jpg