Skip to content

Latest commit

 

History

History
89 lines (65 loc) · 8.9 KB

Split.md

File metadata and controls

89 lines (65 loc) · 8.9 KB

List of processes

DA_split__splitting_differential_analysis_results_in_subsets


Sub-workflow showing the creation of DASs.
Dotted arrows indicate optional additional filters. Abbreviations: FDR - False Discovery Rate, DAR - differentially accessible region, prom - promoter, distNC - distal non-coding region.


Diagrams showing how differential analysis results are split by experiment type and fold change filters.
Panel (a) shows the color code used in all panels, with blue circles representing genomic regions (either DARs or promoters of DEGs) and black circles representing gene sets (either the closest genes of DARs or DEGs). Enrichment of internal GRs and GSs indicates enrichment of GRs and GSs (i.e., DASs) in other GRs and GSs generated by the pipeline. Panels (b-g) show all possible splits of differential analysis results by experiment type (ATAC-Seq – turquoise, mRNA-Seq – orange, or both ATAC-Seq and mRNA-Seq – purple) and by fold change type (up – yellow, or down – green), with either an increase (b) or a decrease (c) in chromatin accessibility, an increase (d) or a decrease (e) in gene expression, and an increase (f) or a decrease (g) in both chromatin accessibility and gene expression. The HA-HE and HA-LE terminology has been previously described in (Nair et al., 2021). Black lines and blue circles represent DNA and nucleosomes, respectively. Orange lines represent mRNA molecules.

Description

This process splits Differential Analysis results into subsets (i.e., DAS - Differential Analysis Subsets) in order to do enrichment analysis on many different angles and extract the most information out of the data.
4 filters are used to split:

  • ET: Experiment Type. Can be either 'ATAC', 'mRNA', 'both', 'both_ATAC', or 'both_mRNA'.
  • PA: DAR Peak Annotation. Can be any combination of 'all', 'gene', 'interG', 'prom', '5pUTR', '3pUTR', 'exon', 'intron', 'downst', 'distIn', 'UTR', 'TSS', 'genPro', 'distNC', 'mt10kb', 'mt100kb', 'mtYkb', 'lt10kb', 'lt100kb', 'ltXkb'. See DA_ATAC__saving_detailed_results_tables for details. 'all' disable this filters (all peaks are included).
  • FC: Fold Change type. To split up and down-regulated results.
  • TV: Theshold Value(s). To split results by significance thresholds.

NOTE: The 'both*' entries indicates that the results pass the filters in both ATAC-Seq and mRNA-Seq. 'both' is used for gene lists (i.e. to find enriched ontologies), while 'both_ATAC' and 'both_mRNA' are used for genomic regions (i.e. to find enriched motifs/CHIP). 'both_ATAC' are ATAC-Seq peaks assigned to genes that are passing the filters in mRNA-Seq data as well. 'both_mRNA' are gene promoters of genes that pass the filters in mRNA-Seq and for which there are nearby ATAC-Seq peaks assigned to the same gene and that pass the filters.

NOTE: The process merges mRNA-Seq and ATAC-Seq results if experiment_types = 'both' otherwise it works on either of the two.

Finally, a key is made, of the form ${ET}__${PA}__${FC}__${TV}__${COMP}, with COMP indicating the comparison. This key is used to make:

  • bed files that contain genomic regions (i.e. to find enriched motifs/CHIP)
  • R files that contain gene sets (i.e. to find enriched ontologies, for Venn diagrams plots).

In additions, two types of tables are produced: res_simple and res_filter. These two tables contain the same columns: the 5 key components (ET, PA, FC, TV and COMP), a peak_id column (Null for mRNA-Seq results), chromosome, gene name and id, pvalue and adjusted p-value and log2 fold changes. These two tables differ in their format:

  • res_simple: each result is reported with the filters that it passes that are combined with "|" (i.e PA: 'all|prom'). This allows to quickly browse all results. Please note that ET = 'both' entries are not shown in this file.
  • res_filter: only results passing filters are reported and each passed filter is on a different line (so 'all' and 'prom' would be on two different lines in the previous example). This file should be smaller as it exclude all the non-significant results. This file includes the entries significant in both ATAC-Seq and mRNA-Seq, with ET = 'both_ATAC' showing the ATAC-Seq results, and ET = 'both_mRNA' showing the mRNA-Seq results.

Parameters

  • params.split__threshold_type: Defines if the threshold cuttoff is based on FDR (adjusted p-value) or rank. Options: 'FDR', 'rank'. Default: 'FDR'.
  • params.split__threshold_values: Groovy list defining the threshold cuttoff value(s). If params.split__threshold_type = 'rank' all entries ranked below this value will be kept (with entries ranked from lowest (rank = 1) to highest adjusted pvalues). If params.split__threshold_type = 'FDR' all entries with a -log10(adjusted p-value) below this threshold will be kept. e.g., params.split__threshold_values = [ 1.3 ] will keep all entries with an adjusted pvalue below 0.05 (i.e., -log10(0.05) = 1.30103). Multiple thresholds can be added but from the same type (FDR or rank). Default: [ 1.3 ].
  • params.split__peak_assignment: Defines the peak assignment filters to use. See DA_ATAC__saving_detailed_results_tables for options. Default: [ 'all' ].
  • params.split__keep_unique_genes: Should only unique DA and NDA genes be kept for downstream analysis. Default: 'TRUE'.

Outputs

  • Gene lists: Processed_Data/2_Differential_Analysis/DA_split__genes_rds/${key}__genes.rds

  • Bed files: Processed_Data/2_Differential_Analysis/DA_split__bed_regions/${key}__regions.bed

  • Res simple:

    • Tables_Individual/2_Differential_Analysis/res_simple/${comparison}__res_simple.{csv,xlsx}
    • Tables_Merged/2_Differential_Analysis/res_simple.{csv,xlsx}
  • Res filter:

    • Tables_Individual/2_Differential_Analysis/res_filter/${comparison}__res_filter.{csv,xlsx}
    • Tables_Merged/2_Differential_Analysis/res_filter.{csv,xlsx}

DA_split__plotting_venn_diagrams

Description

This process takes as input all gene lists made by the previous process for a given comparison and generates venn diagrams for gene lists that share these keys: PA (DAR Peak Annotation), FC (Fold Change type) and TV (Theshold Value).
Two types of plots are made:

  • proportional two ways venn diagrams: ATAC-Seq vs mRNA-Seq with FC either up or down
  • fixed-size four-ways venn diagrams: ATAC-Seq vs mRNA-Seq with FC up and down. In these plots, mRNA-Seq data has an orange filling, ATAC-Seq data has a blue filling, up-regulated genes have a purple outside line and down-regulated genes have a green purple outside line.

Outputs

  • Two-ways venn diagrams:
    • Figures_Individual/2_Differential_Analysis/Venn_diagrams__two_ways/${key}__venn_up_or_down.pdf
    • Figures_Merged/2_Differential_Analysis/Venn_diagrams__two_ways.pdf

  • Four-ways venn diagrams:
    • Figures_Individual/2_Differential_Analysis/Venn_diagrams__four_ways/${key}__venn_up_and_down.pdf
    • Figures_Merged/2_Differential_Analysis/Venn_diagrams__four_ways.pdf