MetArea tool proposes to use the performance calculation to detect functionally related motifs of transcription factor (TF) binding sites in genomic DNA derived from ChIP-seq experiments. MetArea means the Meta Area under curve approach for predicting structural variability and cooperative binding of TFs in ChIP-seq data. Relation of function of two motifs may be interpreted as two structurally different motifs of the same TF. If two TFs are different, then not only the motif is substituted in the mutiprotein complex (e.g., heterodimer), but the TF with its motif are substituted.
For a motif representing TF binding sites, the recognition accuracy is calculated as the partial area under PR curve (pAUPRC, Levitsky et al., 2024); Davis, Goadrich, 2006). The term PR means Precision-Recall curve. Generalizing the PR curves and pAUPRC values of the two separate motifs to a single joint motif reveals whether the joint performance of two motifs competes with those of the participating motifs. Testing multiple possible pairs of motifs allows to detect the pairs of motifs most strongly reinforcing each other. The tested motifs can represent either structurally different binding sites for the same TF, or binding sites of different TFs acting as part of a single multi-protein complex. Since the default option of TFs action is mutual cooperative interactions, the single protein complex of multiple TFs is reflected in various portions of peaks as different enriched motifs of various TFs, including the target TF of ChIP-swq experiment and several partner TFs (Biswas, Narlikar, 2021). Hence, the functional relationship of these distinct motifs is recognized through increased recognition performance in pairs of enriched motifs compared to single motifs. Thus, MetArea predicts motifs with functionally related functions in gene transcription regulation.
MetArea algorithm considers a single set of ChIP-seq peaks, and starts from computation of two PR curves for each of two combined single motifs. They can correspond to various structural types of motifs for the same TF, or motifs for two distinct TFs. The PR curve represents the relationship between the rates of true positive (TP) and false positive (FP) predictions. The default option of the background set generation implies genomic sequences with G/C content matching that for the sequences from the foreground set, details in AntiNoise (Raditsa et al., 2024), although other option can be applied for specific data. For each recognition threshold, the Recall (TP rate) denotes the fraction of sequences from the foreground set (ChIP-seq peaks) containing predicted sites of a motif. Earlier, the common definition of the precision, Precision = TP/(TP+FP) (Davis, Goadrich, 2006), i.e. the Precision is the ratio of the number of predicted sequennces from the foreground set (TP) to the sum of the number of predicted sequences in the foreground set (TP) and in background sets (FP). We aim to compute the universal PR curve not depending on the numbers of sequences of the foreground/background sets (NF/NB). The probabilities to recognize sequences in the foreground/background sets are equal TPR = TP/NF, FPR = FP/NB. Hence, we correct the definituin of precision as follows: The precision is the ratio of the probability of sequence prediction in the foreground set to the sum of the probabilities of sequence prediction in the foreground and background sets, Precision = TPR/(TPR + FPR) = TP/(TP+FP*NF/NB), (Levitsky et al., 2024) The calculation of the partial area under PR curve (pAUPRC) implies restictions from both axes X/Y, i.e. Recall/Precision.
- Recall is resticted by the criterion on the expected recognition rate (ERR) of the motif in the whole genome set of promoters of protein-coding genes (Tsukanov et al., 2022). Preliminary, for each single motif, the promoter sequences of protein coding genes from the respective whole genome are used to compute -Log10(ERR) values (Expected Recognition Rate, ERR). These ERR values measures the recognition motif scores of various motifs in the uniform scale (Levitsky et al., 2019; Tsukanov et al., 2022). Hence, we do not take the entire range of the Recall measure from 1 to 0, instead we consider a portion of the curve respecting the criterion ERR < ERRmax. This criterion means that all peaks that correspond to expected motif frequencies larger than the ERRmax threshold are excluded from the area measure. The recommended range of ERRmax values is from 0.001 to 0.01, the default value is 0.002.
- Precision is resticted by the boundary value of precision for the 'no skill' motif. The expected behabiour of the PR curve for the 'no skill' motif is y(Precision) = 0.5 for any x(Recall), see (Saito, Rehmsmeier, 2015). The 'no skill' model implies a random model that is equally likely predicting sequences in the foreground and background sets.This boundary value is constant, Precision0 = 0.5, for any threshold of Recall, since we normalized FP and TP values to the same equal set size. This restriction is required since the criterion of the functionally related motif pair presumes the ratio of two pAUPRC values, so we remove the boundary value from each pAUPRC value. Hence, instead of Precision values we use their deviation from the boundary value Precision0, {Precision - Precision0}.
Finally, MetArea algorithm computes the joint PR curve for the combination of two motifs. The joint Precision and Recall are determined in the same way as those for single motifs. They respect is the fraction of sequences containing predicted sites of at least one of two motifs. To differentiate significantly similar pairs of motifs among analyzed pairs of PWM motifs, for all pairs of PWM motifs the significance of similarity is estimated by the motif similarity approach from MCOT, (Levitsky et al., 2019). The criterion of the common molecular function for the pair of motifs 1 and 2 requires the higher performance value pAUPRC_1&2 of the joint motif 1&2 compared to the performance values of both participating separate motifs, pAUPRC_1 and pAUPRC_2. This criterion of the functionally related motif pair requires that the growth of the common performance: the Ratio of Areas Under Curves (RAUC) exceeds one: RAUC = pAUPRC_1&2 / Max(pAUPRC_1, pAUPRC_2) > 1. MetArea considers ready PR curves for TF binding site motif recognition models, thereby it can be applied to the motifs either of the traditional model of Position Weight Matrix (PWM) or an alternative model, e.g. SiteGA.
- FASTA file for the set of foreground sequences (peaks). This set represents top-scored peaks derived from ChIP-seq experiment (for example, 1000 peaks of the best quality is the default recomended option).
- FASTA file for the set background sequences. This set consisted of sequences extracted from the whole genome for the respective species, use the AntiNoise to generate background sequences matching in the content of A/T nucleotides the sequences from the foreground dataset. Use the default either genomic or promoters options. These options mean extraction of background sequences from the whole genome or only from promoter regions of all protein coding genes. Do not use the shuffled sequences or any other sequences generated by Markov models of different orders as the background set, this drastically reduces the quality of the output results.
- Two distinct motifs, e.g. Position Frequency Matrices (PFM) representing PWM motifs.
- Two lists of recognition thresholds for two motifs and the respective ERR values. These lists are computated as the recognition score distributions for the set of whole-genome promoter sequences selected for the respective species, as it was described earlier for MCOT (Levitsky et al., 2019) and SiteGA (Tsukanov et al., 2022).
- list of the performanes estmates pAUPRC for all single motifs
- list of all pairs of single motifs with the performanes estmates pAUPRC of joint motifs and the marks of funtional relation, RAUC > 1.
- esiimation of motifs similarities for all pairs of PWM motifs
- One PWM motif vs. another PWM motif.
- Several PWM motifs at once (list of N PWM motifs from de novo motif search), all possible pairwise combinations {N * (N - 1)/2} are tested.
- One traditional motif PWM vs. one alternative motif SiteGA. The SiteGA model was chosen as the most methodologically different from the tradiotional PWM.
- One PWM motif vs. all motifs for known TFs from a given collection from Hocomoco v12 or JASPAR2024.
- Best performed N motifs from the collection of PWM motifs for known TFs for certain taxon or species. The performances of all motifs from a given collection are estimated by pAUPRC, then all possible {N * (N - 1)/2} pairwise combinations of motifs for the top-scored N motifs are tested. Available options includes the collections of 1420/1142 motifs for 942/713 human/mouse TFs from Hocomoco v12 (Vorontsov et al., 2024), of 556/151 motifs for 555/148 plant/insect TFs from JASPAR2024 (Rauluseviciute et al., 2024).
- One PWM motif vs. another PWM motif, pauc_forback_2motifs_only.cpp
- input FASTA file, foreground set of sequences, peaks, example - top 1000 peaks for mouse BHLHA15 TF from GTRD
- input FASTA file, background set of sequences, genomic sequences, A/T-matched random pieces of DNA of the same length from promoters genomic sequences from promoters extracted by AntiNoise
- input binary file for the first motif (PWM1), example. This file integrates in the binary format PFM, PWM, the motif length, and total number of recognition thresholds, the full list of thresholds and ERR values. The binary files for PWM models are generated by PWM thresholds selection program from MCOT.
- input binary file for the second motif (PWM2), example. This file integrates in the binary format PFM, PWM, the motif length, and total number of recognition thresholds, the full list of thresholds and ERR values. The binary files for PWM models are generated by PWM thresholds selection program from MCOT.
- double ERRmax threshold, values in the range from 0.001 to 0.01 are recommended, default value 0.002. ERRmax (maximal Expected Recognition Rate) value means the most mild threshold used to compute the partial area under PR curve. This ERRmax value define the range of recognition thresholds from the table 'Threshold vs. ERR'.
- output file, pAUPRC values for the motifs PWM1 and PWM2 and for the joint motif PWM1&PWM2, value of the Ratio of Areas Under Curves (RAUC), the significance of similarity between two motifs example, pAUPRC values for PWM1, PWM2 and joint PWM1&PWM2 models)
- output file, PR curve for the motif PWM1, example
- output file, PR curve for the motif PWM2, example
- output file, PR curve for the joint motif PWM1&PWM2, example
- Several PWM motifs at once (list of PWM motifs from de novo motif search), pauc_forback_2motifs0.cpp
- input FASTA file,foreground set of sequences, peaks, example - top 1000 peaks for mouse BHLHA15 TF from GTRD
- input FASTA file, background set of sequences, genomic sequences, A/T-matched random pieces of DNA of the same length from promoters genomic sequences from promoters extracted by AntiNoise
- input binary file for the collection of PWM motifs. This binary files is preliminary generated by PWM thresholds selection program from MCOT
- int value, the number motifs, N
- double ERRmax threshold, values in the range from 0.001 to 0.01 are recommended, default value 0.002. ERRmax (maximal Expected Recognition Rate) value means the most mild threshold used to compute the partial area under PR curve. This ERRmax value define the range of recognition thresholds from the table 'Threshold vs. ERR'.
- output file, the matrix of all (N*(N-1)) joint pAUPRC values, all (N*(N-1)) pairwise values of the Ratio of Areas Under Curves (RAUC), and all (N*(N-1)) significances of pairwise similarity for all tested pairs formed by the selected M top-scored motifs, example_of output_matrices
- output file pAUPRC, list of joint pAUPRC values for all pairs of motifs formed by top-scored N motifs, example list of joint pAUPRC values for all pairs of motifs including the corresponding list of the significance of similarity of motifs in pairs
- output file log1, list of pAUPRC values for all single motifs, example list of pAUPRC values fo single motifs
- output file log2, list of joint pAUPRC values for all pairs of motifs formed by N motifs, example list of joint pAUPRC values for all pairs of motifs including the corresponding list of the significance of similarity of motifs in pairs. This output file is concordant with the out file of p-value out_pval of the respective program mcot_denovo.cpp from MCOT
- output file name, PR curves for the single and joint motifs, example, PR curve for pair of models
- One traditional motif PWM vs. one alternative motif SiteGA, pauc_forback_pwm_sga_only.cpp
- input FASTA file, foreground set of sequences, peaks, example - top 1000 peaks for mouse BHLHA15 TF from GTRD
- input FASTA file, background set of sequences, genomic sequences, A/T-matched random pieces of DNA of the same length from promoters genomic sequences from promoters extracted by AntiNoise
- input binary file for the PWM motif, example. This file integrates in the binary format PFM, PWM, the motif length, and total number of recognition thresholds, the full list of thresholds and ERR values. The binary files for PWM models are generated by PWM thresholds selection program from MCOT.
- input binary file for the SiteGA motif, example. This file integrates in the binary format the matrix of SiteGA model, including the motif length, the total number of recognition thresholds, the full list of thresholds and ERR values. The binary files for SiteGA models are generated by SiteGA thresholds selection program from SiteGA
- double ERRmax threshold, values in the range from 0.001 to 0.01 are recommended, default value 0.002. ERRmax (maximal Expected Recognition Rate) value means the most mild threshold used to compute the partial area under PR curve. This ERRmax value define the range of recognition thresholds from the table 'Threshold vs. ERR'.
- output file, pAUPRC values for the PWM and SiteGA motifs and for the joint motif PWM&SiteGA, example, pAUPRC values for the single PWM, SiteGA and joint PWM&SiteGA motif
- output file, PR curve for the motif PWM example, PR curve PWM model
- output file, PR curve for the motif SiteGA, example, PR curve SiteGA model
- output file, PR curve for the joint PWM&SiteGA, example, PR curve PWM&SiteGA model
- One PWM motif vs. all motifs for known TFs from a given collection, pauc_forback_anc_lib.cpp
- input FASTA file,foreground set of sequences, peaks, example - top 1000 peaks for mouse BHLHA15 TF from GTRD
- input FASTA file, background set of sequences, genomic sequences, A/T-matched random pieces of DNA of the same length from promoters genomic sequences from promoters extracted by AntiNoise
- input binary file for PWM motif, example. This file integrates in the binary format PFM, PWM, the motif length, and total number of recognition thresholds, the full list of thresholds and ERR values. The binary files for PWM models are generated by PWM thresholds selection program from MCOT.
- input binary file for the collection of PWM motifs, available options: two collections of human and murine motifs from Hocomoco v12, see archive files in partners folder, two collections for plant and insect TFs from JASPAR2024, JASPAR2024_insects and JASPAR2024_plants These files in the binary format integrate for all motifs in each collection PFM, PWM, the motif length, and total number of recognition thresholds, the full list of thresholds and ERR values. The binary files for PWM models are generated by PWM thresholds selection program from MCOT.
- double ERRmax threshold, values in the range from 0.001 to 0.01 are recommended, default value 0.002. ERRmax (maximal Expected Recognition Rate) value means the most mild threshold used to compute the partial area under PR curve. This ERRmax value define the range of recognition thresholds from the table 'Threshold vs. ERR'.
- output file, list of joint pAUPRC values satisfying the criterion of the Ratio of Areas Under Curves (RAUC) > 1, example list of joint pAUPRC values for the functionally related pairs of motifs, and the corresponding list of the significance of similarity of motifs in pairs
- output file, list of joint pAUPRC values for all pairs of motifs, example list of joint pAUPRC values for all pairs of motifs including the corresponding list of the significance of similarity of motifs in pairs
- Best performed motifs from the collection of PWM motifs for known TFs for certain taxon or species, pauc_forback_2motifs.cpp
- input FASTA file,foreground set of sequences, peaks, example - top 1000 peaks for mouse BHLHA15 TF from GTRD
- input FASTA file, background set of sequences, genomic sequences, A/T-matched random pieces of DNA of the same length from promoters genomic sequences from promoters extracted by AntiNoise
- input binary file for the collection of PWM motifs, available options: two collections of human and murine motifs from Hocomoco v12, see archive files in partners folder, two collections for plant and insect TFs from JASPAR2024, JASPAR2024_insects and JASPAR2024_plants These files in the binary format integrate for all motifs in each collection PFM, PWM, the motif length, and total number of recognition thresholds, the full list of thresholds and ERR values. The binary files for PWM models are generated by PWM thresholds selection program from MCOT
- input binary file for the pairwise similarity between the motifs from the collection of PWM motifs, examples for the human/mouse collections from Hocomoco v12, and for the plants/insects collections from JASPAR2024 are provided in partners folder, e.g. for data for insects motifs
- int value, the number of M top-scored motifs to check pairwise functional relation between motifs
- double ERRmax threshold, values in the range from 0.001 to 0.01 are recommended, default value 0.002. ERRmax (maximal Expected Recognition Rate) value means the most mild threshold used to compute the partial area under PR curve. This ERRmax value define the range of recognition thresholds from the table 'Threshold vs. ERR'.
- output file, the matrix of all (M*(M-1)) joint pAUPRC values, all (M*(M-1)) pairwise values of the Ratio of Areas Under Curves (RAUC), and all (M*(M-1)) significances of pairwise similarity for all tested pairs formed by the selected M top-scored motifs, example_of output_matrices
- output file pAUPRC, list of joint pAUPRC values for all pairs of motifs formed by top-scored M motifs, these pairs satisfy the relation criterion RAUC > 1, example list of joint pAUPRC values for the functionally related motif pairs of motifs including the corresponding list of the significance of similarity of motifs in pairs
- output file log1, list of pAUPRC values for all single motifs from the collection, example list of pAUPRC values fo single motifs
- output file log2, list of joint pAUPRC values for all pairs of motifs formed by top-scored M motifs, example list of joint pAUPRC values for all pairs of motifs including the corresponding list of the significance of similarity of motifs in pairs
- One PWM motif vs. another PWM motif, pauc_forback_2motifs_only.cpp
com_line_pwm_pwm command line for two PWM models BHA15.H12CORE.0.P.B.pcm and BHA15.H12CORE.1.SM.B.pcm respecting two structurally distinct motif types for murine BHLHA15 TF from Hocomoco v12 and ChIP-seq peaks dataset PEAKS039234 for this TF
- Several PWM motifs at once (list PWM motifs from de novo motif search), pauc_forback_2motifs0.cpp
com_line_de_novo command line exanple for seven PWM models {motif1.pfm ... motif7.pfm} respecting the list of motifs derived from de novo motif search for ChIP-seq peaks dataset PEAKS039234 for murine BHLHA15 TF. This command line apply the perl script den.pl
- One traditional motif PWM vs. one alternative motif SiteGA, pauc_forback_pwm_sga_only.cpp
com_line_pwm_sga command line for PWM and SiteGA motifs for ChIP-seq peaks dataset PEAKS039234 for murine BHLHA15 TF, GSE86289, pancreas of adult mice. Two motifs derived from this dataset with de novo motif search.
- One PWM motif vs. all motifs for known TFs from a given collection, pauc_forback_anc_lib.cpp
com_line_anc_lib command line for PWM motif derived by de novo motif search for ChIP-seq peaks dataset PEAKS039234 for murine BHLHA15 TF and the collection of motifs for murine TFs from Hocomoco v12
- Best performed motifs from the collection of PWM motifs for known TFs for certain taxon or species, pauc_forback_2motifs.cpp
com_line_lib_lib command line for ChIP-seq peaks dataset PEAKS039234 for murine BHLHA15 TF to test top-scoring pairwise combiantaions of motifs from the Hocomoco v12 collection for known murine TFs.
-
Generation of the binary file for the PWM model from the text files of the model matrices PFM/PWM and the table (Thresholds of model vs. ERRs), pwm_pwm_txt_bin
-
Generation of the binary files for the SiteGA model from the text files of the model matrix and the table (Thresholds of model vs. ERRs), pwm_sga_txt_bin