Automated machine learning for rare variant analysis of response to antiretroviral therapy in persons living with HIV
Submitted 2020-07-XX
Motivation: Rare variants pose several challenges for GWAS studies including a lack of statistical power and the inability to model epistasis. In this paper we propose and implement a new approach which uses automated machine learning (AutoML) to generate regression pipelines for binned rare variant data. This approach is applied to data from the AIDS Clinical Trials Group (ACTG).
Results: Statistically significant pipelines were generated for all three tested phenotypes. Permutation importance analysis highlighted several genes important to the accurate prediction of phenotypes, many of which had prior associations with the phenotype.
Availability: TPOT (Tree-based Pipeline Optimization Tool) was used for this analysis and is freely available: https://epistasislab.github.io/tpot/
Contact: jhmoore@upenn.edu
A simple command line script to select and format data from the original files
A script to convert gene set information from GMT format into a format that is compatible with the TPOT Feature Set Selector. This also makes plots related to the number of bins and the number of genes in each bin (used for figure 1 in the paper).
A command line script to regress the phenotypes against the covariates, saving and plotting the residuals to be used in further analysis
A command line script that does the following:
- Load the rare variant data
- Load the residuals from regression with covariates
- Optimize and save TPOT pipeline based on the 'FeatureSetSelector-Transformer-Regressor' template
- Score the optimized pipeline and plot the regression results.
The same as above, with an additional line that permutes the phenotype data before running the analysis
- Load and score the optimized pipelines (100 replicate and 100 permuted). Scoring works the same as in run_tpot_exome_residuals.py.
- Save pipeline information (structure, selected feature set) and scores (used in figure 3).
- Generate feature importances 100 times for the top ten (non-permuted) pipelines and save the results (used in figure 4).
Save a copy of the feature importances that excludes rows (bins) with all-zero or all-missing feature importances.
Command line script to generate figure 2 from the paper:
- Plot the distribution of scores for replications and permutations.
- Perform a t-test of the null hypothesis that the distribution of scores is the same for the original and permuted data.
Command line script to plot pipeline diagrams of the steps used in the top 10 pipelines (figure 3 in the paper).
Command line script to plot feature importances (figure 4 in the paper):
- Take the top 10 scoring regression pipelines
- Rank the variant bins (which correspond to genes) by the maximum feature importance of that bin in any one of the top 10 pipelines
- Plot a histogram of feature importances among the 100 replicates
These are pbs files meant to submit jobs to Penn State's ACI-B computing infrastructure
A PBS job script for creating a conda environment with all of the required dependencies.
A PBS job script to run the 'run_tpot_exome_residuals.py' script in parallel a number of times.
Same as run_replicates, but calling the permutation script instead.
Supplemental data referred to in the paper
- A figure showing results of testing with different TPOT settings
- Figures showing the phenotype and residual distributions
- Tables of feature importance for all bins from the top 10 pipelines
This folder contains two files for each phenotype (one with the original data and one after permuting the phenotype). Each file has 100 rows where each row contains information on the structure and score of the optimized pipeline from that run of TPOT. This is used to generate the pipeline structure diagram (figure 3).