Skip to content

DEIB-GECO/prior_tree_models_repo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Prior knowledge-guided tree-based models

Integration of prior domain knowledge into tree-based models.

Description

We developed a robust framework that improves tree-based models for high-dimensional, noisy data by integrating feature selection, tree construction, and weighting with prior knowledge, combining data-driven insights and established domain understanding.

Application Use Case

We compared the performance of the standard tree-based models and our proposed approaches on an application use case concerning the cancer-related subtype prediction of patients based on gene expression data. The use case concerns the classification of Breast Invasive Carcinoma (BRCA) patients in their corresponding cancer subtypes. We also performed two distinct sensitivity analyses to evaluate the impact of incorporating prior knowledge into tree-based models. We used a controlled dataset with limited correlation among the features for these analyses, considering publicly available RNA-seq profiles of Kidney Renal Clear Cell Carcinoma patients from The Cancer Genome Atlas (TCGA) project. The preprocessed dataset is available here, along with the list of features considered in the controlled dataset. For all datasets analysed, the data are preprocessed as described in the notebooks found here.

Implementation

To implement such tree-based models, we developed PkTree, a Python package that implements the proposed modifications. More information on the usage of the PkTree package is available here.

First, build a dedicated conda environment:

conda create -n env_pktree python=3.9 
conda activate env_pktree

Install the PkTree package:

pip install pktree

Lastly, install the required packages from requirements.txt

Prior domain knowledge

In these experiments, we used the score of biological knowledge described here and available here.

Additional Information

All the code is available here. To replicate the experiments run the following scripts:

  • BRCA_dt.py for experiments with the Decision Tree model
  • BRCA_rf_parallel.py for experiments with the Random Forest model

The code to replicate the two sensitivity analyses is available here. All the results from the experiments we performed can be found here.