Augmenting molecular structure representation learning using semantic biomedical knowledge
- Python ≥ 3.8.0;
requirements.txt
contains the Python packages requirements.
The data used are made available through the following box folder, where you can find:
data/
contains pretraining dataset, the Knowledge Graphs created with the relative dictionary of entities and their ids, the classification datasets (datasets_valid_and_splits/
contais for each assay the tabular dataset with Smiles string, MACCS key, chemicals name and labels, and the training, validataion and test index for the 5 random runs), andtsne_2d_embeddings_all_chemicals_37tox21_emb.xlsx
, that is a dataframe containing chemical names, MACCS keys, physical properties and the 2D t-SNE projections for all the n = 8541 chemicals that belong to the set of the 37 Tox21 assays considered;ckpt/
contains the pretrained GNN molecule encoder.
- Machine learning: baseline ML models can be trained by running
ML.py
. The results will be written inresults/ML
with a directory for each random runs (seed).
python ML.py
- Finetune MolCLR: MolCLR can be finetuned by running
MolCLR/finetune.py
. The results will be written inresults/graph_structure_comptoxAI
with a directory for each random runs (seed).
python MolCLR/finetune.py
- Semantic GNN: Semantic GNN model can be trained by running
semantic.py
. The results will be written inresults/semantic_gat
with a directory for each random runs (seed).
python semantic.py
- MolCLR+Sem: MolCLR+Sem model can be trained by running
semantic_and_MolCLR.py
. The results will be written inresults/semantic_and_graph
with a directory for each random runs (seed).
python semantic_and_MolCLR.py
Explainability with GNNExplainer can be obtained for positive chemicals by running the explain.py
script. The results will be written in results/gnn_xai
with a directory for each random runs.
python explain.py
The evaluation.py
script contains code for:
- compute pretrained embeddings for all the chemcials involved in the Tox21 assays considered, project them in 2D with t-SNE and colour them according to chemical and physical properties of the molecules (extracted from ComptoxAI or through puchem API);
- process the classification results by computing the mean classification metrics for each model and for each assay, to create a dataframe than can be used to compute the violin plot with the mean results computedall the assay and the heatmap with the single assay results.
- process the xai results, by thresholding the number of edges to keep, and create the images with the molecule graph and most important subgraph identified a specific compound in input.
python evaluate.py