This repository is the official implementation of the ECIR 2022 paper Automation of Citation Screening for Systematic Literature Reviews using Neural Networks: A Replicability Study.
If you find our code useful, please cite our paper:
@inproceedings{kusa2022automation,
title={Automation of Citation Screening for Systematic Literature Reviews Using Neural Networks: A Replicability Study},
author={Kusa, Wojciech and Hanbury, Allan and Knoth, Petr},
booktitle={European Conference on Information Retrieval},
pages={584--598},
year={2022},
organization={Springer}
}
Tested with Python 3.8.
Install requirements with pip:
$ pip install -r requirements.txt
Original Clinical review datasets can be downloaded from here. Use src/data/prepare_clinical_data.py
script to prepare the datasets. Make sure that the variable repository_path
is set to a root of a bwallace/citation-screening/
repository.
Original Drug review datasets can be downloaded from here.
This dataset does not contain Abstract and Title information, so this data needs to be downloaded from PubMed using the article's PubMed ID. Place epc-ir.clean.tsv
input file in a data/external/drug/
folder and run src/data/prepare_drug_data.py
script.
Original SWIFT review datasets can be downloaded from here.
-
OHAT datasets (
PFOA/PFOS
,Bisphenol A (BPA)
,Transgenerational
andFluoride and neurotoxicity
) are stored as four sheets in one Excel file. -
CAMRADES dataset (
Neuropathic pain
) is stored as a separate Excel file.
Fluoride and neurotoxicity
, and Neuropathic pain
already contain a title and abstract data, so the only needed preparation step is a conversion of the Label column into a common format.
Other datasets consist only of PubMed IDs and assigned labels so, it is necessary to download abstract and title data using biopython.
src/data/prepare_swift_data.py
script accept .tsv files, so you need to convert each dataset into separate .tsv file and place them in data/external/SWIFT/
folder.
For Drug and SWIFT datasets, in order to download documents from Pubmed, you need to set Entrez.email
variable to your email address.
Detailed results are stored in reports/
directory
results-document_features.csv
file contains detailed results of input document feature influence for all models and datasets.results-precision_at_95recall.csv
file contains detailed precision@95% recall results for all models and datasets.results-time.csv
file contains training time measurement results for all models and datasets.
In order to recreate the figures, run jupyter notebook notebooks/plotting.ipynb
.
In order to calculate dataset statistics, run src/data/dataset_statistics.py
script.