readbiomed-ncbi-pathogen-dataset-generation

This package has been used for the generation of datasets with information about pathogens and MEDLINE citations associated to them using resources from NCBI (National Center for Biotechnology Information).

This package has code to generate pathogen information in XML format and to build corpora from those XML files for taxonomic pathogens (e.g. bacteria and viruses), PrPSc prions and toxins.

An example of generated data set is available here.

Installation

The package has been tested with Java 11 and Maven 3.6.3.

To install it run mvn install after cloning this github repository and moved to the local cloned directory.

Data sets generation

In the following sections, the generation of data sets for the different pathogens is explained.

Taxonimic pathogens

From the cloned folder, in order to create the files for the taxonomic pathogens using NCBI resources, there are two steps. In the first one, a set of XML files are generated from a list of pathogens in a text file, one pathogen per line. An output folder needs to be specified as well.

mvn exec:java -Dexec.mainClass="readbiomed.pathogens.dataset.NCBITaxonomy.EntrezTaxonomyDocuments" -Dexec.args="[File_with_list_of_pathogens] [Output_folder]"

In the second set, using the files generated in the previous step as input, citations are collected from MEDLINE and placed in the output folder.

mvn exec:java -Dexec.mainClass="readbiomed.pathogens.dataset.NCBITaxonomy.BuildDataset" -Dexec.args="[Input_folder] [Output_folder]"

PrPSc prions

From the cloned folder, in order to create the files for the PrPSc prions using NCBI resources, there are two steps. In the first one, based on a predefined list of prions, a set of XML files is generated in the specified output folder.

mvn exec:java -Dexec.mainClass="readbiomed.pathogens.dataset.PrPSc.PrPScDocuments" -Dexec.args="[Output_folder]"

In the second step, the XML files are used to recover MEDLINE citations. The XML files are in the input folder, while the MEDLINE citations will be stored in the specified output folder.

mvn exec:java -Dexec.mainClass="readbiomed.pathogens.dataset.PrPSc.PrPScBuildDataset" -Dexec.args="[Input_folder] [Output_folder]"

Toxins

From the cloned folder, in order to create the files for the toxins using NCBI resources, there are two steps. In the first one, based on a predefined list of toxins, a set of XML files is generated in the specified output folder.

mvn exec:java -Dexec.mainClass="readbiomed.pathogens.dataset.toxins.ToxinDocuments" -Dexec.args="[Output_folder]"

In the second step, the XML files are used to recover MEDLINE citations. The XML files are in the input folder, while the MEDLINE citations will be stored in the specified output folder.

mvn exec:java -Dexec.mainClass="readbiomed.pathogens.dataset.toxins.ToxinBuildDataset" -Dexec.args="[Input_folder] [Output_folder]"

References

If you use this work in your research, remember to cite it. More information about our is available in the following paper.

@article{jimeno2023classifying,
  title={Classifying literature mentions of biological pathogens as experimentally studied using natural language processing},
  author={Jimeno Yepes, Antonio and Verspoor, Karin},
  journal={Journal of Biomedical Semantics},
  year={2023},
  volume={14},
  number={1},
  doi={10.1186/s13326-023-00282-y},
  url={https://doi.org/10.1186/s13326-023-00282-y}
}

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github/workflows		.github/workflows
src/main/java/readbiomed/pathogens/dataset		src/main/java/readbiomed/pathogens/dataset
README.md		README.md
license.txt		license.txt
pom.xml		pom.xml
settings.xml		settings.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

readbiomed-ncbi-pathogen-dataset-generation

Installation

Data sets generation

Taxonimic pathogens

PrPSc prions

Toxins

References

About

Releases 1

Packages

Contributors 3

Languages

License

READ-BioMed/readbiomed-ncbi-pathogen-dataset-generation

Folders and files

Latest commit

History

Repository files navigation

readbiomed-ncbi-pathogen-dataset-generation

Installation

Data sets generation

Taxonimic pathogens

PrPSc prions

Toxins

References

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Languages

Packages