Requires: docker
, node
, python
1. Parsing, storing in db
in pubmed-to-db:
1.1 Install dependencies: npm install
1.2 Start database: docker-compose up
1.3 Create database table structure: node db-setup.js
1.4 Download xml-results from the source:
1.5 Parse file: node parse-from-xml.js <path-to-xml-file>
(xml-file refers to an export from pubmed)
in topic-modelling:
2.1 Install modules pip install nltk spacy gensim
2.2 Download nltk stopwords: python
2.3 Download spacy en module: python -m spacy download en
2.4 Preprocess the texts: python
2.5 Transform corpus to dictionary and bag-of-words structure: python -c "from transform_corpus import *; save_corpus_and_dictionary_to_file()"
in topic-modelling:
3.1 Download Mallet, set MALLET_PATH
environment variable
3.2 Generate a topic model: python <'gensim' | 'mallet'> ...topic_number_configurations
3.3 Calculate coherence scores: python ...model_names
3.4 Extract topics to a csv file: python <model-name> <number of topics>