This project is designed to crawl paper metadata from computer science conference websites and convert it into .xml
and .bib
files for Zotero import.
Note
Currently, we only support IEEE conference/jounral. We would add more in the future.
Warning
Due to the unknown restricted format for Zotero RSS, the current .xml
file can not import properly. Use .bib
instead. We are going to pay efforts to fix this issue.
- Web scraping of conference websites for paper metadata
- Conversion of scraped data into Zotero-compatible formats (
.xml
and.bib
)
Conference Name | RSS Link(.xml) [not work often, you may fix the format of these file] | Bibtex Link(.bib) |
---|---|---|
CVPR 2024 | https://bili-sakura.github.io/rss/CVPR2024.xml | https://bili-sakura.github.io/bib/CVPR2024.bib |
WACV 2024 | https://bili-sakura.github.io/rss/WACV2024.xml | https://bili-sakura.github.io/bib/WACV2024.bib |
ICCV 2023 | https://bili-sakura.github.io/rss/ICCV2023.xml | https://bili-sakura.github.io/bib/ICCV2023.bib |
CVPR 2023 | https://bili-sakura.github.io/rss/CVPR2023.xml | https://bili-sakura.github.io/bib/CVPR2023.bib |
WACV 2023 | https://bili-sakura.github.io/rss/WACV2023.xml | https://bili-sakura.github.io/bib/WACV2023.bib |
For journal, the official provide RSS subscription link, you can access them from their main page, here we provide several heating RSS subscription link.
Jounral Name | Official RSS Link |
---|---|
TPAMI | https://ieeexplore.ieee.org/rss/TOC34.XML |
TGRS | https://ieeexplore.ieee.org/rss/TOC36.XML |
To get started with this project, follow the steps below to set up your development environment.
git clone /~https://github.com/bili-sakura/Zotero-Subscription-Helper.git
cd Zotero-Subscription-Helper
It's recommended to use a virtual environment to manage dependencies. Run the following commands:
python -m venv venv
source venv/bin/activate # On Windows, use `venv\Scripts\activate`
Install the required Python packages using pip
:
pip install -r requirements.txt
You can now run the main script to start scraping:
python main.py
Once you're done working, you can deactivate the virtual environment:
deactivate
You can provide a configuration file in YAML format to specify the conference name, year, and other settings.
-
Create a YAML configuration file (e.g.,
config/config.yaml
):conference: ICCV year: 2023 max_papers: 5000 # Optional, defaults to 5000 if not provided
-
Run the script with the configuration file:
python main.py --config config/config.yaml
Alternatively, you can provide the necessary information directly through command-line arguments.
--conference
: The name of the conference (e.g., ICCV, CVPR).--year
: The year of the conference.
--max-papers
: The maximum number of papers to scrape. Default is 5000.--config
: Path to a YAML configuration file. If provided, this overrides command-line arguments.
python main.py --conference ICCV --year 2023 --max-papers 5000
- If both a configuration file and command-line arguments are provided, the configuration file will take precedence.
- Ensure that the output directory (
out/
) is writable and has enough space to store the.bib
and.xml
files.
This project is strongly inspired by the work of CPR-RSS repository @XgDuan.
- restart from existing file and skip existing items and continue
- fix
.xml
format for rss to zotero - use multiprocess to accelerate