diff --git a/README.md b/README.md index a41566b4cb..100526d7e1 100644 --- a/README.md +++ b/README.md @@ -13,13 +13,11 @@ Among other goals, our effort aims to be [the opposite of this](http://phdcomics Anserini grew out of [a reproducibility study of various open-source retrieval engines in 2016](https://link.springer.com/chapter/10.1007/978-3-319-30671-1_30) (Lin et al., ECIR 2016). See [Yang et al. (SIGIR 2017)](https://dl.acm.org/doi/10.1145/3077136.3080721) and [Yang et al. (JDIQ 2018)](https://dl.acm.org/doi/10.1145/3239571) for overviews. -❗ Anserini was upgraded from JDK 11 to JDK 21 at commit [`272565`](/~https://github.com/castorini/anserini/commit/39cecf6c257bae85f4e9f6ab02e0be101338c3cc) (2024/04/03), which corresponds to the release of v0.35.0. - ## 💥 Try It! Anserini is packaged in a self-contained fatjar, which also provides the simplest way to get started. -Assuming you've already got Java installed, fetch the fatjar: +Assuming you've already got Java 21 installed (Yes, you need _exactly_ this version), fetch the fatjar: ```bash wget https://repo1.maven.org/maven2/io/anserini/anserini/0.39.0/anserini-0.39.0-fatjar.jar @@ -39,7 +37,9 @@ java -cp anserini-0.39.0-fatjar.jar io.anserini.search.SearchCollection \ To evaluate: ```bash -java -cp anserini-0.39.0-fatjar.jar trec_eval -c -M 10 -m recip_rank msmarco-passage.dev-subset run.msmarco-v1-passage-dev.splade-pp-ed-onnx.txt +java -cp anserini-0.39.0-fatjar.jar trec_eval \ + -c -M 10 -m recip_rank msmarco-passage.dev-subset \ + run.msmarco-v1-passage-dev.splade-pp-ed-onnx.txt ``` See [detailed instructions](docs/fatjar-regressions/fatjar-regressions-v0.39.0.md) for the current fatjar release of Anserini (v0.39.0) to reproduce regression experiments on the MS MARCO V2.1 corpora for TREC 2024 RAG, on MS MARCO V1 Passage, and on BEIR, all directly from the fatjar! @@ -47,6 +47,9 @@ See [detailed instructions](docs/fatjar-regressions/fatjar-regressions-v0.39.0.m Also, Anserini comes with a built-in webapp for interactive querying along with a REST API that can be used by other applications. Check out our documentation [here](docs/rest-api.md). +❗ Beware, Anserini ships with many prebuilt indexes, which are automatically downloaded upon request (for example, `-index msmarco-v1-passage.splade-pp-ed` above triggers the download of a prebuilt index): these indexes can take up a lot of space. +See [this guide on prebuilt indexes](docs/prebuilt-indexes.md) for more details. + @@ -296,11 +299,11 @@ Key: ### MS MARCO V2.1 Segmented Document Regressions -The MS MARCO V2.1 corpora were derived from the V2 corpora for the TREC 2024 RAG Track. +The MS MARCO V2.1 corpora (documents and segmented documents) were derived from the V2 documents corpus for the TREC 2024 RAG Track. Instructions for downloading the corpus can be found [here](https://trec-rag.github.io/annoucements/2024-corpus-finalization/). -The experiments below use _passage-level_ qrels. +The experiments below capture topics and _passage-level_ qrels for the V2.1 segmented documents corpus. -| | RAG 24 | +| | RAG 24 UMBRELA | |-----------|:-------------------------------------------------------------:| | baselines | [+](docs/regressions/regressions-rag24-doc-segmented-test.md) | @@ -312,10 +315,11 @@ The experiments below use _passage-level_ qrels. ### MS MARCO V2.1 Document Regressions -The MS MARCO V2.1 corpora were derived from the V2 corpora for the TREC 2024 RAG Track. +The MS MARCO V2.1 corpora (documents and segmented documents) were derived from the V2 documents corpus for the TREC 2024 RAG Track. Instructions for downloading the corpus can be found [here](https://trec-rag.github.io/annoucements/2024-corpus-finalization/). -The experiments below capture topics and _document-level_ qrels originally targeted at the V2 corpora, but have been "projected" over to the V2.1 corpora. +The experiments below capture topics and _document-level_ qrels originally targeted at the V2 documents corpus, but have been "projected" over to the V2.1 documents corpus. These should be treated like dev topics for the TREC 2024 RAG Track; actual qrels for that track were generated at the passage level. +There are no plans to generate addition _document-level_ qrels beyond these. | | dev | DL21 | DL22 | DL23 | RAGgy dev | |-----------------------------------------|:---------------------------------------------------------------:|:--------------------------------------------------------------------:|:--------------------------------------------------------------------:|:--------------------------------------------------------------------:|:------------------------------------------------------------------:| @@ -635,6 +639,7 @@ Beyond that, there are always [open issues](/~https://github.com/castorini/anserin ## 📜️ Historical Notes ++ Anserini was upgraded from JDK 11 to JDK 21 at commit [`272565`](/~https://github.com/castorini/anserini/commit/39cecf6c257bae85f4e9f6ab02e0be101338c3cc) (2024/04/03), which corresponds to the release of v0.35.0. + Anserini was upgraded to Lucene 9.3 at commit [`272565`](/~https://github.com/castorini/anserini/commit/27256551e958f39495b04e89ef55de9d27f33414) (8/2/2022): this upgrade created backward compatibility issues, see [#1952](/~https://github.com/castorini/anserini/issues/1952). Anserini will automatically detect Lucene 8 indexes and disable consistent tie-breaking to avoid runtime errors. However, Lucene 9 code running on Lucene 8 indexes may give slightly different results than Lucene 8 code running on Lucene 8 indexes. diff --git a/docs/fatjar-regressions/fatjar-regressions-v0.36.1.md b/docs/fatjar-regressions/fatjar-regressions-v0.36.1.md index 4461180946..5a88f935a1 100644 --- a/docs/fatjar-regressions/fatjar-regressions-v0.36.1.md +++ b/docs/fatjar-regressions/fatjar-regressions-v0.36.1.md @@ -46,7 +46,7 @@ Both indexes will be downloaded automatically. For the TREC 2024 RAG track, we have thus far only implemented BM25 baselines on the MS MARCO V2.1 document corpus (both the doc and doc segmented variants). Current results are based existing qrels that have been "projected" over from MS MARCO V2.0 passage judgments. -The table below reports effectiveness (dev in terms of RR@10, DL21-DL23, RAGgy in terms of nDCG@10): +The table below reports effectiveness (dev in terms of RR@100, DL21-DL23, RAGgy in terms of nDCG@10): | | dev | dev2 | DL21 | DL22 | DL23 | RAGgy | |:---------------------------------------------------------------------------|-------:|-------:|-------:|-------:|-------:|-------:| diff --git a/docs/fatjar-regressions/fatjar-regressions-v0.37.0.md b/docs/fatjar-regressions/fatjar-regressions-v0.37.0.md index 254cbb8135..7e0ee10231 100644 --- a/docs/fatjar-regressions/fatjar-regressions-v0.37.0.md +++ b/docs/fatjar-regressions/fatjar-regressions-v0.37.0.md @@ -101,7 +101,7 @@ Replace `-index msmarco-v2.1-doc` with `-index msmarco-v2.1-doc-segemented` if y Since the TREC 2024 RAG evaluation hasn't happened yet, there are no qrels for evaluation. However, we _do_ have results based existing qrels that have been "projected" over from MS MARCO V2.0 passage judgments. -The table below reports effectiveness (dev in terms of RR@10, DL21-DL23, RAGgy in terms of nDCG@10): +The table below reports effectiveness (dev in terms of RR@100, DL21-DL23, RAGgy in terms of nDCG@10): | | dev | dev2 | DL21 | DL22 | DL23 | RAGgy | |:---------------------------------------------------------------------------|-------:|-------:|-------:|-------:|-------:|-------:| diff --git a/docs/fatjar-regressions/fatjar-regressions-v0.38.0.md b/docs/fatjar-regressions/fatjar-regressions-v0.38.0.md index a44b37ab47..753ec17f7b 100644 --- a/docs/fatjar-regressions/fatjar-regressions-v0.38.0.md +++ b/docs/fatjar-regressions/fatjar-regressions-v0.38.0.md @@ -101,7 +101,7 @@ Replace `-index msmarco-v2.1-doc` with `-index msmarco-v2.1-doc-segemented` if y Since the TREC 2024 RAG evaluation hasn't happened yet, there are no qrels for evaluation. However, we _do_ have results based existing qrels that have been "projected" over from MS MARCO V2.0 passage judgments. -The table below reports effectiveness (dev in terms of RR@10, DL21-DL23, RAGgy in terms of nDCG@10): +The table below reports effectiveness (dev in terms of RR@100, DL21-DL23, RAGgy in terms of nDCG@10): | | dev | dev2 | DL21 | DL22 | DL23 | RAGgy | |:---------------------------------------------------------------------------|-------:|-------:|-------:|-------:|-------:|-------:| diff --git a/docs/fatjar-regressions/fatjar-regressions-v0.39.0.md b/docs/fatjar-regressions/fatjar-regressions-v0.39.0.md index 10ac2df14a..3e4aafc437 100644 --- a/docs/fatjar-regressions/fatjar-regressions-v0.39.0.md +++ b/docs/fatjar-regressions/fatjar-regressions-v0.39.0.md @@ -6,10 +6,6 @@ Fetch the fatjar: wget https://repo1.maven.org/maven2/io/anserini/anserini/0.39.0/anserini-0.39.0-fatjar.jar ``` -Note that prebuilt indexes will be downloaded to `~/.cache/pyserini/indexes/`. -Currently, this path is hard-coded (see [Anserini #2322](/~https://github.com/castorini/anserini/issues/2322)). -If you want to change the download location, the current workaround is to use symlinks, i.e., symlink `~/.cache/pyserini/indexes/` to the actual path you desire. - Let's start out by setting the `ANSERINI_JAR` and the `OUTPUT_DIR`: ```bash @@ -17,6 +13,10 @@ export ANSERINI_JAR="anserini-0.39.0-fatjar.jar" export OUTPUT_DIR="." ``` +❗ Anserini ships with a number of prebuilt indexes, which it'll automagically download for you. +This is a great feature, but the indexes can take up a lot of space. +See [this guide on prebuilt indexes](../prebuilt-indexes.md) for more details. + ## Webapp and REST API Anserini has a built-in webapp for interactive querying along with a REST API that can be used by other applications. @@ -28,44 +28,80 @@ java -cp $ANSERINI_JAR io.anserini.server.Application --server.port=8081 And then navigate to [`http://localhost:8081/`](http://localhost:8081/) in your browser. -Here's a specific example of using the REST API to issue the query "How does the process of digestion and metabolism of carbohydrates start" to `msmarco-v2.1-doc`: +Here's a specific example of using the REST API to issue the query "How does the process of digestion and metabolism of carbohydrates start" to `msmarco-v2.1-doc-segmented`: ```bash -curl -X GET "http://localhost:8081/api/v1.0/indexes/msmarco-v2.1-doc/search?query=How%20does%20the%20process%20of%20digestion%20and%20metabolism%20of%20carbohydrates%20start" +curl -X GET "http://localhost:8081/api/v1.0/indexes/msmarco-v2.1-doc-segmented/search?query=How%20does%20the%20process%20of%20digestion%20and%20metabolism%20of%20carbohydrates%20start" ``` -The json results are the same as the output of the `-outputRerankerRequests` option in `SearchCollection`, described below for TREC 2024 RAG. +The json results are the same as the output of the `-outputRerankerRequests` option in `SearchCollection`, described below for "MS MARCO V2.1 + TREC RAG". Use the `hits` parameter to specify the number of hits to return, e.g., `hits=1000` to return the top 1000 hits. -Switch to `msmarco-v2.1-doc-segmented` in the route to query the segmented docs instead. Details of the built-in webapp and REST API can be found [here](../rest-api.md). -## TREC 2024 RAG +❗ Beware, the above commands will trigger automatic downloading of prebuilt indexes, which take up a lot of space. +The `msmarco-v2.1-doc` prebuilt index is 63 GB uncompressed. +The `msmarco-v2.1-doc-segmented` prebuilt index is 84 GB uncompressed. +See [this guide on prebuilt indexes](../prebuilt-indexes.md) for more details. + +## MS MARCO V2.1 + TREC RAG + +For the [TREC RAG Track](https://trec-rag.github.io/), Anserini so far has only BM25 baselines. +The evaluation uses the MS MARCO V2.1 corpora, which has two "variants", documents and segmented documents: -For the [TREC 2024 RAG Track](https://trec-rag.github.io/), we have thus far only implemented BM25 baselines on the MS MARCO V2.1 document corpus (both the doc and doc segmented variants). ++ The segmented documents corpus (segments = passages) is the one actually used for the TREC RAG evaluations. It contains 113,520,750 passages. ++ The documents corpus is the source of the segments and useful as a point of reference. It contains 10,960,555 documents. ❗ Beware, you need lots of space to run these experiments. The `msmarco-v2.1-doc` prebuilt index is 63 GB uncompressed. The `msmarco-v2.1-doc-segmented` prebuilt index is 84 GB uncompressed. Both indexes will be downloaded automatically. +See [this guide on prebuilt indexes](../prebuilt-indexes.md) for more details. This release of Anserini comes with bindings for the test topics for the TREC 2024 RAG track (`-topics rag24.test`). +To generate a standard TREC run file for these topics (top-1000 hits, BM25), issue the following command: + +```bash +java -cp $ANSERINI_JAR io.anserini.search.SearchCollection \ + -index msmarco-v2.1-doc-segmented \ + -topics rag24.test \ + -output $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.bm25.rag24.test.txt \ + -bm25 -hits 1000 +``` + +The UMBRELA qrels are included in this release. +To evaluate using them: + +```bash +java -cp $ANSERINI_JAR trec_eval -c -m ndcg_cut.10 rag24.test-umbrela-all \ + $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.bm25.rag24.test.txt +``` + + + +The expected results: + +``` +ndcg_cut_10 all 0.3290 +``` + To generate jsonl output containing the raw documents that can be reranked and further processed, use the `-outputRerankerRequests` option to specify an output file. For example: ```bash java -cp $ANSERINI_JAR io.anserini.search.SearchCollection \ - -index msmarco-v2.1-doc \ + -index msmarco-v2.1-doc-segmented \ -topics rag24.test \ - -output $OUTPUT_DIR/run.msmarco-v2.1-doc.bm25.rag24.test.txt \ + -output $OUTPUT_DIR/run.msmarco-v2.1-doc-segmented.bm25.rag24.test.txt \ -bm25 -hits 20 \ - -outputRerankerRequests $OUTPUT_DIR/results.msmarco-v2.1-doc.bm25.rag24.test.jsonl + -outputRerankerRequests $OUTPUT_DIR/results.msmarco-v2.1-doc-segmented.bm25.rag24.test.jsonl ``` -And the output looks something like (pipe through `jq` to pretty-print): +In the above command, we only fetch the top-20 hits. +To examine the output, pipe through `jq` to pretty-print: ```bash -$ head -n 1 $OUTPUT_DIR/results.msmarco-v2.1-doc.bm25.rag24.test.jsonl | jq +$ head -n 1 $OUTPUT_DIR/results.msmarco-v2.1-doc-segmented.bm25.rag24.test.jsonl | jq { "query": { "qid": "2024-105741", @@ -73,23 +109,27 @@ $ head -n 1 $OUTPUT_DIR/results.msmarco-v2.1-doc.bm25.rag24.test.jsonl | jq }, "candidates": [ { - "docid": "msmarco_v2.1_doc_38_1524878562", - "score": 14.4877, + "docid": "msmarco_v2.1_doc_16_287012450#4_490828734", + "score": 15.8199, "doc": { - "url": "https://www.ebmconsult.com/articles/lab-test-white-blood-count-wbc", - "title": "Lab Test: White Blood Cell Count, WBC", - "headings": "...", - "body": "..." + "url": "https://emedicine.medscape.com/article/961169-treatment", + "title": "Bacteremia Treatment & Management: Medical Care", + "headings": "Bacteremia Treatment & Management\nBacteremia Treatment & Management\nMedical Care\nHow well do low-risk criteria work?\nEmpiric antibiotics: How well do they work?\nTreatment algorithms\n", + "segment": "band-to-neutrophil ratio\n< 0.2\n< 20,000/μL\n5-15,000/μL; ABC < 1,000\n5-15,000/μL; ABC < 1,000\nUrine assessment\n< 10 WBCs per HPF; Negative for bacteria\n< 10 WBCs per HPF; Leukocyte esterase negative\n< 10 WBCs per HPF\n< 5 WBCs per HPF\nCSF assessment\n< 8 WBCs per HPF; Negative for bacteria\n< 10 WBCs per HPF\n< 10-20 WBCs per HPF\n…\nChest radiography\nNo infiltrate\nWithin reference range, if obtained\nWithin reference range, if obtained\n…\nStool culture\n< 5 WBCs per HPF\n…\n< 5 WBCs per HPF\n…\n* Acute illness observation score\nHow well do low-risk criteria work? The above guidelines are presented to define a group of febrile young infants who can be treated without antibiotics. Statistically, this translates into a high NPV (ie, a very high proportion of true negative cultures is observed in patients deemed to be at low risk). The NPV of various low-risk criteria for serious bacterial infection and occult bacteremia are as follows [ 10, 14, 16, 19, 74, 75, 76] : Philadelphia NPV - 95-100%\nBoston NPV - 95-98%\nRochester NPV - 98.3-99%\nAAP 1993 - 99-99.8%\nIn basic terms, even by the most stringent criteria, somewhere between 1 in 100 and 1 in 500 low-risk, but bacteremic, febrile infants are missed.", + "start_char": 2846, + "end_char": 4049 } }, { - "docid": "msmarco_v2.1_doc_19_1675146822", - "score": 14.3835, + "docid": "msmarco_v2.1_doc_16_287012450#3_490827079", + "score": 15.231, "doc": { - "url": "https://fcer.org/white-blood-cells/", - "title": "White Blood Cells (WBCs) - Definition, Function, and Ranges", - "headings": "...", - "body": "..." + "url": "https://emedicine.medscape.com/article/961169-treatment", + "title": "Bacteremia Treatment & Management: Medical Care", + "headings": "Bacteremia Treatment & Management\nBacteremia Treatment & Management\nMedical Care\nHow well do low-risk criteria work?\nEmpiric antibiotics: How well do they work?\nTreatment algorithms\n", + "segment": "73] Since then, numerous studies have evaluated combinations of age, temperature, history, examination findings, and laboratory results to determine which young infants are at a low risk for bacterial infection. [ 10, 66, 74, 75, 76]\nThe following are the low-risk criteria established by groups from Philadelphia, Boston, and Rochester and the 1993 American Academy of Pediatrics (AAP) guideline. Table 11. Low-Risk Criteria for Infants Younger than 3 Months [ 10, 74, 75, 76] (Open Table in a new window)\nCriterion\nPhiladelphia\nBoston\nRochester\nAAP 1993\nAge\n1-2 mo\n1-2 mo\n0-3 mo\n1-3 mo\nTemperature\n38.2°C\n≥38°C\n≥38°C\n≥38°C\nAppearance\nAIOS * < 15\nWell\nAny\nWell\nHistory\nImmune\nNo antibiotics in the last 24 h; No immunizations in the last 48 h\nPreviously healthy\nPreviously healthy\nExamination\nNonfocal\nNonfocal\nNonfocal\nNonfocal\nWBC count\n< 15,000/μL; band-to-neutrophil ratio\n< 0.2\n< 20,000/μL\n5-15,000/μL; ABC < 1,000\n5-15,000/μL; ABC < 1,000\nUrine assessment\n< 10 WBCs per HPF; Negative for bacteria\n< 10 WBCs per HPF; Leukocyte esterase negative\n< 10 WBCs per HPF\n< 5 WBCs per HPF\nCSF assessment\n< 8 WBCs per HPF;", + "start_char": 1993, + "end_char": 3111 } }, ... @@ -99,9 +139,9 @@ $ head -n 1 $OUTPUT_DIR/results.msmarco-v2.1-doc.bm25.rag24.test.jsonl | jq Replace `-index msmarco-v2.1-doc` with `-index msmarco-v2.1-doc-segemented` if you want to search over the doc segments instead of the full docs. -Since the TREC 2024 RAG evaluation hasn't happened yet, there are no qrels for evaluation. -However, we _do_ have results based existing qrels that have been "projected" over from MS MARCO V2.0 passage judgments. -The table below reports effectiveness (dev in terms of RR@10, DL21-DL23, RAGgy in terms of nDCG@10): +The experiments below capture _document-level_ qrels originally targeted at the V2 documents corpus, but have been "projected" over to the V2.1 documents corpus. +These can be viewed as dev topics for the TREC 2024 RAG Track (and were released prior to the evaluation). +The table below reports effectiveness (dev in terms of RR@100, DL21-DL23, RAGgy in terms of nDCG@10): | | dev | dev2 | DL21 | DL22 | DL23 | RAGgy | |:---------------------------------------------------------------------------|-------:|-------:|-------:|-------:|-------:|-------:| @@ -344,55 +384,55 @@ done And here's the snippet of code to perform the evaluation (which will yield the scores above): ```bash -java -cp $ANSERINI_JAR trec_eval -c -M 10 -m recip_rank msmarco-passage.dev $OUTPUT_DIR/run.msmarco-v1-passage.bm25.msmarco-v1-passage.dev.txt +java -cp $ANSERINI_JAR trec_eval -c -M 10 -m recip_rank msmarco-v1-passage.dev $OUTPUT_DIR/run.msmarco-v1-passage.bm25.msmarco-v1-passage.dev.txt java -cp $ANSERINI_JAR trec_eval -m ndcg_cut.10 -c dl19-passage $OUTPUT_DIR/run.msmarco-v1-passage.bm25.dl19-passage.txt java -cp $ANSERINI_JAR trec_eval -m ndcg_cut.10 -c dl20-passage $OUTPUT_DIR/run.msmarco-v1-passage.bm25.dl20-passage.txt echo '' -java -cp $ANSERINI_JAR trec_eval -c -M 10 -m recip_rank msmarco-passage.dev $OUTPUT_DIR/run.msmarco-v1-passage.splade-pp-ed.cached_q.msmarco-v1-passage.dev.splade-pp-ed.txt +java -cp $ANSERINI_JAR trec_eval -c -M 10 -m recip_rank msmarco-v1-passage.dev $OUTPUT_DIR/run.msmarco-v1-passage.splade-pp-ed.cached_q.msmarco-v1-passage.dev.splade-pp-ed.txt java -cp $ANSERINI_JAR trec_eval -m ndcg_cut.10 -c dl19-passage $OUTPUT_DIR/run.msmarco-v1-passage.splade-pp-ed.cached_q.dl19-passage.splade-pp-ed.txt java -cp $ANSERINI_JAR trec_eval -m ndcg_cut.10 -c dl20-passage $OUTPUT_DIR/run.msmarco-v1-passage.splade-pp-ed.cached_q.dl20-passage.splade-pp-ed.txt echo '' -java -cp $ANSERINI_JAR trec_eval -c -M 10 -m recip_rank msmarco-passage.dev $OUTPUT_DIR/run.msmarco-v1-passage.splade-pp-ed.onnx.msmarco-v1-passage.dev.txt +java -cp $ANSERINI_JAR trec_eval -c -M 10 -m recip_rank msmarco-v1-passage.dev $OUTPUT_DIR/run.msmarco-v1-passage.splade-pp-ed.onnx.msmarco-v1-passage.dev.txt java -cp $ANSERINI_JAR trec_eval -m ndcg_cut.10 -c dl19-passage $OUTPUT_DIR/run.msmarco-v1-passage.splade-pp-ed.onnx.dl19-passage.txt java -cp $ANSERINI_JAR trec_eval -m ndcg_cut.10 -c dl20-passage $OUTPUT_DIR/run.msmarco-v1-passage.splade-pp-ed.onnx.dl20-passage.txt echo '' -java -cp $ANSERINI_JAR trec_eval -c -M 10 -m recip_rank msmarco-passage.dev $OUTPUT_DIR/run.msmarco-v1-passage.cosdpr-distil.hnsw.cached_q.msmarco-v1-passage.dev.cosdpr-distil.txt +java -cp $ANSERINI_JAR trec_eval -c -M 10 -m recip_rank msmarco-v1-passage.dev $OUTPUT_DIR/run.msmarco-v1-passage.cosdpr-distil.hnsw.cached_q.msmarco-v1-passage.dev.cosdpr-distil.txt java -cp $ANSERINI_JAR trec_eval -m ndcg_cut.10 -c dl19-passage $OUTPUT_DIR/run.msmarco-v1-passage.cosdpr-distil.hnsw.cached_q.dl19-passage.cosdpr-distil.txt java -cp $ANSERINI_JAR trec_eval -m ndcg_cut.10 -c dl20-passage $OUTPUT_DIR/run.msmarco-v1-passage.cosdpr-distil.hnsw.cached_q.dl20-passage.cosdpr-distil.txt echo '' -java -cp $ANSERINI_JAR trec_eval -c -M 10 -m recip_rank msmarco-passage.dev $OUTPUT_DIR/run.msmarco-v1-passage.cosdpr-distil.hnsw.onnx.msmarco-v1-passage.dev.txt +java -cp $ANSERINI_JAR trec_eval -c -M 10 -m recip_rank msmarco-v1-passage.dev $OUTPUT_DIR/run.msmarco-v1-passage.cosdpr-distil.hnsw.onnx.msmarco-v1-passage.dev.txt java -cp $ANSERINI_JAR trec_eval -m ndcg_cut.10 -c dl19-passage $OUTPUT_DIR/run.msmarco-v1-passage.cosdpr-distil.hnsw.onnx.dl19-passage.txt java -cp $ANSERINI_JAR trec_eval -m ndcg_cut.10 -c dl20-passage $OUTPUT_DIR/run.msmarco-v1-passage.cosdpr-distil.hnsw.onnx.dl20-passage.txt echo '' -java -cp $ANSERINI_JAR trec_eval -c -M 10 -m recip_rank msmarco-passage.dev $OUTPUT_DIR/run.msmarco-v1-passage.cosdpr-distil.hnsw-int8.cached_q.msmarco-v1-passage.dev.cosdpr-distil.txt +java -cp $ANSERINI_JAR trec_eval -c -M 10 -m recip_rank msmarco-v1-passage.dev $OUTPUT_DIR/run.msmarco-v1-passage.cosdpr-distil.hnsw-int8.cached_q.msmarco-v1-passage.dev.cosdpr-distil.txt java -cp $ANSERINI_JAR trec_eval -m ndcg_cut.10 -c dl19-passage $OUTPUT_DIR/run.msmarco-v1-passage.cosdpr-distil.hnsw-int8.cached_q.dl19-passage.cosdpr-distil.txt java -cp $ANSERINI_JAR trec_eval -m ndcg_cut.10 -c dl20-passage $OUTPUT_DIR/run.msmarco-v1-passage.cosdpr-distil.hnsw-int8.cached_q.dl20-passage.cosdpr-distil.txt echo '' -java -cp $ANSERINI_JAR trec_eval -c -M 10 -m recip_rank msmarco-passage.dev $OUTPUT_DIR/run.msmarco-v1-passage.cosdpr-distil.hnsw-int8.onnx.msmarco-v1-passage.dev.txt +java -cp $ANSERINI_JAR trec_eval -c -M 10 -m recip_rank msmarco-v1-passage.dev $OUTPUT_DIR/run.msmarco-v1-passage.cosdpr-distil.hnsw-int8.onnx.msmarco-v1-passage.dev.txt java -cp $ANSERINI_JAR trec_eval -m ndcg_cut.10 -c dl19-passage $OUTPUT_DIR/run.msmarco-v1-passage.cosdpr-distil.hnsw-int8.onnx.dl19-passage.txt java -cp $ANSERINI_JAR trec_eval -m ndcg_cut.10 -c dl20-passage $OUTPUT_DIR/run.msmarco-v1-passage.cosdpr-distil.hnsw-int8.onnx.dl20-passage.txt echo '' -java -cp $ANSERINI_JAR trec_eval -c -M 10 -m recip_rank msmarco-passage.dev $OUTPUT_DIR/run.msmarco-v1-passage.bge-base-en-v1.5.hnsw.cached_q.msmarco-v1-passage.dev.bge-base-en-v1.5.txt +java -cp $ANSERINI_JAR trec_eval -c -M 10 -m recip_rank msmarco-v1-passage.dev $OUTPUT_DIR/run.msmarco-v1-passage.bge-base-en-v1.5.hnsw.cached_q.msmarco-v1-passage.dev.bge-base-en-v1.5.txt java -cp $ANSERINI_JAR trec_eval -m ndcg_cut.10 -c dl19-passage $OUTPUT_DIR/run.msmarco-v1-passage.bge-base-en-v1.5.hnsw.cached_q.dl19-passage.bge-base-en-v1.5.txt java -cp $ANSERINI_JAR trec_eval -m ndcg_cut.10 -c dl20-passage $OUTPUT_DIR/run.msmarco-v1-passage.bge-base-en-v1.5.hnsw.cached_q.dl20-passage.bge-base-en-v1.5.txt echo '' -java -cp $ANSERINI_JAR trec_eval -c -M 10 -m recip_rank msmarco-passage.dev $OUTPUT_DIR/run.msmarco-v1-passage.bge-base-en-v1.5.hnsw.onnx.msmarco-v1-passage.dev.txt +java -cp $ANSERINI_JAR trec_eval -c -M 10 -m recip_rank msmarco-v1-passage.dev $OUTPUT_DIR/run.msmarco-v1-passage.bge-base-en-v1.5.hnsw.onnx.msmarco-v1-passage.dev.txt java -cp $ANSERINI_JAR trec_eval -m ndcg_cut.10 -c dl19-passage $OUTPUT_DIR/run.msmarco-v1-passage.bge-base-en-v1.5.hnsw.onnx.dl19-passage.txt java -cp $ANSERINI_JAR trec_eval -m ndcg_cut.10 -c dl20-passage $OUTPUT_DIR/run.msmarco-v1-passage.bge-base-en-v1.5.hnsw.onnx.dl20-passage.txt echo '' -java -cp $ANSERINI_JAR trec_eval -c -M 10 -m recip_rank msmarco-passage.dev $OUTPUT_DIR/run.msmarco-v1-passage.bge-base-en-v1.5.hnsw-int8.cached_q.msmarco-v1-passage.dev.bge-base-en-v1.5.txt +java -cp $ANSERINI_JAR trec_eval -c -M 10 -m recip_rank msmarco-v1-passage.dev $OUTPUT_DIR/run.msmarco-v1-passage.bge-base-en-v1.5.hnsw-int8.cached_q.msmarco-v1-passage.dev.bge-base-en-v1.5.txt java -cp $ANSERINI_JAR trec_eval -m ndcg_cut.10 -c dl19-passage $OUTPUT_DIR/run.msmarco-v1-passage.bge-base-en-v1.5.hnsw-int8.cached_q.dl19-passage.bge-base-en-v1.5.txt java -cp $ANSERINI_JAR trec_eval -m ndcg_cut.10 -c dl20-passage $OUTPUT_DIR/run.msmarco-v1-passage.bge-base-en-v1.5.hnsw-int8.cached_q.dl20-passage.bge-base-en-v1.5.txt echo '' -java -cp $ANSERINI_JAR trec_eval -c -M 10 -m recip_rank msmarco-passage.dev $OUTPUT_DIR/run.msmarco-v1-passage.bge-base-en-v1.5.hnsw-int8.onnx.msmarco-v1-passage.dev.txt +java -cp $ANSERINI_JAR trec_eval -c -M 10 -m recip_rank msmarco-v1-passage.dev $OUTPUT_DIR/run.msmarco-v1-passage.bge-base-en-v1.5.hnsw-int8.onnx.msmarco-v1-passage.dev.txt java -cp $ANSERINI_JAR trec_eval -m ndcg_cut.10 -c dl19-passage $OUTPUT_DIR/run.msmarco-v1-passage.bge-base-en-v1.5.hnsw-int8.onnx.dl19-passage.txt java -cp $ANSERINI_JAR trec_eval -m ndcg_cut.10 -c dl20-passage $OUTPUT_DIR/run.msmarco-v1-passage.bge-base-en-v1.5.hnsw-int8.onnx.dl20-passage.txt echo '' -java -cp $ANSERINI_JAR trec_eval -c -M 10 -m recip_rank msmarco-passage.dev $OUTPUT_DIR/run.msmarco-v1-passage.cohere-embed-english-v3.0.hnsw.cached_q.msmarco-v1-passage.dev.cohere-embed-english-v3.0.txt +java -cp $ANSERINI_JAR trec_eval -c -M 10 -m recip_rank msmarco-v1-passage.dev $OUTPUT_DIR/run.msmarco-v1-passage.cohere-embed-english-v3.0.hnsw.cached_q.msmarco-v1-passage.dev.cohere-embed-english-v3.0.txt java -cp $ANSERINI_JAR trec_eval -m ndcg_cut.10 -c dl19-passage $OUTPUT_DIR/run.msmarco-v1-passage.cohere-embed-english-v3.0.hnsw.cached_q.dl19-passage.cohere-embed-english-v3.0.txt java -cp $ANSERINI_JAR trec_eval -m ndcg_cut.10 -c dl20-passage $OUTPUT_DIR/run.msmarco-v1-passage.cohere-embed-english-v3.0.hnsw.cached_q.dl20-passage.cohere-embed-english-v3.0.txt echo '' -java -cp $ANSERINI_JAR trec_eval -c -M 10 -m recip_rank msmarco-passage.dev $OUTPUT_DIR/run.msmarco-v1-passage.cohere-embed-english-v3.0.hnsw-int8.cached_q.msmarco-v1-passage.dev.cohere-embed-english-v3.0.txt +java -cp $ANSERINI_JAR trec_eval -c -M 10 -m recip_rank msmarco-v1-passage.dev $OUTPUT_DIR/run.msmarco-v1-passage.cohere-embed-english-v3.0.hnsw-int8.cached_q.msmarco-v1-passage.dev.cohere-embed-english-v3.0.txt java -cp $ANSERINI_JAR trec_eval -m ndcg_cut.10 -c dl19-passage $OUTPUT_DIR/run.msmarco-v1-passage.cohere-embed-english-v3.0.hnsw-int8.cached_q.dl19-passage.cohere-embed-english-v3.0.txt java -cp $ANSERINI_JAR trec_eval -m ndcg_cut.10 -c dl20-passage $OUTPUT_DIR/run.msmarco-v1-passage.cohere-embed-english-v3.0.hnsw-int8.cached_q.dl20-passage.cohere-embed-english-v3.0.txt ``` diff --git a/docs/ltr-features.md b/docs/ltr-features.md deleted file mode 100644 index b8c336972e..0000000000 --- a/docs/ltr-features.md +++ /dev/null @@ -1,87 +0,0 @@ -# LTR Features -|Feature name | -|-------------------------------------------------| -|[IBM Model1](../src/main/java/io/anserini/ltr/feature/IbmModel1.java) -|[Sum of BM25](../src/main/java/io/anserini/ltr/feature/BM25Stat.java) -|Average of BM25 -|Median of BM25 -|Max of BM25 -|Min of BM25 -|MaxMinRatio of BM25 -|[Sum of LMDir](../src/main/java/io/anserini/ltr/feature/LmDirStat.java) -|Average of LMDir -|Median of LMDir -|Max of LMDir -|Min of LMDir -|MaxMinRatio of LMDir -| [Sum of DFR\_GL2](../src/main/java/io/anserini/ltr/feature/DfrGl2Stat.java) -| Average of DFR\_GL2 -| Median of DFR\_GL2 -| Max of DFR\_GL2 -| Min of DFR\_GL2 -| MaxMinRatio of DFR\_GL2 -| [Sum of DFR\_in\_expB2](../src/main/java/io/anserini/ltr/feature/DfrInExpB2Stat.java) -| Average of DFR\_in\_expB2 -| Median of DFR\_in\_expB2 -| Max of DFR\_in\_expB2 -| Min of DFR\_in\_expB2 -| MaxMinRatio of DFR\_in\_expB2 -| [Sum of DPH](../src/main/java/io/anserini/ltr/feature/DphStat.java) -| Average of DPH -| Median of DPH -| Max of DPH -| Min of DPH -| MaxMinRatio of DPH -| [Sum of TF](../src/main/java/io/anserini/ltr/feature/TfStat.java) -| Average of TF -| Median of TF -| Max of TF -| Min of TF -| MaxMinRatio of TF -| [Sum of TFIDF](../src/main/java/io/anserini/ltr/feature/TfIdfStat.java) -| Average of TFIDF -| Median of TFIDF -| Max of TFIDF -| Min of TFIDF -| MaxMinRatio of TFIDF -| [Sum of Normalized TF](../src/main/java/io/anserini/ltr/feature/NormalizedTfStat.java) -| Average of Normalized TF -| Median of Normalized TF -| Max of Normalized TF -| Min of Normalized TF -| MaxMinRatio of Normalized TF -| [Sum of IDF](../src/main/java/io/anserini/ltr/feature/IdfStat.java) -| Average of IDF -| Median of IDF -| Max of IDF -| Min of IDF -| MaxMinRatio of IDF -| [Sum of ICTF](../src/main/java/io/anserini/ltr/feature/IcTfStat.java) -| Average of ICTF -| Median of ICTF -| Max of ICTF -| Min of ICTF -| MaxMinRatio of ICTFs -| [UnorderedSequentialPairs with gap 3](../src/main/java/io/anserini/ltr/feature/UnorderedSequentialPairs.java) -| UnorderedSequentialPairs with gap 8 -| UnorderedSequentialPairs with gap 15 -| [OrderedSequentialPairs with gap 3](../src/main/java/io/anserini/ltr/feature/OrderedSequentialPairs.java) -| OrderedSequentialPairs with gap 8 -| OrderedSequentialPairs with gap 15 -| [UnorderedQueryPairs with gap 3](../src/main/java/io/anserini/ltr/feature/UnorderedQueryPairs.java) -| UnorderedQueryPairs with gap 8 -| UnorderedQueryPairs with gap 15 -| [OrderedQueryPairs with gap 3](../src/main/java/io/anserini/ltr/feature/OrderedQueryPairs.java) -| OrderedQueryPairs with gap 8 -| OrderedQueryPairs with gap 15 -| [Normalized TFIDF](../src/main/java/io/anserini/ltr/feature/NormalizedTfIdf.java) -| [ProbabilitySum](../src/main/java/io/anserini/ltr/feature/ProbalitySum.java) -| [Proximity](../src/main/java/io/anserini/ltr/feature/Proximity.java) -| [BM25-TP score](../src/main/java/io/anserini/ltr/feature/TpScore.java) -| [TP distance](../src/main/java/io/anserini/ltr/feature/TpDist.java) -| [Doc size](../src/main/java/io/anserini/ltr/feature/DocSize.java) -| [Query Length](../src/main/java/io/anserini/ltr/feature/QueryLength.java) -| [Query Coverage Ratio](../src/main/java/io/anserini/ltr/feature/QueryCoverageRatio.java) -| [Unique Term Count in Query](../src/main/java/io/anserini/ltr/feature/UniqueTermCount.java) -| [Matching Term Count](../src/main/java/io/anserini/ltr/feature/MatchingTermCount.java) -| [SCS](../src/main/java/io/anserini/ltr/feature/SCS.java) \ No newline at end of file diff --git a/docs/prebuilt-indexes.md b/docs/prebuilt-indexes.md new file mode 100644 index 0000000000..418f156e64 --- /dev/null +++ b/docs/prebuilt-indexes.md @@ -0,0 +1,57 @@ +# Prebuilt Indexes + +Anserini ships with a number of prebuilt indexes. +This means that various indexes (inverted indexes, HNSW indexes, etc.) for common collections used in NLP and IR research have already been built and just needs to be downloaded (from UWaterloo servers), which Anserini will handle automatically for you. + +Bindings for the available prebuilt indexes are in [`io.anserini.index.IndexInfo`](/~https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/index/IndexInfo.java). +For example, if you specify `-index msmarco-v1-passage`, Anserini will know that you mean the Lucene index of the MS MARCO V1 passage corpus. +It will then download the index from our servers at UWaterloo and cache locally. +All of this happens automagically! + +## Changing the Index Location + +The automagic download of prebuilt indexes works great for (relatively) small indexes! + +However, larger indexes can cause issues. +For example, the `msmarco-v2.1-doc` prebuilt index is 63 GB uncompressed and the `msmarco-v2.1-doc-segmented` prebuilt index is 84 GB uncompressed. +And these are only the inverted indexes (e.g., for BM25). +The HNSW indexes for dense retrieval models are even larger, for example, the Arctic-Embed-L indexes for the entire MS MARCO V2.1 segmented document corpus is around 550 GB. + +The prebuilt indexes are automatically downloaded to `~/.cache/pyserini/indexes/`, which may not be the best location for you. +(Yes, `pyserini`; this is so prebuilt indexes from both Pyserini and Anserini can live in the same location.) +Currently, this path is hard-coded (see [Anserini #2322](/~https://github.com/castorini/anserini/issues/2322)). +If you want to change the download location, the current workaround is to use symlinks, i.e., symlink `~/.cache/pyserini/indexes/` to the actual path you desire. + +## Managing Indexes Manually + +Another helpful tip is to download and manage the indexes by hand. +All relevant information is stored in [`IndexInfo`](/~https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/index/IndexInfo.java). +For example, `msmarco-v1-passage` can be downloaded from: + +``` +https://rgw.cs.uwaterloo.ca/pyserini/indexes/lucene/lucene-inverted.msmarco-v1-passage.20221004.252b5e.tar.gz +``` + +and has an MD5 checksum of `678876e8c99a89933d553609a0fd8793`. +You can download, verify, and put anywhere you want. +With `-index /path/to/index/` you'll get exactly the same output as `-index msmarco-v1-passage`, except now you've got fine-grained control over managing the index. + +By manually manging the indexes, you can share indexes between multiple users to conserve space. +The schema of the index location in `~/.cache/pyserini/indexes/` is the tarball name (after unpacking), followed by a dot and the checksum, so `msmarco-v1-passage` lives in following location: + +``` +~/.cache/pyserini/indexes/lucene-inverted.msmarco-v1-passage.20221004.252b5e.678876e8c99a89933d553609a0fd8793 +``` + +You can download the index once, put in a common location, and have each user symlink to the actual index location. +Source would conform to the schema above, target would be where your index actually resides. + +## Recovering from Partial Downloads + +A common issue is recovering from partial downloads, for example, if you abort the downloading of a large index tarball. +In the standard flow, Anserini downloads the tarball from UWaterloo servers, verifies the checksum, and then unpacks the tarball. +If this process is interrupted, you'll end up in an inconsistent state. + +To recover, go to `~/.cache/pyserini/indexes/` and remove any tarballs (i.e., `.tar.gz` files). +If there are any partially unpacked indexes, remove those also. +Then start over (e.g., rerun the command you were running before). diff --git a/docs/reproduction-details.md b/docs/reproduction-details.md deleted file mode 100644 index 47ccdde3ee..0000000000 --- a/docs/reproduction-details.md +++ /dev/null @@ -1,34 +0,0 @@ -# Adding an Entry to the "Reproduction Log" - -To contribute an entry to the "Reproduction Log" in a project's repository, follow these steps: - -1. **Fork the Repository:** - - If you don't already have a fork of the project's repository, go to the repository on GitHub (or the platform it's hosted on). - - Click the "Fork" button. This will create a copy of the repository in your own GitHub account. - -2. **Clone Your Fork:** - - Once you have your own fork, clone it to your local machine using Git. - -3. **Edit the Reproduction Log:** - - - Open the "Reproduction Log" file in a text editor. It should be located in the project's repository. The file are named like "experiments-msmarco-doc.md" and "start-here.md". - -4. **Add an entry** - - Add your entry to the bottom of the "Reproduction Log" following the same format as the existing entries. - - Here's an example of how it should look: - - ```bash - Results reproduced by @YourUsername on yyyy-mm-dd (commit YourCommitID) - -5. **Push Your Changes:** - - Push the changes to your fork on GitHub: - -6. **Create a Pull Request (PR):** - - Go to your fork on GitHub and you should see a prompt to create a new pull request for the branch you just pushed. - - Click on it and follow the instructions to create the pull request. - -7. **Submit the Pull Request:** - - Write a clear and concise description of your system environment (Java, Python, Maven version and ...) in the pull request, and submit it. This will notify the project maintainers of your proposed changes. - -8. **Wait for Review:** - - Project maintainers will review your pull request, and if everything looks good, they may merge it into the main repository. diff --git a/docs/usage-intelij.md b/docs/usage-intelij.md deleted file mode 100644 index f157a161be..0000000000 --- a/docs/usage-intelij.md +++ /dev/null @@ -1,73 +0,0 @@ -# Building Anserini using Intelij IDE - -Building the Anserini package within Intelij can be helpful for debugging. - -Steps to follow: - - -To enable the Maven tool window, click on "View | Tool Windows | Maven" ([source](https://www.jetbrains.com/help/idea/maven-projects-tool-window.html)) - -Press the toolbar buttons in order: -- "Reload all maven projects" -- "Generate sources and update folders" -- "Download sources and documentation" -- press the "Toggle skip tests" button - -To add a maven property, click on the wrench tool, select "Maven settings", -select Runner in the ensuing maven dropdown; and add a new property by clicking the `+` button: -`Name:javadoc.skip` `value:true` - -Select target "package" and run it. - -This should be equivalent to -```$ -mvn clean package -DskipTests -Dmaven.javadoc.skip=true -``` - -# Import Maven project -Import Maven project by following the instructions in -https://www.jetbrains.com/idea/guide/tutorials/working-with-maven/importing-a-project/ - -Set the Java SDK version to 11 per Anserini version requirement. - -# Setting InteliJ environment for source level debugging - -The instructions below shows how to configure InteliJ (from JetBrains) to build and debug Anserini. - -As an example, let's run IndexCollection. (path: anserini/index/indexCollection.java) -The text below is based on InteliJ Ultimate 2021.1.1 - -We need to configure the dependencies. - -Select File | Project Structure | Project Settings | Modules - -then choose the Dependencies tab - -Click `+` "Jars or directories..." and select the folder .../anserini/target/appassemblet/repo - -In the "Scope" column, choose "Compile" - -We don't want to compile the test code. -Open the Sources tab (still in Modules), locate the TEST folders and remove them from the "Add Content Root" - -Now build anserini by choosing (default target) 'anserini' and menu Build | Build Project - -Open src/main/java/io/anserini/index/IndexCollection.java, put breakpoint in first line of main() - -Click the green triangle (on the gutter left to main() ) and choose "Debug 'indexCollection.main()' " - -It should stop on the breakpoint, then continue and the program will exit with error (missing required args) - -Open menu Run | "Edit Configurations..." | Program Arguments and add the args. - -# Final notes -Using the IDE allows to easily follow code in threads, add breakpoints, view and modify data. -This can be done in both Anserini and Lucene transparently since the Lucene source code is available. - - - - - - - - diff --git a/src/main/java/io/anserini/eval/Qrels.java b/src/main/java/io/anserini/eval/Qrels.java index fdf0393fd5..45ebee31cf 100644 --- a/src/main/java/io/anserini/eval/Qrels.java +++ b/src/main/java/io/anserini/eval/Qrels.java @@ -207,6 +207,7 @@ private static HashMap generateSymbolFileDict() { m.put("msmarco-v1-passage-dev", "qrels.msmarco-passage.dev-subset.txt"); m.put("msmarco-passage.dev", "qrels.msmarco-passage.dev-subset.txt"); m.put("msmarco-v1-passage.dev", "qrels.msmarco-passage.dev-subset.txt"); + m.put("rag24.test-umbrela", "qrels.rag24.test-umbrela-all.txt"); return m; }