Skip to content

Commit

Permalink
Refactor Solr/ES scripts + docs (#1403)
Browse files Browse the repository at this point in the history
  • Loading branch information
lintool authored Nov 10, 2020
1 parent a7996d1 commit e19755b
Show file tree
Hide file tree
Showing 4 changed files with 241 additions and 192 deletions.
71 changes: 37 additions & 34 deletions docs/elastirini.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,8 @@ sh target/appassembler/bin/IndexCollection -collection TrecCollection -generator
-es -es.index robust04 -threads 16 -input /path/to/disk45 -storePositions -storeDocvectors -storeRaw
```

We can then run the following command to replicate Anserini BM25 retrieval:
We may need to wait a few minutes after indexing for the index to "catch up" before performing retrieval, otherwise the evaluation metrics may be off.
Run the following command to replicate Anserini BM25 retrieval:

```bash
sh target/appassembler/bin/SearchElastic -topicreader Trec -es.index robust04 \
Expand All @@ -54,78 +55,78 @@ sh target/appassembler/bin/SearchElastic -topicreader Trec -es.index robust04 \
To evaluate effectiveness:

```bash
$ eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.robust04.txt runs/run.es.robust04.bm25.topics.robust04.txt
$ tools/eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.robust04.txt runs/run.es.robust04.bm25.topics.robust04.txt
map all 0.2531
P_30 all 0.3102
```

## Indexing and Retrieval: MS MARCO Passage
## Indexing and Retrieval: Core18

We can replicate the [BM25 Baselines on MS MARCO (Passage)](experiments-msmarco-passage.md) results in a similar way.
First, set up the proper schema using [this config](../src/main/resources/elasticsearch/index-config.msmarco-passage.json):
We can replicate the [TREC Washington Post Corpus](regressions-core18.md) results in a similar way.
First, set up the proper schema using [this config](../src/main/resources/elasticsearch/index-config.core18.json):

```bash
cat src/main/resources/elasticsearch/index-config.msmarco-passage.json \
| curl --user elastic:changeme -XPUT -H 'Content-Type: application/json' 'localhost:9200/msmarco-passage' -d @-
cat src/main/resources/elasticsearch/index-config.core18.json \
| curl --user elastic:changeme -XPUT -H 'Content-Type: application/json' 'localhost:9200/core18' -d @-
```

Indexing:

```bash
sh target/appassembler/bin/IndexCollection -collection JsonCollection -generator DefaultLuceneDocumentGenerator \
-es -es.index msmarco-passage -threads 9 -input /path/to/msmarco-passage -storePositions -storeDocvectors -storeRaw
sh target/appassembler/bin/IndexCollection -collection WashingtonPostCollection -generator WashingtonPostGenerator \
-es -es.index core18 -threads 8 -input /path/to/WashingtonPost -storePositions -storeDocvectors -storeContents
```

We may need to wait a few minutes after indexing for the index to catch up before performing retrieval, otherwise wrong evaluation metrics are returned.
We may need to wait a few minutes after indexing for the index to "catch up" before performing retrieval, otherwise the evaluation metrics may be off.

Retrieval:

```bash
sh target/appassembler/bin/SearchElastic -topicreader TsvString -es.index msmarco-passage \
-topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output runs/run.es.msmacro-passage.txt
sh target/appassembler/bin/SearchElastic -topicreader Trec -es.index core18 \
-topics src/main/resources/topics-and-qrels/topics.core18.txt \
-output runs/run.es.core18.bm25.topics.core18.txt
```

Evaluation:

```bash
$ ./eval/trec_eval.9.0.4/trec_eval -c -m recall.1000 -m map src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.es.msmacro-passage.txt
map all 0.1956
recall_1000 all 0.8573
$ tools/eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.core18.txt runs/run.es.core18.bm25.topics.core18.txt
map all 0.2495
P_30 all 0.3567
```

## Indexing and Retrieval: Core18
## Indexing and Retrieval: MS MARCO Passage

We can replicate the [TREC Washington Post Corpus](regressions-core18.md) results in a similar way.
First, set up the proper schema using [this config](../src/main/resources/elasticsearch/index-config.core18.json):
We can replicate the [BM25 Baselines on MS MARCO (Passage)](experiments-msmarco-passage.md) results in a similar way.
First, set up the proper schema using [this config](../src/main/resources/elasticsearch/index-config.msmarco-passage.json):

```bash
cat src/main/resources/elasticsearch/index-config.core18.json \
| curl --user elastic:changeme -XPUT -H 'Content-Type: application/json' 'localhost:9200/core18' -d @-
cat src/main/resources/elasticsearch/index-config.msmarco-passage.json \
| curl --user elastic:changeme -XPUT -H 'Content-Type: application/json' 'localhost:9200/msmarco-passage' -d @-
```

Indexing:

```bash
sh target/appassembler/bin/IndexCollection -collection WashingtonPostCollection -generator WashingtonPostGenerator \
-es -es.index core18 -threads 8 -input /path/to/WashingtonPost -storePositions -storeDocvectors -storeContents
sh target/appassembler/bin/IndexCollection -collection JsonCollection -generator DefaultLuceneDocumentGenerator \
-es -es.index msmarco-passage -threads 9 -input /path/to/msmarco-passage -storePositions -storeDocvectors -storeRaw
```

We may need to wait a few minutes after indexing for the index to catch up before performing retrieval, otherwise wrong evaluation metrics are returned.
We may need to wait a few minutes after indexing for the index to "catch up" before performing retrieval, otherwise the evaluation metrics may be off.

Retrieval:

```bash
sh target/appassembler/bin/SearchElastic -topicreader Trec -es.index core18 \
-topics src/main/resources/topics-and-qrels/topics.core18.txt \
-output runs/run.es.core18.bm25.topics.core18.txt
sh target/appassembler/bin/SearchElastic -topicreader TsvString -es.index msmarco-passage \
-topics src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt -output runs/run.es.msmacro-passage.txt
```

Evaluation:

```bash
$ eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.core18.txt runs/run.es.core18.bm25.topics.core18.txt
map all 0.2495
P_30 all 0.3567
$ tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.1000 -m map src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.es.msmacro-passage.txt
map all 0.1956
recall_1000 all 0.8573
```

## Indexing and Retrieval: MS MARCO Document
Expand All @@ -145,7 +146,7 @@ sh target/appassembler/bin/IndexCollection -collection CleanTrecCollection -gene
-es -es.index msmarco-doc -threads 1 -input /path/to/msmarco-doc -storePositions -storeDocvectors -storeRaw
```

We may need to wait a few minutes after indexing for the index to catch up before performing retrieval, otherwise wrong evaluation metrics are returned.
We may need to wait a few minutes after indexing for the index to "catch up" before performing retrieval, otherwise the evaluation metrics may be off.

Retrieval:

Expand All @@ -159,14 +160,14 @@ This can take potentially longer than `SearchCollection` with Lucene indexes.
Evaluation:

```bash
$ ./eval/trec_eval.9.0.4/trec_eval -c -m recall.1000 -m map src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.es.msmacro-doc.txt
$ tools/eval/trec_eval.9.0.4/trec_eval -c -m recall.1000 -m map src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.es.msmacro-doc.txt
map all 0.2308
recall_1000 all 0.8856
```

## Elasticsearch Integration Test

We have an end-to-end integration testing script `run_es_regression.py` for [Core18](regressions-core18.md), [Robust04](regressions-robust04.md), [MS MARCO passage](regressions-msmarco-passage.md) and [MS MARCO document](regressions-msmarco-doc.md). Its functionalities are described below.
We have an end-to-end integration testing script `run_es_regression.py` for [Robust04](regressions-robust04.md), [Core18](regressions-core18.md), [MS MARCO passage](regressions-msmarco-passage.md) and [MS MARCO document](regressions-msmarco-doc.md):

```
# Check if Elasticsearch server is on
Expand All @@ -186,12 +187,14 @@ python src/main/python/run_es_regression.py --evaluate [collection]
python src/main/python/run_es_regression.py --regression [collection] --input [directory]
```

For the `collection` meta-parameter, use `robust04`, `core18`, `msmarco-passage`, or `msmarco-doc`, for each of the collections above, respectively.

## Replication Log

+ Results replicated by [@nikhilro](/~https://github.com/nikhilro) on 2020-01-26 (commit [`d5ee069`](/~https://github.com/castorini/anserini/commit/d5ee069399e6a306d7685bda756c1f19db721156)) for both [MS MARCO Passage](experiments-msmarco-passage.md) and [Robust04](regressions-robust04.md)
+ Results replicated by [@edwinzhng](/~https://github.com/edwinzhng) on 2020-01-26 (commit [`7b76dfb`](/~https://github.com/castorini/anserini/commit/7b76dfbea7e0c01a3a5dc13e74f54852c780ec9b)) for both [MS MARCO Passage](experiments-msmarco-passage.md) and [Robust04](regressions-robust04.md)
+ Results replicated by [@HangCui0510](/~https://github.com/HangCui0510) on 2020-04-29 (commit [`07a9b05`](/~https://github.com/castorini/anserini/commit/07a9b053173637e15be79de4e7fce4d5a93d04fe)) for [MS Marco Passage](regressions-msmarco-passage.md), [Robust04](regressions-robust04.md) and [core18](regressions-core18.md) using end-to-end [`run_es_regression`](../src/main/python/run_es_regression.py)
+ Results replicated by [@HangCui0510](/~https://github.com/HangCui0510) on 2020-04-29 (commit [`07a9b05`](/~https://github.com/castorini/anserini/commit/07a9b053173637e15be79de4e7fce4d5a93d04fe)) for [MS Marco Passage](regressions-msmarco-passage.md), [Robust04](regressions-robust04.md) and [Core18](regressions-core18.md) using end-to-end [`run_es_regression`](../src/main/python/run_es_regression.py)
+ Results replicated by [@shaneding](/~https://github.com/shaneding) on 2020-05-25 (commit [`1de3274`](/~https://github.com/castorini/anserini/commit/1de3274b057a63382534c5277ffcd772c3fc0d43)) for [MS Marco Passage](regressions-msmarco-passage.md)
+ Results replicated by [@adamyy](/~https://github.com/adamyy) on 2020-05-29 (commit [`94893f1`](/~https://github.com/castorini/anserini/commit/94893f170e047d77c3ef5b8b995d7fbdd13f4298)) for [MS MARCO Passage](regressions-msmarco-passage.md), [MS MARCO Document](experiments-msmarco-doc.md)
+ Results replicated by [@YimingDou](/~https://github.com/YimingDou) on 2020-05-29 (commit [`2947a16`](/~https://github.com/castorini/anserini/commit/2947a1622efae35637b83e321aba8e6fccd43489)) for [MS MARCO Passage](regressions-msmarco-passage.md)
+ Results replicated by [@yxzhu16](/~https://github.com/yxzhu16) on 2020-07-17 (commit [`fad12be`](/~https://github.com/castorini/anserini/commit/fad12be2e37a075100707c3a674eb67bc0aa57ef)) for [Robust04](regressions-robust04.md), [core18](regressions-core18.md), and [MS MARCO Passage](regressions-msmarco-passage.md)
+ Results replicated by [@yxzhu16](/~https://github.com/yxzhu16) on 2020-07-17 (commit [`fad12be`](/~https://github.com/castorini/anserini/commit/fad12be2e37a075100707c3a674eb67bc0aa57ef)) for [Robust04](regressions-robust04.md), [Core18](regressions-core18.md), and [MS MARCO Passage](regressions-msmarco-passage.md)
Loading

0 comments on commit e19755b

Please sign in to comment.