Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PineconeDocumentStore #2254

Merged
merged 80 commits into from
Mar 21, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
827d6f8
added core install and functionality of pinecone doc store (init, ups…
jamescalam Feb 5, 2022
d9702f8
Merge branch 'master' of github.com:deepset-ai/haystack into pinecone…
jamescalam Feb 13, 2022
c4d21a9
implemented core functionality of Pinecone doc store
jamescalam Feb 13, 2022
2699808
Merge branch 'deepset-ai:master' into pinecone-doc-store
jamescalam Feb 14, 2022
925630f
merge with upstream master
jamescalam Feb 24, 2022
4eff62b
Update Documentation & Code Style
github-actions[bot] Feb 24, 2022
15debe7
updated filtering to use Haystack filtering and reduced default batch…
jamescalam Feb 26, 2022
42c9510
added haystack filter conversion and reduced default batch_size
jamescalam Feb 26, 2022
9657c48
Update Documentation & Code Style
github-actions[bot] Feb 26, 2022
50202bb
merge pinecone-doc-store with upstream changes
jamescalam Feb 27, 2022
92e4a0c
Merge branch 'pinecone-doc-store' of github.com:pinecone-io/haystack …
jamescalam Feb 27, 2022
5e06721
removed debugging code
jamescalam Feb 27, 2022
a7ed9ba
updated Pinecone filtering to use filter_utils
jamescalam Mar 2, 2022
dc2b80f
removed uneeded methods and minor tweaks to current methods
jamescalam Mar 2, 2022
227da6d
fixed typing issues
jamescalam Mar 2, 2022
384391b
Update Documentation & Code Style
github-actions[bot] Mar 2, 2022
bd9c355
Allow filters in al methods except get_embedding_count
bogdankostic Mar 2, 2022
c9823f2
Fix skipping document store tests
bogdankostic Mar 2, 2022
b124925
Update Documentation & Code Style
github-actions[bot] Mar 2, 2022
8526342
Fix handling of Milvus1 and Milvus2 in tests
bogdankostic Mar 2, 2022
632832f
Merge remote-tracking branch 'origin/fix_docstore_tests' into fix_doc…
bogdankostic Mar 2, 2022
ebf5344
Update Documentation & Code Style
github-actions[bot] Mar 2, 2022
422f82e
Fix handling of Milvus1 and Milvus2 in tests
bogdankostic Mar 2, 2022
b63f1b1
Merge remote-tracking branch 'origin/fix_docstore_tests' into fix_doc…
bogdankostic Mar 2, 2022
66f7dd0
Update Documentation & Code Style
github-actions[bot] Mar 2, 2022
1bcf759
Remove SQL from tests requiring embeddings
bogdankostic Mar 3, 2022
2da29ae
Merge remote-tracking branch 'origin/fix_docstore_tests' into fix_doc…
bogdankostic Mar 3, 2022
34c29ff
Update Documentation & Code Style
github-actions[bot] Mar 3, 2022
4cc08c4
Fix get_embedding_count of Milvus2
bogdankostic Mar 3, 2022
2b91993
Make sure to start Milvus2 tests with a new collection
bogdankostic Mar 3, 2022
e8eafa0
Merge branch 'fix_docstore_tests' into pr/2254
bogdankostic Mar 3, 2022
e8aa414
Add pinecone to test suite
bogdankostic Mar 4, 2022
581f298
Merge remote-tracking branch 'origin/master' into pr/2254
bogdankostic Mar 4, 2022
4f28bb3
Update Documentation & Code Style
github-actions[bot] Mar 6, 2022
b44dc33
Fix typing
bogdankostic Mar 6, 2022
9948b30
Merge remote-tracking branch 'github-desktop-pinecone-io/pinecone-doc…
bogdankostic Mar 6, 2022
c80246d
Update Documentation & Code Style
github-actions[bot] Mar 6, 2022
e0995ad
Add pinecone to docstores dependendcy
bogdankostic Mar 6, 2022
8966922
Merge remote-tracking branch 'origin/master' into pr/2254
bogdankostic Mar 7, 2022
8efa2b4
Merge remote-tracking branch 'origin/master' into pr/2254
bogdankostic Mar 10, 2022
71ba098
Add PineconeDocStore to API Documentation
bogdankostic Mar 10, 2022
2cc268d
Add missing comma
bogdankostic Mar 10, 2022
d635baa
Update Documentation & Code Style
github-actions[bot] Mar 10, 2022
5cc6df6
Adapt format of doc strings
bogdankostic Mar 10, 2022
476dad2
Merge remote-tracking branch 'github-desktop-pinecone-io/pinecone-doc…
bogdankostic Mar 10, 2022
7bb62b9
Update Documentation & Code Style
github-actions[bot] Mar 10, 2022
9df1436
Set API key as environment variable
bogdankostic Mar 10, 2022
e0ec8b6
Merge remote-tracking branch 'github-desktop-pinecone-io/pinecone-doc…
bogdankostic Mar 10, 2022
e2e5da4
Skip Pinecone tests in forks
bogdankostic Mar 14, 2022
97ce7e6
Add sleep after deleting index
bogdankostic Mar 14, 2022
9c967f6
Add sleep after deleting index
bogdankostic Mar 14, 2022
3bb7751
Add sleep after creating index
bogdankostic Mar 14, 2022
58550e9
Add check if index ready
bogdankostic Mar 14, 2022
b0ae486
Remove printing of index stats
bogdankostic Mar 14, 2022
4ade615
Create new index for each pinecone test
bogdankostic Mar 14, 2022
971abc5
Use RestAPI instead of Python API for describe_index_stats
bogdankostic Mar 14, 2022
05e8cfd
Fix accessing describe_index_stats
bogdankostic Mar 14, 2022
c4b956e
Remove usages of describe_index_stats
bogdankostic Mar 14, 2022
1795471
Run pinecone tests separately
bogdankostic Mar 14, 2022
2c7c3c7
Update Documentation & Code Style
github-actions[bot] Mar 14, 2022
50773e4
Add pdftotext to pinecone tests
bogdankostic Mar 14, 2022
fce96e2
Merge remote-tracking branch 'origin/pinecone_test' into pinecone_test
bogdankostic Mar 14, 2022
220f913
Remove sleep from doc store fixture
bogdankostic Mar 14, 2022
54f672f
Add describe_index_stats
bogdankostic Mar 14, 2022
fe5035a
Remove unused imports
bogdankostic Mar 15, 2022
2f0b383
Use pull_request_target trigger
bogdankostic Mar 15, 2022
452cc16
Revert use pull_request_target trigger
bogdankostic Mar 15, 2022
d6afb43
Merge remote-tracking branch 'origin/master' into pr/2254
bogdankostic Mar 15, 2022
a904ff2
Merge remote-tracking branch 'origin/master' into pr/2254
bogdankostic Mar 21, 2022
8822e9e
Remove set_config
bogdankostic Mar 21, 2022
7e9af75
Add os to conftest
bogdankostic Mar 21, 2022
c88a305
Integrate review comments
bogdankostic Mar 21, 2022
77fb60a
Set include_values to False
bogdankostic Mar 21, 2022
a2eb6cd
Remove quotation marks from pinecone.Index type
bogdankostic Mar 21, 2022
756f298
Update Documentation & Code Style
github-actions[bot] Mar 21, 2022
2c6ff82
Merge branch 'master' into pinecone-doc-store
bogdankostic Mar 21, 2022
e95c556
Merge remote-tracking branch 'origin/master' into pr/2254
bogdankostic Mar 21, 2022
8bdb974
Merge remote-tracking branch 'github-desktop-pinecone-io/pinecone-doc…
bogdankostic Mar 21, 2022
d9ef404
Update Documentation & Code Style
github-actions[bot] Mar 21, 2022
cbd8c16
Fix number of args in error messages
bogdankostic Mar 21, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions .github/workflows/linux_ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -298,6 +298,8 @@ jobs:
pip install ui/

- name: Run tests
env:
PINECONE_API_KEY: ${{ secrets.PINECONE_API_KEY }}
run: pytest -s ${{ matrix.test-path }}


Expand Down Expand Up @@ -346,3 +348,36 @@ jobs:
run: |
export MILVUS1_ENABLED=1
pytest -s test/test_document_store.py test/test_eval.py test/test_faiss_and_milvus.py test/test_pipeline.py test/test_retriever.py test/test_standard_pipelines.py --document_store_type="milvus1"

test-pinecone:
needs: build-cache
runs-on: ubuntu-20.04

steps:
- uses: actions/checkout@v2
- run: echo "date=$(date +'%Y-%m-%d')" >> $GITHUB_ENV

- name: Set up Python 3.7
uses: actions/setup-python@v2
with:
python-version: 3.7

- name: Cache Python
uses: actions/cache@v2
with:
path: ${{ env.pythonLocation }}
key: linux-${{ env.date }}-${{ hashFiles('**/setup.py') }}-${{ hashFiles('**/setup.cfg') }}-${{ hashFiles('**/pyproject.toml') }}

- name: Install pdftotext
run: wget --no-check-certificate https://dl.xpdfreader.com/xpdf-tools-linux-4.03.tar.gz && tar -xvf xpdf-tools-linux-4.03.tar.gz && sudo cp xpdf-tools-linux-4.03/bin64/pdftotext /usr/local/bin

# Haystack needs to be reinstalled at this stage to make sure the current commit's version is the one getting tested.
# The cache can last way longer than a specific action's run, so older Haystack version could be carried over.
- name: Reinstall Haystack
run: |
pip install .[test]

- name: Run tests
env:
PINECONE_API_KEY: ${{ secrets.PINECONE_API_KEY }}
run: pytest -s test/test_document_store.py test/test_pipeline.py test/test_standard_pipelines.py test/test_pipeline_extractive_qa.py --document_store_type="pinecone"
331 changes: 331 additions & 0 deletions docs/_src/api/api/document_store.md
Original file line number Diff line number Diff line change
Expand Up @@ -4120,6 +4120,337 @@ exists.

None

<a id="pinecone"></a>

# Module pinecone

<a id="pinecone.PineconeDocumentStore"></a>

## PineconeDocumentStore

```python
class PineconeDocumentStore(SQLDocumentStore)
```

Document store for very large scale embedding based dense retrievers like the DPR. This is a hosted document store,
this means that your vectors will not be stored locally but in the cloud. This means that the similarity
search will be run on the cloud as well.

It implements the Pinecone vector database ([https://www.pinecone.io](https://www.pinecone.io))
to perform similarity search on vectors. In order to use this document store, you need an API key that you can
obtain by creating an account on the [Pinecone website](https://www.pinecone.io).

The document text is stored using the SQLDocumentStore, while
the vector embeddings and metadata (for filtering) are indexed in a Pinecone Index.

<a id="pinecone.PineconeDocumentStore.__init__"></a>

#### \_\_init\_\_

```python
def __init__(api_key: str, environment: str = "us-west1-gcp", sql_url: str = "sqlite:///pinecone_document_store.db", pinecone_index: Optional[pinecone.Index] = None, embedding_dim: int = 768, return_embedding: bool = False, index: str = "document", similarity: str = "cosine", replicas: int = 1, shards: int = 1, embedding_field: str = "embedding", progress_bar: bool = True, duplicate_documents: str = "overwrite")
```

**Arguments**:

- `api_key`: Pinecone vector database API key ([https://app.pinecone.io](https://app.pinecone.io)).
- `environment`: Pinecone cloud environment uses `"us-west1-gcp"` by default. Other GCP and AWS regions are
supported, contact Pinecone [here](https://www.pinecone.io/contact/) if required.
- `sql_url`: SQL connection URL for database. It defaults to local file based SQLite DB. For large scale
deployment, Postgres is recommended.
- `pinecone_index`: pinecone-client Index object, an index will be initialized or loaded if not specified.
- `embedding_dim`: The embedding vector size.
- `return_embedding`: Whether to return document embeddings.
- `index`: Name of index in document store to use.
- `similarity`: The similarity function used to compare document vectors. `"dot_product"` is the default
since it is more performant with DPR embeddings. `"cosine"` is recommended if you are using a
Sentence-Transformer model.
In both cases, the returned values in Document.score are normalized to be in range [0,1]:
- For `"dot_product"`: `expit(np.asarray(raw_score / 100))`
- For `"cosine"`: `(raw_score + 1) / 2`
- `replicas`: The number of replicas. Replicas duplicate the index. They provide higher availability and
throughput.
- `shards`: The number of shards to be used in the index. We recommend to use 1 shard per 1GB of data.
- `embedding_field`: Name of field containing an embedding vector.
- `progress_bar`: Whether to show a tqdm progress bar or not.
Can be helpful to disable in production deployments to keep the logs clean.
- `duplicate_documents`: Handle duplicate documents based on parameter options.\
Parameter options:
- `"skip"`: Ignore the duplicate documents.
- `"overwrite"`: Update any existing documents with the same ID when adding documents.
- `"fail"`: An error is raised if the document ID of the document being added already exists.

<a id="pinecone.PineconeDocumentStore.write_documents"></a>

#### write\_documents

```python
def write_documents(documents: Union[List[dict], List[Document]], index: Optional[str] = None, batch_size: int = 32, duplicate_documents: Optional[str] = None, headers: Optional[Dict[str, str]] = None)
```

Add new documents to the DocumentStore.

**Arguments**:

- `documents`: List of `Dicts` or list of `Documents`. If they already contain embeddings, we'll index them
right away in Pinecone. If not, you can later call `update_embeddings()` to create & index them.
- `index`: Index name for storing the docs and metadata.
- `batch_size`: Number of documents to process at a time. When working with large number of documents,
batching can help to reduce the memory footprint.
- `duplicate_documents`: handle duplicate documents based on parameter options.
Parameter options:
- `"skip"`: Ignore the duplicate documents.
- `"overwrite"`: Update any existing documents with the same ID when adding documents.
- `"fail"`: An error is raised if the document ID of the document being added already exists.
- `headers`: PineconeDocumentStore does not support headers.

**Raises**:

- `DuplicateDocumentError`: Exception trigger on duplicate document.

<a id="pinecone.PineconeDocumentStore.update_embeddings"></a>

#### update\_embeddings

```python
def update_embeddings(retriever: "BaseRetriever", index: Optional[str] = None, update_existing_embeddings: bool = True, filters: Optional[Dict[str, Union[Dict, List, str, int, float, bool]]] = None, batch_size: int = 32)
```

Updates the embeddings in the document store using the encoding model specified in the retriever.

This can be useful if you want to add or change the embeddings for your documents (e.g. after changing the
retriever config).

**Arguments**:

- `retriever`: Retriever to use to get embeddings for text.
- `index`: Index name for which embeddings are to be updated. If set to `None`, the default `self.index` is
used.
- `update_existing_embeddings`: Whether to update existing embeddings of the documents. If set to `False`,
only documents without embeddings are processed. This mode can be used for incremental updating of
embeddings, wherein, only newly indexed documents get processed.
- `filters`: Optional filters to narrow down the documents for which embeddings are to be updated.
Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical
operator (`"$and"`, `"$or"`, `"$not"`), a comparison operator (`"$eq"`, `"$in"`, `"$gt"`,
`"$gte"`, `"$lt"`, `"$lte"`) or a metadata field name.
Logical operator keys take a dictionary of metadata field names and/or logical operators as
value. Metadata field names take a dictionary of comparison operators as value. Comparison
operator keys take a single value or (in case of `"$in"`) a list of values as value.
If no logical operator is provided, `"$and"` is used as default operation. If no comparison
operator is provided, `"$eq"` (or `"$in"` if the comparison value is a list) is used as default
operation.
__Example__:
```python
filters = {
"$and": {
"type": {"$eq": "article"},
"date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
"rating": {"$gte": 3},
"$or": {
"genre": {"$in": ["economy", "politics"]},
"publisher": {"$eq": "nytimes"}
}
}
}
```
- `batch_size`: Number of documents to process at a time. When working with large number of documents,
batching can help reduce memory footprint.

<a id="pinecone.PineconeDocumentStore.get_all_documents_generator"></a>

#### get\_all\_documents\_generator

```python
def get_all_documents_generator(index: Optional[str] = None, filters: Optional[Dict[str, Union[Dict, List, str, int, float, bool]]] = None, return_embedding: Optional[bool] = None, batch_size: int = 32, headers: Optional[Dict[str, str]] = None) -> Generator[Document, None, None]
```

Get all documents from the document store. Under-the-hood, documents are fetched in batches from the

document store and yielded as individual documents. This method can be used to iteratively process
a large number of documents without having to load all documents in memory.

**Arguments**:

- `index`: Name of the index to get the documents from. If None, the
DocumentStore's default index (self.index) will be used.
- `filters`: Optional filters to narrow down the documents for which embeddings are to be updated.
Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical
operator (`"$and"`, `"$or"`, `"$not"`), a comparison operator (`"$eq"`, `"$in"`, `"$gt"`,
`"$gte"`, `"$lt"`, `"$lte"`) or a metadata field name.
Logical operator keys take a dictionary of metadata field names and/or logical operators as
value. Metadata field names take a dictionary of comparison operators as value. Comparison
operator keys take a single value or (in case of `"$in"`) a list of values as value.
If no logical operator is provided, `"$and"` is used as default operation. If no comparison
operator is provided, `"$eq"` (or `"$in"` if the comparison value is a list) is used as default
operation.
__Example__:
```python
filters = {
"$and": {
"type": {"$eq": "article"},
"date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
"rating": {"$gte": 3},
"$or": {
"genre": {"$in": ["economy", "politics"]},
"publisher": {"$eq": "nytimes"}
}
}
}
```
- `return_embedding`: Whether to return the document embeddings.
- `batch_size`: When working with large number of documents, batching can help reduce memory footprint.
- `headers`: PineconeDocumentStore does not support headers.

<a id="pinecone.PineconeDocumentStore.get_embedding_count"></a>

#### get\_embedding\_count

```python
def get_embedding_count(index: Optional[str] = None, filters: Optional[Dict[str, Union[Dict, List, str, int, float, bool]]] = None) -> int
```

Return the count of embeddings in the document store.

<a id="pinecone.PineconeDocumentStore.update_document_meta"></a>

#### update\_document\_meta

```python
def update_document_meta(id: str, meta: Dict[str, str], index: str = None)
```

Update the metadata dictionary of a document by specifying its string id

<a id="pinecone.PineconeDocumentStore.delete_documents"></a>

#### delete\_documents

```python
def delete_documents(index: Optional[str] = None, ids: Optional[List[str]] = None, filters: Optional[Dict[str, Union[Dict, List, str, int, float, bool]]] = None, headers: Optional[Dict[str, str]] = None)
```

Delete documents from the document store.

**Arguments**:

- `index`: Index name to delete the documents from. If `None`, the DocumentStore's default index
(`self.index`) will be used.
- `ids`: Optional list of IDs to narrow down the documents to be deleted.
- `filters`: Optional filters to narrow down the documents for which embeddings are to be updated.
Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical
operator (`"$and"`, `"$or"`, `"$not"`), a comparison operator (`"$eq"`, `"$in"`, `"$gt"`,
`"$gte"`, `"$lt"`, `"$lte"`) or a metadata field name.
Logical operator keys take a dictionary of metadata field names and/or logical operators as
value. Metadata field names take a dictionary of comparison operators as value. Comparison
operator keys take a single value or (in case of `"$in"`) a list of values as value.
If no logical operator is provided, `"$and"` is used as default operation. If no comparison
operator is provided, `"$eq"` (or `"$in"` if the comparison value is a list) is used as default
operation.
__Example__:
```python
filters = {
"$and": {
"type": {"$eq": "article"},
"date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
"rating": {"$gte": 3},
"$or": {
"genre": {"$in": ["economy", "politics"]},
"publisher": {"$eq": "nytimes"}
}
}
}
```
- `headers`: PineconeDocumentStore does not support headers.

<a id="pinecone.PineconeDocumentStore.query_by_embedding"></a>

#### query\_by\_embedding

```python
def query_by_embedding(query_emb: np.ndarray, filters: Optional[Dict[str, Union[Dict, List, str, int, float, bool]]] = None, top_k: int = 10, index: Optional[str] = None, return_embedding: Optional[bool] = None, headers: Optional[Dict[str, str]] = None) -> List[Document]
```

Find the document that is most similar to the provided `query_emb` by using a vector similarity metric.

**Arguments**:

- `query_emb`: Embedding of the query (e.g. gathered from DPR).
- `filters`: Optional filters to narrow down the search space to documents whose metadata fulfill certain
conditions.
Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical
operator (`"$and"`, `"$or"`, `"$not"`), a comparison operator (`"$eq"`, `"$in"`, `"$gt"`,
`"$gte"`, `"$lt"`, `"$lte"`) or a metadata field name.
Logical operator keys take a dictionary of metadata field names and/or logical operators as
value. Metadata field names take a dictionary of comparison operators as value. Comparison
operator keys take a single value or (in case of `"$in"`) a list of values as value.
If no logical operator is provided, `"$and"` is used as default operation. If no comparison
operator is provided, `"$eq"` (or `"$in"` if the comparison value is a list) is used as default
operation.
__Example__:
```python
filters = {
"$and": {
"type": {"$eq": "article"},
"date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
"rating": {"$gte": 3},
"$or": {
"genre": {"$in": ["economy", "politics"]},
"publisher": {"$eq": "nytimes"}
}
}
}
# or simpler using default operators
filters = {
"type": "article",
"date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
"rating": {"$gte": 3},
"$or": {
"genre": ["economy", "politics"],
"publisher": "nytimes"
}
}
```
To use the same logical operator multiple times on the same level, logical operators take
optionally a list of dictionaries as value.
__Example__:
```python
filters = {
"$or": [
{
"$and": {
"Type": "News Paper",
"Date": {
"$lt": "2019-01-01"
}
}
},
{
"$and": {
"Type": "Blog Post",
"Date": {
"$gte": "2019-01-01"
}
}
}
]
}
```
- `top_k`: How many documents to return.
- `index`: The name of the index from which to retrieve documents.
- `return_embedding`: Whether to return document embedding.
- `headers`: PineconeDocumentStore does not support headers.

<a id="pinecone.PineconeDocumentStore.load"></a>

#### load

```python
@classmethod
def load(cls)
```

Default class method used for loading indexes. Not applicable to the PineconeDocumentStore.

<a id="utils"></a>

# Module utils
Expand Down
Loading