Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DC SDK - load pipeline from deepset cloud #2013

Merged
merged 28 commits into from
Jan 28, 2022
Merged

Conversation

ArzelaAscoIi
Copy link
Member

@ArzelaAscoIi ArzelaAscoIi commented Jan 17, 2022

Proposed changes:
Deepset cloud SDK - load, run and evaluate pipelines from deepset cloud. We already have the possibility to load an existing pipeline from a yaml configuration. As another origin we can now load pipelines from deepset cloud.

Usage
Set environment variables:

DEEPSET_CLOUD_API_ENDPOINT=<future-deepset-cloud-url>
DEEPSET_CLOUD_API_KEY=<my-dc-api-key>

and run the following lines:

test_query = Pipeline.load_from_dc(pipeline_config_name="document_retrieval_1",pipeline_name="query")

Status (please check what you already did):

  • First draft (up for discussions & feedback)
  • Final code
  • Added tests
  • Updated documentation

closes #2002

@ArzelaAscoIi ArzelaAscoIi marked this pull request as draft January 17, 2022 17:13
@tstadel
Copy link
Member

tstadel commented Jan 18, 2022

Hey @ArzelaAscoIi,
sorry for spoiling your PR. I should have made a pull request into your branch.
I introduced the concept of a pipeline_definition which is part of the pipeline_config (formerly known as yaml). pipeline_definition addresses exactly one pipeline.

I would suggest to keep the separation between pipeline_config and pipeline_definition in the load_from_dc method by specifying something like a pipeline_config_name or pipeline_config_id and the pipeline_name Then it feels much the same as load_from_yaml.
To emphasize the convention of having always one query and one indexing pipeline, we should mention this in the docstrings or even set the default of pipeline_name to "query". However this is just some API sugar. Under the hood Pipeline.load_from_config already takes the first pipeline if no pipeline_name has been specified and this should always be the query pipeline in DC.

@ArzelaAscoIi ArzelaAscoIi marked this pull request as ready for review January 19, 2022 14:41
@ArzelaAscoIi ArzelaAscoIi requested a review from tholor January 19, 2022 15:16
Copy link
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Left a few minor comments - mainly around documentation.

One bigger conceptual question is probably whether to stick in the future with the "two pipelines in one YAML" design. I see more and more cases where it would make more sense to define a Pipeline as the combination of 1x index and 1x query "pipeline/flow/graph".
In that case, you would always load both here, and maybe we split pipeline.run() into pipeline.run() and pipeline.index()?
Probably out of scope for this issue, but we should discuss and decide on this in the next weeks.

@CLAassistant
Copy link

CLAassistant commented Jan 20, 2022

CLA assistant check
All committers have signed the CLA.

tstadel and others added 9 commits January 26, 2022 18:19
* minimal DCDocumentStore

* support filters

* implement get_documents_by_id

* handle not existing documents

* add docstrings

* auth added

* add tests

* generate docs

* Add latest docstring and tutorial changes

* add responses to dev dependencies

* fix tests

* support query() and quey_by_embedding()

* Add latest docstring and tutorial changes

* query tests added

* read api_key and api_endpoint from env

* Add latest docstring and tutorial changes

* support query() and quey_by_embedding()

* query tests added

* Add latest docstring and tutorial changes

* Add latest docstring and tutorial changes

* support dynamic similarity and return_embedding values

* Add latest docstring and tutorial changes

* adjust KeywordDocumentStore description

* refactoring

* Add latest docstring and tutorial changes

* implement get_document_count and raise on all not implemented methods

* Add latest docstring and tutorial changes

* don't use abbreviation DC in comments and errors

* Add latest docstring and tutorial changes

* docstring added to KeywordDocumentStore

* Add latest docstring and tutorial changes

* enhanced api key set

* split tests into two parts

* change setup.py in order to work around build cache

* added link

* Add latest docstring and tutorial changes

* rename DCDocumentStore to DeepsetCloudDocumentStore

* Add latest docstring and tutorial changes

* remove dc.py

* reinsert link to docs

* fix imports

* Add latest docstring and tutorial changes

* better test structure

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: ArzelaAscoIi <kristof.herrmann@rwth-aachen.de>
Copy link
Member

@tstadel tstadel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go!

@tstadel tstadel merged commit 7764b69 into master Jan 28, 2022
@tstadel tstadel deleted the pipelines-from-to-dc branch January 28, 2022 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pipeline.load_from_dc()
4 participants