Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Pipeline.components #2215

Merged
merged 7 commits into from
Feb 22, 2022
Merged

Fix Pipeline.components #2215

merged 7 commits into from
Feb 22, 2022

Conversation

tstadel
Copy link
Member

@tstadel tstadel commented Feb 17, 2022

Proposed changes:

  • Add components property to Pipeline
  • Improve Pipeline.get_document_store() by looking into BaseRetriever components' document_store field

Status (please check what you already did):

  • First draft (up for discussions & feedback)
  • Final code
  • Tests added

fixes #1823

Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice improvement, thanks! 🙂

However, testing looks a bit shallow. I'd add a couple of dedicated unit tests for self.components and self.get_document_store(). Being separate, they highlight where exactly the problem comes from, if/when it comes. Like this, if the test fails it's harder to understand what's wrong.

I think there are other corners of Pipeline that can benefit of this new method. Grepping base.py for ["component"] returns 13 places where it's used, probably many of them could be simplified. There's also an occurrence in standard_pipelines.py:

self.pipeline.add_node(name=node, component=graph.nodes[node]["component"], inputs=previous_node_name)

matches = list(
set([retriever.document_store for retriever in self.get_nodes_by_class(class_type=BaseRetriever)])
)

if len(matches) > 1:
raise Exception(f"Multiple Document Stores found in Pipeline: {matches}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is not part of your changes, but why having more than a docstore is a problem? I can imagine forked pipelines with two retrievers and a JoinDocuments node before the Reader could make sense.

In addition: if the exception should stay, then it's a PipelineError 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the exception still makes sense in case there are really two different docstores in the game. In such a case the user has to dig out the needed docstore herself as we cannot give any general infos to identify them easily. Most of the time this should not be an issue in multi-retriever settings anyway: the case that you depicted should use only one instance of the docstore. That should be safely handled by the implicit unique operation executed by set(). But I'm gonna write a test for that case to ensure :-)

Regarding the PipelineError, yes I can understand it, but let's introduce this in a separate PR. I think this will be a bigger one to get it clean.

self.components: dict = {}

@property
def components(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor naming question: why components and not nodes? Users aren't aware of networkx's naming.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use both notations in pipeline config: components define the "parts" of the pipeline as such. Nodes define the interplay of these components. Semantically components is the better name here.

# TODO make this generic for other pipelines with different naming
RETRIEVER = PIPELINE.get_node(name="Retriever")
DOCUMENT_STORE = RETRIEVER.document_store if RETRIEVER else None
DOCUMENT_STORE = PIPELINE.get_document_store()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 Beautiful, thanks! 😄

@tstadel tstadel requested a review from ZanSara February 21, 2022 18:47
@tstadel
Copy link
Member Author

tstadel commented Feb 21, 2022

I've added some tests. Regarding pipeline.component I decided to make it a convenience property for clients only: pipeline internal code should not make use of it as it is only a facade on top of the graph. Anything that deals with the graph and its nodes internally should better use the graph.

Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests are just beautiful 🤩 I wish all of our unit tests would look like that!

You may want to make a local fixture out of stuff like DummyDocumentStore, DummyRetriever and Node before merge, but in either case it's good to go for me. 👍

@tstadel tstadel merged commit fe03ca7 into master Feb 22, 2022
@tstadel tstadel deleted the fix_pipeline_components branch February 22, 2022 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:pipeline type:bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pipeline.components is always empty
3 participants