Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Ray integration for Pipelines #1255

Merged
merged 31 commits into from
Aug 2, 2021
Merged

Add Ray integration for Pipelines #1255

merged 31 commits into from
Aug 2, 2021

Conversation

oryx1729
Copy link
Contributor

@oryx1729 oryx1729 commented Jul 6, 2021

Ray Integration

This PR is an initial prototype of how Haystack Pipelines can be scaled horizontally and distributed across a cluster of machines using the Ray framework.

Design

The design goal is to provide a standardized interface for Pipelines similar to the Haystack Components. A new BasePipeline class is added that is extended by Pipeline and RayPipeline. This allows users to switch between "execution engines" for the Pipelines while retaining the configuration for the Components.

The new RayPipeline has methods similar to the Pipeline class. With the current implementation, a RayPipeline can be instantiated locally or on a remote cluster using a Pipeline YAML configuration. Ray provides a handle, that allows easy integration of deployed Pipelines with Python code or REST APIs.

Ray allows independently scaling the components by adding more replicas. For instance, in a typical extractive QA Pipeline, multiple readers can be added in front of a single Elasticsearch instance. The number of replicas can be configured by the new replicas parameter for each component in the Pipeline.

Implementation

The nodes in a Pipeline are wrapped in individual Ray Deployments. The Pipeline.run() calls the "graph" of deployments to get the results for an input query(or indexes an uploaded file for indexing Pipelines).

To create a deployment, the component type is passed to Ray together with the YAML configuration. The deployment then creates an instance of the component using parameters from the pipeline config.

Breaking Changes

The current YAML configuration has a type attribute for pipelines that is being used to specify Query or Indexing pipelines. This conflicts with the type parameter for components, where it is used for component class.

With this PR, the type now determines the class to use – Pipeline or RayPipeline.

The pipeline_type parameter is now replaced by root_node.

@oryx1729 oryx1729 marked this pull request as draft July 12, 2021 13:59
@oryx1729 oryx1729 requested a review from tholor July 12, 2021 13:59
Copy link
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks already pretty good to me! Great job!
Left a few smaller comments.
Also, I have one high-level question: Did you test how much overhead we create with the communication between ray nodes? So comparing the time of a run with the regular Pipeline vs. RayPipeline on a single, local machine.

@oryx1729
Copy link
Contributor Author

Also, I have one high-level question: Did you test how much overhead we create with the communication between ray nodes? So comparing the time of a run with the regular Pipeline vs. RayPipeline on a single, local machine.

From Ray's documentation, the overhead should be in single-digit milliseconds. I haven't benchmarked with the Pipelines yet.

@oryx1729 oryx1729 changed the title WIP: Add Ray integration for Pipelines Add Ray integration for Pipelines Jul 30, 2021
@oryx1729 oryx1729 marked this pull request as ready for review July 30, 2021 08:30
Copy link
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Only more documentation is needed to ensure a lower entry barrier for users + collaborators.

@oryx1729 oryx1729 requested a review from tholor August 2, 2021 12:22
@oryx1729 oryx1729 merged commit bafa1b4 into master Aug 2, 2021
@oryx1729 oryx1729 deleted the ray branch August 2, 2021 12:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants