Add Ray integration for Pipelines #1255

oryx1729 · 2021-07-06T13:05:31Z

Ray Integration

This PR is an initial prototype of how Haystack Pipelines can be scaled horizontally and distributed across a cluster of machines using the Ray framework.

Design

The design goal is to provide a standardized interface for Pipelines similar to the Haystack Components. A new BasePipeline class is added that is extended by Pipeline and RayPipeline. This allows users to switch between "execution engines" for the Pipelines while retaining the configuration for the Components.

The new RayPipeline has methods similar to the Pipeline class. With the current implementation, a RayPipeline can be instantiated locally or on a remote cluster using a Pipeline YAML configuration. Ray provides a handle, that allows easy integration of deployed Pipelines with Python code or REST APIs.

Ray allows independently scaling the components by adding more replicas. For instance, in a typical extractive QA Pipeline, multiple readers can be added in front of a single Elasticsearch instance. The number of replicas can be configured by the new replicas parameter for each component in the Pipeline.

Implementation

The nodes in a Pipeline are wrapped in individual Ray Deployments. The Pipeline.run() calls the "graph" of deployments to get the results for an input query(or indexes an uploaded file for indexing Pipelines).

To create a deployment, the component type is passed to Ray together with the YAML configuration. The deployment then creates an instance of the component using parameters from the pipeline config.

Breaking Changes

The current YAML configuration has a type attribute for pipelines that is being used to specify Query or Indexing pipelines. This conflicts with the type parameter for components, where it is used for component class.

With this PR, the type now determines the class to use – Pipeline or RayPipeline.

The pipeline_type parameter is now replaced by root_node.

haystack/pipeline.py

tholor

Looks already pretty good to me! Great job!
Left a few smaller comments.
Also, I have one high-level question: Did you test how much overhead we create with the communication between ray nodes? So comparing the time of a run with the regular Pipeline vs. RayPipeline on a single, local machine.

haystack/pipeline.py

haystack/schema.py

haystack/pipeline.py

oryx1729 · 2021-07-30T08:29:59Z

Also, I have one high-level question: Did you test how much overhead we create with the communication between ray nodes? So comparing the time of a run with the regular Pipeline vs. RayPipeline on a single, local machine.

From Ray's documentation, the overhead should be in single-digit milliseconds. I haven't benchmarked with the Pipelines yet.

tholor

Looking good. Only more documentation is needed to ensure a lower entry barrier for users + collaborators.

haystack/pipeline.py

lalitpagaria reviewed Jul 6, 2021

View reviewed changes

haystack/pipeline.py Outdated Show resolved Hide resolved

lalitpagaria reviewed Jul 6, 2021

View reviewed changes

haystack/pipeline.py Outdated Show resolved Hide resolved

oryx1729 marked this pull request as draft July 12, 2021 13:59

oryx1729 requested a review from tholor July 12, 2021 13:59

tholor reviewed Jul 16, 2021

View reviewed changes

oryx1729 added 11 commits July 28, 2021 12:40

Add RayPipeline

5e8c13e

Add test

765dc74

Add requirement for ray

88acd69

Shutdown ray after test

1fd2569

Introduce pipeline type

929ffae

Fix root_name name

048566b

Fix test for saving Pipeline

d7221f9

Fix Ray test

84b72f4

Add ray to CI

9adf4a7

Add serve shutdown in test

dd66450

Update Ray version

d7415f4

oryx1729 force-pushed the ray branch from c739ccd to d7415f4 Compare July 28, 2021 10:41

oryx1729 added 13 commits July 28, 2021 13:42

Test CI

0806a97

Test CI

752bf06

Test CI

abcf32b

Test CI

bc8a6e2

Test CI

3590b0d

Test CI

5286da5

Test CI

87ba842

Test CI

4b342e3

Make Ray import optional

ff4ea84

Remove redundant code for reading YAML

1f48a7e

Add docstring for BaseComponent.load_from_pipeline_config

b135d4d

Fix typing

7b05db5

Add doctring for Ray Handle

8d25a0f

oryx1729 added 5 commits July 29, 2021 19:39

Test CI

9143b60

Test CI

5cf2976

Test CI

b04bd00

Test CI

db76878

Test CI

4df32c2

oryx1729 changed the title ~~WIP: Add Ray integration for Pipelines~~ Add Ray integration for Pipelines Jul 30, 2021

oryx1729 marked this pull request as ready for review July 30, 2021 08:30

tholor reviewed Aug 2, 2021

View reviewed changes

oryx1729 added 2 commits August 2, 2021 12:19

Add docstring

1251d9f

Add docstring

ed845c1

oryx1729 requested a review from tholor August 2, 2021 12:22

tholor approved these changes Aug 2, 2021

View reviewed changes

oryx1729 merged commit bafa1b4 into master Aug 2, 2021

oryx1729 deleted the ray branch August 2, 2021 12:51

This was referenced Aug 26, 2021

Parallelize pipeline execution #688

Closed

Update documentation for new Pipeline design #1386

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Ray integration for Pipelines #1255

Add Ray integration for Pipelines #1255

oryx1729 commented Jul 6, 2021 •

edited

Loading

tholor left a comment

oryx1729 commented Jul 30, 2021

tholor left a comment

Add Ray integration for Pipelines #1255

Add Ray integration for Pipelines #1255

Conversation

oryx1729 commented Jul 6, 2021 • edited Loading

Ray Integration

Design

Implementation

Breaking Changes

tholor left a comment

Choose a reason for hiding this comment

oryx1729 commented Jul 30, 2021

tholor left a comment

Choose a reason for hiding this comment

oryx1729 commented Jul 6, 2021 •

edited

Loading