Run a pipeline in a container #751

donkirkby · 2018-12-17T21:55:47Z

Now that issue #739 has added a simpler pipeline configuration based on Singularity, we'd still like to support the old style of pipeline where a developer can wire a bunch of scripts together into a pipeline.

Upload a zip/tar.gz file with scripts in it.
Use the pipeline UI to wire inputs and outputs between scripts.
Save pipeline as a pipeline.json file in the tar.gz file with the scripts.
Launch the pipeline with a default Singularity image, plus the tar.gz file mounted on /mnt/bin, plus a driver script that reads the pipeline.json file and executes the steps.
The advantage is that this all runs as a single Slurm job, and the accounting is much simpler.
You can also migrate a pipeline from development to test to production by downloading the tar.gz file that includes the pipeline.json, and uploading it to another Kive server. A Kive container can be either a Singularity image or a tar.gz file. The tar.gz file is a layer on top of the default Singularity image.
When you upload a new tar.gz container without a pipeline.json, it can copy the pipeline.json from the previous revision in the container family.

Implementation steps:

Example JSON:
The pipeline.json file wouldn't hold the list of files. It would just have everything in the pipeline entry. The full JSON would appear in the GET response to /api/containers/123/contents.

{
    "files": [
        "filter_quality.sh",
        "helper.py",
        "lib/antigravity.py"],
    "pipeline": {
        "kive_version": "0.14",
        "default_config": {
            "parent_family": "kive-default",
            "parent_tag": "v1.1",
            "parent_md5": "225a63213afdfd2e2659e9f9c1a3b695",
            "memory": 100,
            "threads": 1
        },
        "inputs": [
            {
                "dataset_name": "quality_csv",
                "x": 0.426540479529696,
                "y": 0.345062429057889
            }
        ],
        "steps": [
            {
                "driver": "filter_quality.sh",
                "inputs": [
                    {
                        "dataset_name": "quality_csv",
                        "source_step": 0,
                        "source_dataset_name": "quality_csv"
                    }
                ],
                "outputs": ["bad_cycles_csv"],
                "x": 0.501879443635952,
                "y": 0.497715260532689,
                "fill_colour": ""
            }
        ],
        "outputs": [
            {
                "dataset_name": "bad_cycles_csv",
                "source_step": 1,
                "source_dataset_name": "bad_cycles_csv",
                "x": 0.588014776534994,
                "y": 0.640181611804767
            }
        ]
    }
}

The text was updated successfully, but these errors were encountered:

rhliang · 2019-01-19T02:09:53Z

Been thinking about this today, sifting through Container code to get a better idea of how a Pipeline would be represented in JSON. Can anyone see anything missing from this tentative representation:

Pipeline JSON format is a dictionary (or "object" I guess) with fields "inputs", "outputs", and "steps", each of which is a list

inputs: a list of input names

steps: each step is represented as
 - script
 - invocation (maybe we can assume this is standardized like old Kive scripts)
 - inputs (tuples: step num and name)
 - outputs (names)

outputs: each output represented as
 - name
 - source (tuples: step num and name)

donkirkby · 2019-01-21T19:36:58Z

The pipeline.json format should also have support for the automated configuration described in #749, as well as a link to the name and MD5 of a Singularity container that it will run in. That will let the archive container file be copied from one Kive server to another without requiring a lot of setup steps.
Launching an archive container should probably use singularity exec so the Singularity container doesn't need explicit support for running pipelines. The run_pipeline.py script should be part of the archive container.

rhliang · 2019-01-23T00:02:23Z

Having looked a little more, I am wondering if we should store some of the UI representation information like (x,y) coordinates etc in case we ever want to edit or make a new version of it. If so, in addition to the above info:

metadata: an object with fields
 - default container name
 - default container MD5
 - default number of threads
 - default memory allocated

steps: in addition to the above, each step has
 - x
 - y
 - fill colour

inputs and outputs: in addition to the above,
 - x
 - y

I don't think I'm qualified to dig much into the Javascript/Typescript code that defines the UI but I am hoping that much of it could be repurposed -- if our API serves it a list of files in the uploaded tarfile, it could hopefully be handled by modifying the code that makes the menu to select/produce Method nodes in the existing Pipeline UI. I'm guessing we would like a bit of code that allows you to define methods from files within the Pipeline UI also.

If no one objects I may just start digging into producing a run_pipeline.py for the default container to use.

donkirkby · 2019-01-23T00:24:18Z

Including the display stuff seems like a good idea.

It might be a helpful exercise to create an example pipeline.json for a simple pipeline like the MiCall filter quality pipeline, and post it here. Taking the output from dump_pipeline.py would be a good starting point. Do you want to do that, or shall I?

@jamesnakagawa, will you have time to work on the Pipeline UI in the next few weeks, or should I take a stab at it?

rhliang · 2019-01-23T00:28:54Z

That's a good idea, I'll have a go at it.

rhliang · 2019-01-23T01:19:59Z

I think this covers everything that we would need to actually carry out the pipeline. I made something up for the invocation field for running the filter_quality.sh step, but I think we would want this to be optional.

The driver path would be relative to /mnt/bin, and the invocation assumes a working directory of /mnt/bin. Input paths would be relative to /mnt/input, and intermediate files would be put into /mnt/step[x] I guess?

Example JSON moved to top comment.

donkirkby · 2019-01-23T01:51:13Z

I suggest we avoid a custom invocation field until we need it. Can't we just use driver input1 input2 output1 output2 as the standard invocation?
How about writing all the intermediate output files to /mnt/output as step1_output1, and then renaming any run outputs to output1 after the last step finishes?

rhliang · 2019-01-23T01:51:43Z

Yeah, sounds good, let's go with that.

donkirkby · 2019-01-23T17:39:23Z

I have a cunning idea! 😈
run_pipeline.py doesn't have to run inside the container. The runcontainer command could have logic to handle archive containers, and it could call singularity exec separately for each step in the pipeline. The benefit of this is that the singularity container wouldn't require Python, and the logic for running a pipeline stays in Kive.

Also, I suggest we write the Kive version into pipeline.json. I think this has two uses:

It's a reasonable sanity check on the pipeline.json file. It's remotely possible that a developer could create an archive that included some other file named pipeline.json. Kive could just reject that archive if it doesn't find a kive_version field in pipeline.json.
We might need it to support backward compatibility if we change the pipeline interface in the future.

rhliang · 2019-01-23T18:54:23Z

Another small edit: inputs don't need a dataset_idx field, they will be listed in their correct order.

donkirkby · 2019-01-24T00:24:34Z

I will work on the UI changes, and @rhliang will work on the server changes. We'll both make updates to the pipeline.json example above if we need to change the contract between browser and server.

Part of #751. I moved it and copied it back, so the annotations would get carried into the container app. The pipeline app will get deleted soon.

donkirkby · 2019-01-28T23:00:53Z

I suggest we tweak the parent configuration in pipeline.json. Change this:

"default_config": {
    "container_name": "kive-default",
    "container_md5": "225a63213afdfd2e2659e9f9c1a3b695",
    "memory": 100,
    "threads": 1
},

To this:

"default_config": {
    "parent_family": "sample",
    "parent_tag": "basic",
    "parent_md5": "8dab0b3c7b7d812f0ba4819664be8acb",
    "memory": 100,
    "threads": 1
},

This clarifies that it's for the parent, not the archive container. It also matches the fields on the container - there is no name, only a tag and a family name.

Remove CdtNode class.

Also configure the vagrant user to be able to run tests.

Only write headers when there is actual output.

Also filter parent containers. Make driver scripts executable.

Doesn't convert runs yet.

Also stop Slurm breaking after Vagrant reboot.

… launching a run, as part of #751.

…751.

…t of #751.

Don't write archive container copies to a subfolder.

…rly configured when running archive containers. This is part of #751.

This is part of #751.

donkirkby · 2019-03-01T23:11:25Z

WOOT!

Update the release documentation.

donkirkby added the enhancement label Dec 17, 2018

donkirkby added this to the Near future milestone Dec 17, 2018

This was referenced Dec 17, 2018

Remove old runs and pipelines #752

Closed

Simplify pipeline configuration #739

Closed

donkirkby modified the milestones: Near future, 0.14 Jan 15, 2019

donkirkby assigned rhliang Jan 23, 2019

donkirkby self-assigned this Jan 24, 2019

donkirkby added a commit that referenced this issue Jan 24, 2019

Add migration for Container.parent, as part of #751.

a05702d

donkirkby added a commit that referenced this issue Jan 26, 2019

Move javascript for editing pipelines to container app as part of #751.

9344701

donkirkby added a commit that referenced this issue Jan 26, 2019

Copy javascript for editing pipelines back to pipeline app.

8d021ef

Part of #751. I moved it and copied it back, so the annotations would get carried into the container app. The pipeline app will get deleted soon.

donkirkby pinned this issue Jan 28, 2019

donkirkby added a commit that referenced this issue Jan 29, 2019

Start converting container javascript, as part of #751.

6d65b6b

donkirkby added a commit that referenced this issue Jan 31, 2019

Fix broken javascript tests for #751.

08688e2

donkirkby added a commit that referenced this issue Jan 31, 2019

Get pipeline editor working, as part of #751.

f7795ed

donkirkby added a commit that referenced this issue Jan 31, 2019

Adjust fields submitted for pipeline.json, as part of #751.

827ced7

donkirkby added a commit that referenced this issue Feb 1, 2019

Add an edit option for input nodes, as part of #751.

2cc9248

Remove CdtNode class.

donkirkby added a commit that referenced this issue Feb 1, 2019

Update method magnets while editing, as part of #751.

3e106e1

donkirkby added a commit that referenced this issue Feb 8, 2019

Remove tests for compressed tar files, as part of #751.

7c95bdd

donkirkby added a commit that referenced this issue Feb 8, 2019

Only move pipeline.json if it exists, as part of #751.

4971f5c

donkirkby added a commit that referenced this issue Feb 8, 2019

Test multiple pipeline steps, as part of #751.

c1c6844

Also configure the vagrant user to be able to run tests.

donkirkby added a commit that referenced this issue Feb 8, 2019

Log stdout and stderr for pipelines, as part of #751.

5f8dc94

Only write headers when there is actual output.

donkirkby added a commit that referenced this issue Feb 9, 2019

Create app for valid pipelines, as part of #751.

f8884d4

Also filter parent containers. Make driver scripts executable.

donkirkby added a commit that referenced this issue Feb 11, 2019

Provide default app configuration, as part of #751.

78ea6e0

donkirkby added a commit that referenced this issue Feb 12, 2019

Add command to convert old pipelines to containers, as part of #751.

cbaec82

Doesn't convert runs yet.

donkirkby added a commit that referenced this issue Feb 12, 2019

Convert dependency file names correctly, as part of #751.

8bd735c

donkirkby added a commit that referenced this issue Feb 13, 2019

Create app when posting a new container, as part of #751.

8bc6297

donkirkby added a commit that referenced this issue Feb 13, 2019

Take pipeline state out of content JSON, as part of #751.

cd0b551

donkirkby added a commit that referenced this issue Feb 15, 2019

Convert runs to container runs, as part of #751.

1a89a07

Also stop Slurm breaking after Vagrant reboot.

donkirkby added a commit that referenced this issue Feb 16, 2019

Update container MD5 when writing content, as part of #751.

301daa2

donkirkby added a commit that referenced this issue Feb 19, 2019

Fix a timing problem in one of the container tests, as part of #751.

28ae0e9

donkirkby added a commit that referenced this issue Feb 19, 2019

Update the documentation for creating a container, as part of #751.

5e69324

donkirkby added a commit that referenced this issue Feb 19, 2019

Document how to reuse pipeline wiring in a container, as part of #751.

435416b

donkirkby added a commit that referenced this issue Feb 20, 2019

Split pipeline submit into Save and Save as..., as part of #751.

190a5c3

donkirkby added a commit that referenced this issue Feb 20, 2019

Check for runs before writing pipeline content, as part of #751.

b3cdf94

rhliang pushed a commit that referenced this issue Feb 21, 2019

Container (and parent container, if applicable) MD5 is checked before…

5937bff

… launching a run, as part of #751.

rhliang pushed a commit that referenced this issue Feb 22, 2019

Input dataset MD5s are now tested when a run is launched, as part of #…

5aaad54

…751.

rhliang pushed a commit that referenced this issue Feb 27, 2019

Separate folders are now created in the sandbox for each step, as par…

19f58c9

…t of #751.

donkirkby added a commit that referenced this issue Mar 1, 2019

Write internal Kive errors to run's stderr log, as part of #751.

39e2360

Don't write archive container copies to a subfolder.

rhliang pushed a commit that referenced this issue Mar 1, 2019

Added a test checking that the input and output directories are prope…

42b66e6

…rly configured when running archive containers. This is part of #751.

rhliang pushed a commit that referenced this issue Mar 1, 2019

Child containers are now considered in the removal plan.

61ce29f

This is part of #751.

donkirkby added a commit that referenced this issue Mar 1, 2019

Summarize inputs and outputs in ContainerRun.md5, as part of #751.

39ebdf4

donkirkby closed this as completed Mar 1, 2019

donkirkby unpinned this issue Mar 1, 2019

donkirkby added a commit that referenced this issue Mar 2, 2019

Add script to set ContainerRun.md5 on existing runs, as part of #751.

2d96b70

donkirkby added a commit that referenced this issue Mar 6, 2019

Fix bugs in convert_pipeline command for #751.

5a8e52d

Update the release documentation.

donkirkby added a commit that referenced this issue Mar 12, 2019

Fix a bug related to #751 by making a separate bin folder per step.

a5f9fe3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run a pipeline in a container #751

Run a pipeline in a container #751

donkirkby commented Dec 17, 2018 •

edited by rhliang

Loading

rhliang commented Jan 19, 2019

donkirkby commented Jan 21, 2019

rhliang commented Jan 23, 2019 •

edited

Loading

donkirkby commented Jan 23, 2019

rhliang commented Jan 23, 2019

rhliang commented Jan 23, 2019 •

edited by donkirkby

Loading

donkirkby commented Jan 23, 2019

rhliang commented Jan 23, 2019

donkirkby commented Jan 23, 2019

rhliang commented Jan 23, 2019

donkirkby commented Jan 24, 2019

donkirkby commented Jan 28, 2019

donkirkby commented Mar 1, 2019

Run a pipeline in a container #751

Run a pipeline in a container #751

Comments

donkirkby commented Dec 17, 2018 • edited by rhliang Loading

rhliang commented Jan 19, 2019

donkirkby commented Jan 21, 2019

rhliang commented Jan 23, 2019 • edited Loading

donkirkby commented Jan 23, 2019

rhliang commented Jan 23, 2019

rhliang commented Jan 23, 2019 • edited by donkirkby Loading

donkirkby commented Jan 23, 2019

rhliang commented Jan 23, 2019

donkirkby commented Jan 23, 2019

rhliang commented Jan 23, 2019

donkirkby commented Jan 24, 2019

donkirkby commented Jan 28, 2019

donkirkby commented Mar 1, 2019

donkirkby commented Dec 17, 2018 •

edited by rhliang

Loading

rhliang commented Jan 23, 2019 •

edited

Loading

rhliang commented Jan 23, 2019 •

edited by donkirkby

Loading