Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run a pipeline in a container #751

Closed
28 tasks done
donkirkby opened this issue Dec 17, 2018 · 14 comments
Closed
28 tasks done

Run a pipeline in a container #751

donkirkby opened this issue Dec 17, 2018 · 14 comments

Comments

@donkirkby
Copy link
Member

donkirkby commented Dec 17, 2018

Now that issue #739 has added a simpler pipeline configuration based on Singularity, we'd still like to support the old style of pipeline where a developer can wire a bunch of scripts together into a pipeline.

  • Upload a zip/tar.gz file with scripts in it.
  • Use the pipeline UI to wire inputs and outputs between scripts.
  • Save pipeline as a pipeline.json file in the tar.gz file with the scripts.
  • Launch the pipeline with a default Singularity image, plus the tar.gz file mounted on /mnt/bin, plus a driver script that reads the pipeline.json file and executes the steps.
  • The advantage is that this all runs as a single Slurm job, and the accounting is much simpler.
  • You can also migrate a pipeline from development to test to production by downloading the tar.gz file that includes the pipeline.json, and uploading it to another Kive server. A Kive container can be either a Singularity image or a tar.gz file. The tar.gz file is a layer on top of the default Singularity image.
  • When you upload a new tar.gz container without a pipeline.json, it can copy the pipeline.json from the previous revision in the container family.

Implementation steps:

  • Add parent container to Container model. That means that the child's container file will be extracted and mounted under the parent's /mnt/bin folder. Then the parent container runs.
  • Add tests for validating container file contents.
  • Use the existing pipeline builder to submit a pipeline.json file to the server.
  • Remove unused fields from pipeline.json.
  • Add an edit option for input nodes, and remove structure field.
  • Write the updated pipeline.json file into a zip container.
  • Read and write tar containers.
  • Update method magnets while typing inputs and outputs.
  • Hint how to enter multiple inputs and outputs.
  • Add default config menu to pipeline, and write it to pipeline.json.
  • Fix broken upload for container files.
  • Investigate warning about missing temporary file in container.tests.ContainerApiTests.test_create_singularity().
  • Filter parent container list to only include singularity containers.
  • Launch each step with singularity exec, based on the steps in the container's pipeline.json file.
  • Create an app when writing archive content.
  • Test running pipelines.
  • Load the pipeline editor, even when the pipeline.json contents are invalid.
  • Update MD5 whenever pipeline.json is updated.
  • Check container and parent MD5's before launching a run.
  • Check input MD5's before launching a run.
  • Let user upload or revise containers with invalid pipelines, but display a warning, and don't create an app. That will stop them from launching.
  • Submit pipeline button splits into "Save" and "Save as new revision" (which prompts for tag and description).
  • Convert old pipelines to child containers.
  • Update developer documentation on how to create a pipeline.
  • Stop editing a pipeline after it's been published, or use copy on write. Something to make all published runs reproducible.
  • Create separate folders for each step, to avoid output name collisions.
  • Add child containers to removal plan.
  • Log to stderr if a run fails due to a bad MD5

Example JSON:
The pipeline.json file wouldn't hold the list of files. It would just have everything in the pipeline entry. The full JSON would appear in the GET response to /api/containers/123/contents.

{
    "files": [
        "filter_quality.sh",
        "helper.py",
        "lib/antigravity.py"],
    "pipeline": {
        "kive_version": "0.14",
        "default_config": {
            "parent_family": "kive-default",
            "parent_tag": "v1.1",
            "parent_md5": "225a63213afdfd2e2659e9f9c1a3b695",
            "memory": 100,
            "threads": 1
        },
        "inputs": [
            {
                "dataset_name": "quality_csv",
                "x": 0.426540479529696,
                "y": 0.345062429057889
            }
        ],
        "steps": [
            {
                "driver": "filter_quality.sh",
                "inputs": [
                    {
                        "dataset_name": "quality_csv",
                        "source_step": 0,
                        "source_dataset_name": "quality_csv"
                    }
                ],
                "outputs": ["bad_cycles_csv"],
                "x": 0.501879443635952,
                "y": 0.497715260532689,
                "fill_colour": ""
            }
        ],
        "outputs": [
            {
                "dataset_name": "bad_cycles_csv",
                "source_step": 1,
                "source_dataset_name": "bad_cycles_csv",
                "x": 0.588014776534994,
                "y": 0.640181611804767
            }
        ]
    }
}
@donkirkby donkirkby added this to the Near future milestone Dec 17, 2018
@donkirkby donkirkby modified the milestones: Near future, 0.14 Jan 15, 2019
@rhliang
Copy link
Contributor

rhliang commented Jan 19, 2019

Been thinking about this today, sifting through Container code to get a better idea of how a Pipeline would be represented in JSON. Can anyone see anything missing from this tentative representation:

Pipeline JSON format is a dictionary (or "object" I guess) with fields "inputs", "outputs", and "steps", each of which is a list

inputs: a list of input names

steps: each step is represented as
 - script
 - invocation (maybe we can assume this is standardized like old Kive scripts)
 - inputs (tuples: step num and name)
 - outputs (names)

outputs: each output represented as
 - name
 - source (tuples: step num and name)

@donkirkby
Copy link
Member Author

The pipeline.json format should also have support for the automated configuration described in #749, as well as a link to the name and MD5 of a Singularity container that it will run in. That will let the archive container file be copied from one Kive server to another without requiring a lot of setup steps.
Launching an archive container should probably use singularity exec so the Singularity container doesn't need explicit support for running pipelines. The run_pipeline.py script should be part of the archive container.

@rhliang
Copy link
Contributor

rhliang commented Jan 23, 2019

Having looked a little more, I am wondering if we should store some of the UI representation information like (x,y) coordinates etc in case we ever want to edit or make a new version of it. If so, in addition to the above info:

metadata: an object with fields
 - default container name
 - default container MD5
 - default number of threads
 - default memory allocated

steps: in addition to the above, each step has
 - x
 - y
 - fill colour

inputs and outputs: in addition to the above,
 - x
 - y

I don't think I'm qualified to dig much into the Javascript/Typescript code that defines the UI but I am hoping that much of it could be repurposed -- if our API serves it a list of files in the uploaded tarfile, it could hopefully be handled by modifying the code that makes the menu to select/produce Method nodes in the existing Pipeline UI. I'm guessing we would like a bit of code that allows you to define methods from files within the Pipeline UI also.

If no one objects I may just start digging into producing a run_pipeline.py for the default container to use.

@donkirkby
Copy link
Member Author

Including the display stuff seems like a good idea.

It might be a helpful exercise to create an example pipeline.json for a simple pipeline like the MiCall filter quality pipeline, and post it here. Taking the output from dump_pipeline.py would be a good starting point. Do you want to do that, or shall I?

@jamesnakagawa, will you have time to work on the Pipeline UI in the next few weeks, or should I take a stab at it?

@rhliang
Copy link
Contributor

rhliang commented Jan 23, 2019

That's a good idea, I'll have a go at it.

@rhliang
Copy link
Contributor

rhliang commented Jan 23, 2019

I think this covers everything that we would need to actually carry out the pipeline. I made something up for the invocation field for running the filter_quality.sh step, but I think we would want this to be optional.

The driver path would be relative to /mnt/bin, and the invocation assumes a working directory of /mnt/bin. Input paths would be relative to /mnt/input, and intermediate files would be put into /mnt/step[x] I guess?

Example JSON moved to top comment.

@donkirkby
Copy link
Member Author

I suggest we avoid a custom invocation field until we need it. Can't we just use driver input1 input2 output1 output2 as the standard invocation?
How about writing all the intermediate output files to /mnt/output as step1_output1, and then renaming any run outputs to output1 after the last step finishes?

@rhliang
Copy link
Contributor

rhliang commented Jan 23, 2019

Yeah, sounds good, let's go with that.

@donkirkby
Copy link
Member Author

I have a cunning idea! 😈
run_pipeline.py doesn't have to run inside the container. The runcontainer command could have logic to handle archive containers, and it could call singularity exec separately for each step in the pipeline. The benefit of this is that the singularity container wouldn't require Python, and the logic for running a pipeline stays in Kive.

Also, I suggest we write the Kive version into pipeline.json. I think this has two uses:

  1. It's a reasonable sanity check on the pipeline.json file. It's remotely possible that a developer could create an archive that included some other file named pipeline.json. Kive could just reject that archive if it doesn't find a kive_version field in pipeline.json.
  2. We might need it to support backward compatibility if we change the pipeline interface in the future.

@rhliang
Copy link
Contributor

rhliang commented Jan 23, 2019

Another small edit: inputs don't need a dataset_idx field, they will be listed in their correct order.

@donkirkby donkirkby self-assigned this Jan 24, 2019
@donkirkby
Copy link
Member Author

I will work on the UI changes, and @rhliang will work on the server changes. We'll both make updates to the pipeline.json example above if we need to change the contract between browser and server.

donkirkby added a commit that referenced this issue Jan 26, 2019
Part of #751.
I moved it and copied it back, so the annotations would get carried into the container app. The pipeline app will get deleted soon.
@donkirkby donkirkby pinned this issue Jan 28, 2019
@donkirkby
Copy link
Member Author

I suggest we tweak the parent configuration in pipeline.json. Change this:

"default_config": {
    "container_name": "kive-default",
    "container_md5": "225a63213afdfd2e2659e9f9c1a3b695",
    "memory": 100,
    "threads": 1
},

To this:

"default_config": {
    "parent_family": "sample",
    "parent_tag": "basic",
    "parent_md5": "8dab0b3c7b7d812f0ba4819664be8acb",
    "memory": 100,
    "threads": 1
},

This clarifies that it's for the parent, not the archive container. It also matches the fields on the container - there is no name, only a tag and a family name.

donkirkby added a commit that referenced this issue Feb 8, 2019
Also configure the vagrant user to be able to run tests.
donkirkby added a commit that referenced this issue Feb 8, 2019
Only write headers when there is actual output.
donkirkby added a commit that referenced this issue Feb 9, 2019
Also filter parent containers.
Make driver scripts executable.
donkirkby added a commit that referenced this issue Feb 12, 2019
donkirkby added a commit that referenced this issue Feb 15, 2019
Also stop Slurm breaking after Vagrant reboot.
rhliang pushed a commit that referenced this issue Feb 21, 2019
donkirkby added a commit that referenced this issue Mar 1, 2019
Don't write archive container copies to a subfolder.
rhliang pushed a commit that referenced this issue Mar 1, 2019
…rly configured when running archive containers.

This is part of #751.
rhliang pushed a commit that referenced this issue Mar 1, 2019
@donkirkby
Copy link
Member Author

WOOT!

@donkirkby donkirkby unpinned this issue Mar 1, 2019
donkirkby added a commit that referenced this issue Mar 6, 2019
Update the release documentation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants