Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-1093] Add python3 Docker images for each MXNet release #12791

Merged
merged 21 commits into from
Mar 12, 2019

Conversation

mbaijal
Copy link
Contributor

@mbaijal mbaijal commented Oct 11, 2018

Description

@KellenSunderland @marcoabreu @gautamkmr
This PR updates the docker/docker-python folder to add 8 new dockerfiles for python3. It also updates the automated docker release script to build and update the python 3 docker images similar to python 2.
Python version used is 3.5 which is the default version on ubuntu 16.04.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Added 8 new dockerfiles for python 3 - cpu, cu80, cu90, cu92 with and without mkl.
  • Updated the build script to also build, test and upload these new dockerfiles
  • Updated the readme.

RUN apt-get update
RUN apt-get install -y wget python3 gcc
RUN wget https://bootstrap.pypa.io/get-pip.py
RUN python3 get-pip.py
Copy link
Contributor

@KellenSunderland KellenSunderland Oct 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Super nit: I'd combine smaller commands like this into a single layer.

RUN apt-get update && \
    apt-get install -y wget python3 gcc && \
        ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to Kellen's ask: I think it is NOT a nit-pick but an important requirement because each RUN corresponds to a filesystem layer for Docker. See this: https://stackoverflow.com/questions/31222377/what-are-docker-image-layers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KellenSunderland @bhavinthaker Thank you, I agree. I will update the PR.

RUN wget https://bootstrap.pypa.io/get-pip.py
RUN python3 get-pip.py

RUN pip3 install mxnet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Version has to be specifyable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean parameterize the MXNet version in the dockerfiles?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We have to be able to create images for the different versions in a reproducible fashion.

@@ -12,13 +12,21 @@ It uses the appropriate pip binaries to build different docker images as -
* gpu_cu80_mkl
* gpu_cu92
* gpu_cu92_mkl
* cpu_py3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer if python3 was the default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not want to change the docker images already uploaded for MXNet 1.3 since this might cause something to break for a user. This change to make python3 the default docker version cannot be immediate and should be discussed on the dev list. If agreed upon, this should take effect from the next release onwards. Wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good points. Could you start the discussion?

Copy link
Contributor Author

@mbaijal mbaijal Oct 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been discussed on dev-list and the decision is to switch to python3 images as default. This will be done in the next release to avoid disruption to the current users.

docker/docker-python/README.md Show resolved Hide resolved
@@ -117,10 +131,40 @@ docker_build_image "${mxnet_version}_gpu_cu92_mkl" "Dockerfile.mxnet.python.gpu.
docker_test_image_gpu "${mxnet_version}_gpu_cu92_mkl"



# Build and Test Python3 dockerfiles - CPU
docker_build_image "${mxnet_version}_cpu_python3" "Dockerfile.mxnet.python3.cpu"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer if the builds and CPU tests ran parallelized to reduce the critical path

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be a little concerned about memory usage if we were to parallelize the builds. I could see that this might make building the images difficult on a laptop or even a middle-tier cloud instance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I built 15 images for our docker cache in parallel on my laptop and didn't encounter any problems. I don't think it will be an issue. Also, this script is targeted for CD execution - individual compilation could be an additional mode, but we should optimise for maximum resource utilization here IMO

@mbaijal mbaijal changed the title [MXNET-1093] Add python3 Docker images for each MXNet release [WIP][MXNET-1093] Add python3 Docker images for each MXNet release Oct 12, 2018
@mbaijal
Copy link
Contributor Author

mbaijal commented Oct 12, 2018

Yea.

Copy link
Contributor

@marcoabreu marcoabreu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small things, but overall looks pretty good to me

docker_generate_image_gpu "${mxnet_version}_gpu_cu92_mkl" "Dockerfile.mxnet.python.gpu.cu92.mkl" "python"

echo "Waiting for MXNet Python2 Docker Images to Build"
wait
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly does this do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It waits for all the processes spawned as background processes (build, test) to complete before moving on to the next command.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Does it also properly process error codes of the underlying processes?


#Build and Test dockerfiles - GPU
docker_generate_image_gpu "${mxnet_version}_gpu_cu90" "Dockerfile.mxnet.python.gpu.cu90" "python" &
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually run all the commands in parallel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I verified using ps -ef that builds/tests are running in parallel. However, I do also want to record the total runtime with and without parallelization.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! Did you also test error propagation in that case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following are runtimes on a p2.16x instance to build and test all docker images (python2 and python3)
Runtime with parallelization: 13mins
Runtime without parallelization: 29mins

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! What happens if there is an error in one of the builds (you can simulate that by just returning a non zero exit code in one of the RUN clauses)? Does it get propagated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked that by hardcoding an incorrect mxnet version in one of the dockerfiles. While a non-zero error message is printed the logs, the script does not exit and completes with "success".
I will fix this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the last commit by writing each build log to a file and then parsing these logs for a possible error message. If error found, script fails. The error message makes it easy to find out which tag failed.

@mbaijal mbaijal changed the title [WIP][MXNET-1093] Add python3 Docker images for each MXNet release [MXNET-1093] Add python3 Docker images for each MXNet release Oct 23, 2018
@ankkhedia
Copy link
Contributor

@marcoabreu It seems that your review comments has been addressed. Could you please take a look?

marcoabreu
marcoabreu previously approved these changes Oct 30, 2018
@mbaijal
Copy link
Contributor Author

mbaijal commented Oct 30, 2018

@marcoabreu @sandeep-krishnamurthy Could either of you please merge this PR. Thanks!

@marcoabreu
Copy link
Contributor

Please wait with merging. I'm still discussing some things internally

Copy link
Contributor

@larroy larroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't merge this code as it is.


# Functions
docker_test_image_cpu(){
Copy link
Contributor

@larroy larroy Nov 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't have docker running like this because it leaves containers hanging around even when the parent process terminates. We introduce logic to avoid this kinds of problems in CI #12381

My suggestion is to use one of this two options:

If on doubt, we can choose option 1 and go from there.

@larroy
Copy link
Contributor

larroy commented Nov 1, 2018

Thanks for this effort. I have some reservations about the implementation, see my previous comment about zombie containers and resource disposal in the code.

Also, can we use Python instead of shell to automate these steps? It's more robust, debuggable, flexible and easier to mantain, we don't want to introduce complex logic in shell scripts.

All the automation on the project is done in Python. The code that is forking process to parallelize the build should be done as it's done in /~https://github.com/apache/incubator-mxnet/blob/master/ci/docker_cache.py or with a similar Python construct of your choice. Even if it's not critical to parallelize building of the images, if the code is well factored and building the container it's on its own function is very easy to parallelize.

@marcoabreu marcoabreu dismissed their stale review November 1, 2018 14:28

reconsidered internally

@ankkhedia
Copy link
Contributor

ankkhedia commented Nov 1, 2018

@mbaijal Could you please look into comments provided by @larroy and @marcoabreu

@kalyc
Copy link
Contributor

kalyc commented Nov 12, 2018

Thanks for your contributions @mbaijal
@mxnet-label-bot update [pr-awaiting-review]

@larroy , @marcoabreu could you please help with reviewing this PR?

@marcoabreu
Copy link
Contributor

@mxnet-label-bot update [pr-awaiting-response]

@marcoabreu marcoabreu added pr-awaiting-response PR is reviewed and waiting for contributor to respond and removed CI Docker pr-work-in-progress PR is still work in progress labels Nov 12, 2018
@marcoabreu
Copy link
Contributor

@mxnet-label-bot add [ci,docker]

@marcoabreu
Copy link
Contributor

We are waiting for a response.

@vandanavk
Copy link
Contributor

@mbaijal could you check the CI build failure and add a comment explaining the changes you made/are making in response to Pedro and Marco's feedback?

@mxnet-label-bot update [CI, Docker, pr-awaiting-testing]

@larroy
Copy link
Contributor

larroy commented Dec 7, 2018

Given that we are taking over this script, shall we close this?

@larroy
Copy link
Contributor

larroy commented Dec 7, 2018

@perdasilva

@nswamy nswamy added pr-awaiting-merge Review and CI is complete. Ready to Merge and removed pr-awaiting-response PR is reviewed and waiting for contributor to respond labels Dec 8, 2018
@perdasilva
Copy link
Contributor

@larroy - let's close it. I'll open a new PR after my RFC goes through

@larroy
Copy link
Contributor

larroy commented Dec 16, 2018

@perdasilva it's meghnas PR, @mbaijal can you close? Thanks.

@mbaijal mbaijal closed this Dec 20, 2018
@mbaijal
Copy link
Contributor Author

mbaijal commented Dec 20, 2018

I don't mind closing this but half of the code for this change (python2 and initial automation script) has already been merged in a previous PR and this PR contains some updates and fixes to it. This updated script was also used for the last two docker releases. In my opinion it might be better to merge this PR to complete those changes and then deprecate it when a new python script is available.

@larroy Wdyt?

@mbaijal mbaijal reopened this Dec 20, 2018
@perdasilva
Copy link
Contributor

@mbaijal imo, in this case, we should probably go fwd and merge. @larroy?

@Roshrini
Copy link
Member

Roshrini commented Jan 2, 2019

@larroy Is it okay if we merge this PR?
@nswamy @marcoabreu

@anirudhacharya
Copy link
Member

@mxnet-label-bot update [pr-awaiting-review]

@marcoabreu marcoabreu added pr-awaiting-review PR is waiting for code review and removed CI Docker pr-awaiting-merge Review and CI is complete. Ready to Merge labels Jan 14, 2019
@sandeep-krishnamurthy
Copy link
Contributor

@larroy / @marcoabreu - I think we should take this PR forward. Can you please help take a look again?

@marcoabreu
Copy link
Contributor

Please hold on with this PR since we are currently refactoring that pipeline.

@perdasilva for further communication regarding this topic

@marcoabreu marcoabreu added pr-work-in-progress PR is still work in progress and removed pr-awaiting-review PR is waiting for code review labels Feb 5, 2019
@larroy
Copy link
Contributor

larroy commented Feb 6, 2019

up to @perdasilva and @mbaijal please decide best path forward.

@perdasilva
Copy link
Contributor

@marcoabreu lets talk about this issue today and come up with the solution. If the PR does what someone needs right now, I don't see any reason to hold this back. We can always refactor it later if need be...

@perdasilva
Copy link
Contributor

@mbaijal @sandeep-krishnamurthy @srochel please merge this - thank you

I don't know if this is tha main merge issue the PR is complaining about on mxnet, but maybe it will help
@mbaijal
Copy link
Contributor Author

mbaijal commented Feb 26, 2019

Thanks @perdasilva !

@marcoabreu marcoabreu merged commit ab0ca86 into apache:master Mar 12, 2019
vdantu pushed a commit to vdantu/incubator-mxnet that referenced this pull request Mar 31, 2019
…#12791)

* Add dockerfiles for python3

* Fix

* Fix the dockerfile names

* Update the README with python3 images

* Parameterize the mxnet version

* Fix typo

* Reduce the number of docker layers

* Combine build and Test, run builds in parallel. Use variable names instead of numbers

* minor fix

* Update README for build command

* Bug Fix: Script should fail if any background process returns non-zero code

* Changes needed for Benchmarking, 4 new tags, change python to python-dev

* Minor typos

* Minor Rearrangement - sometimes tagging fails

* Move error checking to a method

* Push BAI images to dockerhub

* Create ~/temp dir

* Bug Fix

* Fix merge issues

I don't know if this is tha main merge issue the PR is complaining about on mxnet, but maybe it will help
nswamy pushed a commit that referenced this pull request Apr 5, 2019
* Add dockerfiles for python3

* Fix

* Fix the dockerfile names

* Update the README with python3 images

* Parameterize the mxnet version

* Fix typo

* Reduce the number of docker layers

* Combine build and Test, run builds in parallel. Use variable names instead of numbers

* minor fix

* Update README for build command

* Bug Fix: Script should fail if any background process returns non-zero code

* Changes needed for Benchmarking, 4 new tags, change python to python-dev

* Minor typos

* Minor Rearrangement - sometimes tagging fails

* Move error checking to a method

* Push BAI images to dockerhub

* Create ~/temp dir

* Bug Fix

* Fix merge issues

I don't know if this is tha main merge issue the PR is complaining about on mxnet, but maybe it will help
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
…#12791)

* Add dockerfiles for python3

* Fix

* Fix the dockerfile names

* Update the README with python3 images

* Parameterize the mxnet version

* Fix typo

* Reduce the number of docker layers

* Combine build and Test, run builds in parallel. Use variable names instead of numbers

* minor fix

* Update README for build command

* Bug Fix: Script should fail if any background process returns non-zero code

* Changes needed for Benchmarking, 4 new tags, change python to python-dev

* Minor typos

* Minor Rearrangement - sometimes tagging fails

* Move error checking to a method

* Push BAI images to dockerhub

* Create ~/temp dir

* Bug Fix

* Fix merge issues

I don't know if this is tha main merge issue the PR is complaining about on mxnet, but maybe it will help
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-work-in-progress PR is still work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.