-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[MXNET-1093] Add python3 Docker images for each MXNet release #12791
Conversation
RUN apt-get update | ||
RUN apt-get install -y wget python3 gcc | ||
RUN wget https://bootstrap.pypa.io/get-pip.py | ||
RUN python3 get-pip.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Super nit: I'd combine smaller commands like this into a single layer.
RUN apt-get update && \
apt-get install -y wget python3 gcc && \
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to Kellen's ask: I think it is NOT a nit-pick but an important requirement because each RUN corresponds to a filesystem layer for Docker. See this: https://stackoverflow.com/questions/31222377/what-are-docker-image-layers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KellenSunderland @bhavinthaker Thank you, I agree. I will update the PR.
RUN wget https://bootstrap.pypa.io/get-pip.py | ||
RUN python3 get-pip.py | ||
|
||
RUN pip3 install mxnet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Version has to be specifyable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean parameterize the MXNet version in the dockerfiles?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We have to be able to create images for the different versions in a reproducible fashion.
docker/docker-python/README.md
Outdated
@@ -12,13 +12,21 @@ It uses the appropriate pip binaries to build different docker images as - | |||
* gpu_cu80_mkl | |||
* gpu_cu92 | |||
* gpu_cu92_mkl | |||
* cpu_py3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer if python3 was the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not want to change the docker images already uploaded for MXNet 1.3 since this might cause something to break for a user. This change to make python3 the default docker version cannot be immediate and should be discussed on the dev list. If agreed upon, this should take effect from the next release onwards. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good points. Could you start the discussion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been discussed on dev-list and the decision is to switch to python3 images as default. This will be done in the next release to avoid disruption to the current users.
@@ -117,10 +131,40 @@ docker_build_image "${mxnet_version}_gpu_cu92_mkl" "Dockerfile.mxnet.python.gpu. | |||
docker_test_image_gpu "${mxnet_version}_gpu_cu92_mkl" | |||
|
|||
|
|||
|
|||
# Build and Test Python3 dockerfiles - CPU | |||
docker_build_image "${mxnet_version}_cpu_python3" "Dockerfile.mxnet.python3.cpu" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer if the builds and CPU tests ran parallelized to reduce the critical path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be a little concerned about memory usage if we were to parallelize the builds. I could see that this might make building the images difficult on a laptop or even a middle-tier cloud instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I built 15 images for our docker cache in parallel on my laptop and didn't encounter any problems. I don't think it will be an issue. Also, this script is targeted for CD execution - individual compilation could be an additional mode, but we should optimise for maximum resource utilization here IMO
Yea. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small things, but overall looks pretty good to me
docker_generate_image_gpu "${mxnet_version}_gpu_cu92_mkl" "Dockerfile.mxnet.python.gpu.cu92.mkl" "python" | ||
|
||
echo "Waiting for MXNet Python2 Docker Images to Build" | ||
wait |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly does this do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It waits for all the processes spawned as background processes (build, test) to complete before moving on to the next command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Does it also properly process error codes of the underlying processes?
|
||
#Build and Test dockerfiles - GPU | ||
docker_generate_image_gpu "${mxnet_version}_gpu_cu90" "Dockerfile.mxnet.python.gpu.cu90" "python" & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this actually run all the commands in parallel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I verified using ps -ef
that builds/tests are running in parallel. However, I do also want to record the total runtime with and without parallelization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent! Did you also test error propagation in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following are runtimes on a p2.16x instance to build and test all docker images (python2 and python3)
Runtime with parallelization: 13mins
Runtime without parallelization: 29mins
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! What happens if there is an error in one of the builds (you can simulate that by just returning a non zero exit code in one of the RUN clauses)? Does it get propagated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked that by hardcoding an incorrect mxnet version in one of the dockerfiles. While a non-zero error message is printed the logs, the script does not exit and completes with "success".
I will fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in the last commit by writing each build log to a file and then parsing these logs for a possible error message. If error found, script fails. The error message makes it easy to find out which tag failed.
@marcoabreu It seems that your review comments has been addressed. Could you please take a look? |
@marcoabreu @sandeep-krishnamurthy Could either of you please merge this PR. Thanks! |
Please wait with merging. I'm still discussing some things internally |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't merge this code as it is.
|
||
# Functions | ||
docker_test_image_cpu(){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't have docker running like this because it leaves containers hanging around even when the parent process terminates. We introduce logic to avoid this kinds of problems in CI #12381
My suggestion is to use one of this two options:
- Use /~https://github.com/apache/incubator-mxnet/blob/master/ci/build.py as it's done in /~https://github.com/apache/incubator-mxnet/blob/master/ci/docker_cache.py to run the tests safely, build the images in parallel. This is probably the fastest route that is safe but introduces some coupling between CI and CD.
- Abstract the cleanup logic in build.py in a module which is then used from CI and CD with the specific commands.
- Duplicate the logic from build.py and docker_cache.py for this task (copy & paste) and adjust for the specific case. A downside is that introduces similar logic to be maintained , the upside is that the logic wouldn't be coupled and can evolve in parallel.
If on doubt, we can choose option 1 and go from there.
Thanks for this effort. I have some reservations about the implementation, see my previous comment about zombie containers and resource disposal in the code. Also, can we use Python instead of shell to automate these steps? It's more robust, debuggable, flexible and easier to mantain, we don't want to introduce complex logic in shell scripts. All the automation on the project is done in Python. The code that is forking process to parallelize the build should be done as it's done in /~https://github.com/apache/incubator-mxnet/blob/master/ci/docker_cache.py or with a similar Python construct of your choice. Even if it's not critical to parallelize building of the images, if the code is well factored and building the container it's on its own function is very easy to parallelize. |
@mbaijal Could you please look into comments provided by @larroy and @marcoabreu |
Thanks for your contributions @mbaijal @larroy , @marcoabreu could you please help with reviewing this PR? |
@mxnet-label-bot update [pr-awaiting-response] |
@mxnet-label-bot add [ci,docker] |
We are waiting for a response. |
@mbaijal could you check the CI build failure and add a comment explaining the changes you made/are making in response to Pedro and Marco's feedback? @mxnet-label-bot update [CI, Docker, pr-awaiting-testing] |
Given that we are taking over this script, shall we close this? |
@larroy - let's close it. I'll open a new PR after my RFC goes through |
@perdasilva it's meghnas PR, @mbaijal can you close? Thanks. |
I don't mind closing this but half of the code for this change (python2 and initial automation script) has already been merged in a previous PR and this PR contains some updates and fixes to it. This updated script was also used for the last two docker releases. In my opinion it might be better to merge this PR to complete those changes and then deprecate it when a new python script is available. @larroy Wdyt? |
@larroy Is it okay if we merge this PR? |
@mxnet-label-bot update [pr-awaiting-review] |
@larroy / @marcoabreu - I think we should take this PR forward. Can you please help take a look again? |
Please hold on with this PR since we are currently refactoring that pipeline. @perdasilva for further communication regarding this topic |
up to @perdasilva and @mbaijal please decide best path forward. |
@marcoabreu lets talk about this issue today and come up with the solution. If the PR does what someone needs right now, I don't see any reason to hold this back. We can always refactor it later if need be... |
@mbaijal @sandeep-krishnamurthy @srochel please merge this - thank you |
I don't know if this is tha main merge issue the PR is complaining about on mxnet, but maybe it will help
Thanks @perdasilva ! |
…#12791) * Add dockerfiles for python3 * Fix * Fix the dockerfile names * Update the README with python3 images * Parameterize the mxnet version * Fix typo * Reduce the number of docker layers * Combine build and Test, run builds in parallel. Use variable names instead of numbers * minor fix * Update README for build command * Bug Fix: Script should fail if any background process returns non-zero code * Changes needed for Benchmarking, 4 new tags, change python to python-dev * Minor typos * Minor Rearrangement - sometimes tagging fails * Move error checking to a method * Push BAI images to dockerhub * Create ~/temp dir * Bug Fix * Fix merge issues I don't know if this is tha main merge issue the PR is complaining about on mxnet, but maybe it will help
* Add dockerfiles for python3 * Fix * Fix the dockerfile names * Update the README with python3 images * Parameterize the mxnet version * Fix typo * Reduce the number of docker layers * Combine build and Test, run builds in parallel. Use variable names instead of numbers * minor fix * Update README for build command * Bug Fix: Script should fail if any background process returns non-zero code * Changes needed for Benchmarking, 4 new tags, change python to python-dev * Minor typos * Minor Rearrangement - sometimes tagging fails * Move error checking to a method * Push BAI images to dockerhub * Create ~/temp dir * Bug Fix * Fix merge issues I don't know if this is tha main merge issue the PR is complaining about on mxnet, but maybe it will help
…#12791) * Add dockerfiles for python3 * Fix * Fix the dockerfile names * Update the README with python3 images * Parameterize the mxnet version * Fix typo * Reduce the number of docker layers * Combine build and Test, run builds in parallel. Use variable names instead of numbers * minor fix * Update README for build command * Bug Fix: Script should fail if any background process returns non-zero code * Changes needed for Benchmarking, 4 new tags, change python to python-dev * Minor typos * Minor Rearrangement - sometimes tagging fails * Move error checking to a method * Push BAI images to dockerhub * Create ~/temp dir * Bug Fix * Fix merge issues I don't know if this is tha main merge issue the PR is complaining about on mxnet, but maybe it will help
Description
@KellenSunderland @marcoabreu @gautamkmr
This PR updates the docker/docker-python folder to add 8 new dockerfiles for python3. It also updates the automated docker release script to build and update the python 3 docker images similar to python 2.
Python version used is 3.5 which is the default version on ubuntu 16.04.
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes