-
Notifications
You must be signed in to change notification settings - Fork 456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SDK]Support Docker image as objective in the tune API #2338
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: akhilsaivenkata <akhilammu1@gmail.com>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @andreyvelich , I made the changes to katib_client.py. Do we have any test cases that need to be updated or implemented for this change? I also wanted to test these new changes on my local machine so I ran "make check" & "make test" commands on my local machine and there were no failures. I wonder if you could see and suggest any action items that needs to taken up here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for doing this @akhilsaivenkata!
I left a few comments.
parameters: Dict[str, Any], | ||
base_image: str = constants.BASE_IMAGE_TENSORFLOW, | ||
#base_image: str = constants.BASE_IMAGE_TENSORFLOW, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, we should keep the base_image
, since we use it when user set objective
as train function.
@@ -400,12 +407,12 @@ def tune( | |||
trial_template = models.V1beta1TrialTemplate( | |||
primary_container_name=constants.DEFAULT_PRIMARY_CONTAINER_NAME, | |||
retain=retain_trials, | |||
trial_parameters=trial_params, | |||
trial_parameters=trial_params if callable(objective) else [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trial_parameters
still be required even when user sets Docker image.
You can check example here: /~https://github.com/kubeflow/katib/blob/master/examples/v1beta1/hp-tuning/random.yaml#L31-L36
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure @andreyvelich , I will revert this change and keep the trial_parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I would like to know if we need to write new unit test cases or change existing ones ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akhilsaivenkata Yes, since we merge this PR: #2325, please add unit test for tune
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @tariq-hasan
input_params = {} | ||
experiment_params = [] | ||
trial_params = [] | ||
base_image = constants.BASE_IMAGE_TENSORFLOW, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes @andreyvelich , i did keep the base_image here at this line, I have added it in the if block so this code change got mixed up with all the other lines
@akhilsaivenkata Please rebase your PR. |
trial_spec=trial_spec, | ||
) | ||
|
||
# Add parameters to the Katib Experiment. | ||
experiment.spec.parameters = experiment_params | ||
experiment.spec.parameters = experiment_params if callable(objective) else [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think parameters
field is also needed since trial_paramaters
is still required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right @Electronic-Waste , I have just reverted these two code changes and pushed it now.
Signed-off-by: akhilsaivenkata <akhilammu1@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks great. Thank you @akhilsaivenkata ! I left a question for you and @andreyvelich .
command=["bash", "-c"] if callable(objective) else None, | ||
args=[exec_script] if callable(objective) else None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I'm not sure if we can assign None
to command
and args
here when we use Docker image as objective.
As @andreyvelich shows an example for us, we sometimes need to pass command
and args
to the training container to execute python scripts with some parameters.
Could you explain your idea in details so that I can understand more? WDYT👀 @akhilsaivenkata @andreyvelich
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, initially we can just allow user to set image
as objective
without command
and args
.
Similar to how we allow create training job using base_image
parameter: /~https://github.com/kubeflow/training-operator/blob/master/sdk/python/kubeflow/training/api/training_client.py#L327C35-L327C45.
Hi @akhilsaivenkata, did you get a chance to finish this PR ? |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
/remove-lifecycle stale |
@andreyvelich is it fine if i start working on this issue? |
Hi @mahdikhashan , thanks for offering to help with this PR! I've been out for a bit, but now that I'm back, I'd like to revisit this PR myself. A lot has changed since I originally submitted it, and I want to make sure it still fits with our current project goals. |
Hi @andreyvelich , As there are many changes to this tune API in meantime, Is it okay if we introduce a new parameter, say objective_docker_image, rather than overloading objective to handle both a callable and a string? I think this flow would be good:
We would add simple validation:
This approach keeps things explicit and avoids confusion about whether objective is a function or a Docker image. Please let me know your thoughts on whether this design aligns with this API. |
@akhilsaivenkata Maybe we can use the |
@andreyvelich , Yes we can just reuse the base_image parameter for consistency with Kubeflow Training V1. Previously, when objective_func was provided, trial parameters like ${trialParameters.} were dynamically injected into the function at runtime, either as environment variables or command-line arguments, ensuring seamless hyperparameter substitution. If we use only a Docker image as input, I believe we should plan to inject these trial parameters as environment variables/cmd line arguments into the container. This would require users to ensure their training scripts inside their custom docker image retrieve these variables using methods like os.getenv. We also need to highlight this in the documentation to ensure user awareness. What are your thoughts on this approach? This might be a bit overhead on the users end though.. |
What this PR does / why we need it: Supporting Docker image as an objective in the tune API.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #2326
Checklist: