-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Auto para] Relaunch with auto mapping function #37326
Conversation
… auto_para_mapping
Thanks for your contribution! |
… auto_para_graph
… auto_para_launch
… auto_para_launch
… auto_para_launch
01b8b9c
to
8d4199c
Compare
original_args = sys.argv[1:] | ||
os.environ["PADDLE_ORIGINAL_CMD_ARGS"] = " ".join(original_args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this part looks fragile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been dealt with by shlex.split on line 154 of the above parallelizer.py.
"--cluster_topo_path", | ||
type=str, | ||
default=None, | ||
help="A json format file will be stored in this path which is used" | ||
"to represent the cluster topology information for auto parallel.") | ||
collective_group.add_argument( | ||
"--rank_mapping_path", | ||
type=str, | ||
default=None, | ||
help="A json format file will be stored in this path which is used" | ||
"to map processes to machines for auto parallel.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add one more config file is expensive, is it possible to use xxx_config to hold all ? may be paddle_config and you hold some sections ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rank_mapping file will be automatically generated by our framework in the pre-launch analysis pass and must not be exposed to users.
… auto_para_launch
… auto_para_launch
… auto_para_launch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for set_tests_properties(test_auto_parallel_relaunch PROPERTIES LABELS "RUN_TYPE=EXCLUSIVE" TIMEOUT 120)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* [Auto Parallel] Add the unified cluster representation * [Auto Parallel] Add the graph class for physical mapping * [Auto Parallel] Add the simple physical mapper * Set the timeout of the mapper * Merge the upstream develop unittests cmake files * Fix a bug of the process group * Remove mapper unittest from platforms which is not GPU * Move the instantiation of process group after resharding * Add the local id for devices * Update the rank mapping format * [Auto Parallel] Relaunch with the rank mapping file * Remove the unnecessary json file * Avoid entering get_device_proc_info for auto mapping * Correct the mapper unit test * Add some comments * Remove the related files about mapping * Update the unittest for auto mapping * Remove unused rank_mapping unittest * Improve the unittest coverage * Improve the unittest coverage * Improve the unittest of relaunch * Fix the unittest problem in CI * Improve the unittest of relaunch * Remove unnecessary statements * Update the unittest cmakefile * Correct the cmakefile of auto parallel unittests * Modify codes based on the new elastic change * Use the GPUs exclusively in the unittest * Correct the cmakefile * Set the timeout of the unittest
PR types
New features
PR changes
Others
Describe
The pr will relaunch the distributed training based on the rank mapping file produced by the auto mapping function. So there will be two times of launching.