-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Auto para] Relaunch with auto mapping function #37326
Merged
Merged
Changes from all commits
Commits
Show all changes
55 commits
Select commit
Hold shift + click to select a range
68b2c10
[Auto Parallel] Add the unified cluster representation
aoyulong 70e188a
[Auto Parallel] Add the graph class for physical mapping
aoyulong 1a44c06
[Auto Parallel] Add the simple physical mapper
aoyulong b00f5fb
Set the timeout of the mapper
aoyulong 76498be
Merge the upstream develop unittests cmake files
aoyulong b8a7be4
Merge branch 'develop' into auto_para_mapping
aoyulong 9127177
Merge branch 'develop' of /~https://github.com/PaddlePaddle/Paddle into…
aoyulong 2315472
Fix a bug of the process group
aoyulong ab7bea4
Merge branch 'auto_para_mapping' of github.com:aoyulong/Paddle into a…
aoyulong 8f3b236
Remove mapper unittest from platforms which is not GPU
aoyulong 72086ae
Merge branch 'develop' into auto_para_mapping
aoyulong 95d6d3a
Move the instantiation of process group after resharding
aoyulong 6402759
Merge branch 'develop' into auto_para_mapping
aoyulong 6f1559d
Merge branch 'auto_para_mapping' of github.com:aoyulong/Paddle into a…
aoyulong e50494f
Add the local id for devices
aoyulong 14be54b
Merge branch 'auto_para_cluster' into auto_para_mapping
aoyulong 0ccb242
Update the rank mapping format
aoyulong 4060856
[Auto Parallel] Relaunch with the rank mapping file
aoyulong c287b5a
Merge branch 'develop' of github.com:aoyulong/Paddle into auto_para_l…
aoyulong a0127f1
Remove the unnecessary json file
aoyulong 48936b8
Avoid entering get_device_proc_info for auto mapping
aoyulong 9cd37a6
Correct the mapper unit test
aoyulong 7349999
Add some comments
aoyulong d8647be
Merge branch 'auto_para_cluster' into auto_para_mapping
aoyulong 11d41b4
Remove the related files about mapping
aoyulong cb8de4c
Merge branch 'develop' of /~https://github.com/PaddlePaddle/Paddle into…
aoyulong f56cacf
Update the unittest for auto mapping
aoyulong f36849e
Merge branch 'auto_para_mapping' into auto_para_launch
aoyulong 9cb742d
Merge branch 'develop' into auto_para_graph
aoyulong cb9041a
Merge branch 'develop' into auto_para_graph
aoyulong 7b831ae
Merge branch 'develop' of /~https://github.com/PaddlePaddle/Paddle into…
aoyulong 677d3e3
Remove unused rank_mapping unittest
aoyulong dc2ba12
Improve the unittest coverage
aoyulong 5494547
Merge branch 'auto_para_graph' into auto_para_mapping
aoyulong 6c268b5
Merge branch 'develop' of /~https://github.com/PaddlePaddle/Paddle into…
aoyulong d56ebf8
Improve the unittest coverage
aoyulong 55870d2
Merge branch 'auto_para_mapping' into auto_para_launch
aoyulong 9e8cc18
Improve the unittest of relaunch
aoyulong 7b24059
Merge branch 'develop' of /~https://github.com/PaddlePaddle/Paddle into…
aoyulong e71ce76
Merge branch 'develop' of /~https://github.com/PaddlePaddle/Paddle into…
aoyulong fd8ff31
Fix the unittest problem in CI
aoyulong df19fa2
Merge branch 'develop' into auto_para_launch
aoyulong a65acab
Merge branch 'develop' of /~https://github.com/PaddlePaddle/Paddle into…
aoyulong 8002a63
Merge branch 'auto_para_launch' of github.com:aoyulong/Paddle into au…
aoyulong 35828dd
Improve the unittest of relaunch
aoyulong 8d4199c
Remove unnecessary statements
aoyulong d2e3737
Update the unittest cmakefile
aoyulong 3aef5c5
Correct the cmakefile of auto parallel unittests
aoyulong 6040fea
Merge branch 'develop' of /~https://github.com/PaddlePaddle/Paddle into…
aoyulong e746224
Modify codes based on the new elastic change
aoyulong ce25444
Use the GPUs exclusively in the unittest
aoyulong 8706b24
Correct the cmakefile
aoyulong 9a23b7f
Set the timeout of the unittest
aoyulong 31ef42c
Merge branch 'develop' of /~https://github.com/PaddlePaddle/Paddle into…
aoyulong 6db885e
Merge branch 'develop' of /~https://github.com/PaddlePaddle/Paddle into…
aoyulong File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -175,25 +175,17 @@ def _parse_args(): | |
default="127.0.0.1", | ||
help="Paddle cluster nodes ips, such as 192.168.0.16,192.168.0.17..") | ||
collective_group.add_argument( | ||
"--rank_mapping_file", | ||
type=argparse.FileType('r'), | ||
default=sys.stdin, | ||
help="This rank mapping information in json format is used specifically " | ||
"for lazy launch for auto parallel. Some of the ranks in each node " | ||
"may not be used, and the indices of rank should be kept the same " | ||
"as the indices of sub-task splited by auto parallel. " | ||
" { " | ||
" \"ip_ranks\": [ " | ||
" { " | ||
" \"ip\": \"127.0.0.1\", " | ||
" \"ranks\": [0,1] " | ||
" }, " | ||
" { " | ||
" \"ip\": \"127.0.0.2\", " | ||
" \"ranks\": [2,3,4] " | ||
" } " | ||
" ] " | ||
" } ") | ||
"--cluster_topo_path", | ||
type=str, | ||
default=None, | ||
help="A json format file will be stored in this path which is used" | ||
"to represent the cluster topology information for auto parallel.") | ||
collective_group.add_argument( | ||
"--rank_mapping_path", | ||
type=str, | ||
default=None, | ||
help="A json format file will be stored in this path which is used" | ||
"to map processes to machines for auto parallel.") | ||
collective_group.add_argument( | ||
"--enable_auto_mapping", | ||
type=bool, | ||
|
@@ -297,20 +289,56 @@ def cpuonly_check(args): | |
def get_cluster_info(args): | ||
# parse arguments, used for cloud-single-machine and local | ||
if args.backend == 'gloo': cpuonly_check(args) | ||
(device_mode, devices_per_proc) = launch_utils.get_device_proc_info(args) | ||
if args.enable_auto_mapping: | ||
(device_mode, devices_per_proc) = (DeviceMode.GPU, []) | ||
else: | ||
(device_mode, | ||
devices_per_proc) = launch_utils.get_device_proc_info(args) | ||
trainers_num = cloud_utils.get_trainers_num() | ||
logger.debug("parsed from args trainerss_num:{} mode:{} devices:{}".format( | ||
trainers_num, device_mode, devices_per_proc)) | ||
|
||
cuda_visible_devices = os.getenv("CUDA_VISIBLE_DEVICES") | ||
|
||
cluster = None | ||
pod = None | ||
|
||
start_port = 6170 | ||
if os.environ.get('FLAGS_START_PORT') is not None: | ||
start_port = os.environ.get('FLAGS_START_PORT') | ||
# lazy launch for auto-parallel | ||
# auto mapping between processes and devices for auto-parallel | ||
if args.enable_auto_mapping == True: | ||
cluster, pod = get_mapped_cluster_from_args(args, device_mode) | ||
assert args.cluster_topo_path is not None, \ | ||
"The cluster topology must be provied when enabling auto mapping." | ||
rank_mapping_path = args.rank_mapping_path or os.getenv( | ||
"PADDLE_RANK_MAPPING_PATH") | ||
if not rank_mapping_path: | ||
os.environ["PADDLE_NEED_RANK_MAPPING"] = str(True) | ||
os.environ["PADDLE_ENABLE_ELASTIC"] = str( | ||
enable_elastic(args, device_mode)) | ||
cwd = pathlib.Path().resolve() | ||
rank_mapping_path = os.path.join(cwd, | ||
"auto_parallel_rank_mapping.json") | ||
os.environ["PADDLE_RANK_MAPPING_PATH"] = str(rank_mapping_path) | ||
|
||
original_args = sys.argv[1:] | ||
os.environ["PADDLE_ORIGINAL_CMD_ARGS"] = " ".join(original_args) | ||
Comment on lines
+324
to
+325
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this part looks fragile There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This has been dealt with by shlex.split on line 154 of the above parallelizer.py. |
||
os.environ["PADDLE_CLUSTER_TOPO_PATH"] = str(args.cluster_topo_path) | ||
os.environ["PADDLE_ENABLE_AUTO_MAPPING"] = str( | ||
args.enable_auto_mapping) | ||
cluster, pod = launch_utils.get_mapped_cluster_from_args_without_rank_mapping( | ||
args, device_mode) | ||
else: | ||
os.environ["PADDLE_NEED_RANK_MAPPING"] = str(False) | ||
os.environ["PADDLE_ENABLE_ELASTIC"] = str( | ||
enable_elastic(args, device_mode)) | ||
|
||
os.environ["PADDLE_CLUSTER_TOPO_PATH"] = str(args.cluster_topo_path) | ||
os.environ["PADDLE_RANK_MAPPING_PATH"] = str(rank_mapping_path) | ||
os.environ["PADDLE_ENABLE_AUTO_MAPPING"] = str( | ||
args.enable_auto_mapping) | ||
cluster, pod = launch_utils.get_mapped_cluster_from_args_with_rank_mapping( | ||
args, device_mode) | ||
elif cloud_utils.use_paddlecloud() and trainers_num != 1: | ||
cluster, pod = cloud_utils.get_cloud_cluster( | ||
args.ips, device_mode, devices_per_proc, start_port) | ||
|
@@ -328,6 +356,7 @@ def get_cluster_info(args): | |
logger.debug("get cluster from args:{}".format(cluster)) | ||
return cluster, pod | ||
|
||
|
||
def get_global_envs(args, tmp_dir): | ||
global_envs = copy.copy(os.environ.copy()) | ||
# add gloo env | ||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add one more config file is expensive, is it possible to use xxx_config to hold all ? may be paddle_config and you hold some sections ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rank_mapping file will be automatically generated by our framework in the pre-launch analysis pass and must not be exposed to users.