mpirun can mangle tagged output lines, so use heuristics to fix that. #510

hategan · 2025-02-28T00:46:03Z

Of course, the saga is not over.

When specifying --tag-output, mpirun is supposed to "tag each line" with [jobid, rank]<stdxxx>:. It mostly does. However
it occasionally does something else. Assume that a.txt and b.txt contain ABCD and EFGH, respectively. Running
mpirun --tag-output -n 1 cat a.txt b.txt mostly produces

[1,0]<stdout>:ABCDEFGH

Occasionally, the following shows up instead:

[1,0]<stdout>:ABCD[1,0]<stdout>:EFGH

That is indistinguishable from b.txt having contained [1,0]<stdout>:EFGH. This, my guess would be, is due to a brief delay between the files that cat introduces. This can be verified by adding more files for cat and seeing all kinds of combinations of tags popping out in the middle of a line.

One solution is to use heuristics and consider an output line to begin with the tag while also assuming that it is very unlikely for the application to produce the tag in the middle. Hence, we can filter on lines that start with the tag and then
remove any other tags that appear in the middle. This should significantly reduce the likelihood of random mishaps, but transforms it into less likely but deterministic mishaps (e.g., running echo "[1, 0]<stdout>:bla" through mpirun.

Another choice is --xml. Unfortunately, parsing XML in POSIX only is difficult and many simplifying assumptions are made. Nonetheless, that branch appears to work fine with OpenMPI 4, so, perhaps, the loss in clarity might not outweigh the benefits.

them.

andre-merzky · 2025-02-28T10:37:25Z

A general remark: launching tasks on HPC systems poses a really large and complex problem space, and that it does require complex solutions. I am not sure though that adding solution complexity to a single shell script is a viable route in the long run. Don't get me wrong - I love shell scripting for it's directness, performance and conciseness - but maintainability and readability are not features of shell, and slowly growing the launcher into a single, non-modular and / or large shell script is not a route I would recommend, really.

Obviously I am biased, as we have been there and done that also in RP ;-) Our approach at the moment is to shove the complexities into modular python code and to generate small, readable and self-contained shell scripts on the fly. I wonder since quite some time if we should try to extract that code from RP, remove all dependencies, and make it usable for psi/j. I'd love to have a discussion about that at some point...

On to the problem at hand: Yes, I agree, finding the tag in the middle of a line is unlikely. But even so, it remains messy. Is it worth the effort? The original motivation was that mpirun produces various diagnostic output along with the application stdout and to filter that out. Well, any user running natively on that machine would also see that output - so psi/j is trying to improve over the system's native behavior (https://xkcd.com/1172/). I understand (and actually share) the sentiment, but the complexity tradeoff might not be worth it.

Having said all that: the code seems to be correct and seems to address the stated problem, so I'd probably approve the PR ;-)

PS.: XML? I happily would avoid that bottomless pit...

mpirun can mangle tagged output lines, so use heuristics to unmangle

0a62a3c

them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mpirun can mangle tagged output lines, so use heuristics to fix that. #510

mpirun can mangle tagged output lines, so use heuristics to fix that. #510

hategan commented Feb 28, 2025

andre-merzky commented Feb 28, 2025

mpirun can mangle tagged output lines, so use heuristics to fix that. #510

Are you sure you want to change the base?

mpirun can mangle tagged output lines, so use heuristics to fix that. #510

Conversation

hategan commented Feb 28, 2025

andre-merzky commented Feb 28, 2025