Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpirun can mangle tagged output lines, so use heuristics to fix that. #510

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

hategan
Copy link
Collaborator

@hategan hategan commented Feb 28, 2025

Of course, the saga is not over.

When specifying --tag-output, mpirun is supposed to "tag each line" with [jobid, rank]<stdxxx>:. It mostly does. However
it occasionally does something else. Assume that a.txt and b.txt contain ABCD and EFGH, respectively. Running
mpirun --tag-output -n 1 cat a.txt b.txt mostly produces

[1,0]<stdout>:ABCDEFGH

Occasionally, the following shows up instead:

[1,0]<stdout>:ABCD[1,0]<stdout>:EFGH

That is indistinguishable from b.txt having contained [1,0]<stdout>:EFGH. This, my guess would be, is due to a brief delay between the files that cat introduces. This can be verified by adding more files for cat and seeing all kinds of combinations of tags popping out in the middle of a line.

One solution is to use heuristics and consider an output line to begin with the tag while also assuming that it is very unlikely for the application to produce the tag in the middle. Hence, we can filter on lines that start with the tag and then
remove any other tags that appear in the middle. This should significantly reduce the likelihood of random mishaps, but transforms it into less likely but deterministic mishaps (e.g., running echo "[1, 0]<stdout>:bla" through mpirun.

Another choice is --xml. Unfortunately, parsing XML in POSIX only is difficult and many simplifying assumptions are made. Nonetheless, that branch appears to work fine with OpenMPI 4, so, perhaps, the loss in clarity might not outweigh the benefits.

@andre-merzky
Copy link
Collaborator

A general remark: launching tasks on HPC systems poses a really large and complex problem space, and that it does require complex solutions. I am not sure though that adding solution complexity to a single shell script is a viable route in the long run. Don't get me wrong - I love shell scripting for it's directness, performance and conciseness - but maintainability and readability are not features of shell, and slowly growing the launcher into a single, non-modular and / or large shell script is not a route I would recommend, really.

Obviously I am biased, as we have been there and done that also in RP ;-) Our approach at the moment is to shove the complexities into modular python code and to generate small, readable and self-contained shell scripts on the fly. I wonder since quite some time if we should try to extract that code from RP, remove all dependencies, and make it usable for psi/j. I'd love to have a discussion about that at some point...

On to the problem at hand: Yes, I agree, finding the tag in the middle of a line is unlikely. But even so, it remains messy. Is it worth the effort? The original motivation was that mpirun produces various diagnostic output along with the application stdout and to filter that out. Well, any user running natively on that machine would also see that output - so psi/j is trying to improve over the system's native behavior (https://xkcd.com/1172/). I understand (and actually share) the sentiment, but the complexity tradeoff might not be worth it.

Having said all that: the code seems to be correct and seems to address the stated problem, so I'd probably approve the PR ;-)

PS.: XML? I happily would avoid that bottomless pit...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants