Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: restarting #504

Merged
merged 47 commits into from
Feb 4, 2025
Merged

fix: restarting #504

merged 47 commits into from
Feb 4, 2025

Conversation

jmbuhr
Copy link
Collaborator

@jmbuhr jmbuhr commented Jan 7, 2025

New/Fixed

Restarting or continuing from a previous kimmdy run instead of incrementing the number of the output directory can be triggered by either passing the --restart or -r flag on the cli or setting restart: true in the yml.

valid restart points are (the latest is used):

  • after setup
  • during md task that has already produced a cpt file
  • after completed md task

kimmdy --generate-jobscript is update accordingly.

Trajectories are not truncated during a kimmdy run. Instead, a <name>_reaction.gro file is written next to the trajectory with the state at the chosen reaction time (which is also written als a .kimmdy_reaction_time marker file).
The analysis tools handle these files and do truncation if required.

misc changes

  • improved ergonomics of analysis tools
    • dir is automatically taken from input config file if not explicitly set, so e.g. kimmdy-analysis energy is enough
    • added short options for opening -o analysis output
  • new dumy_first kmc option that always chooses the first recipe regardless of the rates for debugging
  • new config.slurm.runcmd (default sbatch) option to be put into the jobscript re-submission section. Can be set to an empty string "" for local debugging.
  • logging is de-cluttered by shortening full file paths to the name of the task directory when required. The full cwd and output dir are only printed at the start and end of a run.
  • removed deprecated options in examples
  • removed leftover print statements
  • fixed gh ci (publish workflow)

@jmbuhr
Copy link
Collaborator Author

jmbuhr commented Jan 17, 2025

We don't allow max_hours to be a floating point number, but just for reference I briefly changed that and tested the restarting with a timeout of 3 seconds using this jobscript:

#!/bin/env bash
#SBATCH --job-name=alanine_hat_000
#SBATCH --output=alanine_hat_000-job.log
#SBATCH --error=alanine_hat_000-job.log
#SBATCH --time=0.001:00:00
#SBATCH -N=1
#SBATCH --ntasks-per-node=20
#SBATCH --mincpus=20
#SBATCH --exclusive
#SBATCH --cpus-per-task=1
#SBATCH --gpus=1
#SBATCH --mail-type=ALL
# #SBATCH -p <your-partition>.p
# #SBATCH --mail-user=<your-email>


# Setup up your environment here
# modules.sh might load lmod modules, set environment variables, etc.
if [ -f ./_modules.sh ]; then
    source ./_modules.sh
fi

CYCLE=0.001
CYCLE_SECONDS=$(echo "$CYCLE * 3600/1" | bc)
CYCLE_buffered=$(echo "scale=2; $CYCLE_SECONDS - $CYCLE_SECONDS*0.1/1" | bc)


START=$(date +"%s") # in s

timeout ${CYCLE_buffered}s kimmdy -i kimmdy.yml --restart

END=$(date +"%s") # in s

LEN=$((END-START))
HOURS=$((LEN/3600))

echo "$LEN seconds ran"
echo "$HOURS full hours ran"

if [ $LEN -lt $CYCLE_SECONDS ]; then
  echo "last cycle was just $HOURS h long, KIMMDY is done."
  exit 3
else
  echo "jobscript resubmitting"
   ./jobscript.sh
  exit 2
fi

and it works. Expectedly you can create an infinite loop by making the timeout shorter than kimmdy takes for the steps until the next valid restart point.

@jmbuhr jmbuhr marked this pull request as ready for review January 17, 2025 16:05
@jmbuhr
Copy link
Collaborator Author

jmbuhr commented Jan 30, 2025

moved --input to each submodule. Also, get_task_directories now ignores directories that are not clearly a kimmdy task (start with a digit and have a _), so your previous case where you had a run within a run would no longer crash.

@jmbuhr jmbuhr requested a review from ehhartmann January 30, 2025 12:04
@ehhartmann
Copy link
Collaborator

If you think there is a real risk the current gro file does not contain all system atoms, we could also use the tpr for gmx trjconv -s and the index file.
Also it would be cleaner if gro_reaction and trr_reaction would get the filestem from the xtc or trr file that was used for converting.

Eric Hartmann and others added 2 commits January 30, 2025 18:50
and improve logging of the filehistory, which is now used in a unit test
for said feature
@jmbuhr
Copy link
Collaborator Author

jmbuhr commented Jan 31, 2025

lt would be cleaner if gro_reaction and trr_reaction would get the filestem from the xtc or trr file that was used for converting.

You mean as their filename? They do that. e.g. a reaction happening on top of equilibrium.xtc will be equilibrium_reaction.gro.

@jmbuhr
Copy link
Collaborator Author

jmbuhr commented Jan 31, 2025

The issue with using the wrong coordinates files after apply_recipe (because the runmanager would re-discover the output files of apply_recipe, which overwrites output files created by those ephemeral tasks like Relax and Place) as being the latest should be fixed now. I also included an integration test that validates that based on the kimmdy history file that documents after each task what files where available (and created) by the task.

Can you think through that as well to make sure it's sound?

@ehhartmann
Copy link
Collaborator

lt would be cleaner if gro_reaction and trr_reaction would get the filestem from the xtc or trr file that was used for converting.

You mean as their filename? They do that. e.g. a reaction happening on top of equilibrium.xtc will be equilibrium_reaction.gro.

in write_coordinate_files_at_reaction_time we run run_gmx( f"echo '0' | gmx trjconv -f {files.input['xtc']} -s {gro} -b {time} -dump {time} -o {gro_reaction}" ) with gro_reaction = gro.with_name(gro.stem + f"_reaction.gro") and gro = files.input["gro"], so the file stem is inherited from the gro file, not xtc or trr

@jmbuhr
Copy link
Collaborator Author

jmbuhr commented Feb 4, 2025

lt would be cleaner if gro_reaction and trr_reaction would get the filestem from the xtc or trr file that was used for converting.

You mean as their filename? They do that. e.g. a reaction happening on top of equilibrium.xtc will be equilibrium_reaction.gro.

in write_coordinate_files_at_reaction_time we run run_gmx( f"echo '0' | gmx trjconv -f {files.input['xtc']} -s {gro} -b {time} -dump {time} -o {gro_reaction}" ) with gro_reaction = gro.with_name(gro.stem + f"_reaction.gro") and gro = files.input["gro"], so the file stem is inherited from the gro file, not xtc or trr

But the name of the gro file is the same as the xtc and trr, because both come from the name of the MD instance.

@ehhartmann
Copy link
Collaborator

I think using the trajectory file stem is more robust than using the gro stem which should be the same if everything goes well.

@ehhartmann
Copy link
Collaborator

The issue with using the wrong coordinates files after apply_recipe (because the runmanager would re-discover the output files of apply_recipe, which overwrites output files created by those ephemeral tasks like Relax and Place) as being the latest should be fixed now. I also included an integration test that validates that based on the kimmdy history file that documents after each task what files where available (and created) by the task.

Can you think through that as well to make sure it's sound?

the current treatment looks good! I created an issue for cleaning up the subtask handling

@ehhartmann
Copy link
Collaborator

I found another bug with restarting pulling simulations. Generating the rotref files again during grompp fails (some gromacs bug), so I made kimmdy skip grompp for continued md which should have no downside. All tests are passing with the current commit:

======= 113 passed, 3 warnings in 342.43s (0:05:42) ==========

@jmbuhr jmbuhr merged commit b711668 into main Feb 4, 2025
1 check passed
@jmbuhr jmbuhr deleted the fix/restarting branch February 4, 2025 18:15
jmbuhr added a commit that referenced this pull request Feb 4, 2025
  test if this triggers the other conventional
  commit messages in this commit

feat: pass restart flag via config for cli

fix: autofilldict (files.input) should return None on missing

refactor: simplify get_task_directories

feat: exit early if run already finished

chore: add grappa to dev requirements

feat: add config.slurm.runcmd to replace sbatch in jobscript for local testing

feat(analysis): get dir from kimmdy.yml input file if not specified in analysis

feat: dummy_first kmc method that ignores rates
  and always chooses the first recipe for debugging

chore: fix gh ci (publish workflow is not allowed to be re-usable workflow)

fix: fully specify --nodes for slurm

fix: silently ignore directories in the kimmdy output folder that don't match
  the kimmdy task naming scheme

fix: clean up cli arguments

fix: handle latest files in intermediate tasks like relax and place
  and improve logging of the filehistory, which is now used in a unit test
  for said feature

fix: do not run grompp for checkpoint restarts

---------

Co-authored-by: Eric Hartmann <hartmaec@rh05659.villa-bosch.de>
jmbuhr added a commit that referenced this pull request Feb 4, 2025
  test if this triggers the other conventional
  commit messages in this commit

feat: pass restart flag via config for cli

fix: autofilldict (files.input) should return None on missing

refactor: simplify get_task_directories

feat: exit early if run already finished

chore: add grappa to dev requirements

feat: add config.slurm.runcmd to replace sbatch in jobscript for local testing

feat(analysis): get dir from kimmdy.yml input file if not specified in analysis

feat: dummy_first kmc method that ignores rates
  and always chooses the first recipe for debugging

chore: fix gh ci (publish workflow is not allowed to be re-usable workflow)

fix: fully specify --nodes for slurm

fix: silently ignore directories in the kimmdy output folder that don't match
  the kimmdy task naming scheme

fix: clean up cli arguments

fix: handle latest files in intermediate tasks like relax and place
  and improve logging of the filehistory, which is now used in a unit test
  for said feature

fix: do not run grompp for checkpoint restarts

---------

Co-authored-by: Eric Hartmann <hartmaec@rh05659.villa-bosch.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants