Skip to content

Conversation

@mandresm
Copy link
Contributor

@mandresm mandresm commented Jun 11, 2025

Refactors the hetjob logic and options, and fixes the incorrect .run script when SLURM's hetjobs was used (see #1340).

It deprecates the computer.taskset option that is now substituted by computer.hetjob_strategy. The hetjob_strategy values can be tasket, hetjob or srunsteps. The different options result into these different .run scripts:

hetjob (default, allows for heterogeneous compute resources)

...
#SBATCH --nodes=10
#SBATCH --partition=cpu-clx:test
#SBATCH hetjob
#SBATCH --nodes=2
#SBATCH --partition=cpu-clx:test
#SBATCH hetjob
#SBATCH --nodes=1
#SBATCH --partition=cpu-clx:test
#SBATCH hetjob
#SBATCH --nodes=2
#SBATCH --partition=cpu-clx:test
...

time srun --mpi=pmix -l --kill-on-bad-exit=1 --cpu_bind=none \
--nodes=10 --ntasks=960 --ntasks-per-node=96 --cpus-per-task=1 --export=ALL,OMP_NUM_THREADS=1 ./fesom \
: --nodes=2 --ntasks=192 --ntasks-per-node=96 --cpus-per-task=1 --export=ALL,OMP_NUM_THREADS=1 ./oifs -v ecmwf -e awi3 \
: --nodes=1 --ntasks=1 --ntasks-per-node=1 --cpus-per-task=96 --export=ALL,OMP_NUM_THREADS=96 ./rnfma \
: --nodes=2 --ntasks=4 --ntasks-per-node=2 --cpus-per-task=32 --export=ALL,OMP_NUM_THREADS=32 ./xios.x  2>&1 &

taskset (#SBATCH --nodes still needs fixing, see #1148 (comment)

...
#SBATCH --nodes=15
...

#Creating hostlist for MPI + MPI&OMP heterogeneous parallel job
rm -f ./hostlist
export SLURM_HOSTFILE=/Users/mandresm/Work///hetjob_rework/run_20000101-20000131/work//hostlist
IFS=$'\n'; set -f
listnodes=($(< <( scontrol show hostnames $SLURM_JOB_NODELIST )))
unset IFS; set +f
rank=0
current_core=0
current_core_mpi=0
mpi_tasks_fesom=960
omp_threads_fesom=1
mpi_tasks_oifs=192
omp_threads_oifs=1
mpi_tasks_rnfmap=1
omp_threads_rnfmap=96
mpi_tasks_xios=4
omp_threads_xios=32
for model in fesom oifs rnfmap oasis3mct xios ;do
    eval nb_of_cores=\${mpi_tasks_${model}}
    eval nb_of_cores=$((${nb_of_cores}-1))
    for nb_proc_mpi in `seq 0 ${nb_of_cores}`; do
        (( index_host = current_core / 96 ))
        host_value=${listnodes[${index_host}]}
        (( slot =  current_core % 96 ))
        echo $host_value >> hostlist
        (( current_core = current_core + omp_threads_${model} ))
    done
done

time srun --mpi=pmix -l --kill-on-bad-exit=1 --cpu_bind=none --multi-prog hostfile_srun 2>&1 &

srunsteps (equivalent to how it's done in PBS+aprun in Aleph, does not allow for heterogeneous compute resources)

...
#SBATCH --nodes=15
...

time srun --mpi=pmix -l --kill-on-bad-exit=1 --cpu_bind=none \
--nodes=10 --ntasks=960 --ntasks-per-node=96 --cpus-per-task=1 --export=ALL,OMP_NUM_THREADS=1 ./fesom \
: --nodes=2 --ntasks=192 --ntasks-per-node=96 --cpus-per-task=1 --export=ALL,OMP_NUM_THREADS=1 ./oifs -v ecmwf -e awi3 \
: --nodes=1 --ntasks=1 --ntasks-per-node=1 --cpus-per-task=96 --export=ALL,OMP_NUM_THREADS=96 ./rnfma \
: --nodes=2 --ntasks=4 --ntasks-per-node=2 --cpus-per-task=32 --export=ALL,OMP_NUM_THREADS=32 ./xios.x  2>&1 &

TODO

  • Test in Lise (blogin)
  • Test in Levante
  • Clean modified functions
  • Write docstrings
  • Write documentation
  • Detect deprecated taskset option and exit with an error clearly stating what the user needs to do in their yaml
  • Remove unset SLURM_* from any option different than taskset

Closes #1340

Thanks to @ufukozkan for finding this problem and further investigating and to @christgau for investigating and providing the solution.

…s taskset, srunmix and hetjob with all their functionality
@mandresm
Copy link
Contributor Author

Correcting myself here, I doubt that the hetjob option allows for shared MPI_COMM_WORLD across different srun binaries, so I'm looking into the option srunmix as the default.

@mandresm
Copy link
Contributor Author

@JanStreffing, regarding my post above, I am assuming the errors I am seeing in the hetjob approach are because this models share the same MPI_COMM_WORLD, but is this really the case or do each of them have their own MPI_COMM_WORLD and use MPI_COMM_CONNECT?

@JanStreffing
Copy link
Contributor

JanStreffing commented Jun 12, 2025

I'm not sure how hetjob works. But I believe for Taskset we let xios init the MPI_COMM_WORLD for all models, and then XIOS knows that OASIS is running, and it needs to split the MPI_COMM_WORLD into the local comms.

Each model then uses that local comm instead of the MPI comm world.

@mandresm
Copy link
Contributor Author

Thanks @JanStreffing, that answers my question. I need to do a bit more of exploration on the hetjob and the srunmix options. Once I achieve some clarity I'll let you know. New srunmix and taskset options work in Levante in this branch.

@mandresm
Copy link
Contributor Author

This is a nice description on how hetjob can be used together with a single srun command, where each part separated by : in the srun command is allocated to a different het-group. So in principle, the old ESM-Tools logic for hetjob should still work.

https://apps.fz-juelich.de/jsc/hps/juwels/modular-jobs.html

@mandresm
Copy link
Contributor Author

This is a nice description on how hetjob can be used together with a single srun command, where each part separated by : in the srun command is allocated to a different het-group. So in principle, the old ESM-Tools logic for hetjob should still work.

And in fact, still works in levante

@mandresm
Copy link
Contributor Author

@ufukozkan, I am managing to get simulations started, but is failing because I don't have reading rights to the inputs. Can you allow read permissions to them?

@mandresm
Copy link
Contributor Author

Related to #1148

@ufukozkan
Copy link
Collaborator

Hi all,

@ufukozkan, I am managing to get simulations started, but is failing because I don't have reading rights to the inputs. Can you allow read permissions to them?

I gave the permissions on NHR. However, I only have Arc01-TL255 coupling. I can upload CORE2 if you need them.

@christgau
Copy link
Contributor

I gave the permissions on NHR.

Nitpicking, but maybe helpful for clarification. NHR is the national HPC alliance. There is no system named NHR nor are the systems of the alliance unified. There are nine NHR centers offering individual HPC services under the NHR umbrella. What used to be HLRN is now NHR@ZIB and NHR@GWDG and they operate independently, including software stack, such as slurm and it's configuration.

Apologies for possibly repeating already known stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Slurm issue with different taskset options

5 participants