-
Notifications
You must be signed in to change notification settings - Fork 17
feat/refactor_hetjobs #1364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: release
Are you sure you want to change the base?
feat/refactor_hetjobs #1364
Conversation
…s taskset, srunmix and hetjob with all their functionality
|
Correcting myself here, I doubt that the |
|
@JanStreffing, regarding my post above, I am assuming the errors I am seeing in the |
|
I'm not sure how hetjob works. But I believe for Taskset we let xios init the MPI_COMM_WORLD for all models, and then XIOS knows that OASIS is running, and it needs to split the MPI_COMM_WORLD into the local comms. Each model then uses that local comm instead of the MPI comm world. |
|
Thanks @JanStreffing, that answers my question. I need to do a bit more of exploration on the |
|
This is a nice description on how hetjob can be used together with a single srun command, where each part separated by |
And in fact, still works in |
|
@ufukozkan, I am managing to get simulations started, but is failing because I don't have reading rights to the inputs. Can you allow read permissions to them? |
|
Related to #1148 |
|
Hi all,
I gave the permissions on NHR. However, I only have Arc01-TL255 coupling. I can upload CORE2 if you need them. |
Nitpicking, but maybe helpful for clarification. NHR is the national HPC alliance. There is no system named NHR nor are the systems of the alliance unified. There are nine NHR centers offering individual HPC services under the NHR umbrella. What used to be HLRN is now NHR@ZIB and NHR@GWDG and they operate independently, including software stack, such as slurm and it's configuration. Apologies for possibly repeating already known stuff. |
Refactors the hetjob logic and options, and fixes the incorrect
.runscript when SLURM's hetjobs was used (see #1340).It deprecates the
computer.tasksetoption that is now substituted bycomputer.hetjob_strategy. Thehetjob_strategyvalues can betasket,hetjoborsrunsteps. The different options result into these different.runscripts:hetjob (default, allows for heterogeneous compute resources)
taskset (#SBATCH --nodes still needs fixing, see #1148 (comment)
srunsteps (equivalent to how it's done in PBS+aprun in Aleph, does not allow for heterogeneous compute resources)
TODO
tasksetoption and exit with an error clearly stating what the user needs to do in their yamlCloses #1340
Thanks to @ufukozkan for finding this problem and further investigating and to @christgau for investigating and providing the solution.