Skip to content

Conversation

@dgarcia18
Copy link
Contributor

This PR:

  • Adds Slurm Controller and Slurm Worker appliances logic and packer script
  • Modifies Makefile to add new Slurm appliances

Both appliances are based on Ubuntu 24.04.

Appliance logic description:

  • Controller
    • Install
      • Installs the following packages: munge libmunge-dev slurmctld
      • Injects the /etc/slurm/slurm.conf config file. This file is prepared for configless Slurm usage and includes a nodeset called "one" and a partition that groups all nodes.
      • Creates repositories to save the Slurm state and enables the slurmctld service
    • Configure
      • Sets the hostname to slurm-one-controller
      • Adds the hostname to /etc/hosts so that it can be resolved to the main IP address
      • If needed (based on the /etc/munge/one_key_generated flag file) generates a new Munge key and exposes it through OneGate under the MUNGE_KEY_BASE64 attribute
      • Restarts slurmctld if needed and checks it started correctly
  • Worker
    • Install
      • Installs the following packages: munge libmunge-dev slurmd
    • Configure
      • Checks that the context vars ONEAPP_SLURM_CONTROLLER_IP and ONEAPP_MUNGE_KEY_BASE64 are received and not empty
      • Checks that the controller can be reached via port 6817 (slurmctld port). If not, retries several times
      • Obtains the VM ID via onegate (with retry logic)
      • Configures the VM hostname to slurm-one-worker-#{vm_id}
      • Adds the controller to /etc/hosts so that the controller hostname is resolvable
      • Adds the new munge key (received via ONEAPP_MUNGE_KEY_BASE64) to /etc/munge/munge.key and starts the Munge service
      • Checks Munge is fully operational
      • Adds the new worker node using the "-Z" option: https://slurm.schedmd.com/dynamic_nodes.html
        • I.e. slurmd -Z --conf "CPUs=#{cpus} RealMemory=#{mem_mb} Feature=one" --conf-server #{controller_ip}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant