When a Slurm node is added to a partition, you specify the quantity of physical memory the machine has. Running slurmd -C
on the node generates the configuration data based on the machine capacity.
NodeName=foo CPUs=16 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=1048576
However, even with cgoups enabled, users running with --mem 0
run unchecked, which can lead to the out-of-memory reaper collecting the process. Even limiting RealMemory
to 99% or 66% of the actual RAM does not protect the machine.
Using a job_submit plugin, you can block --mem=0
. Slurm provides a Lua interface that allows you to intercept and modify/reject job submissions.
Here’s how to create a custom job_submit plugin that blocks --mem=0
:
1. Create /etc/slurm/job_submit.lua
:
function slurm_job_submit(job_desc, part_list, submit_uid)
-- Check if user requested unlimited memory (--mem=0)
if job_desc.pn_min_memory == 0 then
slurm.log_user("ERROR: --mem=0 is not allowed. Please specify an explicit memory limit.")
slurm.user_msg("--mem=0 is not allowed. Please specify an explicit memory limit (e.g., --mem=100G)")
return slurm.ERROR
end
return slurm.SUCCESS
end
function slurm_job_modify(job_desc, job_rec, part_list, modify_uid)
return slurm.SUCCESS
end
return slurm.SUCCESS
2. Configure slurm.conf
:
# Enable the Lua job submit plugin
JobSubmitPlugins=lua
3. Restart slurmctld
:
sudo systemctl restart slurmctld
Users trying srun --mem=0
will get an error message:
–mem=0 is not allowed. Please specify an explicit memory limit (e.g., –mem=100G)
This enforces explicit memory requests while still allowing up to the RealMemory limits. The plugin intercepts job submissions before they are processed, allowing you to reject --mem=0
requests and force users to specify explicit memory amounts within your configured limits.
I’d still recommend additionally setting RealMemory
to less than the physical RAM installed to allow some room for the OS.