LBNL Node Health Check[1] (NHC) is an exceptionally reliable tool that integrates seamlessly with cluster management systems. We utilize it in conjunction with Slurm on four of our computing clusters.
On our newest cluster, we transitioned to using PBS Pro, and we needed to figure out a way to integrate NHC, as it had worked effectively in our clusters using Slurm. With respect to this implementation, we decided to incorporate it as a step in our job execution and cleanup procedures, utilizing what is called a 'hook'.
A "hook" in PBS Pro is a feature that allows administrators to customize and control the behavior of the PBS Pro scheduler and server. Fundamentally, a PBS Pro hook is a script or set of scripts that is triggered automatically by certain events in the PBS system. These hooks are used to intervene at various points in the job scheduling and management process to perform custom actions or checks.
It's important to understand that hooks have event-driven execution. Indeed, specific events in a job's lifecycle are what trigger hooks; e.g. job submissions, job starts, or job completions. This allows for real-time intervention and control.
For the purpose of this post, I'm going to focus on two hook events:EXECJOB_BEGIN
: This event is triggered when the main host receives a job and has prepared any necessary files or directories. It initially runs on the primary host and, if successful, on other allocated MoM hosts.EXECJOB_END
: This event activates on all hosts involved in a job after the job's completion. It's ideal for clean-up activities or gathering post-job data.import pbs
import os
import subprocess
REJECT = False
REJECT_MSG = ""
e = None
def check_nhc(local_node, vnode, op, vnl):
pattern = 'ERROR: nhc: Health check failed: '
try:
nhc_result = subprocess.run('/usr/sbin/nhc', shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, check=True)
pbs.logmsg(pbs.LOG_DEBUG, "NHC check successful.")
nhc_exit_code = nhc_result.returncode
except subprocess.CalledProcessError as nhc_error:
error_msg = nhc_error.stdout.decode("utf-8").strip().replace(pattern, '')
pbs.logmsg(pbs.LOG_ERROR, f"NHC check failed. Error: {error_msg}")
nhc_exit_code = nhc_error.returncode
if nhc_exit_code != 0:
pbs.logmsg(pbs.LOG_DEBUG, f"NHC reported an issue for {local_node}")
current_state = vnode.state
if current_state != pbs.ND_OFFLINE:
vnl[local_node].state = pbs.ND_OFFLINE
vnl[local_node].comment = f"{op}: NHC {error_msg}"
global REJECT
REJECT = True
pbs.logmsg(pbs.LOG_DEBUG, f"Setting {local_node} to offline due to NHC report.")
else:
pbs.logmsg(pbs.LOG_DEBUG, f"{local_node} is already offline.")
else:
pbs.logmsg(pbs.LOG_DEBUG, f"NHC reported that {local_node} is healthy.")
def execute_tool_script(local_node, op, vnl, jobid, u, resource_list, server_resource_list, tool):
resources = ';'.join([f"{x}={resource_list[x]}" for x in resource_list.keys()])
server_resources = ';'.join([f"server_{x}={server_resource_list[x]}" for x in server_resource_list.keys() if x in SERVER_FS_RESOURCES])
if server_resources:
resources += ';' + server_resources
message = f"PBS_MOM_Hook:{op}:Job:{jobid}:User:{u}:Resources:{resources}"
pbs.logjobmsg(jobid, f"{tool} {message}")
completed_proc = subprocess.run([tool, message], timeout=120, shell=False)
pbs.logjobmsg(jobid, f"{tool} script completed with return code {completed_proc.returncode}")
if completed_proc.returncode != 0:
current_state = server.vnode(local_node).state
if current_state != pbs.ND_OFFLINE:
vnl[local_node].state = pbs.ND_OFFLINE
if os.path.isfile('/var/tmp/logue_firstfail'):
firstfail_msg = subprocess.check_output(["cat", "/var/tmp/logue_firstfail"]).strip().decode()
vnl[local_node].comment = f"{op}: {firstfail_msg}"
else:
vnl[local_node].comment = f"{op}: offlining node..."
global REJECT
REJECT = True
elif (op == "EXECJOB_BEGIN") and os.path.isfile('/var/tmp/pbsjob_remaining_procs'):
REJECT = True
try:
e = pbs.event()
server = pbs.server()
server_resource_list = server.resources_available
if e.type in [pbs.EXECJOB_BEGIN, pbs.EXECJOB_END]:
j = e.job
jobid = j.id
u = j.euser
resource_list = server.job(jobid).Resource_List
local_node = pbs.get_local_nodename()
vnode = e.vnode_list.get(local_node)
if not vnode:
pbs.logmsg(pbs.LOG_DEBUG, f"No vnode found for local node: {local_node}")
raise Exception(f"No vnode found for local node: {local_node}")
vnl = e.vnode_list
if e.type == pbs.EXECJOB_BEGIN:
check_nhc(local_node, vnode, "EXECJOB_BEGIN", vnl)
if not REJECT:
tool = "/opt/pbs/scripts/prologue.sh"
execute_tool_script(local_node, "EXECJOB_BEGIN", vnl, jobid, u, resource_list, server_resource_list, tool)
elif e.type == pbs.EXECJOB_END:
tool = "/opt/pbs/scripts/epilogue.sh"
execute_tool_script(local_node, "EXECJOB_END", vnl, jobid, u, resource_list, server_resource_list, tool)
check_nhc(local_node, vnode, "EXECJOB_END", vnl)
except Exception as ex:
pbs.logmsg(pbs.LOG_DEBUG, f"Exception occurred: {ex}")
finally:
if e:
if REJECT:
if REJECT_MSG:
msg = f"(pro|epi)logue failure: {REJECT_MSG}"
else:
msg = f"(pro|epi)logue failure"
e.reject(msg)
else:
e.accept()
We have a primary job hook script that employs integrates NHC to evaluate node health and executes specific scripts at crucial job stages.
check_nhc()
: This function executes NHC and logs the outcome. It handles node issues by setting the node to offline and adding a comment. The comment is prepended with NHC:
, and the message that follows is a string that is taken directly from NHC's output. This allows us to see what NHC didn't like about the node.execute_tool_script()
: Runs specific scripts (prologue.sh
and epilogue.sh
) which set and remove various settings on the host respectively.EXECJOB_BEGIN
and EXECJOB_END
events.EXECJOB_BEGIN
, it first runs the NHC check, followed by prologue.sh
.EXECJOB_END
, it executes epilogue.sh
and then performs the NHC check.After we successfully integrated NHC into our PBS Pro job workflow, I tried to deploy a hook that does periodic NHC checks on the MoM hosts using the EXECHOST_PERIODIC
event. If the host fails its NHC check, then the node is taken offline. While this worked, this crushed the performance of our cluster. Simply having the hook enabled, even if it wasn't actively executing anything on the node, would cause the job's performance to plummet. I decided to take an approach that's less intrusive.
My team and I decided to scrap periodic checks entirely. Instead, we chose to rely on the job hook for offlining nodes with NHC, and a shell script that runs on the PBS Pro server, which onlines the nodes. Below is the nhc_node_checks
script that accomplishes this:
#!/bin/sh set -e trap 'echo "Error on line $LINENO" >> /var/log/nhc_error.log' ERR process_node() { nodename="$1" error_file="$2" # Run NHC and check its exit status if ! /usr/bin/ssh -o ConnectTimeout=15 "$nodename" /usr/sbin/nhc > /dev/null 2>&1; then echo "SSH to $nodename for NHC failed." >> /var/log/nhc_error.log return 0 else # Only run pbsnodes commands if NHC was successful if ! /opt/pbs/bin/pbsnodes -r "$nodename" || ! /opt/pbs/bin/pbsnodes -C "" "$nodename"; then echo "Failed to run pbsnodes commands on $nodename." >> /var/log/nhc_error.log fi fi echo "$? $nodename" >> "$error_file" } capture_and_process_nodes() { pbsnodes_output=$(/opt/pbs/bin/pbsnodes -l) node_names=$(echo "$pbsnodes_output" | /usr/bin/awk '$2 ~ /(down|offline)/ && $0 ~ /NHC/ {print $1}') if [ -z "$node_names" ]; then echo "No nodes match the given conditions. Exiting." return 0 fi error_file=$(mktemp) pids="" for nodename in $node_names; do process_node "$nodename" "$error_file" & pids="$pids $!" done for pid in $pids; do wait "$pid" done if grep -q '^[^0]' "$error_file"; then echo "Some nodes failed to process. See $error_file for details." >> /var/log/nhc_error.log fi rm "$error_file" } capture_and_process_nodes
^ | [1] | LBNL Node Health Check |