Integrating NHC with PBS Pro


LBNL Node Health Check[1] (NHC) is an exceptionally reliable tool that integrates seamlessly with cluster management systems. We utilize it in conjunction with Slurm on four of our computing clusters.

On our newest cluster, we transitioned to using PBS Pro, and we needed to figure out a way to integrate NHC, as it had worked effectively in our clusters using Slurm. With respect to this implementation, we decided to incorporate it as a step in our job execution and cleanup procedures, utilizing what is called a 'hook'.

Understanding PBS Pro Hooks


A "hook" in PBS Pro is a feature that allows administrators to customize and control the behavior of the PBS Pro scheduler and server. Fundamentally, a PBS Pro hook is a script or set of scripts that is triggered automatically by certain events in the PBS system. These hooks are used to intervene at various points in the job scheduling and management process to perform custom actions or checks.

It's important to understand that hooks have event-driven execution. Indeed, specific events in a job's lifecycle are what trigger hooks; e.g. job submissions, job starts, or job completions. This allows for real-time intervention and control.

For the purpose of this post, I'm going to focus on two hook events:
import pbs
import os
import subprocess

REJECT = False
REJECT_MSG = ""

e = None

def check_nhc(local_node, vnode, op, vnl):
    pattern = 'ERROR:  nhc:  Health check failed:  '
    try:
        nhc_result = subprocess.run('/usr/sbin/nhc', shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, check=True)
        pbs.logmsg(pbs.LOG_DEBUG, "NHC check successful.")
        nhc_exit_code = nhc_result.returncode
    except subprocess.CalledProcessError as nhc_error:
        error_msg = nhc_error.stdout.decode("utf-8").strip().replace(pattern, '')
        pbs.logmsg(pbs.LOG_ERROR, f"NHC check failed. Error: {error_msg}")
        nhc_exit_code = nhc_error.returncode

    if nhc_exit_code != 0:
        pbs.logmsg(pbs.LOG_DEBUG, f"NHC reported an issue for {local_node}")
        current_state = vnode.state
        if current_state != pbs.ND_OFFLINE:
            vnl[local_node].state = pbs.ND_OFFLINE
            vnl[local_node].comment = f"{op}: NHC {error_msg}"
            global REJECT
            REJECT = True
            pbs.logmsg(pbs.LOG_DEBUG, f"Setting {local_node} to offline due to NHC report.")
        else:
            pbs.logmsg(pbs.LOG_DEBUG, f"{local_node} is already offline.")
    else:
        pbs.logmsg(pbs.LOG_DEBUG, f"NHC reported that {local_node} is healthy.")

def execute_tool_script(local_node, op, vnl, jobid, u, resource_list, server_resource_list, tool):
    resources = ';'.join([f"{x}={resource_list[x]}" for x in resource_list.keys()])
    server_resources = ';'.join([f"server_{x}={server_resource_list[x]}" for x in server_resource_list.keys() if x in SERVER_FS_RESOURCES])
    if server_resources:
        resources += ';' + server_resources
    message = f"PBS_MOM_Hook:{op}:Job:{jobid}:User:{u}:Resources:{resources}"
    pbs.logjobmsg(jobid, f"{tool} {message}")

    completed_proc = subprocess.run([tool, message], timeout=120, shell=False)
    pbs.logjobmsg(jobid, f"{tool} script completed with return code {completed_proc.returncode}")

    if completed_proc.returncode != 0:
        current_state = server.vnode(local_node).state
        if current_state != pbs.ND_OFFLINE:
            vnl[local_node].state = pbs.ND_OFFLINE
            if os.path.isfile('/var/tmp/logue_firstfail'):
                firstfail_msg = subprocess.check_output(["cat", "/var/tmp/logue_firstfail"]).strip().decode()
                vnl[local_node].comment = f"{op}: {firstfail_msg}"
            else:
                vnl[local_node].comment = f"{op}: offlining node..."
            global REJECT
            REJECT = True
    elif (op == "EXECJOB_BEGIN") and os.path.isfile('/var/tmp/pbsjob_remaining_procs'):
        REJECT = True

try:
    e = pbs.event()
    server = pbs.server()
    server_resource_list = server.resources_available

    if e.type in [pbs.EXECJOB_BEGIN, pbs.EXECJOB_END]:
        j = e.job
        jobid = j.id
        u = j.euser
        resource_list = server.job(jobid).Resource_List

    local_node = pbs.get_local_nodename()
    vnode = e.vnode_list.get(local_node)

    if not vnode:
        pbs.logmsg(pbs.LOG_DEBUG, f"No vnode found for local node: {local_node}")
        raise Exception(f"No vnode found for local node: {local_node}")

    vnl = e.vnode_list

    if e.type == pbs.EXECJOB_BEGIN:
        check_nhc(local_node, vnode, "EXECJOB_BEGIN", vnl)
        if not REJECT:
            tool = "/opt/pbs/scripts/prologue.sh"
            execute_tool_script(local_node, "EXECJOB_BEGIN", vnl, jobid, u, resource_list, server_resource_list, tool)

    elif e.type == pbs.EXECJOB_END:
        tool = "/opt/pbs/scripts/epilogue.sh"
        execute_tool_script(local_node, "EXECJOB_END", vnl, jobid, u, resource_list, server_resource_list, tool)
        check_nhc(local_node, vnode, "EXECJOB_END", vnl)

except Exception as ex:
    pbs.logmsg(pbs.LOG_DEBUG, f"Exception occurred: {ex}")

finally:
    if e:
        if REJECT:
            if REJECT_MSG:
                msg = f"(pro|epi)logue failure:  {REJECT_MSG}"
            else:
                msg = f"(pro|epi)logue failure"
            e.reject(msg)
        else:
            e.accept()

Overview of the Python Job Hook Script


We have a primary job hook script that employs integrates NHC to evaluate node health and executes specific scripts at crucial job stages.

Health Check Function

Tool Script Execution Function

Main Event Handling

Using NHC to Online Healthy Nodes


After we successfully integrated NHC into our PBS Pro job workflow, I tried to deploy a hook that does periodic NHC checks on the MoM hosts using the EXECHOST_PERIODIC event. If the host fails its NHC check, then the node is taken offline. While this worked, this crushed the performance of our cluster. Simply having the hook enabled, even if it wasn't actively executing anything on the node, would cause the job's performance to plummet. I decided to take an approach that's less intrusive.

My team and I decided to scrap periodic checks entirely. Instead, we chose to rely on the job hook for offlining nodes with NHC, and a shell script that runs on the PBS Pro server, which onlines the nodes. Below is the nhc_node_checks script that accomplishes this:

#!/bin/sh
set -e
trap 'echo "Error on line $LINENO" >> /var/log/nhc_error.log' ERR

process_node() {
    nodename="$1"
    error_file="$2"

    # Run NHC and check its exit status
    if ! /usr/bin/ssh -o ConnectTimeout=15 "$nodename" /usr/sbin/nhc > /dev/null 2>&1; then
        echo "SSH to $nodename for NHC failed." >> /var/log/nhc_error.log
        return 0
    else
        # Only run pbsnodes commands if NHC was successful
        if ! /opt/pbs/bin/pbsnodes -r "$nodename" || ! /opt/pbs/bin/pbsnodes -C "" "$nodename"; then
            echo "Failed to run pbsnodes commands on $nodename." >> /var/log/nhc_error.log
        fi
    fi

    echo "$? $nodename" >> "$error_file"
}

capture_and_process_nodes() {
    pbsnodes_output=$(/opt/pbs/bin/pbsnodes -l)

    node_names=$(echo "$pbsnodes_output" | /usr/bin/awk '$2 ~ /(down|offline)/ && $0 ~ /NHC/ {print $1}')

    if [ -z "$node_names" ]; then
        echo "No nodes match the given conditions. Exiting."
        return 0
    fi

    error_file=$(mktemp)
    pids=""

    for nodename in $node_names; do
        process_node "$nodename" "$error_file" &
        pids="$pids $!"
    done

    for pid in $pids; do
        wait "$pid"
    done

    if grep -q '^[^0]' "$error_file"; then
        echo "Some nodes failed to process. See $error_file for details." >> /var/log/nhc_error.log
    fi

    rm "$error_file"
}

capture_and_process_nodes
References

^ [1]LBNL Node Health Check