Skip to content

Spreading out executors on RDHPCS head nodes

TerrenceMcGuinness-NOAA edited this page Aug 7, 2024 · 1 revision

To maintain a single node connection to a head node on a supercomputer and farm out executors to other head nodes without running multiple agents, you can use SSH to execute commands on remote nodes. This approach allows you to control and execute tasks on other nodes from a single Jenkins agent.

Here's a step-by-step approach to achieve this:

  1. Setup SSH Keys: Ensure that the Jenkins agent has SSH access to the other head nodes. You can set up SSH keys for passwordless authentication.

  2. Use SSH in Jenkins Pipeline: Use the sh step in the Jenkins pipeline to execute SSH commands on the remote nodes.

  3. Define Remote Nodes: Define the remote nodes and their corresponding SSH details in your Jenkinsfile.

  4. Execute Commands on Remote Nodes: Use the sh step to execute commands on the remote nodes via SSH.

Example Jenkinsfile

Here's an example Jenkinsfile that demonstrates this approach:

def Machine = 'none'
def machine = 'none'
def CUSTOM_WORKSPACE = 'none'
def caseList = ''
def GH = 'none'
// Location of the custom workspaces for each machine in the CI system. They are persistent for each iteration of the PR.
def NodeName = [hera: 'Hera-EMC', orion: 'Orion-EMC', hercules: 'Hercules-EMC', gaea: 'Gaea']
def custom_workspace = [hera: '/scratch1/NCEPDEV/global/CI', orion: '/work2/noaa/stmp/CI/ORION', hercules: '/work2/noaa/stmp/CI/HERCULES', gaea: '/gpfs/f5/epic/proj-shared/global/CI']
def repo_url = 'git@github.com:NOAA-EMC/global-workflow.git'
def STATUS = 'Passed'

// Define SSH details for remote nodes
def remoteNodes = [
    hera: [host: 'hera.example.com', user: 'jenkins'],
    orion: [host: 'orion.example.com', user: 'jenkins'],
    hercules: [host: 'hercules.example.com', user: 'jenkins'],
    gaea: [host: 'gaea.example.com', user: 'jenkins']
]

pipeline {
    agent { label 'built-in' }

    options {
        skipDefaultCheckout()
        parallelsAlwaysFailFast()
    }

    stages {
        stage('1. Get Machine') {
            agent { label 'built-in' }
            steps {
                script {
                    def causes = currentBuild.rawBuild.getCauses()
                    def isSpawnedFromAnotherJob = causes.any { cause ->
                        cause instanceof hudson.model.Cause.UpstreamCause
                    }

                    def run_nodes = []
                    if (isSpawnedFromAnotherJob) {
                        echo "machine being set to value passed to this spawned job"
                        echo "passed machine: ${params.machine}"
                        machine = params.machine
                    } else {
                        echo "This is parent job so getting list of nodes matching labels:"
                        for (label in pullRequest.labels) {
                            if (label.matches("CI-(.*?)-Ready")) {
                                def machine_name = label.split('-')[1].toString().toLowerCase()
                                jenkins.model.Jenkins.get().computers.each { c ->
                                    if (c.node.selfLabel.name == NodeName[machine_name]) {
                                        run_nodes.add(c.node.selfLabel.name)
                                    }
                                }
                            }
                        }
                        // Spawning all the jobs on the nodes matching the labels
                        if (run_nodes.size() > 1) {
                            run_nodes.init().each { node ->
                                def machine_name = node.split('-')[0].toLowerCase()
                                echo "Spawning job on node: ${node} with machine name: ${machine_name}"
                                build job: "/global-workflow/EMC-Global-Pipeline/PR-${env.CHANGE_ID}", parameters: [
                                    string(name: 'machine', value: machine_name),
                                    string(name: 'Node', value: node) ],
                                    wait: false
                            }
                            machine = run_nodes.last().split('-')[0].toLowerCase()
                            echo "Running parent job: ${machine}"
                        } else {
                            machine = run_nodes[0].split('-')[0].toLowerCase()
                            echo "Running only the parent job: ${machine}"
                        }
                    }
                }
            }
        }

        stage('2. Execute on Remote Nodes') {
            steps {
                script {
                    def remoteNode = remoteNodes[machine]
                    if (remoteNode) {
                        echo "Executing on remote node: ${remoteNode.host}"
                        sh """
                        ssh ${remoteNode.user}@${remoteNode.host} 'bash -s' <<'ENDSSH'
                        # Commands to execute on the remote node
                        echo "Running on remote node: ${remoteNode.host}"
                        # Add your commands here
                        ENDSSH
                        """
                    } else {
                        error "No remote node configuration found for machine: ${machine}"
                    }
                }
            }
        }

        stage('3. FINALIZE') {
            steps {
                script {
                    echo "Finalizing the pipeline"
                    // Add your finalization steps here
                }
            }
        }
    }
}

Explanation

  1. Define Remote Nodes: The remoteNodes map contains the SSH details for each remote node.

  2. Get Machine Stage: This stage determines the machine to use based on the GitHub labels on the PR.

  3. Execute on Remote Nodes Stage: This stage uses SSH to execute commands on the remote node. The sh step is used to run SSH commands, and the commands to be executed on the remote node are placed within the ENDSSH block.

  4. Finalize Stage: This stage contains any finalization steps needed for the pipeline.

By using SSH to execute commands on remote nodes, you can maintain a single node connection to a head node on a supercomputer and farm out executors to other head nodes without running multiple agents.