Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add code infrastructure for pipeline configuration regression tests #12

Merged
merged 7 commits into from
Feb 27, 2024

Conversation

nwiltsie
Copy link
Member

Okay, this is a lot. None of this is in Action format yet, but I wanted to push up this minimal working version and start getting some feedback on it.

The problem this is solving

We have so much metaprogramming craziness inside the pipeline configurations. The configuration changes based on available CPUs and memory, there are 7 ways to modify parameters, and we have fancy process parameter merging and retry specification.

Given that, it's very difficult to review most pull requests changing configuration logic, as I have exceptionally little understanding of what's going on under the hood.

I want an Action that can run regression tests on our Nextflow pipeline configurations. That is, given a specific commit of a pipeline and a defined configuration (split across config files, parameter files, environment variables, and command line arguments), I want to be able to compute the final configuration in the cloud, without requiring any files outside of the repository. Then that Action can automatically run on pull requests, documenting how the code changes impacted the final rendered configuration.

This PR is the bones of that Action (and a record of my slow descent into madness).

Interface

Test File Format

Tests are defined as specially-formatted JSON files tracked within each pipeline repository, one test per file. For demonstration I've included one test for the recalibrate BAM pipeline with this PR. The configuration appears at the start:

{
"config": [
"test/nftest.config"
],
"params_file": "test/single.yaml",
"cpus": 16,
"memory_gb": 31,
"empty_files": [
"/hot/ref/reference/GRCh38-BI-20160721/Homo_sapiens_assembly38.fasta",
"/hot/ref/tool-specific-input/GATK/GRCh38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz",
"/hot/ref/tool-specific-input/GATK/GRCh38/Homo_sapiens_assembly38.known_indels.vcf.gz",
"/hot/ref/tool-specific-input/GATK/GRCh38/resources_broad_hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf.gz",
"/hot/ref/tool-specific-input/GATK/GRCh38/Biallelic/hapmap_3.3.hg38.BIALLELIC.PASS.2021-09-01.vcf.gz",
"/hot/resource/SMC-HET/tumours/A-mini/bams/n1/output/S2.T-n1.bam"
],
"mapped_files": [],
"nf_params": {
"output_dir": "/tmp/outputs"
},
"envvars": {
"SLURM_JOB_ID": "851543"
},
"mocks": {
"parse_bam_header": {
"read_group": [
{
"SM": "4915723"
}
]
}
},
"dated_fields": [
"params.log_output_dir",
"report.file",
"timeline.file",
"trace.file"
],

The following keys are required (unless stated otherwise all paths are relative to the pipeline root):

Key Description
config A list of configuration files to be included (nextflow -c <file1> -c <file2>)
params_file A single parameter file or an empty string (nextflow -params-file <file>)
cpus The integer CPU count to be returned by SysHelper::getAvailCpus()
memory_gb The memory value to be returned by SysHelper::getAvailMemory (float, GB)
empty_files A list of absolute paths to touch within the docker image
mapped_files A map of relative repository files to absolute container paths that should be mapped into the docker image
nf_params A map of command-line parameters to pass to Nextflow (nextflow --<key>=<value>)
envvars A map of environment variables to set in the docker image (KEY=VALUE nextflow ...)
mocks Method names to be mocked, mapped to the objects they should return
dated_field A list of JSONPath-like keys indicating values in the rendered configuration that contain datestamps
expected_results The expected output of the test (lightly filtered, see later sections for details)

That expected_results field looks something like this:

"expected_result": {
"docker": {
"all_group_ids": "$(for i in `id --real --groups`; do echo -n \"--group-add=$i \"; done)",
"enabled": true,
"runOptions": "-u $(id -u):$(id -g) $(for i in `id --real --groups`; do echo -n \"--group-add=$i \"; done)",
"uid_and_gid": "-u $(id -u):$(id -g)"
},
"manifest": {
"author": "Yash Patel",
"description": "Nextflow pipeline to perform Indel Realignment and Base Quality Score Recalibration",
"name": "recalibrate-BAM",
"version": "1.0.0-rc.4"
},
"params": {
"aligner": "BWA-MEM2-2.2.1",
"blcds_registered_dataset": false,
"bundle_contest_hapmap_3p3_vcf_gz": "/hot/ref/tool-specific-input/GATK/GRCh38/Biallelic/hapmap_3.3.hg38.BIALLELIC.PASS.2021-09-01.vcf.gz",
"bundle_known_indels_vcf_gz": "/hot/ref/tool-specific-input/GATK/GRCh38/Homo_sapiens_assembly38.known_indels.vcf.gz",
"bundle_mills_and_1000g_gold_standard_indels_vcf_gz": "/hot/ref/tool-specific-input/GATK/GRCh38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz",
"bundle_v0_dbsnp138_vcf_gz": "/hot/ref/tool-specific-input/GATK/GRCh38/resources_broad_hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf.gz",
"cache_intermediate_pipeline_steps": false,
"dataset_id": "A-mini",
"docker_container_registry": "ghcr.io/uclahs-cds",
"docker_image_gatk": "broadinstitute/gatk:4.2.4.1",
"docker_image_gatk3": "ghcr.io/uclahs-cds/call-gsnp:GATK-3.7.0",
"docker_image_picard": "ghcr.io/uclahs-cds/picard:2.26.10",
"docker_image_pipeval": "ghcr.io/uclahs-cds/pipeval:4.0.0-rc.2",
"docker_image_samtools": "ghcr.io/uclahs-cds/samtools:1.17",
"gatk3_version": "GATK-3.7.0",
"gatk_command_mem_diff": "0",
"gatk_ir_compression": "1",
"gatk_version": "4.2.4.1",
"input": {
"BAM": {
"tumor": [
"/hot/resource/SMC-HET/tumours/A-mini/bams/n1/output/S2.T-n1.bam"
]
},
"recalibration_table": [
"/scratch/851543/NO_FILE.grp"
]
},
"intervals": "",
"is_DOC_run": false,
"is_emit_original_quals": true,
"is_targeted": false,
"log_output_dir": "/tmp/outputs/recalibrate-BAM-1.0.0-rc.4/TWGSAMIN000001/log-recalibrate-BAM-1.0.0-rc.4-20240214T213139Z",
"max_cpus": "16",
"max_memory": "31 GB",
"metapipeline_delete_input_bams": false,
"metapipeline_states_to_delete": [
"normal",
"tumor"
],
"min_cpus": "1",
"min_memory": "1 MB",
"output_dir": "/tmp/outputs",
"output_dir_base": "/tmp/outputs/recalibrate-BAM-1.0.0-rc.4/TWGSAMIN000001/GATK-4.2.4.1",
"parallelize_by_chromosome": true,
"patient_id": "TWGSAMIN000001",
"picard_version": "2.26.10",
"pipeval_version": "4.0.0-rc.2",
"proc_resource_params": {
"deduplicate_records_SAMtools": {
"cpus": "2",
"memory": "27.9 GB",
"retry_strategy": {
"memory": {
"operand": "2",
"strategy": "exponential"
}
}
},
"extract_GenomeIntervals": {
"cpus": "1",
"memory": "1 GB"
},
"remove_intermediate_files": {
"cpus": "1",
"memory": "1 GB"
},
"remove_merged_BAM": {
"cpus": "1",
"memory": "1 GB"
},
"remove_unmerged_BAMs": {
"cpus": "1",
"memory": "1 GB"
},
"run_ApplyBQSR_GATK": {
"cpus": "1",
"memory": "2 GB",
"retry_strategy": {
"memory": {
"operand": "4",
"strategy": "exponential"
}
}
},
"run_BaseRecalibrator_GATK": {
"cpus": "1",
"memory": "27.9 GB",
"retry_strategy": {
"memory": {
"operand": "2",
"strategy": "exponential"
}
}
},
"run_CalculateContamination_GATK": {
"cpus": "1",
"memory": "14 GB",
"retry_strategy": {
"memory": {
"operand": "2",
"strategy": "exponential"
}
}
},
"run_DepthOfCoverage_GATK": {
"cpus": "1",
"memory": "14 GB",
"retry_strategy": {
"memory": {
"operand": "2",
"strategy": "exponential"
}
}
},
"run_GetPileupSummaries_GATK": {
"cpus": "1",
"memory": "14 GB",
"retry_strategy": {
"memory": {
"operand": "2",
"strategy": "exponential"
}
}
},
"run_IndelRealigner_GATK": {
"cpus": "2",
"memory": "4 GB",
"retry_strategy": {
"memory": {
"operand": "4",
"strategy": "exponential"
}
}
},
"run_MergeSamFiles_Picard": {
"cpus": "2",
"memory": "27.9 GB",
"retry_strategy": {
"memory": {
"operand": "2",
"strategy": "exponential"
}
}
},
"run_RealignerTargetCreator_GATK": {
"cpus": "2",
"memory": "4 GB",
"retry_strategy": {
"memory": {
"operand": "2",
"strategy": "exponential"
}
}
},
"run_SplitIntervals_GATK": {
"cpus": "1",
"memory": "1 GB"
},
"run_index_SAMtools": {
"cpus": "1",
"memory": "2 GB",
"retry_strategy": {
"memory": {
"operand": "2",
"strategy": "exponential"
}
}
},
"run_validate_PipeVal": {
"cpus": "1",
"memory": "1 GB"
}
},
"reference_fasta": "/hot/ref/reference/GRCh38-BI-20160721/Homo_sapiens_assembly38.fasta",
"samples_to_process": [
{
"id": "4915723",
"path": "/hot/resource/SMC-HET/tumours/A-mini/bams/n1/output/S2.T-n1.bam",
"sample_type": "tumor"
}
],
"samtools_version": "1.17",
"save_intermediate_files": false,
"scatter_count": "50",
"split_intervals_extra_args": "",
"ucla_cds": true,
"use_recal_tables": false,
"work_dir": "/scratch/851543"
},
"params_schema": {
"aligner": {
"help": "Aligner used to align input BAMs. Provided as <Aligner>-<Aligner-version>",
"required": true,
"type": "AlignerTool"
},
"base_resource_update": {
"elements": {
"cpus": {
"help": "List of CPU updates",
"required": false,
"type": "ResourceUpdateList"
},
"memory": {
"help": "List of memory updates",
"required": false,
"type": "ResourceUpdateList"
}
},
"help": "User-defined modifications for adjusting base resource allocations for processes",
"required": false,
"type": "ResourceUpdateNamespace"
},
"bundle_contest_hapmap_3p3_vcf_gz": {
"help": "Absolute path to ConEst HapMap 3p3 VCF",
"mode": "r",
"required": true,
"type": "Path"
},
"bundle_known_indels_vcf_gz": {
"help": "Absolute path to known INDELs VCF",
"mode": "r",
"required": true,
"type": "Path"
},
"bundle_mills_and_1000g_gold_standard_indels_vcf_gz": {
"help": "Absolute path to Mills and 1000g gold standard INDELs VCF",
"mode": "r",
"required": true,
"type": "Path"
},
"bundle_v0_dbsnp138_vcf_gz": {
"help": "Absolute path to v0 dbSNP 138 VCF",
"mode": "r",
"required": true,
"type": "Path"
},
"dataset_id": {
"help": "Dataset ID",
"required": true,
"type": "String"
},
"gatk_ir_compression": {
"choices": [
"0",
"1",
"2",
"3",
"4",
"5",
"6",
"7",
"8",
"9"
],
"default": "1",
"help": "",
"required": false,
"type": "Integer"
},
"input": {
"elements": {
"BAM": {
"elements": {
"normal": {
"help": "Input normal BAMs",
"required": false,
"type": "BAMEntryList"
},
"tumor": {
"help": "Input tumor BAMs",
"required": false,
"type": "BAMEntryList"
}
},
"help": "Input BAMs for calling",
"required": true,
"type": "InputBAMNamespace"
},
"recalibration_table": {
"allow_empty": false,
"help": "List of any available recalibration tables",
"required": false,
"type": "RecalibrationTableList"
}
},
"help": "Input samples",
"required": true,
"type": "InputNamespace"
},
"intervals": {
"allow_empty": true,
"help": "Target intervals to process for DNA panel/targeted sequencing samples; leave empty for WGS",
"required": true,
"type": "String"
},
"is_DOC_run": {
"default": false,
"help": "Whether to run the DepthOfCoverage process, which is very time-consuming for large BAMs",
"required": true,
"type": "Bool"
},
"is_emit_original_quals": {
"default": true,
"help": "Whether to emit original quality scores after recalibration",
"required": true,
"type": "Bool"
},
"metapipeline_delete_input_bams": {
"default": false,
"help": "Whether to delete the input BAMs",
"required": true,
"type": "Bool"
},
"metapipeline_final_output_dir": {
"help": "Directory containing final outputs to check before input deletion",
"required": false,
"type": "String"
},
"metapipeline_states_to_delete": {
"choice": [
"normal",
"tumor"
],
"default": [
"normal",
"tumor"
],
"help": "List of states for which to delete input BAMs",
"required": true,
"type": "List"
},
"output_dir": {
"help": "Absolute path to output directory",
"mode": "w",
"required": true,
"type": "Path"
},
"patient_id": {
"help": "Patient ID",
"required": true,
"type": "String"
},
"reference_fasta": {
"help": "Absolute path to reference genome fasta",
"mode": "r",
"required": true,
"type": "Path"
},
"save_intermediate_files": {
"default": false,
"help": "Whether to save intermediate files",
"required": true,
"type": "Bool"
},
"scatter_count": {
"default": "50",
"help": "How many intervals to divide the genome into for parallelization",
"required": true,
"type": "Integer"
},
"split_intervals_extra_args": {
"allow_empty": true,
"help": "Extra arguments for interval splitting",
"required": false,
"type": "String"
}
},
"proc_name_keys": [
"withName:run_validate_PipeVal",
"withName:extract_GenomeIntervals",
"withName:run_SplitIntervals_GATK",
"withName:run_RealignerTargetCreator_GATK",
"withName:run_IndelRealigner_GATK",
"withName:run_BaseRecalibrator_GATK",
"withName:run_ApplyBQSR_GATK",
"withName:run_MergeSamFiles_Picard",
"withName:deduplicate_records_SAMtools",
"withName:run_index_SAMtools",
"withName:run_GetPileupSummaries_GATK",
"withName:run_CalculateContamination_GATK",
"withName:run_DepthOfCoverage_GATK",
"withName:remove_intermediate_files",
"withName:remove_unmerged_BAMs",
"withName:remove_merged_BAM"
],
"proc_names": "[Ljava.lang.String;@7cf166db",
"process": {
"cache": false,
"containerOptions": {
"1": "--cpu-shares 1024 --cpus $task.cpus",
"2": "--cpu-shares 1024 --cpus $task.cpus",
"3": "--cpu-shares 1024 --cpus $task.cpus",
"closure": "--cpu-shares 1024 --cpus $task.cpus"
},
"cpus": {
"1": "1",
"2": "2",
"3": "3",
"closure": "closure()"
},
"errorStrategy": {
"1": "terminate",
"2": "terminate",
"3": "terminate",
"closure": "terminate"
},
"executor": "local",
"maxRetries": "1",
"memory": "31 GB",
"withLabel:process_high": {
"cpus": {
"1": "12",
"2": "12",
"3": "12",
"closure": "retry_updater(12, add, 0, $task.attempt, cpus)"
},
"memory": {
"1": "31 GB",
"2": "31 GB",
"3": "31 GB",
"closure": "retry_updater(84 GB, exponential, 2, $task.attempt, memory)"
}
},
"withLabel:process_low": {
"cpus": {
"1": "2",
"2": "2",
"3": "2",
"closure": "retry_updater(2, add, 0, $task.attempt, cpus)"
},
"memory": {
"1": "3 GB",
"2": "6 GB",
"3": "12 GB",
"closure": "retry_updater(3 GB, exponential, 2, $task.attempt, memory)"
}
},
"withLabel:process_medium": {
"cpus": {
"1": "6",
"2": "6",
"3": "6",
"closure": "retry_updater(6, add, 0, $task.attempt, cpus)"
},
"memory": {
"1": "31 GB",
"2": "31 GB",
"3": "31 GB",
"closure": "retry_updater(42 GB, exponential, 2, $task.attempt, memory)"
}
},
"withName:deduplicate_records_SAMtools": {
"cpus": "2",
"memory": {
"1": "27.9 GB",
"2": "31 GB",
"3": "31 GB",
"closure": "retry_updater(27.9 GB, exponential, 2, $task.attempt, memory)"
}
},
"withName:extract_GenomeIntervals": {
"cpus": "1",
"memory": "1 GB"
},
"withName:remove_intermediate_files": {
"cpus": "1",
"memory": "1 GB"
},
"withName:remove_merged_BAM": {
"cpus": "1",
"memory": "1 GB"
},
"withName:remove_unmerged_BAMs": {
"cpus": "1",
"memory": "1 GB"
},
"withName:run_ApplyBQSR_GATK": {
"cpus": "1",
"memory": {
"1": "2 GB",
"2": "8 GB",
"3": "31 GB",
"closure": "retry_updater(2 GB, exponential, 4, $task.attempt, memory)"
}
},
"withName:run_BaseRecalibrator_GATK": {
"cpus": "1",
"memory": {
"1": "27.9 GB",
"2": "31 GB",
"3": "31 GB",
"closure": "retry_updater(27.9 GB, exponential, 2, $task.attempt, memory)"
}
},
"withName:run_CalculateContamination_GATK": {
"cpus": "1",
"memory": {
"1": "14 GB",
"2": "27.9 GB",
"3": "31 GB",
"closure": "retry_updater(14 GB, exponential, 2, $task.attempt, memory)"
}
},
"withName:run_DepthOfCoverage_GATK": {
"cpus": "1",
"memory": {
"1": "14 GB",
"2": "27.9 GB",
"3": "31 GB",
"closure": "retry_updater(14 GB, exponential, 2, $task.attempt, memory)"
}
},
"withName:run_GetPileupSummaries_GATK": {
"cpus": "1",
"memory": {
"1": "14 GB",
"2": "27.9 GB",
"3": "31 GB",
"closure": "retry_updater(14 GB, exponential, 2, $task.attempt, memory)"
}
},
"withName:run_IndelRealigner_GATK": {
"cpus": "2",
"memory": {
"1": "4 GB",
"2": "16 GB",
"3": "31 GB",
"closure": "retry_updater(4 GB, exponential, 4, $task.attempt, memory)"
}
},
"withName:run_MergeSamFiles_Picard": {
"cpus": "2",
"memory": {
"1": "27.9 GB",
"2": "31 GB",
"3": "31 GB",
"closure": "retry_updater(27.9 GB, exponential, 2, $task.attempt, memory)"
}
},
"withName:run_RealignerTargetCreator_GATK": {
"cpus": "2",
"memory": {
"1": "4 GB",
"2": "8 GB",
"3": "16 GB",
"closure": "retry_updater(4 GB, exponential, 2, $task.attempt, memory)"
}
},
"withName:run_SplitIntervals_GATK": {
"cpus": "1",
"memory": "1 GB"
},
"withName:run_index_SAMtools": {
"cpus": "1",
"memory": {
"1": "2 GB",
"2": "4 GB",
"3": "8 GB",
"closure": "retry_updater(2 GB, exponential, 2, $task.attempt, memory)"
}
},
"withName:run_validate_PipeVal": {
"cpus": "1",
"memory": "1 GB"
}
},
"report": {
"enabled": true,
"file": "/tmp/outputs/recalibrate-BAM-1.0.0-rc.4/TWGSAMIN000001/log-recalibrate-BAM-1.0.0-rc.4-20240214T213139Z/nextflow-log/report.html"
},
"timeline": {
"enabled": true,
"file": "/tmp/outputs/recalibrate-BAM-1.0.0-rc.4/TWGSAMIN000001/log-recalibrate-BAM-1.0.0-rc.4-20240214T213139Z/nextflow-log/timeline.html"
},
"trace": {
"enabled": true,
"file": "/tmp/outputs/recalibrate-BAM-1.0.0-rc.4/TWGSAMIN000001/log-recalibrate-BAM-1.0.0-rc.4-20240214T213139Z/nextflow-log/trace.txt"
},
"workDir": "/scratch/851543",
"yaml": {
"aligner": {
"help": "Aligner used to align input BAMs. Provided as <Aligner>-<Aligner-version>",
"required": true,
"type": "AlignerTool"
},
"base_resource_update": {
"elements": {
"cpus": {
"help": "List of CPU updates",
"required": false,
"type": "ResourceUpdateList"
},
"memory": {
"help": "List of memory updates",
"required": false,
"type": "ResourceUpdateList"
}
},
"help": "User-defined modifications for adjusting base resource allocations for processes",
"required": false,
"type": "ResourceUpdateNamespace"
},
"bundle_contest_hapmap_3p3_vcf_gz": {
"help": "Absolute path to ConEst HapMap 3p3 VCF",
"mode": "r",
"required": true,
"type": "Path"
},
"bundle_known_indels_vcf_gz": {
"help": "Absolute path to known INDELs VCF",
"mode": "r",
"required": true,
"type": "Path"
},
"bundle_mills_and_1000g_gold_standard_indels_vcf_gz": {
"help": "Absolute path to Mills and 1000g gold standard INDELs VCF",
"mode": "r",
"required": true,
"type": "Path"
},
"bundle_v0_dbsnp138_vcf_gz": {
"help": "Absolute path to v0 dbSNP 138 VCF",
"mode": "r",
"required": true,
"type": "Path"
},
"dataset_id": {
"help": "Dataset ID",
"required": true,
"type": "String"
},
"gatk_ir_compression": {
"choices": [
"0",
"1",
"2",
"3",
"4",
"5",
"6",
"7",
"8",
"9"
],
"default": "1",
"help": "",
"required": false,
"type": "Integer"
},
"input": {
"elements": {
"BAM": {
"elements": {
"normal": {
"help": "Input normal BAMs",
"required": false,
"type": "BAMEntryList"
},
"tumor": {
"help": "Input tumor BAMs",
"required": false,
"type": "BAMEntryList"
}
},
"help": "Input BAMs for calling",
"required": true,
"type": "InputBAMNamespace"
},
"recalibration_table": {
"allow_empty": false,
"help": "List of any available recalibration tables",
"required": false,
"type": "RecalibrationTableList"
}
},
"help": "Input samples",
"required": true,
"type": "InputNamespace"
},
"intervals": {
"allow_empty": true,
"help": "Target intervals to process for DNA panel/targeted sequencing samples; leave empty for WGS",
"required": true,
"type": "String"
},
"is_DOC_run": {
"default": false,
"help": "Whether to run the DepthOfCoverage process, which is very time-consuming for large BAMs",
"required": true,
"type": "Bool"
},
"is_emit_original_quals": {
"default": true,
"help": "Whether to emit original quality scores after recalibration",
"required": true,
"type": "Bool"
},
"metapipeline_delete_input_bams": {
"default": false,
"help": "Whether to delete the input BAMs",
"required": true,
"type": "Bool"
},
"metapipeline_final_output_dir": {
"help": "Directory containing final outputs to check before input deletion",
"required": false,
"type": "String"
},
"metapipeline_states_to_delete": {
"choice": [
"normal",
"tumor"
],
"default": [
"normal",
"tumor"
],
"help": "List of states for which to delete input BAMs",
"required": true,
"type": "List"
},
"output_dir": {
"help": "Absolute path to output directory",
"mode": "w",
"required": true,
"type": "Path"
},
"patient_id": {
"help": "Patient ID",
"required": true,
"type": "String"
},
"reference_fasta": {
"help": "Absolute path to reference genome fasta",
"mode": "r",
"required": true,
"type": "Path"
},
"save_intermediate_files": {
"default": false,
"help": "Whether to save intermediate files",
"required": true,
"type": "Bool"
},
"scatter_count": {
"default": "50",
"help": "How many intervals to divide the genome into for parallelization",
"required": true,
"type": "Integer"
},
"split_intervals_extra_args": {
"allow_empty": true,
"help": "Extra arguments for interval splitting",
"required": false,
"type": "String"
}
}
}

Due to our metaprogramming madness many of the memory and cpu values under process are closures that return different results depending upon task.attempt - in those cases the reported value is a map containing the closure code itself (heavy caveats) and the computed values for attempts 1, 2, and 3:

"withName:run_index_SAMtools": {
"cpus": "1",
"memory": {
"1": "2 GB",
"2": "4 GB",
"3": "8 GB",
"closure": "retry_updater(2 GB, exponential, 2, $task.attempt, memory)"
}

Current Usage

The entry.py script compares the bundled tests against a local checkout of the recalibrate BAM pipeline. With a clean checkout that goes well (all examples are running on my mac laptop, detached from the cluster):

$ ./entry.py ~/src/pipeline-recalibrate-BAM/
No changes!

After I make a small change to the pipeline...

--- a/config/F16.config
+++ b/config/F16.config
@@ -17,7 +17,7 @@ process {
         retry_strategy {
             memory {
                 strategy = 'exponential'
-                operand = 2
+                operand = 4
             }
         }
     }
$ ./entry.py ~/src/pipeline-recalibrate-BAM/
.params.proc_resource_params.run_RealignerTargetCreator_GATK.retry_strategy.memory.operand
2
4
------
.process.withName:run_RealignerTargetCreator_GATK.memory.2
8 GB
16 GB
------
.process.withName:run_RealignerTargetCreator_GATK.memory.3
16 GB
31 GB
------
.process.withName:run_RealignerTargetCreator_GATK.memory.closure
retry_updater(4 GB, exponential, 2, $task.attempt, memory)
retry_updater(4 GB, exponential, 4, $task.attempt, memory)
------
Saving updated file to /Users/nwiltsie/src/tool-Nextflow-action/run-nextflow-tests/recalibrate-bam-out.json

When differences are found, the script saves a complete test file as "[testfile]-out.json", which makes it easy to copy over the original if the changes were intended.

Internals

The key steps are:

  1. Build the docker image (currently done on-the-fly)
  2. Generate a temporary config file with the appropriate mocks and includeConfigs
  3. Within the image, run a groovy script that's kinda-sorta like nextflow config -properties
  4. Outside the image, parse the properties and diff them against the expected values

Docker Image

This tool runs in a modified version of the nextflow/nextflow:23.10.0 image, and can be updated if desired.

The key changes are to download and include mockito on the classpath, and to swap the JAR entrypoint for the nextflow script from nextflow.cli.Launcher to groovy.ui.GroovyMain (what you get if you call groovy from the command line).

This is definitely hackish, but the goal is to execute an arbitrary groovy script with exactly the same JVM arguments and settings as nextflow. The nextflow script does some tricky business, caching all of those arguments in /.nextflow/tmp/launcher/nextflow-one_${NEXTFLOW_VERSION}/buildkitsandbox/classpath-${NEXTFLOW_MD5}, where ${NEXTFLOW_MD5} is the hash of a bunch of environment variables. I'd prefer to use $NXF_CLASSPATH rather than sed to splice in the extra JARs, but doing so would result in an unpredictable launcher and make it harder to swap out the entrypoint.

Runtime Container

At runtime, this tool creates a container from the image and bind-mounts:

  • The pipeline to the same path
  • An empty NamedTemporaryFile to each "empty file"
  • Individual pipeline files to the absolute paths defined by "mapped_files"
  • A temporary directory to /mnt/bl_tests

Within that temporary directory, it creates:

  • docker_test.config - the temporary config file described below
  • cli_params.json - A JSON file of the Nextflow command-line parameters
  • test_mocks.json - A JSON file of the test-specific method mocks

The container also gets the following environment variables:

  • Everything specified by the test's "envvars"
  • BL_PIPELINE_DIR - Path to the pipeline
  • BL_CONFIG_FILE - The container path to docker_test.config
  • BL_MOCKS_FILE - The container path to test_mocks.json
  • BL_CLI_PARAMS_FILE - The container path to cli_params.json
  • BL_PARAMS_FILE - The container path to the "params file" (if set)

Temporary Config File

The generated config file looks something like this:

import nextflow.util.SysHelper
import nextflow.util.MemoryUnit
import static org.mockito.Mockito.*
import org.mockito.MockedStatic

try (MockedStatic dummyhelper = mockStatic(
        SysHelper.class,
        CALLS_REAL_METHODS)) {
    dummyhelper
        .when(SysHelper::getAvailCpus)
        .thenReturn(16);
    dummyhelper
        .when(SysHelper::getAvailMemory)
        .thenReturn(MemoryUnit.of("31GB"));
    includeConfig "/Users/nwiltsie/src/pipeline-recalibrate-BAM/test/nftest.config"
}

System.out.println("=========SENTINEL_OUTPUT==========")

The two key parts here:

  1. The use of mockito to mock out static Java libraries (it's mindblowing that it works) with the CPU and memory values from the test configuration.
  2. Explicitly printing a sentinel value after the true config files are loaded. Our configs are littered with random print statements, so the sentinel indicates the boundary between all that junk and the structured output to follow.

Groovy Script

Here's the main madness. This script is a chimera of nextflow config and a portion of nextflow run with some major changes:

  1. I don't know how to do parameter parsing in groovy, so the parameters and config files are read from the BL_* environment variables and injected into the nextflow.cli.CliOptions and nextflow.cli.CmdRun objects in the same way that they would have been using CLI parameters.
  2. The actual building of the ConfigObject (where all of our config code is executed) is wrapped in an Interceptor that gets to spy on every method call. If any of those method names are in the tests's "mocks", the actual method is ignored and the pre-computed result is returned.

At this point the ConfigObject is as complete as Nextflow would ever make it. However, if it were to be serialized at this point every closure would be represented like Script1E713595D312C8FDC3D2D9EDAF11D59D$_run_closure1$_closure11@65293ca5, which is not ideal.

To solve that I take one more step and walk over the process block with another Interceptor. During that process I take the following steps:

  1. For top-level keys (direct children of the process block), cache their name (e.g. withLabel:process_medium) in the interceptor while traversing their children.
  2. For every closure value, attempt to replace it with the following, moving onto the next if there is an issue:
    1. The result of evaluating the closure, raising an exception if it tries to get a property named task.
    2. A map with keys closure, 1, 2, and 3 mapping to results of evaluating the closure, with some serious skullduggery to intercept ConfigObject.get("task") to inject the value [process: <saved name>, attempt: '???', cpus: '$task.cpus'] (making the closure not-so-closed). The value for cpus is the literal string '$task.cpus', but the value for attempt varies:
      • For closure, attempt is set to the literal string '$task.attempt'. While evaluating the closure any calls to methods named check_limits or retry_updater are mocked out for strings like "$name(${args.join(', ')})" (e.g. "retry_updater(12, add, 0, $task.attempt, cpus)").
      • For 1, 2, and 3, attempt is set to that integer and the closure is evaluated without any other shenanigans.
  3. If we still fail, it's because of a closure like cpus = { methods.check_limits( 1 * task.attempt, 'cpus' ) }. 1 * task.attempt is evaluated with the static Java method java.lang.Integer.multiply and can't be intercepted without mockito, so at this point I just bail and return "closure()".

Finally, the fully-rendered configuration is printed to stdout in Java Properties File format.

Parsing and Diffing

Outside of the docker container, the groovy output is split using the sentinel value and parsed into a dictionary. A lot of junk shows up here - seemingly any variable defined without a def (example, another example) bubbles up to the global namespace. For sanity I'm not parsing that json_object variable, but we should just add the def onto it.

Once parsed, I pop off any configuration-only namespaces we've defined, as they're just noise:

# These are namespaces defined in the common submodules
boring_keys = {
'csv_parser',
'custom_schema_types',
'methods',
'retry',
'schema',
'bam_parser',
'json_extractor',
}
for key in boring_keys:
result.pop(key, None)

Finally, I generate a list of differences, using re.sub to remove any dates from the "dated_fields".

Where to go from here

As-is, this is not an Action. However, it is shockingly complicated, and before I go any further I need feedback from the lot of you. Feel free to dive into any aspect of this, but here are some specific questions:

  • Is this at core a terrible idea? I don't really want to throw this away, but it will only work if the rest of you are invested in the utility of it.
  • Does the top-level interface seem reasonable? My vague plan is to have the Action look for all nf-configtest-*.json files (or something) in the repository. That would allow each test file to be independent.

Copy link

@sorelfitzgibbon sorelfitzgibbon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Superficially, which is as far as I'm capable of going with this, looks great to me. How does this work interplay, if at all, with the plan to capture all input configuration for every run? Can we use some of this to capture both input values and post configuration values to store? Or can we already extract those values just from the normal run?

@nwiltsie
Copy link
Member Author

nwiltsie commented Feb 15, 2024

How does this work interplay, if at all, with the plan to capture all input configuration for every run?

This is definitely related work. My mental framework is as follows:

  1. We write configuration and parameter files for the pipelines (our intended inputs)
  2. Each pipeline's configuration process renders our inputs into variety of processes to be run and inputs for each (the rendered configuration)
  3. Nextflow executes each process (nextflow reports, pipeline outputs)

There are three problems there:

  • On the input side, we control the intended inputs, but the pipelines actually work with the rendered configuration. Something might go wrong in that translation or change inadvertently.
  • On the output side, we can only infer pieces of the rendered configuration from the Nextflow reports.
  • On the output side, even assuming we have the full rendered configuration, we can't back-calculate the original human inputs.

This work partially addresses the first problem by helping us be consistent in the rendering/translation process (although it doesn't say anything about it being right). Sufficient test coverage with this tool might convince us of the "rightness".

@j2salmingo's work with uclahs-cds/pipeline-Nextflow-config#43 addresses the second problem by capturing the full rendered configuration with the pipeline results.

To my knowledge, the third problem is currently unaddressed.

@sorelfitzgibbon
Copy link

Thanks, I agree it would be helpful to solve all three problems.

@kiarod
Copy link

kiarod commented Feb 15, 2024

Is this at core a terrible idea? I don't really want to throw this away, but it will only work if the rest of you are invested in the utility of it.

On the first point, this definitely is not a terrible idea, and I don't think we can throw it away if we wanted to. This would be an awesome utility to take the guesswork out of the downstream effects of configuration changes, and would dramatically improve the quality of code reviews in a number of ways(speed, certainty, trackability of changes).

Does the top-level interface seem reasonable? My vague plan is to have the Action look for all nf-configtest-*.json files (or something) in the repository. That would allow each test file to be independent.

In terms of top level interface, did you compose the test json by hand or did you have an automated way of constructing the json or at least a skeleton of it. Some of the components seem to be JSONified portions of the schema.yaml

Overall this is amazing work Nick! I think if this were actionified it would seriously raise the professional standard of our pipeline work. I do feel it would be worthwhile to present a demo on this work at Nextflow WG meeting so we can stop and ask you implementation level questions

@nwiltsie
Copy link
Member Author

nwiltsie commented Feb 15, 2024

In terms of top level interface, did you compose the test json by hand or did you have an automated way of constructing the json or at least a skeleton of it. Some of the components seem to be JSONified portions of the schema.yaml

I hand-composed a file like this:

{
  "config": [
    "test/nftest.config"
  ],
  "params_file": "test/single.yaml",
  "cpus": 16,
  "memory_gb": 31,
  "empty_files": [
    "/hot/ref/reference/GRCh38-BI-20160721/Homo_sapiens_assembly38.fasta",
    "/hot/ref/tool-specific-input/GATK/GRCh38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz",
    "/hot/ref/tool-specific-input/GATK/GRCh38/Homo_sapiens_assembly38.known_indels.vcf.gz",
    "/hot/ref/tool-specific-input/GATK/GRCh38/resources_broad_hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf.gz",
    "/hot/ref/tool-specific-input/GATK/GRCh38/Biallelic/hapmap_3.3.hg38.BIALLELIC.PASS.2021-09-01.vcf.gz",
    "/hot/resource/SMC-HET/tumours/A-mini/bams/n1/output/S2.T-n1.bam"
  ],
  "mapped_files": [],
  "nf_params": {
    "output_dir": "/tmp/outputs"
  },
  "envvars": {
    "SLURM_JOB_ID": "851543"
  },
  "mocks": {
    "parse_bam_header": {
      "read_group": [
        {
          "SM": "4915723"
        }
      ]
    }
  },
  "dated_fields": [
    "params.log_output_dir",
    "report.file",
    "timeline.file",
    "trace.file"
  ],
  "expected_result": {}
}

So that's all of the inputs and an expected result of nothing. When I ran the tool it complained about all of the unexpected results, and wrote them to a new file for me (I trimmed out most of the JSON from the console output below), which I turned right around and committed. That definitely makes these regression tests - I don't swear that it's the correct output, but it's consistent output.

$ ./entry.py ~/src/pipeline-recalibrate-BAM/
docker
None
{'all_group_ids': '$(for i in `id --real --groups`; do echo -n "--group-add=$i "; done)', 'enabled': True, 'runOptions': '-u $(id -u):$(id -g) $(for i in `id --real --groups`; do echo -n "--group-add=$i "; done)', 'uid_and_gid': '-u $(id -u):$(id -g)'}
------
manifest
None
{'author': 'Yash Patel', 'description': 'Nextflow pipeline to perform Indel Realignment and Base Quality Score Recalibration', 'name': 'recalibrate-BAM', 'version': '1.0.0-rc.4'}
------
params
None
[full content trimmed]
------
params_schema
None
[full content trimmed]
------
process
None
[full content trimmed]
------
report
None
{'enabled': True, 'file': '/tmp/outputs/recalibrate-BAM-1.0.0-rc.4/TWGSAMIN000001/log-recalibrate-BAM-1.0.0-rc.4-20240215T215116Z/nextflow-log/report.html'}
------
timeline
None
{'enabled': True, 'file': '/tmp/outputs/recalibrate-BAM-1.0.0-rc.4/TWGSAMIN000001/log-recalibrate-BAM-1.0.0-rc.4-20240215T215116Z/nextflow-log/timeline.html'}
------
trace
None
{'enabled': True, 'file': '/tmp/outputs/recalibrate-BAM-1.0.0-rc.4/TWGSAMIN000001/log-recalibrate-BAM-1.0.0-rc.4-20240215T215116Z/nextflow-log/trace.txt'}
------
workDir
None
/scratch/851543
------
yaml
None
[full content trimmed]
------
Saving updated file to /Users/nwiltsie/src/tool-Nextflow-action/run-nextflow-tests/recalibrate-bam-out.json

So yes, that yaml block is the JSONified version of the schema.yaml, because this line of code defines it without a def.

@nwiltsie
Copy link
Member Author

(Expanding on that - that line of code should have a def on it, and we shouldn't include that yaml key in the test files. This is going to quickly turn up a lot of global variables that we didn't intend to be global.)

@zhuchcn
Copy link
Member

zhuchcn commented Feb 18, 2024

This is absolutely amazing work! Have you considered incorporating this into our nf-test? It is an already published software, so we could even consider another publication of nf-test v2 in the future if more functionalities are added. Are you also expect users to write a series of test cases for different input parameters and enviroments? We can benefit from it when developing locally if this can be accessed from command line directly.

A couple of silly questions:

  1. Does every closure needs to be provided in 'mocks'? What will happen if a cluster is not mocked?
  2. I don't understand what entry.py is comparing against. What is a local checkout?

@nwiltsie
Copy link
Member Author

Have you considered incorporating this into our nf-test? It is an already published software, so we could even consider another publication of nf-test v2 in the future if more functionalities are added.

I've vaguely considered it - part of my goal in pushing this up now was to solicit feedback on how best to use it. I'm open to whatever makes the most sense!

Are you also expect users to write a series of test cases for different input parameters and enviroments? We can benefit from it when developing locally if this can be accessed from command line directly.

Yes, the expectation is that we'd have a representative set of tests stored in each pipeline - different CPU counts, different inputs, etc. I agree that it should be available locally as well as in the cloud.

  1. Does every closure needs to be provided in 'mocks'? What will happen if a cluster is not mocked?

No, not every closure needs to be mocked (in fact, as few as possible should be mocked out). Anything not mocked will be evaluated in the same way it would be in a real pipeline run.

  1. I don't understand what entry.py is comparing against. What is a local checkout?

entry.py is just a small demo for this pull request. In practice this should work like NFTest, with each pipeline storing its own test files. Specifically for this pull request, to demonstrate the functionality, I bundled a test case for the recalibrate BAM pipeline and wrote entry.py as a simple test runner. That means you can follow these steps to run the test:

#!/bin/bash

# Clone this pull request
git clone \
    --branch nwiltsie-nextflow-regression-logic \
    --recurse-submodules \
    git@github.com:uclahs-cds/tool-Nextflow-action.git

# Clone the recalibrate-BAM pipeline (with submodules)
git clone \
    --recurse-submodules \
    git@github.com:uclahs-cds/pipeline-recalibrate-BAM.git

# Run the test bundled with the pull request
./tool-Nextflow-action/run-nextflow-tests/entry.py \
    "$(readlink -f ./pipeline-recalibrate-BAM/)"

@nwiltsie
Copy link
Member Author

uclahs-cds/pipeline-call-gSV#118 is a perfect example of where this can be useful. @Faizal-Eeman and I created the test files from cd08fa7 and a187b4e to match pipeline-call-gSV's current main branch. When run against the branch from that pull request, the output is...

$ ./entry.py ../../../pipelines/pipeline-call-gSV ./call-gsv-F32.json
.paramsproc_resource_params
None
{'call_gSV_Delly': {'cpus': '1', 'memory': '30 GB', 'retry_strategy': {'memory': {'operand': '2', 'strategy': 'exponential'}}}, 'call_gSV_Manta': {'cpus': '1', 'memory': '30 GB', 'retry_strategy': {'memory': {'operand': '2', 'strategy': 'exponential'}}}, 'run_validate_PipeVal': {'cpus': '1', 'memory': '1 GB'}}
------
.process.withName:call_gSV_Delly.memory
30 GB
{'1': '30 GB', '2': '60 GB', '3': '64 GB', 'closure': 'retry_updater(30 GB, exponential, 2, $task.attempt, memory)'}
------
.process.withName:call_gSV_Manta.memory
30 GB
{'1': '30 GB', '2': '60 GB', '3': '64 GB', 'closure': 'retry_updater(30 GB, exponential, 2, $task.attempt, memory)'}
------
proc_names
None
[Ljava.lang.String;@47fc9ce
------
Saving updated file to call-gsv-F32-out.json

So the differences are:

  1. The branch newly defines the params.proc_resource_params variable (and my script clearly has a bug in assembling that variable name).
  2. The call_gSV_Delly and call_gSV_Manta processes previously had a flat 30 GB memory allocation, and now have a progressive allocation of 30 GB, 60 GB, and 64 GB on the first three attempts (this is on an F32 node).
  3. The proc_names variable is newly set to an opaque pointer value, indicating that it's a variable missing a def (link).

Copy link
Contributor

@yashpatel6 yashpatel6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I think we can proceed with this version and make updates as necessary once we begin implementing across pipelines

run-nextflow-tests/entry.py Show resolved Hide resolved
@nwiltsie nwiltsie merged commit d84ad41 into main Feb 27, 2024
1 check passed
@nwiltsie nwiltsie deleted the nwiltsie-nextflow-regression-logic branch February 27, 2024 23:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants