AWS HPC Blog

Linter rules for Nextflow to improve the detection of errors before runtime

Linter rules for Nextflow to improve the detection of errors before runtimeNextflow is a popular domain-specific language (DSL) and runtime used to define workflows that string together multiple processing steps into a pipeline. This allows it to perform quite complex genomics or scientific analyses including for machine-learning.

Workflows defined in Nextflow code can leverage container orchestration technologies to deploy containerized workloads across clusters, clouds, or HPC environments. As an interpreted language, errors in a Nextflow script are only revealed at runtime. This increases the time and cost of developing and debugging a workflow which could be reduced if errors could be spotted earlier.

In this post, we’ll discuss how we created linter rules for Nextflow DSL 2, how these rules can be extended and how you can use them to check your scripts before runtime.

Background

Nextflow is commonly used by customers on AWS. Some use the AWS HealthOmics managed workflow service and others choose to deploy their own Nextflow engine, integrated with AWS Batch. The HealthOmics service provides a fully-managed experience while AWS Batch provides a more generalized mechanism for running batch computing tasks.

A key difference between the Nextflow DSL and a traditional programming language is that Nextflow code is interpreted at runtime rather than compiled. This provides flexibility, because workflows can be developed quickly without a build step. But it also means that you might not detect errors in the Nextflow script until the interpreter lands on that specific part of the code during an actual workflow run.

Because genomics workflows often involve processing large volumes of data, a workflow run could run for hours (and hours) before failing on a simple coding mistake. Having to restart these long-running analyses is frustrating and could be costly. This runtime-evaluation model argues for a static analysis tool that can scan Nextflow code before workflows are run to detect issues early – known as a linter.

Introducing a linter for Nextflow

To address the need for static analysis, we’ve developed a static linter that can analyze Nextflow workflow code by taking advantage of the fact that the Nextflow DSL uses Groovy, a dynamic language for the Java Virtual Machine, as its underlying implementation language. By building on Groovy, we can parse Nextflow code and analyze it, syntactically and semantically, without needing to execute the workflow itself.

We built our linter using CodeNarc, an open-source static analysis tool used to enforce Groovy coding standards and best practices. Under the covers, CodeNarc uses Groovy’s own parser to generate an abstract syntax tree (AST) representing the structure of the code. It then provides a framework for coding semantic rules that can traverse this AST using the visitor pattern and check for rule violations. We can take advantage of this capability to analyze Nextflow scripts for potential issues.

Understanding the Nextflow AST

An abstract syntax tree is a hierarchical data structure that models code at increasing levels of granularity. The root node represents the entire script or program. Child nodes represent top-level statements and expressions with their children representing sub-expressions. By traversing this tree, and examining nodes with semantic meaning, rules can gather the information we need to detect problems.

We built a simple application called AST Echo to determine which AST node types need to be visited to evaluate different Nextflow language elements. Our app parses Nextflow code and prints out the AST hierarchy along with the Groovy node-type associated with each Nextflow construct.

Using this approach, we were able to discover that a Nextflow process declaration is a Groovy MethodCallExpression and that a Nextflow process block maps to a Groovy ClosureExpression containing a BlockExpression. Here’s a partial example of a Nextflow DSL script. We’ve made the full source available on GitHub.

nextflow.enable.dsl=2

process foo {
    container 'ubuntu:latest'
    cpus 1

    output:
    path 'foo.txt'

    script:
    """
    your_command > foo.txt
    """
}

// other processess …

workflow {
    data = channel.fromPath('/some/path/*.txt')
    foo()
    // further process calls …
}

Providing this as input to our AST Echo app will produce something like this.

(∅:∅)-(∅:∅) <BlockStatement>:  { (nextflow.enable.dsl = 2); this.process(this.foo({ -> ... })); this.process(this.bar({ -> ... })); this.workflow({ -> ... }) }
  -> (1:1)-(1:22) <ExpressionStatement>:  (nextflow.enable.dsl = 2)
    -> (1:1)-(1:22) <BinaryExpression>: (nextflow.enable.dsl = 2)
      -> (1:1)-(1:20) <PropertyExpression>: nextflow.enable.dsl
        -> (1:9)-(1:16) <PropertyExpression>: nextflow.enable
          -> (1:1)-(1:9) <VariableExpression>: nextflow
          -> (1:10)-(1:16) <ConstantExpression>: enable
        -> (1:17)-(1:20) <ConstantExpression>: dsl
      -> (1:21)-(1:22) <ConstantExpression>: 2
  -> (3:1)-(14:2) <ExpressionStatement>:  this.process(this.foo({ -> ... }))
    -> (3:1)-(14:2) <MethodCallExpression>: this.process(this.foo({ -> ... }))
      -> (3:1)-(3:1) <VariableExpression>: this
      -> (3:1)-(3:8) <ConstantExpression>: process
      -> (3:9)-(14:2) <ArgumentListExpression>: (this.foo({ -> ... }))
        -> (3:9)-(14:2) <MethodCallExpression>: this.foo({ -> ... })
          -> (3:9)-(3:9) <VariableExpression>: this
          -> (3:9)-(3:12) <ConstantExpression>: foo
          -> (3:13)-(14:2) <ArgumentListExpression>: ({ -> ... })
            -> (3:13)-(14:2) <ClosureExpression>: { -> ... }
              -> (4:5)-(14:1) <BlockStatement>:  { this.container(ubuntu:latest); this.cpus(1); this.path(foo.txt); 
    your_command > foo.txt
     }
                -> (4:5)-(4:30) <ExpressionStatement>:  this.container(ubuntu:latest)

The root of the AST is on the first line. Each subsequent line is a node of the tree indented by the depth of the node.

The start and stop locations of the code contained in the node are displayed first with their line number and character offset separated by ‘:’. The start and stop are separated by ‘-’. The type of Groovy expression or statement is displayed surrounded by angle brackets. Following this is the code fragment, possibly truncated, contained by the node as interpreted by the Groovy parser.

Writing linter rules

Armed with these mappings we can write CodeNarc rules targeting the relevant AST node types.

CodeNarc rules define a set of methods matching AST node types which will be called by CodeNarc when the corresponding nodes are visited during AST traversal. In the body of these visit methods, we can write logic to gather information and detect rule violations. Any issues can be reported by adding them to CodeNarc’s violation list.

For example, we can write a rule that checks if CPU or memory resource requests for processes fall within valid ranges. The rule would override CodeNarcs’s AbstractAstVisitor visitMethodCallExpression method which represents a method call in Groovy. It would check if the method call is requesting CPU resources and then evaluate if the arguments are valid.

Let’s look at part of a Groovy implementation of a rule to check the cpu directive by overriding the visitMethodCallExpression :

class CpuAstVisitor extends AbstractAstVisitor {
    def MIN_CPU = 2
    def MAX_CPU = 96

    @Override
    void visitMethodCallExpression(MethodCallExpression expression) {
        if(expression.getMethodAsString() == 'cpus'){
            checkOneArgument(expression)
        }

        super.visitMethodCallExpression(expression)
    }


    private checkOneArgument(final MethodCallExpression expression){
        def methodArguments = AstUtil.getMethodArguments(expression)
        if (methodArguments.size() == 0) {
            addViolation(expression, 'the cpus directive must have one argument')
            return new EmptyExpression()
        } else if (methodArguments.size() > 1) {
            addViolation(expression, 'the cpus directive must have only one argument')
        }

        if( methodArguments.first() instanceof ConstantExpression){
           checkNumeric((ConstantExpression)methodArguments.first())
        }
    }

    private checkNumeric(ConstantExpression expression){
        try {
            def val = Integer.parseInt(expression.value.toString())
            checkMinMax(expression, val)
        } catch (NumberFormatException ignored){
            addViolation(expression,
                    "'${expression.value}' is not a valid number.")
        }

    }
    private void checkMinMax(Expression exp, final int val) {
        if (val < MIN_CPU) {
            addViolation(exp,
                    "The minimum CPU count is '$MIN_CPU'.")
        } else if (val > MAX_CPU) {
            addViolation(exp,
                    "The maximum CPU count is '$MAX_CPU'.")
        }
    }
}

During the check, we validate several conditions. Violations are recorded by calling the addViolation method. At the end of the checks the visitMethodCallExpression of the super-class is invoked to continue the traversal of the AST. We’ve put the full source on GitHub.

It’s critical to rigorously test rules to ensure they work as expected. CodeNarc provides a convenient rule-testing harness. It allows us to define test Nextflow code snippets along with expected rule violations. CodeNarc then runs analysis on these code fragments and compares the actual violations it finds to the expected results, ensuring that all expected violations and no unexpected violations are detected.

For example, we can test that the rule in from our previous code list is violated when the cpus value is 97 but not violated when the value is 2.

@Test
void cpuRule_MaxViolation(){
    final SOURCE =
'''
process MY_PROCESS {
 cpus 97
}
'''
    assertSingleViolation(SOURCE, 3, 'cpus 97', “The maximum CPU count is ‘96’”)
    }
@Test
void cpuRule_NoViolationsMin(){
    final SOURCE =
'''
process MY_PROCESS {
 cpus 2
}
'''
    assertNoViolations(SOURCE)
}

This checks that the code defined by SOURCE and asserts that a single violation occurs when cpus is set to 97 and that no violations occur when cpus are set to 2. We’ve put the full source of this test on GitHub.

Running the Linter Rules

To run the linter rules you build the package and then run from the command line with

java  -Dorg.slf4j.simpleLogger.defaultLogLevel=error \
  -classpath ./linter-rules/build/libs/linter-rules-0.1.jar:CodeNarc-3.3.0-all.jar:slf4j-api-1.7.36.jar:slf4j-simple-1.7.36.jar \
  org.codenarc.CodeNarc \
  -report=text:stdout \
  -rulesetfiles=rulesets/healthomics.xml \
  -includes=**/**.nf

The -includes argument will inspect all filenames matching the *.nf pattern in the current working directory, and any sub-directory. To adjust the rules to be run, you adjust -rulesetfiles to specify either a custom ruleset in CodeNarc format, or a prebuilt set like rulesets/general.xml.

To simplify this, we’ve provided a Dockerfile that includes all required dependencies with a convenient entry point along with build and usage instructions.

Using this mechanism, the command to run with Docker becomes:

docker run -v $PWD:/data -e ruleset=healthomics linter-rules-for-nextflow

You can find a prebuilt image in the Amazon ECR Public Gallery.

Detecting runtime compatibility issues

Nextflow provides portability across environments, but workflows can still be written in ways that tie them to certain runtime capabilities.

We’ve developed rules that can detect the use of Nextflow features that may not be supported in AWS HealthOmics. For example, AWS HealthOmics ignores certain process directives that are only relevant to other runtime environments. Our rules warn when these directives are used so workflows designed for HealthOmics avoid incompatibility issues.

The AWS HealthOmics workflow service provides a managed environment optimized for security, scalability, and cost efficiency. Workflows must adhere to certain constraints around storage, containers, and infrastructure. We’ve built rules that check Nextflow code for patterns that violate HealthOmics environment policies.

For example, one rule checks that pipeline inputs are loaded from Amazon Simple Storage Service (Amazon S3) buckets, or HealthOmics Sequence Stores, and that pipeline container images are being pulled from Amazon ECR Private repositories.

Other rules check that resource requests don’t exceed HealthOmics instance type capabilities. Configuring these as linter rules allows workflows to be pre-validated rather than discovering issues mid-execution.

In addition to AWS HealthOmics specific checks, we’ve also created some general rules that detect invalid syntax, potential portability issues, and other common anti-patterns. For example, a rule that detects the use of undefined process directives – this often indicates a typo or misunderstanding of the DSL.

Flexible, extensible, and open-source

The linter lets you include only rule sets you’re interested in. You can focus strictly on general Nextflow rules or both general rules and HealthOmics best practices. Over time, we expect the rules library will grow to cover more scenarios.

We architected our linter so that rules can easily be contributed by the Nextflow community.

We’ve released the linter rules and the AST Echo utility as open-source projects on GitHub.

The code we’ve provided there uses the Apache 2.0 license. We welcome issues, pull requests, and additional rules from the Nextflow community.

Conclusion

By providing early feedback on Nextflow code, this linter can improve developer productivity, reduce errors, and make workflows more portable. We encourage all Nextflow pipeline authors to integrate it into their continuous integration pipelines and contribute to its evolution.

Static analysis of domain-specific languages like Nextflow opens up new possibilities for accelerating scientific advancements through code quality and collaboration.

Mark Schreiber

Mark Schreiber

Mark Schreiber is a senior genomics consultant working in the Amazon Web Services (AWS) health artificial intelligence (AI) team. Mark specializes in genomics and life sciences applications and data. Prior to joining AWS, he worked for several years with pharmaceutical and biotech companies. Mark is also a frequent contributor to open-source projects.