Skip to content

Add new WDL recipe for string matching (find, matches, sub) #10

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
121 changes: 121 additions & 0 deletions recipes/12-string-matching.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# Using find, matches, and sub for String Matching in WDL

**When processing genomic data (or any data that includes string manipulations),** you may need to check whether your input strings follow a certain pattern, retrieve part of them, or replace certain substrings. With **WDL 1.2**, you can do this using three built-in functions—find, matches, and sub—which rely on POSIX Extended Regular Expressions (ERE).

In this recipe, we’ll look at how to:
- Detect if a pattern exists in your input strings (matches)
- Extract the first match if it exists (find)
- Replace occurrences of a regex pattern (sub)

Just like any other WDL recipe, we’ll illustrate **why** you’d use these functions, **how** to integrate them into a minimal workflow, and **what** pitfalls to watch out for (like double-escaping backslashes).

---

## Problem

You need a straightforward way to test or manipulate strings based on a regex pattern in your WDL workflow. Specifically, you want to:

1. **Check** if a string contains some substring matching a certain pattern (returning true/false).
2. **Find** the first substring that matches a certain pattern, returning that substring or None.
3. **Replace** all occurrences of a certain pattern within a string.

Before WDL 1.2, you had to rely on system calls or external scripts. Now, WDL 1.2 supplies built-in regex-based string functions for these operations.

---

## Solution

WDL 1.2 offers **three** core functions in its standard library:

1. **matches(String input, String pattern) → Boolean**
Returns true if pattern occurs at least once in input, otherwise false.

2. **find(String input, String pattern) → String?**
Returns the *first* matching substring if pattern is found, or None if not found.

3. **sub(String input, String pattern, String replace) → String**
Replaces *all* non-overlapping occurrences of pattern in input with replace.

**Important**:
- These functions rely on **POSIX Extended Regular Expressions**, so watch out for **double-escaping** backslashes.
- The WDL file must declare version 1.2 or higher, because matches and find are not available in older versions.

---

## Example: Minimal Workflow

Below is a small example that demonstrates each function in a single workflow. The user provides an input_string and some patterns, and the workflow outputs the match result.

wdl
version 1.2

workflow StringRegexExample {
input {
String input_string # e.g. "Hello\tBob, sample_R1.fastq world!"
String pattern_for_find # e.g. "\\t"
String pattern_for_match # e.g. "_R1"
String pattern_for_sub # e.g. "world"
String replace_string # e.g. "WDL"
}

# Perform a find operation:
String? found_substring = find(input_string, pattern_for_find)

# Check if we have a match:
Boolean matched = matches(input_string, pattern_for_match)

# Replace any occurrence of pattern_for_sub:
String replaced_result = sub(input_string, pattern_for_sub, replace_string)

output {
String? found = found_substring
Boolean is_matched = matched
String replaced = replaced_result
}
}


### Sample Inputs

json
{
"StringRegexExample.input_string": "Hello\tBob, sample_R1.fastq world!",
"StringRegexExample.pattern_for_find": "\\t",
"StringRegexExample.pattern_for_match": "_R1",
"StringRegexExample.pattern_for_sub": "world",
"StringRegexExample.replace_string": "WDL"
}


### Running Locally with Cromwell

bash
java -jar cromwell.jar run StringRegexExample.wdl --inputs inputs.json


**Possible Output**:

- found: "\t" (the tab character), or None if not found
- is_matched: true (since _R1 is found in "sample_R1.fastq")
- replaced: "Hello\tBob, sample_R1.fastq WDL!"

---

## Key Points

1. **Regex Anchors**: If you want to match the entire string, use ^ and $ (e.g., matches("abc123", "^a.+3$")).
2. **Double-Escaping**: For literal backslashes or special characters, remember to double-escape. "\\t" in WDL becomes a single \t in the regex engine.
3. **Conditional**: If you’re branching logic based on whether something matches, use if (matches(...)) { ... } or check if find(...) returns None.
4. **Return Types**:
- matches(...) returns a Boolean
- find(...) returns a String? (optional)
- sub(...) returns a plain String.

**That’s all there is to it!** By using WDL 1.2, you can easily incorporate string pattern matching and replacements into your workflows without resorting to external scripting.

---

## Additional Resources

- [Official WDL 1.2 Specification](https://github.com/openwdl/wdl/blob/wdl-1.2/SPEC.md)
- [Regex Syntax: POSIX Extended Regular Expressions](https://en.wikipedia.org/wiki/Regular_expression#POSIX_extended)