diff --git a/recipes/12-string-matching.md b/recipes/12-string-matching.md new file mode 100644 index 0000000..b8e156c --- /dev/null +++ b/recipes/12-string-matching.md @@ -0,0 +1,121 @@ +# Using find, matches, and sub for String Matching in WDL + +**When processing genomic data (or any data that includes string manipulations),** you may need to check whether your input strings follow a certain pattern, retrieve part of them, or replace certain substrings. With **WDL 1.2**, you can do this using three built-in functions—find, matches, and sub—which rely on POSIX Extended Regular Expressions (ERE). + +In this recipe, we’ll look at how to: +- Detect if a pattern exists in your input strings (matches) +- Extract the first match if it exists (find) +- Replace occurrences of a regex pattern (sub) + +Just like any other WDL recipe, we’ll illustrate **why** you’d use these functions, **how** to integrate them into a minimal workflow, and **what** pitfalls to watch out for (like double-escaping backslashes). + +--- + +## Problem + +You need a straightforward way to test or manipulate strings based on a regex pattern in your WDL workflow. Specifically, you want to: + +1. **Check** if a string contains some substring matching a certain pattern (returning true/false). +2. **Find** the first substring that matches a certain pattern, returning that substring or None. +3. **Replace** all occurrences of a certain pattern within a string. + +Before WDL 1.2, you had to rely on system calls or external scripts. Now, WDL 1.2 supplies built-in regex-based string functions for these operations. + +--- + +## Solution + +WDL 1.2 offers **three** core functions in its standard library: + +1. **matches(String input, String pattern) → Boolean** + Returns true if pattern occurs at least once in input, otherwise false. + +2. **find(String input, String pattern) → String?** + Returns the *first* matching substring if pattern is found, or None if not found. + +3. **sub(String input, String pattern, String replace) → String** + Replaces *all* non-overlapping occurrences of pattern in input with replace. + +**Important**: +- These functions rely on **POSIX Extended Regular Expressions**, so watch out for **double-escaping** backslashes. +- The WDL file must declare version 1.2 or higher, because matches and find are not available in older versions. + +--- + +## Example: Minimal Workflow + +Below is a small example that demonstrates each function in a single workflow. The user provides an input_string and some patterns, and the workflow outputs the match result. + +wdl +version 1.2 + +workflow StringRegexExample { + input { + String input_string # e.g. "Hello\tBob, sample_R1.fastq world!" + String pattern_for_find # e.g. "\\t" + String pattern_for_match # e.g. "_R1" + String pattern_for_sub # e.g. "world" + String replace_string # e.g. "WDL" + } + + # Perform a find operation: + String? found_substring = find(input_string, pattern_for_find) + + # Check if we have a match: + Boolean matched = matches(input_string, pattern_for_match) + + # Replace any occurrence of pattern_for_sub: + String replaced_result = sub(input_string, pattern_for_sub, replace_string) + + output { + String? found = found_substring + Boolean is_matched = matched + String replaced = replaced_result + } +} + + +### Sample Inputs + +json +{ + "StringRegexExample.input_string": "Hello\tBob, sample_R1.fastq world!", + "StringRegexExample.pattern_for_find": "\\t", + "StringRegexExample.pattern_for_match": "_R1", + "StringRegexExample.pattern_for_sub": "world", + "StringRegexExample.replace_string": "WDL" +} + + +### Running Locally with Cromwell + +bash +java -jar cromwell.jar run StringRegexExample.wdl --inputs inputs.json + + +**Possible Output**: + +- found: "\t" (the tab character), or None if not found +- is_matched: true (since _R1 is found in "sample_R1.fastq") +- replaced: "Hello\tBob, sample_R1.fastq WDL!" + +--- + +## Key Points + +1. **Regex Anchors**: If you want to match the entire string, use ^ and $ (e.g., matches("abc123", "^a.+3$")). +2. **Double-Escaping**: For literal backslashes or special characters, remember to double-escape. "\\t" in WDL becomes a single \t in the regex engine. +3. **Conditional**: If you’re branching logic based on whether something matches, use if (matches(...)) { ... } or check if find(...) returns None. +4. **Return Types**: + - matches(...) returns a Boolean + - find(...) returns a String? (optional) + - sub(...) returns a plain String. + +**That’s all there is to it!** By using WDL 1.2, you can easily incorporate string pattern matching and replacements into your workflows without resorting to external scripting. + +--- + +## Additional Resources + +- [Official WDL 1.2 Specification](https://github.com/openwdl/wdl/blob/wdl-1.2/SPEC.md) +- [Regex Syntax: POSIX Extended Regular Expressions](https://en.wikipedia.org/wiki/Regular_expression#POSIX_extended) \ No newline at end of file