TAN-regex

Version 1.01

TAN-regex is an XSLT library that extends regular expressions used by XPath functions matches(), replace(), tokenize(), and analyze-string(). The parallel TAN-regex functions, rgx:matches(), rgx:replace(), rgx:tokenize(), and rgx:analyze-string(), behave exactly like the standard XPath functions, but permit the escape character \u{}, which takes four types of constructions.

hexadecimal codepoints, e.g., \u{3f-4a 1faa}.
Unicode name words, e.g., \u{.omega!greek} (any Unicode character whose name includes the word "OMEGA" but not the word "GREEK").
Unicode composites, e.g., \u{+b} (any Unicode character that can be decomposed to a "b", e.g., bᵇḃḅḇ.
Unicode simple decompositions, e.g., \u{-ǡḃčď} (converts the character class to 'abcd').

If a particular version of Unicode is desired, use the $flags parameter, e.g., rgx:matches($input-text, '\u{.bottle}', '11.0').

A construction may take multiple items, space delimited, e.g., \u{+* .pizza} (any character with a plus as a component and any character with "pizza" in the name).

Functions are in the TAN namespace, tan:textalign.net,2015:ns. The prefix rgx is recommended.

Other useful functions:

rgx:string-to-components(). Takes each character in an input string and returns a concatenation of its decomposed components.
rgx:string-to-composites(). Takes each character in an input string and returns a concatenation of characters that can decompose to that character.
rgx:string-base(). Changes in an input string any characters that can decompose to a single base character.
Key get-chars-by-name, e.g., key('get-chars-by-name', ('parenthesis'), $default-ucd-names-db). Returns a tree fragment with Unicode characters with matching words in their names.

See the subdirectory tests for examples and the XSLT function library for all the functions, with documentation.

TAN-regex has been developed in service to the Text Alignment Network (http://textalign.net), but can be used independent of TAN. The functions are encapsulated, so can be incorporated by any XSLT stylesheet via <include> or <import>.

For more on TAN-regex see Joel Kalvesmaki, “A New \u: Extending XPath Regular Expressions for Unicode.” Proceedings of Balisage: The Markup Conference 2020. Balisage Series on Markup Technologies, vol. 25 (2020). https://doi.org/10.4242/BalisageVol25.Kalvesmaki01.

I thank the following individuals for suggestions that significantly improved TAN-regex: C. M. Sperberg-McQueen, David Birnbaum.

Change history

1.01: 2021-07-06: set visibility on all functions; demoted numeric functions (better supported in the TAN function library).

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
tests		tests
ucd		ucd
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
TAN-regex.xsl		TAN-regex.xsl
regex-ext-tan-functions.xsl		regex-ext-tan-functions.xsl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TAN-regex

Change history

About

Releases

Packages

Languages

License

textalign/TAN-regex

Folders and files

Latest commit

History

Repository files navigation

TAN-regex

Change history

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages