Skip to content

textalign/TAN-regex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TAN-regex

Version 1.01

TAN-regex is an XSLT library that extends regular expressions used by XPath functions matches(), replace(), tokenize(), and analyze-string(). The parallel TAN-regex functions, rgx:matches(), rgx:replace(), rgx:tokenize(), and rgx:analyze-string(), behave exactly like the standard XPath functions, but permit the escape character \u{}, which takes four types of constructions.

  1. hexadecimal codepoints, e.g., \u{3f-4a 1faa}.
  2. Unicode name words, e.g., \u{.omega!greek} (any Unicode character whose name includes the word "OMEGA" but not the word "GREEK").
  3. Unicode composites, e.g., \u{+b} (any Unicode character that can be decomposed to a "b", e.g., bᵇḃḅḇ.
  4. Unicode simple decompositions, e.g., \u{-ǡḃčď} (converts the character class to 'abcd').

If a particular version of Unicode is desired, use the $flags parameter, e.g., rgx:matches($input-text, '\u{.bottle}', '11.0').

A construction may take multiple items, space delimited, e.g., \u{+* .pizza} (any character with a plus as a component and any character with "pizza" in the name).

Functions are in the TAN namespace, tan:textalign.net,2015:ns. The prefix rgx is recommended.

Other useful functions:

  • rgx:string-to-components(). Takes each character in an input string and returns a concatenation of its decomposed components.
  • rgx:string-to-composites(). Takes each character in an input string and returns a concatenation of characters that can decompose to that character.
  • rgx:string-base(). Changes in an input string any characters that can decompose to a single base character.
  • Key get-chars-by-name, e.g., key('get-chars-by-name', ('parenthesis'), $default-ucd-names-db). Returns a tree fragment with Unicode characters with matching words in their names.

See the subdirectory tests for examples and the XSLT function library for all the functions, with documentation.

TAN-regex has been developed in service to the Text Alignment Network (http://textalign.net), but can be used independent of TAN. The functions are encapsulated, so can be incorporated by any XSLT stylesheet via <include> or <import>.

For more on TAN-regex see Joel Kalvesmaki, “A New \u: Extending XPath Regular Expressions for Unicode.” Proceedings of Balisage: The Markup Conference 2020. Balisage Series on Markup Technologies, vol. 25 (2020). https://doi.org/10.4242/BalisageVol25.Kalvesmaki01.

I thank the following individuals for suggestions that significantly improved TAN-regex: C. M. Sperberg-McQueen, David Birnbaum.

Change history

1.01: 2021-07-06: set visibility on all functions; demoted numeric functions (better supported in the TAN function library).

About

Extension of regular expressions in XML

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published