Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace TokenDatatype pattern #770

Open
3 tasks
Rojax opened this issue Oct 11, 2024 · 4 comments
Open
3 tasks

Replace TokenDatatype pattern #770

Rojax opened this issue Oct 11, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@Rojax
Copy link

Rojax commented Oct 11, 2024

User Story:

As a Metaschema user, I want to use the OSCAL catalog schema to validate my catalog files. I use https://github.com/python-jsonschema/check-jsonschema to validate my catalog against the schema https://github.com/usnistgov/OSCAL/releases/download/v1.1.2/oscal_catalog_schema.json.

However, Python's re module, for example, does not support \p{L} and \p{N} directly.

Error: schemafile was not valid: '^(\\p{L}|_)(\\p{L}|\\p{N}|[.\\-_])*$' is not a 'regex'
Failed validating 'format' in metaschema['properties']['definitions']['additionalProperties']['properties']['pattern']:
    {'type': 'string', 'format': 'regex'}
On schema['definitions']['TokenDatatype']['pattern']:
    '^(\\p{L}|_)(\\p{L}|\\p{N}|[.\\-_])*$'
SchemaError: '^(\\p{L}|_)(\\p{L}|\\p{N}|[.\\-_])*$' is not a 'regex'
Failed validating 'format' in metaschema['properties']['definitions']['additionalProperties']['properties']['pattern']:
    {'type': 'string', 'format': 'regex'}
On schema['definitions']['TokenDatatype']['pattern']:
    '^(\\p{L}|_)(\\p{L}|\\p{N}|[.\\-_])*$'

Also all other patterns in the same file are using [a-zA-Z] and [0-9] instead of \p{L} and \p{N}. That's why I'm opening the issue here and not in <https://github.com/python-jsonschema/check-jsonschema

Goals:

"pattern": "^(\\p{L}|_)(\\p{L}|\\p{N}|[.\\-_])*$"

I'm suggesting to replace the above line with

"pattern": "^([a-zA-Z_])([a-zA-Z0-9.\\-_])*$"

This way it's more consistent with other patterns and more regex validators are supported.

Dependencies:

Not sure about the dependencies, because this is my first issue here.

Acceptance Criteria

  • All website and readme documentation affected by the changes in this issue have been updated. Changes to the website can be made in the docs/content directory of your branch.
  • A Pull Request (PR) is submitted that fully addresses the goals of this User Story. This issue is referenced in the PR.
  • The CI-CD build process runs without any reported errors on the PR. This can be confirmed by reviewing that all checks have passed in the PR.
@Rojax Rojax added the enhancement New feature or request label Oct 11, 2024
@wendellpiez
Copy link
Collaborator

IIRC using the Unicode character categories here (\p{L} and \p{N}) was deliberate inasmuch as we wanted tokens (which are sometimes user-facing) to support all Unicode 'letters' and 'numbers', not just those matching [A-Za-z0-9]+ ('lower ASCII'). Otherwise tokens are as tight as we thought we could make them, to align with the XML 'name' construct (as being a more restricted value space than keys in JSON).

This being the status quo, the main problem with the proposal as given is that it breaks backward compatibility for any data sets that already have tokens with special characters (which are of course not 'special' to their users). A secondary problem is that they can't use such characters in the future. Depending on your requirements and planned uses for your data (anything declared as a token) this is or is not a real problem.

This leads me to ask what an actual equivalent would be, which captures all the Unicode blocks matched by \p{L} and \p{N}, and which more libraries (or any preferred library) would support?

This would be very useful information even if you are just patching a schema. Whether the released schemas can be altered (compatibly) depends on whether such an equivalent exists.

@RS-Credentive IIRC did you have info bearing on this?

Note also: you could make this change in a local schema variant and you would only face problems receiving tokens using accented characters or characters in many/most writing systems....

@Rojax
Copy link
Author

Rojax commented Oct 17, 2024

Thanks for the insights, much appreciated!

Note also: you could make this change in a local schema variant and you would only face problems receiving tokens using accented characters or characters in many/most writing systems....

Thanks for the hint. I already did this but opened this issue to save others the trouble.

However, I think it's not easily possible to support \p{L} and \p{N} easily by replacing them with "custom ranges". Therefore it might really be a better option to use other regex validators such as https://pypi.org/project/regex/ with support for \p{L} and \p{N}. After researching with more insights I got from your reply I also found a related issue here python-jsonschema/check-jsonschema#353.

@RS-Credentive
Copy link
Contributor

@wendellpiez , thanks for tagging me on this. It was indeed a challenge for me to handle \p{L} in python. I discovered a library called "elementpath" which is a part of the "xmlschema" package on pypi.

I can process the paths in python like this see here:

            xml_pattern = datatype.patterns.regexps[0]
            pcre_pattern = elementpath.regex.translate_pattern(xml_pattern)

The equivalent of \p{L} is approximately (may be garbled due to cut and past from here):

A-Za-zªµºÀ-ÖØ-öø-ˁˆ-ˑˠ-ˤˬˮͰ-ʹͶͷͺ-ͽͿΆΈ-ΊΌΎ-ΡΣ-ϵϷ-ҁҊ-ԯԱ-Ֆՙՠ-ֈא-תׯ-ײؠ-يٮٯٱ-ۓەۥۦۮۯۺ-ۼۿܐܒ-ܯݍ-ޥޱߊ-ߪߴߵߺࠀ-ࠕࠚࠤࠨࡀ-ࡘࡠ-ࡪࢠ-ࢴࢶ-ࢽऄ-हऽॐक़-ॡॱ-ঀঅ-ঌএঐও-নপ-রলশ-হঽৎড়ঢ়য়-ৡৰৱৼਅ-ਊਏਐਓ-ਨਪ-ਰਲਲ਼ਵਸ਼ਸਹਖ਼-ੜਫ਼ੲ-ੴઅ-ઍએ-ઑઓ-નપ-રલળવ-હઽૐૠૡૹଅ-ଌଏଐଓ-ନପ-ରଲଳଵ-ହଽଡ଼ଢ଼ୟ-ୡୱஃஅ-ஊஎ-ஐஒ-கஙசஜஞடணதந-பம-ஹௐఅ-ఌఎ-ఐఒ-నప-హఽౘ-ౚౠౡಀಅ-ಌಎ-ಐಒ-ನಪ-ಳವ-ಹಽೞೠೡೱೲഅ-ഌഎ-ഐഒ-ഺഽൎൔ-ൖൟ-ൡൺ-ൿඅ-ඖක-නඳ-රලව-ෆก-ะาำเ-ๆກຂຄຆ-ຊຌ-ຣລວ-ະາຳຽເ-ໄໆໜ-ໟༀཀ-ཇཉ-ཬྈ-ྌက-ဪဿၐ-ၕၚ-ၝၡၥၦၮ-ၰၵ-ႁႎႠ-ჅჇჍა-ჺჼ-ቈቊ-ቍቐ-ቖቘቚ-ቝበ-ኈኊ-ኍነ-ኰኲ-ኵኸ-ኾዀዂ-ዅወ-ዖዘ-ጐጒ-ጕጘ-ፚᎀ-ᎏᎠ-Ᏽᏸ-ᏽᐁ-ᙬᙯ-ᙿᚁ-ᚚᚠ-ᛪᛱ-ᛸᜀ-ᜌᜎ-ᜑᜠ-ᜱᝀ-ᝑᝠ-ᝬᝮ-ᝰក-ឳៗៜᠠ-ᡸᢀ-ᢄᢇ-ᢨᢪᢰ-ᣵᤀ-ᤞᥐ-ᥭᥰ-ᥴᦀ-ᦫᦰ-ᧉᨀ-ᨖᨠ-ᩔᪧᬅ-ᬳᭅ-ᭋᮃ-ᮠᮮᮯᮺ-ᯥᰀ-ᰣᱍ-ᱏᱚ-ᱽᲀ-ᲈᲐ-ᲺᲽ-Ჿᳩ-ᳬᳮ-ᳳᳵᳶᳺᴀ-ᶿḀ-ἕἘ-Ἕἠ-ὅὈ-Ὅὐ-ὗὙὛὝὟ-ώᾀ-ᾴᾶ-ᾼιῂ-ῄῆ-ῌῐ-ΐῖ-Ίῠ-Ῥῲ-ῴῶ-ῼⁱⁿₐ-ₜℂℇℊ-ℓℕℙ-ℝℤΩℨK-ℭℯ-ℹℼ-ℿⅅ-ⅉⅎↃↄⰀ-Ⱞⰰ-ⱞⱠ-ⳤⳫ-ⳮⳲⳳⴀ-ⴥⴧⴭⴰ-ⵧⵯⶀ-ⶖⶠ-ⶦⶨ-ⶮⶰ-ⶶⶸ-ⶾⷀ-ⷆⷈ-ⷎⷐ-ⷖⷘ-ⷞⸯ々〆〱-〵〻〼ぁ-ゖゝ-ゟァ-ヺー-ヿㄅ-ㄯㄱ-ㆎㆠ-ㆺㇰ-ㇿ㐀-䶵一-鿯ꀀ-ꒌꓐ-ꓽꔀ-ꘌꘐ-ꘟꘪꘫꙀ-ꙮꙿ-ꚝꚠ-ꛥꜗ-ꜟꜢ-ꞈꞋ-ꞿꟂ-Ᶎꟷ-ꠁꠃ-ꠅꠇ-ꠊꠌ-ꠢꡀ-ꡳꢂ-ꢳꣲ-ꣷꣻꣽꣾꤊ-ꤥꤰ-ꥆꥠ-ꥼꦄ-ꦲꧏꧠ-ꧤꧦ-ꧯꧺ-ꧾꨀ-ꨨꩀ-ꩂꩄ-ꩋꩠ-ꩶꩺꩾ-ꪯꪱꪵꪶꪹ-ꪽꫀꫂꫛ-ꫝꫠ-ꫪꫲ-ꫴꬁ-ꬆꬉ-ꬎꬑ-ꬖꬠ-ꬦꬨ-ꬮꬰ-ꭚꭜ-ꭧꭰ-ꯢ가-힣ힰ-ퟆퟋ-ퟻ豈-舘並-龎ff-stﬓ-ﬗיִײַ-ﬨשׁ-זּטּ-לּמּנּסּףּפּצּ-ﮱﯓ-ﴽﵐ-ﶏﶒ-ﷇﷰ-ﷻﹰ-ﹴﹶ-ﻼ

I think that python specifically may include a lot of the XML characters in the pattern [A-Za-z] but I'm not sure that that is true of every language, or that every character in the list above would be represented by [A-Za-z] in python, but regardless the regex [A-Za-z0-9] would not include international character sets when used in XML, as wendell says.

In other languages YMMV

@wendellpiez
Copy link
Collaborator

@Rojax you are quite welcome, thanks again for posting.

"Hints", of course, are not only for you ... trying to spell it all out for the record and other readers also, who knows? 🤔

@RS-Credentive this is very helpful indeed, thanks to you as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants