Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why not just escape every character? #15

Closed
domenic opened this issue Jun 16, 2015 · 7 comments
Closed

Why not just escape every character? #15

domenic opened this issue Jun 16, 2015 · 7 comments

Comments

@domenic
Copy link
Member

domenic commented Jun 16, 2015

Is there any reason to only escape a specific subset? It's harmless to add slashes, right?

@benjamingr
Copy link
Collaborator

It makes the resulting string longer, other than that it's harmless.

This is what some programming languages (Python escapes non alphanumeric strings) do where others escape a strict set (like C#).

@domenic
Copy link
Member Author

domenic commented Jun 16, 2015

Might be worth mentioning this as a design alternative in the readme, with the pro that it's more future-proof.

@benjamingr
Copy link
Collaborator

Good idea, I'll add that when I'm in front of a computer :) (you're welcome to if you'd like of course).

@benjamingr
Copy link
Collaborator

Updated the README, I'll leave this open for a week to see if anyone has any further input on it.

@benjamingr
Copy link
Collaborator

Following the research https://github.com/benjamingr/RegExp.escape/blob/master/data/other_languages/discussions.md it appears that other languages that used to escape every character have either made exceptions (like Python) or changed it (like Perl). The discussion notes contain links to posts with reasons on why changes were made.

@mjpieters
Copy link

Python's new regex engine (under development) gives you a choice; either escape all non-alphanumerics, or only metacharacters (and NUL), see https://bitbucket.org/mrabarnett/mrab-regex/src/6193ea4246da272cf18a190c46aa116737067780/regex_3/Python/regex.py?at=default#cl-342

In your discussion you mentioned a problem with wide characters; you ran into the Python re limitations with UCS-2 vs. UCS-4 builds (all Python versions up to 3.2 use one or the other based on a compile-time switch), the regular expression engine does not handle codepoints but code_units_, which in a UCS-2 build means 2 per non-BMP character. The escaping is correct for their respective builds.

@benjamingr
Copy link
Collaborator

I think we're good with not escaping every character. I want to focus on the discussion about big set vs readable set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants