-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The xclass caseless repetition problem #469
Comments
Assume that by "META phase" you refer to With that the patten provided will be "rewritten" as: "[\pC\x{17b}-\x{17c}]" before "compiling" into opcodes; alternatively Obviously to avoid duplicating work ALL case expansion for classes will need to be done at that phase and not when doing the "compiling" by treating the content of classes like a blob. |
Also curious, where is this duplication problematic, and if is it worth doing any optimization to prevent it (specially if after compilation?), for example wouldn't the following also be technically more "efficient" by rewriting? (ex: |
I expect Zoltan is thinking that duplication is wasted effort at match time. All the handling of case-independence is done at compile time so that the match phase doesn't have to keep track of where and where not it applies. (An earlier PCRE did, but pulling it out to compile time simplified things and must have benefited performance.) Because wide characters (and ranges) and just listed in an XCLASS, the compiler adds the other case to the list, but as has been pointed out, there is no duplication checking. Your suggestion above of a different kind of optimization is also legitimate. However, how far should we go in such things? Who is actually going to write [\p{Cn}\pC] ? If we are thinking of programs generating patterns, then it really should be the generating program that does its own optimization, IMHO. |
I would like to optimize actual ranges only, no properties. The ranges should be sorted, so we can use logarithmic search, which is much faster for many ranges. |
Since the character sizes are not fixed in utf, the whole binary search should be implemented as byte code. Structure could be something like this: Range list start with a pointer to Operation types: Since we have only two operations, perhaps we could encode them without an operation byte in some way. Probably LESS_THAN, GREATER_OR_EQUAL_THAN are in pairs, and can be encoded as |
Btw the 0..MAX_UTF range cannot be encoded in the structure above, since MAX_UTF+1 cannot be encoded in UTF16. But this is the same as ALL characters. If we have at least one range, 'c' is always <= MAX_UTF. Maybe for a few ranges (<=2) we could simply use the current method. |
There is progress on this, the issue can be closed. |
I was talking about the binary search problem of xclass ranges: should be relatively simple (or the binary search will not be efficient), but at the same time should not increase the code size too much. I have to following proposal for the problem: There should be four range lists:
Each range list contains characters and ranges in increasing order. 16 bit character matching can only use A and B ranges, while unicode matching can only use A, B and C ranges. D range is only used in 32 bit mode, no utf. The character values are shifted left by 1, and the lowest bit is cleared for range starts. Example: the The binary search searches the When a range intersects with multiple range lists, a range is created for them in each range list. The worst case is that a single range is present in A,B,C and D lists as well. Should be a rare case. For B and D ranges the All ranges contains characters, which size is less than or equal than the size of the unicode representation of that character. For large number of characters/ranges, this should consume less amount of space than the current implementation. For a low amount of characters/ranges (<= 8), we could keep the current code. We need some header for the range lists, I haven't decided it. Probably the first character represents the number of elements in the list (e.g. 16 bit is enough for 16 bit range lists, since 32K items is the maximum amount). The data for range lists must be naturally aligned. Let me know if you have some suggestions. |
Genius! It's self-synchronising so you can bisect it, and recover whether you have a range or single character. How important is it to use tricks like this to shave off bytes? It seems OK, but not super-necessary. We can encode it like this: #define XCL_NOT 0x01 /* Flag: this is a negative class */
#define XCL_MAP 0x02 /* Flag: a 32-byte map is present */
#define XCL_HASPROP 0x04 /* Flag: property checks are present. */
#define XCL_HASNOTPROP 0x08 /* Flag: not property checks are present. */
#define XCL_HASRANGE_A 0x10 /* Flag: Zoltan's "range A". */
#define XCL_HASRANGE_D 0x80 /* Flag: Zoltan's "range D". */
get rid of XCL_END and the others...
The sections don't need any tags to indicate what they are, we just include them always in the same order, based on whether the flag indicates to include the section. |
Would be great to do something with caseless repetitions in xclass. Example:
The question is where. During compilation, in the META phase, the buffer might be too small to have the proper ranges. In the byte code generation phase, computing it twice would be costly. Can we reallocate the buffer in the META phase, and do all caseless checks there? Maybe do all checks there, including the class to single character optimization.
The text was updated successfully, but these errors were encountered: