Skip to content

Latest commit

 

History

History
389 lines (269 loc) · 10.8 KB

re.md

File metadata and controls

389 lines (269 loc) · 10.8 KB

reimage title

contents

related file

  • cpython/Lib/re.py
  • cpython/Lib/sre_compile.py
  • cpython/Lib/sre_constants.py
  • cpython/Lib/sre_parse.py
  • cpython/Modules/sre.h
  • cpython/Modules/sre_constants.h
  • cpython/Modules/sre_lib.h
  • cpython/Modules/_sre.c
  • cpython/Modules/clinic/_sre.c.h

how regex work

what happened when you execute the following code ?

import re
re.search("abc\d+abc", "xxabc123abcd") # re.DEBUG)

the call stack

call_stack

the overview of different phases of re.search

overview

let's see what's going on in each step

parse

the core code in parse phase is in cpython/Lib/sre_parse.py

there is a class named Tokenizer to split your regex text into tokens, and _parse function is in charge of binding the token with the sre_constants code

# pseudo code
tokenizer = Tokenizer(pattern, flags)
while True:
    next_token = tokenizer.get_next_token()
    if not next_token:
        break
    parse(next_token)

let's run the parse phase with regex pattern abc\d+abc

next_token: a
rest part of token: bc\d+abc
parse result: [(LITERAL, 97)]

next_token: b
rest part of token: c\d+abc
parse result: [(LITERAL, 97), (LITERAL, 98)]

next_token: c
rest part of token: \d+abc
parse result: [(LITERAL, 97), (LITERAL, 98), (LITERAL, 99)]

next_token: \d
rest part of token: +abc
parse result: [(LITERAL, 97), (LITERAL, 98), (LITERAL, 99), (IN, [(CATEGORY, CATEGORY_DIGIT)])]

now, the next_token is '+', the parse function will pop whatever in the last parse result, and insert in back with the repeat symbol

item = parse_result[-1:]
min, max = 1, MAXREPEAT
parse_result[-1] = (MAX_REPEAT, (min, max, item))

next_token: +
rest part of token: abc
parse result: [(LITERAL, 97), (LITERAL, 98), (LITERAL, 99), (MAX_REPEAT, (1, MAXREPEAT, [(IN, [(CATEGORY, CATEGORY_DIGIT)])]))]

next_token: a
rest part of token: bc
parse result: [(LITERAL, 97), (LITERAL, 98), (LITERAL, 99), (MAX_REPEAT, (1, MAXREPEAT, [(IN, [(CATEGORY, CATEGORY_DIGIT)])])), (LITERAL, 97)]

next_token: b
rest part of token: c
parse result: [(LITERAL, 97), (LITERAL, 98), (LITERAL, 99), (MAX_REPEAT, (1, MAXREPEAT, [(IN, [(CATEGORY, CATEGORY_DIGIT)])])), (LITERAL, 97), (LITERAL, 98)]

next_token: c
rest part of token: None
parse result: [(LITERAL, 97), (LITERAL, 98), (LITERAL, 99), (MAX_REPEAT, (1, MAXREPEAT, [(IN, [(CATEGORY, CATEGORY_DIGIT)])])), (LITERAL, 97), (LITERAL, 98), (LITERAL, 99)]

compile

this is the input of the compile phase

[(LITERAL, 97), (LITERAL, 98), (LITERAL, 99), (MAX_REPEAT, (1, MAXREPEAT, [(IN, [(CATEGORY, CATEGORY_DIGIT)])])), (LITERAL, 97), (LITERAL, 98), (LITERAL, 99)]

the compile function in sre_compile.py

def _code(p, flags):

    flags = p.state.flags | flags
    code = []

    # compile info block
    _compile_info(code, p, flags)

    # compile the pattern
    _compile(code, p.data, flags)

    code.append(SUCCESS)

    return code

the info block has the following structure [INFO, length_of_prefix_code, mask_indicate_whether_has_prefix, min_length_for_the_pattern, max_length_for_the_pattern, prefix1, prefix2..., prefixn, overlap_table_for_prefix]

after the _compile_info function, code becomes

code == [INFO, 12, 1, 7, MAXREPEAT, 3, 3, 97, 98, 99, 0, 0, 0]

12 means the length of the prefix part excluding the first INFO element is 12

1 is the mask value, 1 indicate that the current pattern has a literal prefix

7 is the minimal length of the pattern "abc\d+abc", i.e. length of abc1abc is 7

MAXREPEAT is the maximum length of the pattern "abc\d+abc", constant MAXREPEAT indicate the +∞ (actually, the value is finite)

the first 3 is the length of the literal prefix, ("abc" here, which is 3)

the second 3 is the length of the prefix_skip, prefix_skip here has the same length as the prefix

the following 97, 98, 99 is the prefix "abc"

and the following 0, 0, 0 here is generated by the _generate_overlap_table

after compile the info block, the pattern will be compiled in _compile(code, p.data, flags)

for op, av in pattern:
    compile_and_fill_code_according_to_op_and_av()

# pattern here is: [(LITERAL, 97), (LITERAL, 98), (LITERAL, 99), (MAX_REPEAT, (1, MAXREPEAT, [(IN, [(CATEGORY, CATEGORY_DIGIT)])])), (LITERAL, 97), (LITERAL, 98), (LITERAL, 99)]

op: LITERAL av: 97
code: [LITERAL, 97]

op: LITERAL av: 98
code: [LITERAL, 97, LITERAL, 98]

op: LITERAL av: 99
code: [LITERAL, 97, LITERAL, 98, LITERAL, 99]

op: MAX_REPEAT av: (1, MAXREPEAT, [(IN, [(CATEGORY, CATEGORY_DIGIT)])])

the handling process of MAX_REPEAT is a little bit different from others, the code snippet handling this situation is shown

if op is MAX_REPEAT:
    emit(REPEAT_ONE)
else:
    emit(MIN_REPEAT_ONE)
skip = _len(code); emit(0)
emit(av[0])
emit(av[1])
# position 1
_compile(code, av[2], flags) # recursive call
emit(SUCCESS)
code[skip] = _len(code) - skip # change the length
# position 2

the code object in position 1

code: [LITERAL, 97, LITERAL, 98, LITERAL, 99, REPEAT_ONE, 0, 1, MAXREPEAT]

after the recursive call, code object in position 2, the length now becomes 9

op: IN av: [(CATEGORY, CATEGORY_DIGIT)]
code: [LITERAL, 97, LITERAL, 98, LITERAL, 99, REPEAT_ONE, 9, 1, MAXREPEAT, IN, 4, CATEGORY, CATEGORY_UNI_DIGIT, FAILURE, SUCCESS]

op: LITERAL av: 97
code: [LITERAL, 97, LITERAL, 98, LITERAL, 99, REPEAT_ONE, 9, 1, MAXREPEAT, IN, 4, CATEGORY, CATEGORY_UNI_DIGIT, FAILURE, SUCCESS, LITERAL, 97]

op: LITERAL av: 98
code: [LITERAL, 97, LITERAL, 98, LITERAL, 99, REPEAT_ONE, 9, 1, MAXREPEAT, IN, 4, CATEGORY, CATEGORY_UNI_DIGIT, FAILURE, SUCCESS, LITERAL, 97, LITERAL, 98]

op: LITERAL av: 99
code: [LITERAL, 97, LITERAL, 98, LITERAL, 99, REPEAT_ONE, 9, 1, MAXREPEAT, IN, 4, CATEGORY, CATEGORY_UNI_DIGIT, FAILURE, SUCCESS, LITERAL, 97, LITERAL, 98, LITERAL, 99, SUCCESS]

combine the _compile_info and _compile together, the code to the next phase is

[INFO, 12, 1, 7, MAXREPEAT, 3, 3, 97, 98, 99, 0, 0, 0, LITERAL, 97, LITERAL, 98, LITERAL, 99, REPEAT_ONE, 9, 1, MAXREPEAT, IN, 4, CATEGORY, CATEGORY_UNI_DIGIT, FAILURE, SUCCESS, LITERAL, 97, LITERAL, 98, LITERAL, 99, SUCCESS]

match

the match phase is written in c

it's a big for loop, act as a state machine, every complied opcode pass down here will be passed to this loop, the loop will exam the pattern according to the opcode, if mismatch in any state, go to fail

if success, go on, there will be another article talking about the match phase(if I have time)

for (;;) {
    switch (pattern[i]) {
        case SRE_OP_MARK:xxx
        case SRE_OP_LITERAL:xxx
        case SRE_OP_REPEAT_ONE:xxx
        ...
    }
}

code detail

sre_ucs1_search

follow the call stack to here

status = sre_search(&state, PatternObject_GetCode(self));

the defination of sre_search is

the search will delegate to different function according to the real size of the unicode object, for detail of memory usage of unicode object, please refer to how unicode implemented

LOCAL(Py_ssize_t)
sre_search(SRE_STATE* state, SRE_CODE* pattern)
{
    if (state->charsize == 1)
        return sre_ucs1_search(state, pattern);
    if (state->charsize == 2)
        return sre_ucs2_search(state, pattern);
    assert(state->charsize == 4);
    return sre_ucs4_search(state, pattern);
}

if we search for sre_ucs1_search in the folder, we can't find anything

find . -name '*' -exec grep -nHr "sre_ucs1_search" {} \;
Binary file ./libpython3.8m.a matches
./Modules/_sre.c:572:        return sre_ucs1_search(state, pattern);

the trick here is for reusing the common part in sre_lib.h with different definitions

/* generate 8-bit version */

#define SRE_CHAR Py_UCS1
#define SIZEOF_SRE_CHAR 1
#define SRE(F) sre_ucs1_##F
#include "sre_lib.h"

/* generate 16-bit unicode version */

#define SRE_CHAR Py_UCS2
#define SIZEOF_SRE_CHAR 2
#define SRE(F) sre_ucs2_##F
#include "sre_lib.h"

/* generate 32-bit unicode version */

#define SRE_CHAR Py_UCS4
#define SIZEOF_SRE_CHAR 4
#define SRE(F) sre_ucs4_##F
#include "sre_lib.h"

we can find the defination of search in sre_lib.h

LOCAL(Py_ssize_t)
SRE(search)(SRE_STATE* state, SRE_CODE* pattern)

which will be expanded to three different form

inline Py_ssize_t sre_ucs1_search(SRE_STATE* state, SRE_CODE* pattern)
inline Py_ssize_t sre_ucs2_search(SRE_STATE* state, SRE_CODE* pattern)
inline Py_ssize_t sre_ucs4_search(SRE_STATE* state, SRE_CODE* pattern)

when I go inside the expanded sre_ucs1_search, I found the match phase

fifo cache

when you called the re.compile, there will be a FIFO cache in the python level

_cache is of type dictionary, and because type dict is now ordered(dict implementation), it act as a fifo queue

>>> r1 = re.compile("a\d+b")
>>> id(r1)
4409698992
>>> r2 = re.compile("b\d+b")
>>> id(r2)
4409699184
>>> r3 = re.compile("a\d+b")
>>> id(r3)
4409698992 # same as r1
>>> r4 = re.compile("b\d+b")
>>> id(r4)
4409699184 # same as r2

the cache mechanism is written in python

_cache = {}  # ordered!

_MAXCACHE = 512
def _compile(pattern, flags):
    # omit
    try:
        return _cache[type(pattern), pattern, flags]
    except KeyError:
        pass
    # omit
    p = sre_compile.compile(pattern, flags)
    if not (flags & DEBUG):
        if len(_cache) >= _MAXCACHE:
            # Drop the oldest item
            try:
                del _cache[next(iter(_cache))]
            except (StopIteration, RuntimeError, KeyError):
                pass
        _cache[type(pattern), pattern, flags] = p
    return p

read more