Skip to content

nasciiboy/RecursiveRegexpRaptor-4

Repository files navigation

Recursive Regexp Raptor (regexp4)

lang: es

regexp3 (C-lang, Go-lang) and regexp4 (C-lang, Go-lang)

raptor-book (draft (spanish)) : here

benchmarks ==> here

Characteristics

  • Easy to use.
  • No error checking.
  • only regexp
  • The most compact and clear code in a human regexp library.
  • Zero dependencies. Neither the standard C library is present PURE C.
  • No explicit dynamic memory management. No malloc, calloc, free, …
  • Count matches
  • Catchs
  • Replacement catch
  • Placement of specific catches within an array
  • Backreferences
  • Support UTF8

Introduction

Recurseve Regexp Raptor is a library search, capture and replacement Regular expressions written in C language from zero, trying to achieve what following:

  • Having most of the features present in any other regexp library.
  • Elegant Code: simple, clear and endowed with grace.
  • Avoid explicit request dynamic memory.
  • Avoid using any external libraries, including the standard library.
  • Be a useful learning material.

There are two parallel developments of this library the first (regexp3) focuses on simplicity and code, the second (this) still in beta seeks achieve the maximum speed possible implementing a “table of instructions.” In both cases the algorithm is from scratch, and only use C, enjoy!

Motivation

C does not have a standard library of regular expressions, although there are several implementations, such as pcre, the regexp.h library of the GNU project, regexp of the Plan 9 operating system, and some other more, the author of this work (which is a little bit retard) found in all code farfetched and mystical divided into several files full of macros, scripts low and cryptic variables. Unable to understand anything and after a retreat to the island of onanista meditacion the author intended to make your own library with casinos and Japanese schoolgirls.

Development and Testing

Has been used GNU Emacs (the only true operating system), gcc (6.3.1) & clang compiler (LLVM) 3.8.1, konsole and fish, running in Freidora 25.

There are two tests for the library, the first ascii test battery is used in the file ascii_test.c.

to test the ascii library

gcc ascii_test.c regexp4_ascii.c

to the ut8 vercion

gcc ascii_test.c regexp4_utf8.c

the second battery of tests is exclusive of regexp4_utf8.c

gcc utf8_test.c regexp4_utf8.c

in either case run with

./a.out

Use

To include Recursive Regexp Raptor in their code, place the files regexp4.h, charUtils.h and regexp4_ascii.c or regexp4_utf8.c inside the folder of your draft. You must include the header

#include "regexp4.h"

and finally compile well with

gcc myProyect.c regexp4_ascii.c

or

gcc myProyect.c regexp4_utf8.c

obviously compile with optimization provides a significant decline, runtime, try -O3

The regexp4() function

This the only search function, its prototype is:

int regexp4( const char *txt, const char *re );
txt
pointer to string on which to perform the search, must end with the sign of termination ‘\0’.
re
pointer to string containing the regular expression search, You must end with the sign of termination ‘\0’.

The function returns the number of matches 0 (none) o n matches.

The standard syntax for regular expressions using the character ’\’, unfortunately this sign goes into “conflict” with the syntax of C, by this and trying to keep simple the code, has opted for a alternate syntax detailed below

Syntax

  • Text search in any location:
    regexp4( "Raptor Test", "Raptor" );
        
  • Multiple search options “exp1|exp2”
    regexp4( "Raptor Test", "Dinosaur|T Rex|Raptor|Triceratops" );
        
  • Matches any character ‘.’
    regexp4( "Raptor Test", "R.ptor" );
        
  • Zero or one coincidences ‘?’
    regexp4( "Raptor Test", "Ra?ptor" );
        
  • One or more coincidences ‘+’
    regexp4( "Raaaptor Test", "Ra+ptor" );
        
  • Zero or more coincidences ‘*’
    regexp4( "Raaaptor Test", "Ra*ptor" );
        
  • Range of coincidences “{n1,n2}”
    regexp4( "Raaaptor Test", "Ra{0,100}ptor" );
        
  • Number of specific matches ‘{n1}’
    regexp4( "Raptor Test", "Ra{1}ptor" );
        
  • Minimum Number of matches ‘{n1,}’
    regexp4( "Raaaptor Test", "Ra{2,}ptor" );
        
  • Sets.
    • Character Set “[abc]”
      regexp4( "Raptor Test", "R[uoiea]ptor" );
              
    • Range within a set of characters “[a-b]”
      regexp4( "Raptor Test", "R[a-z]ptor" );
              
    • Metacaracter within a set of characters “[:meta]”
      regexp4( "Raptor Test", "R[:w]ptor" );
              
    • Investment character set “[^abc]”
      regexp4( "Raptor Test", "R[^uoie]ptor" );
              
  • UTF8 characters
    regexp4( "R△ptor Test", "R△ptor" );
        

    also

    regexp4( "R△ptor Test", "R[△]ptor" );
        
  • Coinciding with a character that is a letter “:a”
    regexp4( "RAptor Test", "R:aptor" );
        
  • Coinciding with a character that is not a letter “:A”
    regexp4( "R△ptor Test", "R:Aptor" );
        
  • Coinciding with a character that is a number “:d”
    regexp4( "R4ptor Test", "R:dptor" );
        
  • Coinciding with a character other than a number “:D”
    regexp4( "Raptor Test", "R:Dptor" );
        
  • Coinciding with an alphanumeric character “:w”
    regexp4( "Raptor Test", "R:wptor" );
        
  • Coinciding with a non-alphanumeric character “:W”
    regexp4( "R△ptor Test", "R:Wptor" );
        
  • Coinciding with a character that is a space “:s”
    regexp4( "R ptor Test", "R:sptor" );
        
  • Coinciding with a character other than a space “:S”
    regexp4( "Raptor Test", "R:Sptor" );
        
  • Coincidence with utf8 character “:&”
    regexp4( "R△ptor Test", "R:&ptor" );
        
  • Escape character with special meaning “:character”

    the characters ‘|’, ‘(‘, ‘)’, ‘<’, ‘>’, ‘[‘, ‘]’, ‘?’, ‘+’, ‘*’, ‘{‘, ‘}’, ‘-‘, ‘#’ and ‘@’ as a especial characters, placing one of these characters as is, regardless one correct syntax within the exprecion, can generate infinite loops and other errors.

    regexp4( ":#()|<>", ":::#:(:):|:<:>" );
        

    The special characters (except the metacharacter) lose their meaning within a set

    regexp4( "()<>[]|{}*#@?+", "[()<>:[:]|{}*?+#@]" );
        
  • Grouping “(exp)”
    regexp4( "Raptor Test", "(Raptor)" );
        
  • Grouping with capture “<exp>”
    regexp4( "Raptor Test", "<Raptor>" );
        
  • Backreferences “@id”

    the backreferences need one previously captured expression “<exp>”, then the number of capture is placed, preceded by ‘@’

    regexp4( "ae_ea", "<a><e>_@2@1" )
        
  • Behavior modifiers

    There are two types of modifiers. The first affects globally the exprecion behaviour, the second affects specific sections. In either case, the syntax is the same, the sign ‘#’, followed by modifiers,

    modifiers global reach is placed at the beginning, the whole and are as follows exprecion

    • Search only the beginning ‘#^exp’
      regexp4( "Raptor Test", "#^Raptor" );
              
    • Search only at the end ‘#$exp’
      regexp4( "Raptor Test", "#$Test" );
              
    • Search the beginning and end “#^$exp”
      regexp4( "Raptor Test", "#^$Raptor Test" );
              
    • Stop with the first match “#?exp”
      regexp4( "Raptor Test", "#?Raptor Test" );
              
    • Search for the string, character by character “#~”

      By default, when a exprecion coincides with a region of text search, the search continues from the end of that coincidence to ignore this behavior, making the search always be character by character this switch is used

      regexp4( "aaaaa", "#~a*" );
              

      in this example, without modifying the result it would be a coincidence, however with this switch continuous search immediately after returning character representations of the following five matches.

    • Ignore case sensitive “#*exp”
      regexp4( "Raptor Test", "#*RaPtOr TeSt" );
              

all of the above switches are compatible with each other ie could search

regexp4( "Raptor Test", "#^$*?~RaPtOr TeSt" );

however modifiers ‘~’ and ‘?’ lose sense because the presence of ‘^’ and/or ‘$’.

one exprecion type:

regexp4( "Raptor Test", "#$RaPtOr|#$TeSt" );

is erroneous, the modifier after the ‘|’ section would apply between ‘|’ and ‘#’, ie zero, with a return of wrong

local modifiers are placed after the repeat indicator (if there) and affect the same region affecting indicators repetition, ie characters, sets or groups.

  • Ignore case sensitive “exp#*”
    regexp4( "Raptor Test", "(RaPtOr)#* TeS#*t" );
        
  • Not ignore case sensitive “exp#/”
    regexp4( "RaPtOr TeSt", "#*(RaPtOr)#/ TES#/T" );
        

Captures

Catches are indexed according to the order of appearance in the expression for example:

<   <   >  | <   <   >   >   >
= 1 ==========================
    = 2==    = 2 =========
                 = 3 =

If the exprecion matches more than one occasion in the search text index is increased according to their appearance that is:

<   <   >  | <   >   >   <   <   >  | <   >   >   <   <   >  | <   >   >
= 1 ==================   = 3 ==================   = 5 ==================
    = 2==    = 2==           = 4==    = 4==           = 6==    = 6==
coincidencia uno         coincidencia dos         coincidencia tres

cpytCatch function makes a copy of a catch into an array character, here its prototype:

char * cpyCatch( char * str, const int index )
str
pointer capable of holding the largest capture.
index
index of the grouping (1 to n).

function returns a pointer to the capture terminated ‘\0’. an index incorrect return a pointer that begins in ‘\0’.

to get the number of catches in a search, using totCatch:

int totCatch();

returning a value of 0 a n.

Could use this and the previous function to print all catches with a function like this:

void printCatch(){
  char str[128];
  int i = 0, max = totCatch();

  while( ++i <= max )
    printf( "[%d] >%s<\n", i, cpyCatch( str, i ) );
}

gpsCatch() y lenCatch()

functions gpsCatch() and lenCatch() perform the same work cpyCatch with the variant not use an array, instead the first returns a pointer to the initial position of capture within the text of search and the second returns the length of the capture.

int          lenCatch( const int index );
const char * gpsCatch( const int index );

the above example with these fuciones, would:

void printCatch(){
  int i = 0, max = totCatch();

  while( ++i <= max )
    printf( "[%d] >%.*s<\n", i, lenCatch( i ), gpsCatch( i ) );
}

Place catches in a string

char * putCatch( char * newStr, const char * putStr );

putStr argument contains the text with which to form the new chain as well as indicators which you catch place. To indicate the insertion a coke capture the ‘#’ sign followed the capture index. for example putStr argument could be

char *putStr = "catch 1 >>#1<< catch 2 >>#2<< catch 747 >>#747<<";

newStr is an character array large enough to contain the string + catches. the function returns a pointer to the starting position of this arrangement, which ends with the sign of completion ‘\0’.

to place the character ‘#’ within the escape string ‘#’ with ‘#’ further, ie:

"## Comment" -> "# comment"

Replace a catch

Replacement operates on an array of characters in which is placed the text search modifying a specified catch by a string text, the function in charge of this work is rplCatch, its prototype is:

char * rplCatch( char * newStr, const char * rplStr, const int id );
newStr
character array dimension text is placed dende original on which is carried out and the replacement text of catches.
rplStr
replacement text capture.
id
Capture identifier after the order of appearance within regular exprecion. Spend a wrong index, place a unaltered copy of the search string on the settlement = Newstr =.

in this case the use of the argument id unlike function getCatch does not refer to a “catch” in specific, that is no matter how much of occasions that has captured a exprecion, the identifier indicates the position within the exprecion itself, ie:

   <   <   >  | <   <   >   >   >
id = 1 ==========================
id     = 2==    = 2 =========
id                  = 3 =
capturing position within the exprecion

The amendment affects so

<   <   >  | <   >   >       <   <   >  | <   >   >      <   <   >  | <   >   >
= 1 ==================       = 1 ==================      = 1 ==================
    = 2==    = 2==               = 2==    = 2==              = 2==    = 2==
capture one                  "..." two                   "..." Three

Metacharacters search

:d
digit from 0 to 9.
:D
any character other than a digit from 0 to 9.
:a
any character is a letter (a-z, A-Z)
:A
any character other than a letter
:w
any alphanumeric character.
:W
any non-alphanumeric character.
:s
any blank space character.
:S
any character other than a blank.
:&
Non-ASCII character (in UTF8 version only).
:|
Vertical bar
:^
Caret
:$
Dollar sign
:(
Left parenthesis
:)
Right parenthesis
:<
Greater than
:>
Less than
:[
Left bracket
:]
Right bracket
:.
Point
:?
Interrogacion
:+
More
:-
Less
:*
Asterisk
:{
Left key
:}
Right key
:#
Modifier
::
Colons

additionally use the proper c syntax to place characters new line, tab, …, etc. Similarly you can use the c syntax for “placing” characters in octal, hexadecimal or unicode.

Examples of use

ascii_test.c file contains a wide variety of tests that are useful as examples of use, these include the next:

regexp4( "07-07-1777", "<0?[1-9]|[12][0-9]|3[01]><[/:-\\]><0?[1-9]|1[012]>@2<[12][0-9]{3}>" );

captures a date format string, separately day, stripper, month and year. The separator has to coincider the two occasions that appears

regexp4( "https://en.wikipedia.org/wiki/Regular_expression", "(https?|ftp):://<[^:s/:<:>]+></[^:s:.:<:>,/]+>*<.>*" );

capture something like a web link

regexp4( "<mail>nasciiboy@gmail.com</mail>", "<[_A-Za-z0-9:-]+(:.[_A-Za-z0-9:-]+)*>:@<[A-Za-z0-9]+>:.<[A-Za-z0-9]+><:.[A-Za-z0-9]{2}>*" );

capture sections (user, site, domain) something like an email.

Hacking

algorithm

Flow Diagram

     ┌────┐
     │init│
     └────┘
        │◀───────────────────────────────────┐
        ▼                                    │
 ┌──────────────┐                            │
 │loop in string│                            │
 └──────────────┘                            │
        │                                    │
        ▼                                    │
 ┌─────────────┐  no   ┌─────────────┐       │
<│end of string│>────▶<│search regexp│>──────┘
 └─────────────┘       └─────────────┘ no match
        │ yes                 │ match
        ▼                     ▼
┌────────────────┐     ┌─────────────┐
│report: no match│     │report: match│
└────────────────┘     └─────────────┘
        │                     │
        │◀────────────────────┘
        ▼
      ┌───┐
      │end│
      └───┘

search regexp version one

                                                        ┌──────────────────────────────┐
┏━━━━━━━━━━━━━┓                                         ▼                              │
┃search regexp┃                                  ┌───────────┐                         │
┗━━━━━━━━━━━━━┛                                  │get builder│                         │
                                                 └───────────┘                         │
                                                        │                              │
                                                        ▼                              │
                                                ┌───────────────┐  no  ┌────────────┐  │
                                               <│we have builder│>────▶│finish: the │  │
                                                └───────────────┘      │path matches│  │
                                                        │ yes          └────────────┘  │
                              ┌────────┬─────┬──────────┼────────────┬──────────┐      │
                              ▼        ▼     ▼          ▼            ▼          ▼      │
                        ┌───────────┐┌───┐┌─────┐┌─────────────┐┌─────────┐┌────────┐  │
                        │alternation││set││point││metacharacter││character││grouping│  │
                        └───────────┘└───┘└─────┘└─────────────┘└─────────┘└────────┘  │
                              │        │     │          │            │          │      │
                              ▼        └─────┴──────────┼────────────┘          └──────┤
                     ┌────────────────┐                 │                              │
            ┌────────│ save position  │                 ▼                              │
            │        └────────────────┘          ┌─────────────┐  no match             │
            │        ┌────────────────┐         <│match builder│>──────────┐           │
            ▼◀───────│restore position│◀────┐    └─────────────┘           │           │
     ┌──────────────┐└────────────────┘     │           │ match            │           │
     │loop in paths │                       │           ▼                  ▼           │
     └──────────────┘                       │   ┌─────────────────┐ ┌───────────────┐  │
            │                               │   │advance in string│ │finish, the    │  │
            ▼                               │   └─────────────────┘ │path no matches│  │
      ┌────────────┐ yes  ┌─────────────┐   │           │           └───────────────┘  │
     <│we have path│>───▶<│search regexp│>──┘           └──────────────────────────────┘
      └────────────┘      └─────────────┘ no match
            │ no          match │
            ▼                   ▼
┌───────────────────────┐ ┌────────────┐
│finish, without matches│ │finish, the │
└───────────────────────┘ │path matches│
                          └────────────┘

search regexp version two

               ┌─────────────┐
               │save position│                             ┏━━━━━━━━━━━━━┓
               └─────────────┘                             ┃search regexp┃
        ┌────────────▶│                                    ┗━━━━━━━━━━━━━┛
        │             ▼
        │      ┌──────────────┐
        │      │loop in paths │
        │      └──────────────┘
        │             │                       ┌────────────────────────────────┐
        │             ▼                       ▼                                │
        │       ┌────────────┐   yes    ┌───────────┐                          │
        │      <│we have path│>────────▶│get builder│                          │
        │       └────────────┘          └───────────┘                          │
        │             │ no                    │                                │
        │             ▼                       ▼                                │
        │  ┌───────────────────────┐   ┌───────────────┐ no  ┌─────────────┐   │
        │  │finish: without matches│  <│we have builder│>───▶│finish: the  │   │
        │  └───────────────────────┘   └───────────────┘     │path matches │   │
        │                                     │ yes          └─────────────┘   │
        │                    ┌─────┬──────────┼────────────┬─────────┐         │
        │                    ▼     ▼          ▼            ▼         ▼         │
┌────────────────┐        ┌───┐┌─────┐┌─────────────┐┌─────────┐┌────────┐     │
│restore position│        │set││point││metacharacter││character││grouping│     │
└────────────────┘        └───┘└─────┘└─────────────┘└─────────┘└────────┘     │
        ▲                    │     │          │            │         │         │
        │                    └─────┴──────────┼────────────┘         │         │
        │                                     ▼                      ▼         │
 ┌───────────────┐      no match       ┌─────────────┐        ┌─────────────┐  │
 │finish: the    │◀────────┬──────────<│match builder│>  ┌───<│search regexp│> │
 │path no matches│         │           └─────────────┘   │    └─────────────┘  │
 └───────────────┘         │                  │ match    │           │         │
                           └────────────────┈┈│┈┈────────┘           │ match   │
                                              ▼                      │         │
                                     ┌─────────────────┐             └─────────┤
                                     │advance in string│                       │
                                     └─────────────────┘                       │
                                              │                                │
                                              └────────────────────────────────┘

License

This project is not “open source” is free software, and according to this, use the GNU GPL Version 3. Any work that includes used or resulting code of this library, you must comply with the terms of this license.

Contact, contribution and other things

mailto:nasciiboy@gmail.com

About

regexp4 engine (C-lang)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages