Eigenstate : libregex

Libregex is a simple regex API that uses a parallel NFA implementation. This means that while it is not blazingly fast, it does not exhibit pathological behavior on regexes like (aa|aab?)\* that many common regex APIs will see.

API Reference

pkg regex =
    const compile   : (re : byte[:] -> std.error(regex#, status))
    const dbgcompile    : (re : byte[:] -> std.error(regex#, status))
    const free  : (re : regex# -> void)
    const exec  : (re : regex#, str : byte[:] -> bool)
;;

compile takes a regex string, and converts it to a compiled regex, returning \std.Success regexif the regex was valid, or a`std.Failure rwith the reason that the compilation failed.dbgcompile` is similar, however, the regex is compiled so it will spit out a good deal of debugging output. Unless you are intent on debugging the internals of the regex engine, this is likely only of academic interest.

free must be called on a compiled regex to release it's resources after you are finished using it.

exec runs the regex over the specified text, returning an \std.Some matches if the text matched, orstd.None` if the text did not match. matches[0] is always the full text that was matched, and will always be returned regardless of whether capture groups are specified.

search is not yet implemented. It should produce equivalent results to execing a regex of the form '.(regex).', however, it does not need to scan the full string, and is therefore more efficient.

Example:

const main = {
    match regex.compile("abc+")
    | `std.Success re:  runwith(re, "abccc")
    | `std.Failure msg: std.fatal("Failed to compile regex\n")
    ;;
}

const runwith = {re, txt
    match regex.exec(re, txt)
    | `std.Some matches:
        std.put("matched %s, got %i matches\n", txt, matches.len)
        for m in matches
            std.put("Match: %s\n", m[i])
        ;;
    | `std.None:
        std.put("%s did not match\n")
    ;;
}

Regex Syntax

The grammar for regexes that are accepted is sketched out below.

       regex       : altexpr
       altexpr     : catexpr ('|' altexpr)+
       catexpr     : repexpr (catexpr)+
       repexpr     : baseexpr[*+?][?]
       baseexpr    : literal
                   | charclass
                   | charrange
                   | '.'
                   | '^'
                   | '$'
                   | '(' regex ')'
       charclass   : see below
       charrange   : '[' (literal('-' literal)?)+']'

The following metacharacters have the meanings listed below:

Matches a single unicode character

Metachar Description
^ Matches the beginning of a line. Does not consume any characters.
$ Matches the end of a line. Does not consume any characters.
* Matches any number of repetitions of the preceding regex fragment.
+ Matches one or more repetitions of the preceding regex fragment.
? Matches zero or one of the preceding regex fragment.

In order to match a literal metacharacter, it needs to be preceded by a '\' character.

The following character classes are supported:

Charclass Description
\d ASCII digits
\D Negation of ASCII digits
\x ASCII Hex digits
\X Negation of ASCII Hex digits
\s ASCII spaces
\S Negation of ASCII spaces
\w ASCII word characters
\W Negation of ASCII word characters
\h ASCII whitespace characters
\H Negation of ASCII whitespace characters
\pX Characters with unicode property 'X'
\PX Negation of characters with property 'X'

The current list of supported Unicode character classes X are

Abbrev Full name Description
L Letter All letters, including lowercase, uppercase, titlecase, and uncased.
Lu Uppercase_Letter All uppercase letters.
Ll Lowercase_Letter All lowercase letters.
Lt Titlecase_Letter All titlecase letters.
N Number All numbers.
Z Separator All separators, including spaces and control characers.
Zs Space_Separator All space separators, including tabs and ASCII spaces.