Libregex is a simple regex API that uses a parallel NFA implementation. This
means that while it is not blazingly fast, it does not exhibit pathological
behavior on regexes like (aa|aab?)\*
that many common regex APIs will see.
API Reference
pkg regex =
const compile : (re : byte[:] -> std.error(regex#, status))
const dbgcompile : (re : byte[:] -> std.error(regex#, status))
const free : (re : regex# -> void)
const exec : (re : regex#, str : byte[:] -> bool)
;;
compile
takes a regex string, and converts it to a compiled regex, returning
\
std.Success regexif the regex was valid, or a
`std.Failure rwith the
reason that the compilation failed.
dbgcompile` is similar, however, the
regex is compiled so it will spit out a good deal of debugging output. Unless
you are intent on debugging the internals of the regex engine, this is likely
only of academic interest.
free
must be called on a compiled regex to release it's resources after you
are finished using it.
exec
runs the regex over the specified text, returning an \
std.Some matches
if the text matched, or
std.None` if the text did not match. matches[0] is
always the full text that was matched, and will always be returned regardless
of whether capture groups are specified.
search
is not yet implemented. It should produce equivalent results to
execing a regex of the form '.(regex).', however, it does not need to scan
the full string, and is therefore more efficient.
Example:
const main = {
match regex.compile("abc+")
| `std.Success re: runwith(re, "abccc")
| `std.Failure msg: std.fatal("Failed to compile regex\n")
;;
}
const runwith = {re, txt
match regex.exec(re, txt)
| `std.Some matches:
std.put("matched %s, got %i matches\n", txt, matches.len)
for m in matches
std.put("Match: %s\n", m[i])
;;
| `std.None:
std.put("%s did not match\n")
;;
}
Regex Syntax
The grammar for regexes that are accepted is sketched out below.
regex : altexpr
altexpr : catexpr ('|' altexpr)+
catexpr : repexpr (catexpr)+
repexpr : baseexpr[*+?][?]
baseexpr : literal
| charclass
| charrange
| '.'
| '^'
| '$'
| '(' regex ')'
charclass : see below
charrange : '[' (literal('-' literal)?)+']'
The following metacharacters have the meanings listed below:
Matches a single unicode character
Metachar | Description |
---|---|
^ | Matches the beginning of a line. Does not consume any characters. |
$ | Matches the end of a line. Does not consume any characters. |
* | Matches any number of repetitions of the preceding regex fragment. |
+ | Matches one or more repetitions of the preceding regex fragment. |
? | Matches zero or one of the preceding regex fragment. |
In order to match a literal metacharacter, it needs to be preceded by a '\' character.
The following character classes are supported:
Charclass | Description |
---|---|
\d | ASCII digits |
\D | Negation of ASCII digits |
\x | ASCII Hex digits |
\X | Negation of ASCII Hex digits |
\s | ASCII spaces |
\S | Negation of ASCII spaces |
\w | ASCII word characters |
\W | Negation of ASCII word characters |
\h | ASCII whitespace characters |
\H | Negation of ASCII whitespace characters |
\pX | Characters with unicode property 'X' |
\PX | Negation of characters with property 'X' |
The current list of supported Unicode character classes X
are
Abbrev | Full name | Description |
---|---|---|
L | Letter |
All letters, including lowercase, uppercase, titlecase, and uncased. |
Lu | Uppercase_Letter |
All uppercase letters. |
Ll | Lowercase_Letter |
All lowercase letters. |
Lt | Titlecase_Letter |
All titlecase letters. |
N | Number |
All numbers. |
Z | Separator |
All separators, including spaces and control characers. |
Zs | Space_Separator |
All space separators, including tabs and ASCII spaces. |