Regular Expressions made easy: a declarative approach

Published on 11 Feb, 2020 · 4 minutes read


Valentino Pellegrino
Software Craftsman / Continuous Learner @BMC Software

Be honest: every time you find a regular expression in the code, you start wondering if you can avoid to change it, or maybe if a colleague can help you to understand it.
How many seconds do you need to understand that
<(\[A-Z\][A-Z0-9]*)\b[^>]*>(.*?)</\1>
is a regex to grab HTML tags?
If you are searching for a smart way to write and maintain a regular expression, relax and continue reading.

First of all - What is a Regular Expression?

I know Regular Expressions - XKCD comic

“A regular expression (regex or regexp for short) is a special text string for describing a search pattern. You can think of regular expressions as wildcards on steroids. You are probably familiar with wildcard notations such as *.txt to find all text files in a file manager. The regex equivalent is ^.*\.txt$" - https://www.regular-expressions.info/

There are a lot of use cases where regular expressions fit well:

  • You want to analyze command lines.
  • In general, you want to parse user input.
  • An huge text file: let’s parse it to find some useful stuff (e.g. specific logged errors).
  • Pattern matching (e.g. you want a password to follow a specific format).
  • Replace a repeating sub-string in a characters sequence.

In order to use regex, you have to understand and remember a lot of symbols and methods:

Regular expression methods - www.ks.uiuc.edu

Why Regular Expressions are used so much?

The reason why regex are widely used is for their performance. The more precise your regex is, the less likely you are to accidentally match text that you didn’t mean to match.
Regex are really fast when they are accurate. Good regular expressions are often longer than bad regular expressions because they make use of specific characters/character classes and have more structure. This causes good regular expressions to run faster as they predict their input more accurately.

VerbalExpressions

VerbalExpressions is a set of libraries that represents an easy way to write readable regex. It can ease the pain of regex, and actually make writing expressions fun again.
VerbalExpressions has been ported to so many other languages that a GitHub organization (https://github.com/VerbalExpressions) was created just to host them all.
Obviously, there is also an implementation of such library for JavaScript (https://github.com/VerbalExpressions/JSVerbalExpressions).
Given a complex regex that checks for valid URL /^(http)(s)?(\:\/\/)(www\.)?([^\ ]*)$/
Let’s see how it easy to write it, using such library:

const urlTester = VerEx()
    .startOfLine()
    .then('http')
    .maybe('s')
    .then('://')
    .maybe('www.')
    .anythingBut(' ')
    .endOfLine();

How to use it

There are several ways to use such library:

  • You can download it and import using a standard script tag:
  • You can use a Content Delivery Network, like http://www.jsdelivr.com/projects/jsverbalexpressions
  • You can install it, using NPM and use it in any Node based application:
    npm install verbal-expressions

You can also use it live on the site https://verbalregex.com/

Chatbot Expenses - Simple bot for collecting expenses typed in the terminal

In that example (https://github.com/vpellegrino/chatbot-expenses), I show how to build complex parsing features, used by a simple NodeJS application, with a prompt interface, used to collect and report expenses from a group of users.
Imagine you want to offer a list of commands, like the ones defined below.
Store expense

<EXPENSE>=<PARTICIPANT>\[,<PARTICIPANT>...\][ "<MESSAGE>"]

For each participant, you can also specify a different split for the costs, by using the modifiers + and *.
Examples:

84.20=MR,VP+0.20 "Pizza"

This means, VP has paid 84.20 USD for a pizza, where 42.00 USD is in charge of MR.

MR> 20=VP "Hamburger"

In that example, MR has paid 20 USD for an Hamburger eat up by VP.

Retrieve the list of expenses

HISTORY

Retrieve the group balance

BALANCE

This is the most important command, since behind the scenes an algorithm similar to Bin Packing and Partition Problem is used. The goal is to print the minimal set of transaction in order to pay all debts inside the group.
Example:

Alice -> Bill $10
Bill -> Alice $1
Bill -> Charles $5
Charles -> Alice $5

Solution would be:

Alice = $4 Bill = $-4 Charles = $0

Declarative Regular Expressions

The service that is responsible for providing all checks for well-formed commands and for grabbing user input is src/services/regExpService.js.
A series of constants (that can be reused in other complex expressions) have been defined. For instance:

const twoLetters = new VerbalExpression()
                      .then(new VerbalExpression().range('A', 'Z').repeatPrevious(2));

The combination of such constants get assembled in more complex functions, that are still easy to read (or at least, easiest than the related regex).
For example, given a line of text, the function below is able to return an array containing two elements: the sender initials, and the message he sent:

function parseSenderInitialsAndText(line) {
    return new VerbalExpression()
        .startOfLine()
        .beginCapture().then(twoLetters).endCapture().then(ARROW).maybe(WHITESPACE)
        .beginCapture().then(new VerbalExpression().anything()).endCapture()
        .endOfLine().exec(line);
}

It is quite easy to switch from standard regex to VerbalExpression() and viceversa. So, it is definitely easy to combine them when you don’t know exactly how a specific regex works, but you still need to extend it.

Conclusion

Regular Expressions are mathematically sound and fast. But they suck 😁 really hard in terms of ease of use and maintainability.
So, for good performance, we need longer regular expressions. 😮
But, for good maintainability, we need shorter regular expressions. 🤔
VerbalExpressions represent a good solution 😎, that enables you to use regex, without the pain of maintain them. With a declarative approach, you can simply write your statement, describing the way you expect to check or grab a certain character/group of characters.