For many developers, regular expressions are arcane magic at best, impenetrable nonsense at worst. I once felt the same way, until in a caffeinated mania I plunged in headfirst and discovered a fascinating and endlessly useful world under the surface. Five years later, regex is a trusted tool I rely on for all kinds of development tasks, like an old Swiss army knife – complex, but wonderfully adaptable. I’m here to show you how you can tame these intimidating globs of characters by breaking them down into manageable chunks, then putting them to work to make your life easier and maybe even save the day.
What is a regular expression?
A regular expression is simply a way of representing a pattern in a string. Many people get a lot of use out of control-F to search a web page for a key term, or the search and replace feature of a word processor. Imagine how much more you could do if you weren’t limited to specifying an exact string. You could get more general, and find everything that looked like an email address or a full name. You could also drill down to the specific and eliminate those false positives that might screw up a search and replace.
Of course, that’s only scratching the surface. The real power of regular expressions lies in the ability to use them in scripts.
Real-world examples of Regular Expressions
One of the first tasks I had at Shockoe was to move some of our documentation from a proprietary service to a self-hosted solution. The old service used a nonstandard flavor of markdown, so part of my approach was to convert all of the documents from that to CommonMark, which plays nicely with a lot of open-source tools like mdbook, which I eventually used to host the files.
This would have been a monumental undertaking by hand, and simple find-and-replace wasn’t going to cut it either. There were lots of changes to make. I’ll start with a simple one: a markdown header in most flavors is written # Header, but the proprietary flavor didn’t include a space between the # and the rest. I needed my script to identify these spots so it could insert a space. Here’s the regex :
^#+(?![ ]) (check out an interactive version on regexr)
Let’s break it down:
|^||matches the beginning of a line, so we avoid anything in a paragraph body|
|#+||matches the literal character #; the + means one or more times|
|(?!…)||this is a ‘lookahead’. It checks that whatever came previously is followed by (or in this case not followed by, because of the !) whatever is inside the parentheses.|
|[ ]||a character set containing a literal space|
So, taken together, this matches one or more # at the beginning of a line, as long as they are not followed by a space. Any other character would break the pattern. If there’s a match, the script can insert a space after the match, and we’ve taken care of this whole class of changes across every file.
Let’s try a more generally useful one. In this case, I needed to scrape a bunch of markdown image links to get their urls and associated alt text. They look like this:
The good news is a regex can capture both of these at once. Here it is:
(interactive version on regexr)
|!||match the literal ! that marks the start of an image|
|brackets are special characters in regex, so we have to escape them when we want to match literal brackets. The same is true for parentheses.|
|(.*)||match 0 or more of any character and remember the match. 0 or more because alt text isn’t required, although you should have it because accessibility is important!|
|(.+)||match 1 or more of any character and remember the match|
So, this regex looks for a segment formatted like a markdown image, and remembers whatever was inside the brackets (the alt text) and whatever was inside the parentheses (the url). The most important bit is the parentheses (the non-escaped ones): those tell the regex to remember whatever matched inside of them. From there you can use the values as you like. The exact method varies by language, but here it is for Python:
match = re.search(r’!\[(.*)\]\((.+)\)’, inputString) if match: return match.groups()
Let’s Socialize! Follow Us On Twitter
Making your own Regexes
As with any skill, practice makes perfect; that said, you can make things much easier on yourself by getting a sense for all of the tools available to you. For that, my favorite reference is the MDN web docs’ regex reference. It details every special character and what it does, with examples. It’s a good idea to keep this open while you work.
Next, use a special editor while you’re developing your regex. I’m fond of regexr. It color codes each component of the expression and runs it on sample text live as you edit. This is a great way to test your regex, so make sure to include edge cases the same as you would when writing unit tests for code.
Once you have your special editor and the docs are handy, it’s time to study the text you’re trying to match. Think broadly about the patterns you can see: try to express them in natural language first; then, gradually get more granular until your natural language maps onto the tools you have available. “Match every string where some characters are followed by an @, some more characters, a period, and a few more characters” is a much more straightforward thing to convert to regex than “match every string that looks like an email address.” You’ll also be well prepared to document your regexes, which is one of the most important things you can do to make your code containing them approachable and maintainable.
This is another area where practice helps. I recommend playing some regex golf to practice both writing regexes and spotting and expressing patterns.
Finally, a word of warning: not every pattern has a regex that can find it. Regular expressions are not turing complete, meaning they don’t have the full expressive power of a programming language. As a rule of thumb, use them for small, specific tasks rather than trying to tackle big general ones, and definitely don’t try to parse arbitrary code with regex. If you want to learn more about the reasons for this, strap in for a heavy dose of theory.
One final, really cool example
That said, regex is capable of quite a lot! I’ll leave you with one of the coolest I’ve ever seen, which illustrates one of the most useful tools in regex, the back-reference (or backref):
(interactive version on regexr)
It’s a prime number detector! Or at least a composite number detector. This will match on any composite number of x’s in a row (e.g. xxxx) , and not match on any prime (e.g. xxxxx). Let’s break it down one more time.
|^x?$|||Match the start of the line, 0 or 1 x, and the end of the line. This handles the special cases of 0 and 1, which are non-prime by definition. The | at the end means or, letting the regex move on to use our general algorithm when it doesn’t meet this special case.|
|^(xx+?)||Match 2 or more x at the start of the line and remember the match thanks to the parentheses. The ? makes the + non-greedy, so it will stop at xx if it hasn’t already, then xxx, and so on. This generates the test divisors in our algorithm.|
|\1+$||Uses a backref (\1) to repeat the remembered match one or more times until the end of the line. If this works, it means the input can be divided evenly by the match, which is always at least 2, so the number is composite|
Our algorithm here, outside of the 0 and 1 cases, is to take the smallest chunk of two or more we haven’t tried yet, then try to match it as many times as we can against the rest of the string. If this works, it means the number can be divided evenly by the test value, and is therefore not prime. If it doesn’t, we keep incrementing our test divisor and trying again. If the number is prime, eventually the test divisor will be as long as the string itself, and the regex will stop since it has tried every possibility without finding a match.
Now, go forth with your regex swiss army knife freshly sharpened. Mess about, make mistakes, and have fun with it until it’s a trusted tool you can reach for like any other.