This is the first part of a series of articles about regular expressions in Vim. Here’s the complete table of content:
“We should never write regular expressions. They are difficult to learn, understand, and maintain. They can spiral in a wormhole of quantum complexity. We should burn them to the ground!”
That was Dave, your colleague developer, patronizing you for using a simple regex to parse an HTML file. Even if you don’t like the tone of your colleague, you ask yourself: are regular expressions that bad? Should we use them?
It’s true that writing regular expressions (also called “regex” or “regexp”) in a codebase can be problematic. Like any bunch of code, they will change overtime, and eventually get more and more complex. Also, their conciseness can make them difficult to parse with our poor brain.
That said, we don’t have to write regexes in a codebase; we can also use them for one-off tasks. You need to find a specific entry in a gigantic pile of logs? You need to replace a specific HTML attribute with another one, whatever its value? Regexes can help you!
That’s what we’ll focus on in this article: writing regular expressions for different one-off tasks.
To write regexes, we need tools supporting them; we’ll use mostly Vim here, but also GNU Grep to see the difference between two (similar) regex flavors.
In this article, we’ll try to answer these questions:
- What are regular expressions for?
- What are metacharacters? Why are they different from the characters we all know and love?
- What are the most common regex engines (also called “regex flavors”) available?
- What are the common metacharacters we can use in our regexes?
To understand how regular expressions work, we’ll use, throughout this article, this example file you can download and open in Vim (or copy-paste if you want). That way, you won’t read a boring article in the worst passive way, but you’ll be the acting Hero of the Regular Expression Journey™. I’ve written some small exercises you’ll find throughout this article, for you to test your knowledge, and to remember what we speak about.
Each exercise will have a solution using Vim’s regex engine as well as the PCRE one, using GNU grep (and sometimes the CLI perl when we need to substitute some text). As a result, to follow along, make sure that you have both Vim (or Neovim), GNU Grep, and eventually Perl installed on your computer.
One of the most important things to keep in mind when creating your regexes: the perfect regex is rarely necessary. Try something good enough to solve your problem, instead of crafting the most general regex for five hours.
Regexes are not that difficult when you get the hang of it, but there is a lot to cover; that’s why this article will be part of a series of article about regexes, and more specifically regexes in Vim.
That’s it! We’re now ready to craft our regexes like a Plain Text God©.
The Goal of Regexes
The goal of a regex is to match some text to perform some action(s). For example:
- Searching a specific text pattern in a text file.
- Searching and replacing a specific pattern (using the substitute command
:s
in Vim, for example). - Operating on the lines matching a specific pattern (using the global command
:g
in Vim, for example).
That’s why regular expressions are so powerful: they allow us to define abstract text patterns to do what we want to do, effectively and efficiently.
Characters and Metacharacters
The simplest regex patterns we can create are composed of characters. In that case, the characters of your pattern will try to match the characters of your text.
For example, the regex vim
will match the three consecutive characters v
, i
, and m
in your text. To search for these characters in Vim, you can first open the HTML file provided for this article, and then you can search by running the command /vim
. Indeed, Vim’s search support regexes by default.
But using only characters in a regex force us to know the exact characters (and their order) we want to search for, or act upon. What if we want to match a more general pattern, where we know some characters, but others could be anything and everything?
It’s where metacharacters become useful. They offer us the mean to match a bunch of characters, even if we don’t know precisely what these characters are.
For example, if you run /[abc]
in Vim, you won’t match the consecutive characters [
, a
, b
, c
, and ]
this time. Instead, you’ll match every single character which is either a
, b
, or c
. The square brackets are metacharacters: they won’t match [
or ]
in our text, but they have a special meaning. The regex engine, which is in charge of interpreting regexes, will understand that we want to use a character class here; more on that later.
So, metacharacters have a meaning, unlike literal characters we want to match one to one. With metacharacters, we can make our regular expressions more abstract, and, therefore, more general, more suitable for many different possible text patterns.
You’ll notice that a metacharacter could also be a character we want to match in our text. Said differently, a metacharacter has always its “character counterpart”. For example, what if we don’t want to use these metacharacters [
or ]
, but we really want to match the character [
or ]
in our text? This is a question which can get tricky quickly, so we’ll come back to that.
This concept of metacharacter is important to understand; we’ll often use the term “characters” and “metacharacters” in this series of articles. If you don’t understand the difference while reading the rest of this article, you can try to come back to this section to clarify your doubts.
Also, don’t be surprised to find the concept of “atom” in Vim’s help. It’s almost synonym to “metacharacter”, except that it always match one character in the text.
:help pattern-searches
:help atom
Regex Engines
There are many different regex engine (also called “regex flavors”) out there. As we saw above, these engines interpret the different metacharacters of a regex. That is, depending on the engine you’re using, you won’t have the same metacharacters available. The meaning of the different metacharacters can also be different from one engine to another. To make everything more complicated, regex engines can also have different versions, offering slightly take on their metacharacters.
But there are islands of hope in this lake of confusion: many regex flavors are based on the Perl regular expression engine. It’s not really a standard, but it’s considered as such by many. That’s why many resources out there will simplify everything by saying that the most used regex engines are all “Perl-style”, or PCRE, for Perl Compatible Regular Expression.
In fact, PCRE is simply a library written in C; many developers used it to create their own regex engine implementations in different tools or programming languages. That’s why it’s considered as one of the most common regex engine.
That said, all these “Perl-style” regex engines don’t necessarily have everything PCRE defines. In the case of Vim, its regex engine is not even based on PCRE, but it still has many metacharacters in common with PCRE. Yet, it also has some new, unique metacharacters, and weird quirks.
What’s important to remember: it’s not because a regex works as expected in your tool of choice that it will work the same way in another tool. Look at the specifics of the regex flavors you’re working with, and adjust your regexes accordingly.
In this article, we’ll mostly see the basics of regex in Vim. To show the differences with a “Perl-style” regex engine, all the solutions of the exercises are also using GNU Grep with its PCRE engine. It will give you the general knowledge for you to be able to work with most regex engines out there.
Vim Regex Engine or the Escaping Hell
Vim regex’s flavor is based on an old engine from the equally old “ed”, the standard editor. This engine was created even before PCRE was a thing, and, therefore, it has some specificities you won’t find anywhere else.
If you look at some Vim regexes online, the most obvious oddity will jump at your face: they have many escaped characters. For example, if you want to use both parentheses “(” and “)” as metacharacters for grouping (we’ll see what it means below in this article), you’ll need to escape them: (
will become \(
and )
will become \)
.
Do you think it’s annoying? I do. Especially since not every character need to be escaped to get its metacharacter counterpart, only some of them. Speaking of inconsistencies! Who wants to learn by heart what should be escaped and what shouldn’t? Not me, not Dave (your colleague developer), and maybe not you.
There is a “magic” way to go around this weird quirk however, to get a more consistent behavior. We need to use \v
at the beginning of our regexes. This stands for “very magic”. What? Are we magicians, now? Well, maybe you always have been, deep inside.
We’ll look more in depth at this concept of “magic” in the second article of this series. For now, just remember this: with the “very magic” \v
at the beginning of a regex pattern, every character which can be a metacharacter will be a metacharacter. If you need to match the character itself, and not using the metacharacter, you’ll need to escape it. In every case.
For example, if I want to use the metacharacters (
and )
, I can simply run the following in Vim:
/\v(whatIsearch)
Here, the parentheses are metacharacters (we’ll see what they mean later). On the other hand, if I want to match the characters (
and )
in my text, I’ll need to escape them with a backslash \
:
/\v\(someContent\)
We’ll always use the “very magic” \v
before each pattern in this article if they contain any metacharacter, for the sake of consistency.
Last important point: to look at Vim’s help for a specific metacharacter, you’ll often have to prefix it with the search symbol /
, and then the metacharacter you want to look at. Sometimes, you’ll also need to escape it! I included the different help commands you can run in Vim to find these metacharacters throughout the article, so you’ll see what I’m talking about.
Common Regex Metacharacters
Let’s now look at the common metacharacters we can use to create our awesome regular expressions. This section is meant to be pretty general; you’ll have access to these metacharacters in most regex engines, including Vim. A general knowledge about these metacharacters will help you tremendously if you need to manipulate plain text. You know, like a developer searching a specific pattern in a mountain of delicious XML. It can also be very useful for writers.
If you use often use a shell (or if you want to), many CLIs support regexes, too.
The Full Stop
The full stop “.” might be the easiest metacharacter to understand: it represents any character. For example, what would you do if you wanted to match the characters vim
and the character that directly follows, whatever this character is? This will be our first exercise; you can use Vim’s search to solve it, or/and GNU Grep with the option -P
to use its PCRE engine. Searching in the file example.html will give you a couple of matches.
/\vvim.
grep -P 'vim.' example.html
With this regex, you’ll match “vim” followed by any character. If you search in the file example.html, the regex will match vim-
or vim_
for example. It also matches substring: the characters “vim)” from “Neovim)” will be matched.
:help /\.
Character Classes
It’s nice to have the full stop to match any character, but you’ll often want to match some specific characters, not all of them. Character classes (also called character sets) are great for that.
General Concept
A character class represents only one character (like the full stop) from a specific set of characters. To create this set, you need to use the metacharacters [
followed by the character(s) you want to include. To close the set, use ]
.
For example, if you want to search and match the characters “v”, “i”, or “m”, you can run the following in Vim:
/\v[vim]
grep -P '[vim]' example.html
Keep in mind that a character class always represent one character from a set, whatever the number of characters there is in the set. That’s why our example matches the characters v
, i
, or m
, whether or not they are consecutive characters. That’s also why, when you hit n
or N
in vim, you’ll jump from one character to another, not from one word vim
to another. Another consequence of this fact: the order of characters in your character class doesn’t matter. The following also works:
/\v[ivm]
Ranges
Some characters are metacharacters inside the square brackets, and only inside them. The hyphen -
is one of them. If you search for it, it will match the character -
, as follows:
/\v-
No metacharacters here. But inside the character class, the hyphen represents a range, if and only if it’s surrounded by two other characters. These two characters are the beginning and the end of the range. For example:
/\v[a-z]
This regex will match a character ranged from a
to z
included. Said differently, it will match all lowercase letters.
If the hyphen is not surrounded by two characters, it won’t be a metacharacter. For example, /[a-]
and /[-z]
will match the characters “a” or “-”, and “-” or “z”, respectively.
Now, how would you match all uppercase characters from the alphabets? What about matching all numbers from 0 to 9? How would you match both ranges?
To match all uppercase letters:
/\v[A-Z]
grep -P '[A-Z]' example.html
To match all numbers from 0 to 9:
/\v[0-9]
grep -P '[0-9]' example.html
To match both ranges:
/\v[0-9A-Z]
grep -P '[0-9A-Z]' example.html
The order doesn’t matter; /\v[A-Z0-9]
works too, for example.
Negated Character Classes
The hyphen is not the only metacharacter which is exclusive to the character class. You can also use the caret ^
after the opening squared brackets [
to negate the character class. That is, instead of saying “I want to match these characters”, you’re saying “I want to match every character except these characters”.
This is exactly what the following does:
/\v[^abc]
The order is important here: to act as metacharacter, the caret ^
needs to be placed just after the opening square bracket [
. The regex [a^bc]
is not equivalent to the one above, for example.
I’ve a question I always wanted to ask you: how would you match any character, except the cursed characters e
, m
, a
, c
, and s
?
/\v[^smeca]
grep -P '[^smeca]' example.html
You can put the letters in any order, the result will be the same. Always put the character ^
after the opening bracket of the character class, however. Do I repeat myself? Yes, but it’s for our own good.
Shorthand Classes
It can be quite tedious to write all the possible ranges we want in our regexes. Luckily, there are also a bunch of shorthand classes we can use.
We’ll see here the most general shorthands; Vim has many more under its sleeves. We’ll see them in the next article of this series, focusing on Vim specifically. For now, here are the shorthands you can try out:
Character class | Description | Equivalent |
---|---|---|
\s | Whitespace characters | |
\d | Digits from 0 to 9 | [0-9] |
\w | Word characters | [0-9A-Za-z_] |
The uppercase version of these shorthands can be used to negate the character class. For example, to include everything except digits in your character class, we can use \D
, which is equivalent to [^0-9]
.
We can also use these common POSIX character classes:
Character class | Description | Equivalent |
---|---|---|
[:alnum:] | Uppercase and lowercase letters, as well as digits | A-Za-z0-9 |
[:alpha:] | Uppercase and lowercase letters | A-Za-z |
[:digit:] | Digits from 0 to 9 | 0-9 |
[:lower:] | Lowercase letters | a-z |
[:upper:] | Uppercase letters | A-Z |
[:blank:] | Space and tab | [ \t] |
[:punct:] | Punctuation characters (all graphic characters except letters and digits) | |
[:space:] | Whitespace characters (space, tab, new line, return, NL, vertical tab, and form feed) | [ \t\n\r\v\f] |
[:xdigit:] | Hexadecimal digits | A-Fa-f0-9 |
Let’s do our usual exercises to keep our athletic shape. Using the shorthands described above, how would you:
- Search for dates with the format “2022-01-01”?
- Search for three characters: an uppercase letter, followed by two lowercase ones?
To search for the date:
/\v\d\d\d\d-\d\d-\d\d
grep -P '\d\d\d\d-\d\d-\d\d' example.html
For the uppercase and lowercase letters:
/\v[[:upper:]][[:lower:]][[:lower:]]
grep -P '[[:upper:]][[:lower:]][[:lower:]]' example.html
We need here to create three character classes here, one for each character.
:help /\[]
:help whitespace
:help [:alnum:]
Quantifiers
We’ve seen how to match a single character using different metacharacters, like the full stop or character classes. What about matching multiple time the same character?
For example, let’s say that we want to search for every year in our example file. A year is a number with four digits; we could do the following:
/[0-9][0-9][0-9][0-9]
It works, but it’s quite verbose. Imagine if we want to replace a full datetime in a file! We would need to spawn character ranges until our fingers are on fire.
We could improve it by using the shorthands we’ve already seen, but we still need to repeat them a couple of times.
/\d\d\d\d
It’s a good use case for quantifiers. They allow us to match multiple times the character preceding them. Here’s a list of these quantifiers:
Metacharacter | Description |
---|---|
* | Matches the preceding (meta)character 0 or more time. |
+ | Matches the preceding (meta)character 1 or more time. |
= | Matches the preceding (meta)character 0 or 1 time. |
{n,m} | Matches the preceding (meta)character from n to m times. |
{n} | Matches the preceding (meta)character exactly n times. |
{,m} | Matches the preceding (meta)character from 0 to m. |
Note that the quantifier =
doesn’t exist in the PCRE world.
Now that you have The Knowledge™, can you rewrite our regular expression to find the years composed of four digits?
/\v[0-9]{4}
grep -P '[0-9]{4}' example.html
Using a shorthand for more swag:
/\v\d{4}
grep -P '\d{4}' example.html
Here’s another exercise: can you replace every single year in the file by “2023”, using a regular expression with a quantifier?
:%s/\v\d{4}/2023
We use here the substitution command, allowing us to use regular expressions to match what we want to replace.
If you have Perl installed, you can run PCRE substitution in your shell, as follows:
perl -pe 's/\d{4}/2023/g' example.html
It will output the whole file after the substitution. We can also pipe the above to Grep with the replacement pattern, to only output what was replaced:
perl -pe 's/\d{4}/2023/g' example.html | grep 2023
Another exercise: using shorthands, how would you match a date of the format “2022.01”, with any number of digits (from 1 to infinity) after the dot character .
?
/\v\d{4}\.\d+
grep -P '\d{4}\.\d+' example.html
We need to escape the dot character .
here: we want to match the character, not using the metacharacter. You can try not to escape the dot to see the difference.
:help /multi
Grouping and Backreference
Grouping can be useful to repeat different characters using a group and a quantifier, and backreferencing our groups can also be useful to get part of the search pattern and add it to the replacement pattern, when we do some substitutions.
Grouping
We’ve looked at quantifiers above: they are applied to the character (or metacharacter) just before it. What if we want to apply them on multiple different characters? It’s where grouping can be useful.
Let’s say that we want to find all the datetimes composed of the digits and characters “00:00:”. We can see that the pattern 00:
is repeated two times, so we can use a group with a quantifier as follows:
/\v(00:){2}
I’ve terrible news! The HTML of our file example.html is incorrect: there are some divs which are empty and never closed!
Using a quantifier with a group, how would you match the characters <div><div>
? How would you delete them?
/\v(\<div\>){2}
grep -P '(<div>){2}' example.html
In Vim, the characters ‘<’ and ‘>’ can be used as metacharacters for word boundaries; we’ll see them below in this article. Here, we want to match the characters ‘<’ and ‘>’, so we need to escape them. Notice that word bounderies use different metacharacters in the PCRE engine; that’s why we don’t need to escape <
and >
with Grep.
We can delete these aberrations as follows:
:%s/\v(\<div\>){2}
perl -pe 's/(<div>){2}//' example.html
Backreference
What if we only want to replace the first two digits of a date? How to match the whole date pattern, but only replacing a part of it? It’s where grouping and backreferences come in handy.
Again, to search and replace in our HTML file, let’s use the substitution command in Vim.
First, we need to create one or more group in the search pattern. Then, we can refer to these groups in the replacement pattern, using the metacharacters \1
to \9
. The first group can be referenced using \1
, the second one with \2
, and so on.
For example, here’s how you would replace the first two digits of our years:
:%s/\v\d\d(\d\d)/19\1
Let’s look at the replacement pattern 19\1
. We replace here the whole four-digits number \d\d(\d\d)
with only 19
, and then we do a backreference \1
to our first (and only) group, effectively concatenating the two last digits captured by the group to the replacement pattern 19
.
This is tricky to explain, so let’s try to solve another problem. How would you replace the value of every HTML attribute style
with text-align: left
?
:%s/\v(style\=").+"/\1text-align: left"
We can also use Perl to do some substitution using the PCRE engine, and pipe it to Grep to only output what we changed:
perl -pe 's/(style\=").+"/\1text-align: left"/' example.html | grep 'style='
:help /\(
Anchors
Metacharacters are not only meant to match one or multiple characters. We can also match positions. In that case, you don’t include characters in the match; instead, you want your match to target a specific position. Because of that, these metacharacters are said to have “zero-width”.
Anchors are part of this kind of metacharacters. Here are the two most common ones:
Metacharacter | Description |
---|---|
^ | Start of the line |
$ | End of the line |
These metacharacters effectively anchor the matches at these positions. Let’s try it: can you match every single HTML tag at the beginning of a line?
/\v^\<.+\>
grep -P '^<.+>' example.html
Again, the characters <
and >
are word boundaries metacharacters in Vim, but not in the PCRE engine.
We can also substitute anchors. In that case, we won’t replace any characters, but insert characters at a given position. It can be interesting to add characters at the beginning of each line, or at the end.
Here’s an interesting exercise: using Vim’s global command, how can you delete every empty line in our example file?
:%g/\v^$/d
The regex ^$
only match lines which have no character between the beginning and the end of the line; they are the empty lines we want to delete. We run the command d
(delete) on each of them to… delete them.
An empty line is not necessarily devoid of characters, however. We could have spaces or tabs between the beginning and the end of the line. If we also want to delete these lines, we could run the following:
:%g/\v^[[:space:]]*$/d
All the lines composed only of 0 or more whitespaces (including spaces and tabs) will be deleted.
:help /^
:help /$
:help /zero-width
Word Boundaries
Word boundaries are another good example of zero-width metacharacters. Let’s first try to search for “vim” in our example file by running the following command in Vim:
/vim
You’ll notice that you’ll match all the words “vim”, but also the substring “vim” in “Neovim”. What if we don’t want to match any substring, that is, we only want to match the word “vim”?
It’s where word boundaries are useful. Using the PCRE flavor, you can use the metacharacter \b
as follows:
grep -P '\bvim\b' example.html
Vim uses a different syntax. To mark the boundaries of our words, we need to use the metacharacter <
at the beginning, and the metacharacter >
at the end. For example, to search for the word “vim” in Vim (excluding the substring “vim”):
/\v<vim>
By the way, how would you match the word ‘of’ (but not the substring ‘of’) in our example file?
/\v<of>
grep -P '\bof\b' example.html
:help /\<
Alternations
The metacharacter |
allows us to match multiple regexes at once. For example, if we want to match class
and href
, we can do the following:
/\vclass|href
How would you match, using Vim’s search, the HTML class “la” (and not the substring “la”), as well as the class “la-twitter”? You shouldn’t match any other class beginning with “la” and followed by a dash “-”.
/\v<la>[^-]|la-twitter
grep -P '\bla\b[^-]|la-twitter' example.html
:help /\|
Greedy And Non-Greedy Quantifiers
We’ve seen already what’s a quantifier, but we didn’t look at how these quantifiers work; and, trust me, we should care about that.
Let’s try to search in our example file using Vim, to match all HTML attributes “name” and their values:
/\vname\=".*"
If you look at the 6th line of the file, you’ll see that we’ve match name="twitter:card" content="summary_large_image"
. Why did we also match the HTML attribute “content” and its value here?
All the quantifiers we’ve seen until now are “greedy”. To explain what it means, we need to look at how the regex engine is working:
- The engine match the consecutive characters
n
,a
,m
,e
, and"
. - The engine reaches
.*
. The star*
is a greedy quantifier, so it will match the preceding character (or metacharacter) as much as possible. Here, the preceding metacharacter is a full stop.
(matching any character), so it will match every character until the end of the line. - The engine reaches the last character of the regex, the double quote
"
in our example. Everything until the end of the line is still matched, so it will backtrack and “un-match” everything until it finds the character"
.
It’s interesting to note that if the greedy quantifier was followed by more characters (instead of a single double quote "
), it would backtrack, match everything, and backtrack again for each of them!
It’s the same for the PCRE engine, and many other flavors. I’m not lying; here, try it for yourself:
grep -P 'name\=".*"' example.html
Grep works also line by line, but it’s not the case for every tool. Other ones work on multiple lines; it means that backtracking won’t happen before the match reaches the end of the file. Needless to say that even more content could be matched in that case.
To solve our problem, we need to use non-greedy quantifiers. In Vim, we’ll need to use the curly brackets notation {}
, followed by a hyphen -
after the opening bracket.
This notation might look weird to those who are already familiar with other regex engines. For example, with Perl-style flavors, you need to add a question mark after the quantifier itself to get the non-greedy version. For example:
grep -P 'name=".*?"' example.html
Here’s the list of non-greedy quantifiers for Vim, and their greedy counterparts:
Greedy quantifier | Non-greedy quantifier | Description |
---|---|---|
* | {-} | Match the preceding (meta)character 0 or more time. |
+ | {-1,} | Match the preceding (meta)character 1 or more time. |
= | {-0,1} | Match the preceding (meta)character 0 or 1 time. |
{n,m} | {-n,m} | Match the preceding (meta)character from n to m times. |
{n} | {-n} | Match the preceding (meta)character exactly n times. |
{,m} | {-,m} | Match the preceding (meta)character from 0 to m. |
Now that we have our non-greedy quantifiers for Vim, let’s use them! How would you match the HTML attributes “name”, their values, and their surrounding double quotes?
/\vname\=".{-}"
grep -P 'name=".*?"' example.html
Another, slightly more complicated exercise: how would you match every HTML attribute “property”, their values, and the surrounding double quotes? Another rule you need to follow: the value should begin by “og:” and be followed by at least one character; this character shouldn’t be a double quote.
Said differently, your regex should match property="og:type"
, but not property="og:"
.
/\vproperty\="og:[^"]{-1,}"
First, we use the non-greedy quantifier {-1,}
; its greedy counterpart is +
.
Second, we don’t use the full stop to match any character, because we don’t want to match any character. We want to match any character except the double quotes. As a rule of thumb, when you want to exclude characters, you need to use a negated character class with the characters you want to blacklist.
In the PCRE world, we need to use the quantifier “+” followed by a question mark to make it non-greedy:
grep -P 'property="og:[^"]+?"' example.html
:help non-greedy
:help /\{-
Vim Regexes Have Been Unraveled
Even if the metacharacters you can use (and their meaning) will depend on the regex engine you use, this article stays general enough to give us a good understanding of the very basics of regular expressions in general. So, what are these important generality we should try to remember?
- Regular expressions are powerful to perform some general operations on plain text, using specific text patterns.
- A metacharacter is a character which has a special meaning in a regular expression.
- There are many regex engines (or regex flavors) out there. They are often similar, but they don’t necessarily have the same set of metacharacters. These metacharacters can also have a different meaning.
- The most common type of regex engine is called “Perl-style”, or PCRE.
- Vim follow some of the PCRE standards, but there are many differences too.
- Some characters can be used as metacharacters in a specific context, like the caret
^
for negating character classes. Outside a character class, the caret has another meaning. - There are many different ways to represent the same range in a character class:
0-9
is equivalent to[:digit:]
, which is equivalent to\d
. - Vim uses the characters
<
and>
for word boundaries, PCRE uses\b
and…\b
. - A greedy quantifier match as many characters as possible (depending on what precede the quantifier), and then backtrack to the characters (or metacharacters) following the quantifier.
- A non-greedy quantifier look up the character (or metacharacter) followed by the quantifier and stop when it’s found, without including as many characters as possible.
- Grouping is useful if you need to repeat more than one character multiple time, by using a quantifier after your group.
- You can also backreference a group. It’s useful if you want to reference a group of characters in a replacement pattern for example.
- There are metacharacters which don’t match any character, but instead match a position. These metacharacters are called ‘zero-width’.
- Anchors are good examples of zero-width metacharacters.
You can now appreciate the power of regular expressions to quickly isolate some (more or less) general text pattern in your plain text, to replace them, or to perform any action on them. You can use Vim, Grep, or Perl to do so, three powerful tools which can solve quickly your problems.