This is the second article of a series about regular expressions in Vim:
We’ve seen, in the first article, general metacharacters we can use in our Vim regexes, as well as equivalent Perl-style regexes we can use with other tools (like GNU grep). This time, we’ll dive deeper in Vim’s regex engine by looking at more metacharacters we can use in our favorite editor.
As a reminder, I don’t advise writing 32908309 regexes in your codebases, but instead using regexes with Vim (or other tools) for one-off tasks. A codebase is often a system in constant evolution, and regexes can ramp up in complexity fairly quickly.
We’ll see, in this article:
- How to do some substitutions in Vim using Perl-style regexes.
- What is the “level of magic” of a regex in Vim, and what “level of magic” to choose.
- Some characters class shorthands relying on Vim options.
- What are lookaround assertions in Vim and Perl-style regex engines, and how to use them.
- How to match a pattern on multiple lines.
- How to match a pattern in a visual selection.
I recommend you to actively follow along by trying the different regexes directly in Vim. There are also many exercises throughout the article; you can try to solve them using this example file. Only trying will help you understand and memorize the different metacharacters, even if you don’t try very hard!
Are you ready to dive into the depths of Vim’s regular expressions?
Vim and Perl Regex Engines
We’ve seen, in the first article of this series, some differences between Vim’s regex engine and a Perl-style regex engine (using the PCRE implementation of GNU grep). We can also look at Vim’s help to see the differences between both engines at a glance; to do so, simply run :help perl-patterns
.
That’s not all: we can also use Perl’s regex engine directly in Vim, thanks to the :perldo
command. It can be useful if you really don’t want to deal with the quirks of Vim’s regexes in your substitutions.
For example, if you want to substitute “regex” with “replacement”, you can run:
:perldo s/regex/replacement/g
You’ll need to have Perl installed on your system for this command to work. If you use Neovim, you’ll also need to install the Perl provider. You can run :checkhealth provider
to make sure the provider is indeed available.
:help perl-patterns
:help :perldo
It’s a Kind of Magic
The first article of this series left an elephant in the room : the “level of magic” for Vim regexes. Let’s address now this oddity.
I was recommending prefixing our Vim regexes with \v
(making the regex “very magic”) to be able to use every metacharacter available. What does this “very magic” means? What mechanism is used here?
First, a little reminder: some characters can act as metacharacter in a regex pattern. For example, the character full stop (or dot) .
can be interpreted by our regex engine as a literal character; in that case, it will match a dot in your text. It can also be interpreted as a metacharacter; in that case, the full stop means “any character”.
Depending on the “level of magic” of the regex, we might need to escape some literal characters to use them as metacharacters, or to escape some metacharacters to use them as literal characters. To escape characters, you simply need to add a backslash in front of it (i.e. \.
).
As a rule, I always try to avoid escaping any character (literal or metacharacters) in my regexes. Backslashes make the pattern more difficult to read and understand. As a result, I often change the “level of magic” of my regexes depending on what I need: do I want to match more literal characters, or do I want to use mostly metacharacters?
From there, we can ask ourselves many questions:
- What “levels of magic” forces us to escape characters to use their metacharacter counterparts?
- What “levels of magic” forces us to escape metacharacters to use their literal character counterparts?
- Do we have to escape all characters (or metacharacters), or only some of them?
- How do we change this level of magic for a specific regex?
We can already answer the last question: we need to add a prefix to a regex in order to change its “level of magic”. The following table tries to answer the other questions:
Level of magic | Prefix | Description |
---|---|---|
Very magic | \v | All possible metacharacters are available without escaping them. |
Magic | \m | Only some metacharacters are available without escaping them. The others need to be escaped. |
Nomagic | \M | Only some literal characters are available without escaping them. The others need to be escaped. |
Very nomagic | \V | All possible literal characters are available without escaping them. |
By default, if we don’t add any of these prefixes to our regexes, they’re considered “magic”. That’s a shame, because I think that “magic” and “nomagic” are quite confusing: for the first, you need to escape some (but not all) metacharacters to match the literal ones, and for the second it’s the contrary. It’s not consistent, because not all metacharacters (or literal characters) need to be escaped; only some of them, and you need to learn them by heart.
As a result, I try to avoid “magic” and “nomagic” as much as possible; I always try to add the prefix \v
(for “very magic”) or \V
(for “very nomagic”) to my regexes.
For “very magic” patterns, most characters are considered as metacharacters by the regex engine, without the need to escape anything. It’s useful when you need to use many metacharacters in your regex. The downside: you’ll need to escape every literal character you want to match.
If you want to mostly match literal characters instead of relying on metacharacters, you can use “very nomagic”. This time, most characters will be interpreted as literal characters by the regex engine, without the need to escape anything. The downside: you have to escape any metacharacter you want to use.
To drive the point home, here are some equivalent regexes using different “level of magic”: “magic”, “very magic”, and “very nomagic”. The first regex tries to match a year with four digits, and the second one tries to match a literal string { Mouseless }
.
Magic | Very magic | Very nomagic |
---|---|---|
/[0-9]\{4} | /\v[0-9]{4} | /\V\[0-9]\{4} |
/{ Mouseless } | /\v\{ Mouseless \} | /\V{ Mouseless } |
Let’s solidify our understanding with an exercise: using Vim’s search in our example file, how would you match all HTML attributes rel
, their values, and the quotes surrounding them? Would you use a “very magic” or a “very nomagic” pattern?
We first need to match the name of the attribute rel
followed by the literal character equal =
. Then, we need to use metacharacters to match one or more character (using the full stop .
) between the literal double quotes "
.
Here’s the “very magic” pattern:
/\vrel\=".*"
Here’s the “very nomagic” equivalent:
/\Vrel="\.\*"
We match a literal character which could be interpreted as metacharacters (the equal character =
), and we need to use two metacharacters to match any character between the two double quotes "
. As a result, we have more metacharacters than literal characters; the “very magic” pattern is arguably more readable, because we escape fewer characters.
This regex won’t work for every HTML file, however. For example, if there is anything after the rel
attribute, it will be matched too, because we use the greedy quantifiers +
. Here’s the equivalent using non-greedy ones:
/\vrel\=".{-}"
Here’s the “very nomagic” equivalent:
/\Vrel="\.\{-\}"
You can read more about greedy and non-greedy quantifiers in the previous article.
All the regexes here work well enough for the task at hand. That’s what we should aim for, instead of trying to craft the best regex which could work on any HTML file. If you want to systematically parse HTML, you should use an HTML parser. Again, I think regexes are best used in one-off tasks.
Here’s another exercise: how would you match any text surrounded by literal stars *
? Would you use “very magic” or “very nomagic”?
Using “very magic”:
/\v\*.+\*
Using “very nomagic”:
/\V*\.\+*
In both cases, we want to match two literal characters which could be interpreted as metacharacters (the two surrounding stars), and two metacharacters which could be interpreted as literal characters (the full stop .
and the quantifier +
). Whatever level of magic you use, you’ll have to escape the same number of characters; it’s a draw!
That said, I still prefer using “very magic” when I use any metacharacter, to stay consistent. When I look at one of my regexes, I know that the characters escaped are literal characters I want to match; I don’t need to think about it.
:help magic
Character Classe Shorthands and Vim Options
We’ve already seen some shorthands for character classes in the last article. There are more available, specific to Vim; the characters they include depend on the value of some Vim options:
Character class | Description | Option |
---|---|---|
\f | Filename characters | isfname |
\i | Identifier characters | isident |
\k | Keyword character | iskeyword |
\p | Printable character | isprint |
If you look at the value of these options, you might see weird ranges of number. For example, if I run :set isfname?
, I’ll get @,48-57,/,.,-,_,+,,,#,$,%,~,=
. The range 48-57
means that the ASCII characters from 48 to 57 are included (the digits 0 to 9). The characters separated by commas ,
are also included in the character class.
You can learn more about Vim options in this other article I wrote. Another tip: you can have access to a fancy ASCII table by running man ascii
in your terminal.
I’ll finish this short section with a word of caution: these options are sometimes used by other commands or plugins, and changing their values can have unforeseen consequences. For example, the keystroke gf
use the option isfname
under the hood.
Vim Lookaround Assertions
What if we want to match a pattern only if it’s before (or after) another pattern? It’s where lookaround assertions enter the chat.
There are two sorts of these assertions:
- Positive lookaround: the pattern you want to match needs to be before (or after) another pattern.
- Negative lookaround: the pattern you want to match shouldn’t be before (or after) another pattern.
We’ll first look at the handiest metacharacters we can use for positive lookaround assertions; then, we’ll talk about other metacharacters for both positive and negative lookaround.
Start and End of the Matched Pattern
We can use two different atoms to mark the beginning or the end of the pattern we want to match:
Metacharacter | Description |
---|---|
\zs | Set the start of the match after \zs ; the pattern before it will need to be in the text, but won’t be matched. |
\ze | Set the start of the match before \ze ; the pattern after it will need to be in the text, but won’t be matched. |
Here’s an exercise: using Vim’s search and our example file (as always), how would you match all non-empty strings surrounded by double quotes "
and beginning by the string application\
? We don’t want to match application\
however, just what’s after.
For example, for the attribute type="application/ld+json"
, we want to match ld+json
.
/\v"application\/\zs.{-1,}"
Let’s decompose the pattern:
zs
- Everything afterzs
will be matched, but not what’s before ("application\/
)\/
- Since we use “very magic”, we need to escape the slash/
to match it (it’s a literal character in that case).{.{-1,}}
- We want to match 1 or more character between the literal slash/
and the closing double quotes"
in a non-greedy way;{-1,}
is the non-greedy equivalent of the quantifier+
.
Lookahead and Lookbehind
There are other atoms we can use in Vim to look around (ahead or behind) the pattern we want to match. Here’s the complete list (assuming, as always, that you’re in “very magic” mode):
Metacharacters | Type | Description |
---|---|---|
(<pattern> )@<=<match> | Positive lookbehind | The pattern <match> will be matched only if the pattern <pattern> precedes it. |
<match> (<pattern> )@= | Positive lookahead | The pattern <match> will be matched only if the pattern <pattern> follows it. |
(<pattern> )@<!<match> | Negative lookbehind | The pattern <match> will be matched only if the pattern <pattern> doesn’t precede it. |
<match> (<pattern> )@! | Negative lookahead | The pattern <match> will be matched only if the pattern <pattern> doesn’t follow it. |
The two strings <pattern>
and <match>
are just placeholders here, you should replace them by what you want.
It’s time for another exercise! Still using the example file, Vim’s search, and the lookaround and lookbehind assertion described above, how would you match every pattern tag
followed by the pattern small
(the string small
prefixed with a space)? Another constraint: the match should not be after the pattern article:
.
For example, section:tag small
should match, or even tag small
, but not article:tag small
.
/\v(article:)@<!tag(\ssmall)@=
Some explanations:
(article:)@<!
- We use the negative lookbehind@<!
to ensure that the pattern in parenthesises is not before the patterntag
we want to match.(\ssmall)@=
- We use@=
to assert that the pattern\ssmall
(the litteral characterssmall
preceded by a space,\s
) need to be after the patterntag
we want to match.
:help \@=
Equivalent Perl-style Metacharacters
Lookahead and lookbehind assertions are not unique to Vim: you can also use them in Perl-style regexes, like PCRE.
Here’s a table summarizing the equivalent syntaxes:
Vim syntax | PCRE syntax |
---|---|
(<pattern> )@<=<match> | (?<=<pattern> )<match> |
<match> (<pattern> )@= | <match> (?=<pattern> ) |
(<pattern> )@<!<match> | (?<!<pattern> )<match> |
<match> (<pattern> )@! | <match> (?!<pattern> ) |
As you can see, the two syntaxes have some similarities; learning one can help to memorize the other.
Let’s practice with this exercise: using GNU grep with the PCRE engine (using the option -P
), how would you solve the previous exercise above?
grep -P '(?!article:)tag(?=\ssmall)' example.html
Here’s the Vim version:
/\v(article:)@<!tag(\ssmall)@=
You can also substitute the match using :perldo
with a Perl-style regex, as we’ve seen at the beginning of this article. For example:
:perldo s/(?!article:)tag(?=\ssmall)/another-thing/
The PCRE metacharacter \K
can also be used for lookbehind assertions. In fact, you have to use it if you want have an assertion which is not fixed in length.
For example, let’s say that you only want to output the content of your <h1>
tag. You could try something like this:
grep -Po '(?<=<h1.*>).*(?=</h1>)' example.html
It doesn’t work, warning you that lookbehind assertion is not fixed length
. It’s because you use the quantifier *
in the assertion itself. But the following works:
grep -Po '<h1.*>\K.*(?=</h1>)' example.html
Regexes Matching On Multiple Lines
Until now, every regex we’ve written are limited to line. We can match the same pattern on different lines, but we can’t yet create a pattern which match multiple lines.
There are other metacharacters we can use in Vim for that purpose. Many of them are similar to the metachacters operating on a single line, except that they’re prefixed with \_
. It means that these metacharacters will also match end-of-lines:
Metacharacter | Description |
---|---|
%^ | Match the beginning of the file. |
%$ | Match the end of the file. |
\n | Match an end-of-line. It can be used in a character class. |
\_. | Match any character (including end-of-lines). |
\_[] | Match a character class as well as end-of-lines. |
\_^ | Match any start-of-line; not only for the current line, but any other line included in the match. |
\_$ | Match any end-of-line; not only for the current line, but any other line included in the match. |
We can also prefix character class shorthands with \_
. For example, if you want the character class \s
(for whitespaces) to match multiple lines (that is, it includes also end-of-lines), you can use \_s
.
Let’s try to solve some exercises, shall we? Using Vim’s search in our example file, how would you select the opening and closing HTML tag <head>
, as well as every line in between?
/\v\<head\>\_.*\<\/head\>
The atom \_.
matches any character, including end-of-lines. Said differently, it will match any character (exactly like the full stop .
) but on multiple lines.
Another exercise: how would you match HTML link tags a
and their possible attributes, but only when they don’t have any inner HTML? That is, there is nothing except possible spaces, tabs, or newlines between the opening tag and the closing one.
/\v\<a.*\>\_s+\<\/a\>
We use \_s
here because it includes spaces, tabs, and end-of-lines; if we would had used \s
, it would have only matched spaces and tabs on the same line.
If you’re not tired of exercises yet, here’s another one: what about matching list tags li
? More precisely, we want to match the opening tag (and its possible attributes), the closing tag, and everything else in between. Let’s add more constraints: we don’t want anything else on the line of the opening tag, and we don’t want to match list tags which are only on one line.
For example, the following is a valid match:
<li class="nav-opened" role="presentation">
<a href="https://thevaluable.dev/tags/fundamentals">Fundamentals</a>
</li>
The following is not a valid match, because the opening tag is not alone on its line; there’s also the string “Some content on the same line”.
<li class="nav-opened" role="other">Some content on the same line
<a href="https://thevaluable.dev/tags/others">Others</a>
</li>
/\v\<li [^<]+\>\_$\_.{-}\<\/li\>
Some explanations:
[^<]
- Ensure that we don’t have any other opening tag on the line. It’s basically saying “I want to match any characters except<
”.\_$
- Ensure that the list tag is at the end of the line. Contrary to\$
,\_$
allows us to continue our match on the lines below (it includes end-of-lines).{-}
- This is the non-greedy equivalent of*
, otherwise the match would continue until the end of the file, and then backtrack to the first closing list tag the engine can find. It would be too much match!
A last exercise: how would you match the first opening HTML link tag a
and its possible attributes in our example file?
/\v%^\_.{-}\zs\<a.{-}\>
The usual explanations:
%^
- We want to search from the beginning of the file, to be sure we match the first link tag. Note that we need to escape this atom if we’re not in “very magic” mode (\%^
).\_.{-}
- We match all characters on multiple lines, using a non-greedy quantifier.\zs\<a
- We don’t want to match everything from the beginning of the file until the first tag, that’s why we only begin the match when we see the first link tag..{-}\>
- Ensure that we also match the possible attributes of the link tag.
:help \%^
:help \%$
:help \_
Only Matching the Visual Selection
What about matching a pattern, but only if it’s visually selected? Here’s what you need:
Metacharacter | Description |
---|---|
%V | Match inside the VISUAL mode selection (or the previous one if you’ve nothing selected). |
Let’s look at a real-life example: when I try to rename a file in my shell (Zsh), I often end up with this kind of command:
mv my-file-name.jpg my-file-name.jpg
I simply type mv
in the shell, and then I use Zsh completion to get the filename I want to rename. I have it two times to modify the second filename, instead of typing it entirely. I then use a keystroke to directly edit this command in Vim.
The goal here is to rename the second string my-file-name.jpg
to my_file_name.jpg
, replacing the hyphen -
with underscores _
. I could do that manually, but I think using a substitution is easier and less prone to errors.
Here’s what I tried at the beginning:
- Switch to VISUAL mode.
- Select the second
my-file-name.jpg
. - Run the following:
:'<,'>s/\V-/_/g
Too bad: it doesn’t work. This substitution will replace every hyphen on the line because of the g
flag. It doesn’t matter what’s selected: the substitution command will always operate on the whole line (if no other range is specified). If I don’t use the g
flag, it will only substitute the first hyphen.
It’s where %V
can be useful:
:'<,'>s/\v%V-/_/g
It works as expected: the pattern only match what I’ve selected in VISUAL mode, thanks to the atom %V
.
Here’s the last exercise: on line 74 of the example file, how would you substitute nav-and-other
with nav_and_other
, without modifying any other attribute in the h3
tag?
First, we need to switch to VISUAL mode and select nav-and-other
. We can then switch to COMMAND-LINE mode and run the following:
:'<,'>s/\v%V-/_/g
Note that you need to escape the atom %V
if you’re not in “very magic” mode. For example:
:'<,'>s/\%V-/_/g
:help \%V
Ready for Vim’s Regexes?
We’ve seen, throughout this series, the most important metacharacters we can use in our Vim regexes, from the basics to the more complicated ones. Now, instead of writing a script to parse some plain text files, you can directly open them in Vim and use some good regexes! You’ll also be able to use tools using a Perl-style regex engine; it’s not that different, in principle, than the Vim regexes!
So, what did we see in this article?
- We can make some substitutions using Perl-style regexes directly in Vim, thanks to the Ex command
:perldo
. Handy if you’re allergic to Vim’s regex engine. - A Vim regex can have different levels of magic: “very magic” (using the prefix
\v
) gives you all metacharacters without the need to escape anything, “very nomagic” gives you all literal characters without the need to escape anything. - There are a couple of Vim character class shorthands which depends on the value of some Vim options.
- We can try to match a pattern if another pattern is just before or after. To do so, we can use
\zs
(before thes
tart of the pattern for a lookbehind assertion) or\ze
(after thee
nd of the pattern for a lookahead assertion). - Lookaround assertions are not exclusive to Vim: we can use some similar syntax with a Perl-style regex engine, like PCRE.
- Vim introduced, in Vim 6, the possibility to match a pattern on multiple lines. Most of these new atoms begins with
\_
. - We can try to match our regexes in a visual selection only, using the atom
%V
(\%V
if we’re not in “very magic” mode).
I’m curious: what are the most useful regexes you’re often using in Vim? Don’t hesitate to share with the community in the comment section just below! You know, sharing is caring.