Measuring Software Complexity: What Metrics to Use?

Should we rebuild everything too complex?

This article is part of a series about complexity metrics:

“This part of the codebase doesn’t feel right!”

This was Dave, your colleague developer, arguing in another never-ending meeting to rewrite a part of your company’s codebase. His arguments? Technical debt, high entropy, and the fear of the legacy system.

Our work, as developers, pushes us to take many decisions, from the architectural design to the code implementation. How do we make these decisions? Most of the time, we follow what “feel right”, that is, we rely on our intuition. It comes from our experience, an important source of information. But the same intuition can the source of many problems too. We’re humans, and we are subjects to many biases leading to wrong assumptions.

That’s why, to take important decisions, like rewriting a whole chunk of a codebase, we need more unbiased information. Measuring complexity can bring the information we need to be sure we’re headed in the good direction. That’s the subject of this article: how do we measure this complexity? More precisely, we’ll see:

How complexity metrics can be complementary to experience and experimentation.
What we’re aiming for when we’re measuring complexity.
What are the most common complexity metrics out there and their limitations.

The goal of the article is to give you a broad introduction of the different complexity metrics you can use. As a result, I’m aiming for a general overview without going into too many details.

I’ll speak mostly about modules here, which can be seen as units of code. A function, a class, an object, a package, a service (or microservice) can be a module.

This article is also the first of a series about measuring complexity. In the following articles, we’ll see more concrete examples how to combine these metrics to get the information we want.

Now, prepare your favorite beverage, get your measuring tools, and let’s go!

Ask Your Questions First

All sort of data is widely available in our information-driven world. We can look at dashboards full of trendy metrics for everything and anything, all day long, if we want to. It seems that, with these metrics, we search something, but we’re often not really sure what. Maybe the pleasure to see that the cyclomatic complexity of our codebase is low? The thrill to see that more and more people are coming visiting our landing page?

Looking at metrics without having any question in mind is like searching in a massive haystack to see if there are some needles, somewhere, even if in reality we need plastic ducks. Instead, we should define the problems we want to solve, and only then try to find the information we need to solve them. In short, metrics should be used mainly to inform our decisions.

We, as beautiful humans, are very sensible to numbers. We like when they go up. It gives us a good feeling of accomplishment, even if the metrics doesn’t provide us useful information for making decisions. These metrics are commonly called vanity metrics. It’s very easy to fall in this trap; that’s another reason why it’s important to explore the problem space first, before deciding what metrics to look at.

So let’s ask: by measuring complexity in our codebase, what problem do we try to solve? Two things, mainly:

Finding the complex parts of our codebase to make the code easier to understand and reason about.
Preventing the codebase to have new bugs.

The first point is important: our brain didn’t evolve to create an accurate mental model of a codebase with hundreds of different states, processes, and loopholes. According to the cognitive load theory, our working memory can’t retain many pieces of information at once. That’s why reducing the complexity of our systems is essential.

According to this study and many other ones, the defects in a codebase obey the Pareto principle; that is, most defects (80% to 100% of them) are often popping up from a few modules (10% to 20% of the codebase). Measuring the complexity of a codebase can help us isolate these problematic modules.

The second point is not easier to achieve. Predicting the future evolution of complex systems is far from being a solved problem. As such, the goal is not to acquire absolute certainty about the modules which could bring horrible bugs, but to reduce the uncertainty.

Reducing the Uncertainty

Measuring

The definition of measuring from the book How to Measure Anything is on point:

A quantitively expressed reduction of uncertainty based on one or more observations.
Douglas W. Hubbard Source

We often think of measuring as an action providing certainty, but it’s not the case. You can find certainty in the abstract world of Mathematics but, in our world, nothing is certain. Said differently, we don’t need perfect formulas, but imperfect heuristics to get the information we need. The goal is to get some clues about what to do next: what part of the codebase we need to refactor or, at least, to monitor more closely for future bugs.

Research about measuring complexity in codebases is ongoing since the 70s. It’s not a solved problem either. That’s partly why measuring is useful to isolate problematic code, but not enough to know what to do with it. We need to also use some past information we acquired through our own experience, the experience of others, past measurements, possibly past experimentations.

The difficulty is, of course, that the ‘real world’ is not entirely formal, in the sense that we cannot model it with precise mathematical relationships. The best we can hope for is engineering approximations.
Martin John Shepperd Source

Experimenting

Another way to acquire more information and reduce our uncertainty is by experimenting. I advocate experimentation strongly in many articles in this blog, because it can give us immediate feedback on what work and what doesn’t. It can show us the wrong assumptions and cognitive biases we have.

Sometimes, experimenting doesn’t give us more information. It might be seen as a waste of time. But it’s also the most efficient way to see quickly if we’re heading in the good direction.

Experimenting can be trying something new you didn’t try before, like creating a small system to verify some assumptions. Don’t hesitate to throw the system away if it’s not useful anymore. We’re searching information here, we’re not trying to write the next unicorn.

Using Our Experience

Our Experience can also be used to complete the lack of information we have and to reduce the uncertainty even further. If our reasoning was good enough, and our remembering not too fuzzy, it can help to take the good decisions.

This is an important piece of the puzzle of the decision process. Be careful with it, however: again, we’ve many biases. If we don’t keep in mind that our brain likes to take shortcuts, we might end up on the wrong reasoning path.

It’s also useful to keep in mind that your experience is not necessarily better than the other’s.

Measuring Complexity With Code Metrics

Let’s look now at the popular metrics used to measure complexity in our codebases and their limits.

The Halstead Metrics

If you search about code metrics, you’ll stumble on the Halstead Metrics. Even if they’re not always explicitly used in your favorite static analysis tools, they’re often used under the hood to calculate some sort of “maintainability index” or “complexity index”.

These metrics were invented in 1977 by Maurice Halstead, at a time when almost everything was procedural. Codebases were often written in COBOL in a couple of files. Functions (or, more precisely, procedures) were the main constructs used for abstraction.

A codebase, for Halstead, was a sequence of two types of tokens: operators and operands. All his metrics are based on this simple idea. For example, in the operation 1 + 2, 1 and 2 are operands, + is an operator.

From there, we can calculate the following metrics for a specific module:

Length - Number of operators and operands
Vocabulary - Number of unique operators and unique operands
Difficulty - (unique operators / 2) * (operands / unique operands)

These metrics are the bases for others, more meaningful ones:

Halstead Volume - How much information the reader has to absorb to understand the code.
Halstead Effort - Amount of effort to rewrite a codebase (excluded all the work related, like understanding specifications).
Halstead Bugs - How many bugs there are in the system.
Halstead Time - How much time is needed to rewrite a codebase. This one is heavily criticized and never used in practice.

Don’t take these metrics at face value. They’re not the best ones to measure complexity in my experience, mainly because it’s hard to find out what are the operators and operands in modern languages. Take the following code:

<?php declare(strict_types=1);

class Parser
{
    public function count(string $filepath) {
        printf("I'm counting lines of %s pretty hard!", $filepath);
    }
}

What’s the token ;? An operator? An operand? It has semantics, and therefore should contribute to the amount of information a reader has to absorb, for example. What about the token <?php? class?

Halstead stays vague on the subject. According to him, the distinction between operators and operands should be “intuitively obvious”. This is surprising: as we saw, we shouldn’t use our intuition when we begin to measure something. The goal is to have unbiased information.

Because of this ambiguity, static analysis tools count these tokens in different ways. You can have two different results on the same codebase for the same metrics, depending on the tools you’re using.

That’s not all. The interpretation of the different results are often different from one tool to another, too. For some, the Halstead Length is a measure of complexity, for others it’s just the number of operators and operands.

I also wonder if there is more information emerging from the combination of these operators and operands, information not appearing when we only count the different tokens of the codebase. For example, a system “A” with 10 classes hardly coupled is not the same as system “B” with 10 independent classes. Yet, there are the same number of classes in both systems.

At the end, be wary when you see some complexity metrics using the Halstead metrics. They can indicate possible problems in a codebase, but, due to their limitations, further inspections is often necessary. Personally, I don’t rely on them at all.

The Cyclomatic Complexity

Ah! The mythical cyclomatic complexity. You’ll see this one in every static analysis tool you can imagine.

It was developed by Thomas McCabe and popularized in 1976 thanks to the paper A Complexity Measure. McCabe’s goal was to propose a metric for complexity using the control graph flow of a module.

At that time, complexity was often measured by counting the lines of code (LOC). McCabe wanted to create something more accurate to replace this metric. The main goal was to inform when to modularize a piece of a system, for the resulting modules to be more maintainable and testable.

To understand how it works, let’s take this example:

<?php

class Cyclo
{
    public function cyclo()
    {
        for ($i = 0; $i < 10; $i++) {
            if ($i == 2 || $i == 4) {
                echo 'Hello!';
            } else {
                echo 'Bye!';
            }
        }
        echo 'this is the end';
    }
}

To find out the cyclomatic complexity of our method cyclo, we would need to draw its control flow graph, count the edges and the nodes, and use a formula to get your result. In short, to know how many branches we need to test, we count the possible paths the program can take at runtime.

But there’s an easier way:

Count the number of branch points in the code (if, while, for, and boolean operators like && or ||).
Add 1.

What are the branch points in this example?

The for loop
$i == 2
$i == 4

Add one to that and you get the fantastic cyclomatic complexity of 4. Wonderful. A question remain, however: what result do we need for our code to be considered too complex? According to McCabe, the “reasonable upper limit” is 10, without much more explanations.

This is where we begin to stumble in the drawbacks of the cyclomatic complexity: there is no empirical evidence that this upper bound is meaningful. More generally, there is no evidence that the number of paths of a codebase is correlated to its complexity.

Additionally, the cyclomatic complexity doesn’t take into account the nesting of the different branch points. In my experience, confusion can rise quickly when you have nested conditionals, much more than the number of conditionals by itself.

Other studies (like this one) also show that the cyclomatic complexity is not correlated with the number of defects. Additionally, even if McCabe wanted to stop counting lines of code to measure complexity, it has been showed that the counting LOC is correlated to the cyclomatic complexity. In some situation, measuring these lines of code is even better! What’s the point to measure the cyclomatic complexity in that case?

Here’s my advice: take the cyclomatic complexity into account when it’s really high (more than 20 or 30). In most cases, look at the LOC instead.

Counting Lines of Code

Counting the lines of code of a module is one of the easiest complexity metric you can compute. Modules with a high LOC are possible complex modules. It might indicate that they have too many responsibilities, and that these responsibilities shouldn’t be entangled together (lack of cohesion).

Another advantage of the LOC metric: you can use the same tools to calculate it whatever the programming language you’re using. Many tools, like cloc for example, can do that for you.

Keep in mind, however, that this metric work best when you have modules which are really bigger than the others. Additionally, a codebase with small modules doesn’t mean that it’s a codebase without complexity.

Code Shape

Are you afraid when the code you’re reading is unexpectedly attracted by the right side of your screen? You should. Lines containing many indentations often indicate, in many common programming languages, nested structures (like conditionals).

I like to think of complexity as many different modules intertwined with each others, forming a blurb of chaos. Nested constructs are definitely in this ballpark.

It’s pretty easy to measure, too: you just have to count the number of logical indentation (spaces or tab) of a module, line per line. You can then find out the maximum level of nesting in a file, the total count of nesting, and you can compare the information to other, more healthy modules.

Consider the following code:

<?php

foreach ( (array) $post_links as $url ) {
    $url = strip_fragment_from_url( $url );

    if ( '' !== $url && ! $wpdb->get_var( $wpdb->prepare( "SELECT post_id FROM $wpdb->postmeta WHERE post_id = %d AND meta_key = 'enclosure' AND meta_value LIKE %s", $post->ID, $wpdb->esc_like( $url ) . '%' ) ) ) {

        $headers = wp_get_http_headers( $url );
        if ( $headers ) {
            $len           = isset( $headers['content-length'] ) ? (int) $headers['content-length'] : 0;
            $type          = isset( $headers['content-type'] ) ? $headers['content-type'] : '';
            $allowed_types = array( 'video', 'audio' );

            // Check to see if we can figure out the mime type from the extension.
            $url_parts = parse_url( $url );
            if ( false !== $url_parts && ! empty( $url_parts['path'] ) ) {
                $extension = pathinfo( $url_parts['path'], PATHINFO_EXTENSION );
                if ( ! empty( $extension ) ) {
                    foreach ( wp_get_mime_types() as $exts => $mime ) {
                        if ( preg_match( '!^(' . $exts . ')$!i', $extension ) ) {
                            $type = $mime;
                            break;
                        }
                    }
                }
            }
        }
    }
}

This is an extract from the Wordpress source code. Our assumptions are correct here: we can all agree that this code is quite complex. I’ve found it by:

Looking at files with the highest LOC metric.
Looking at lines having more indentations than the others.

Simple metrics like LOC and code shape can give you more information than more complex ones.

Coupling and Cohesion

Things which belong together should be together (cohesion), if not they should be independent (not coupled)

We’re leaving now the nitty-gritty details of the implementation itself to zoom out a bit, and look at the architecture. As we already saw, complexity emerge from many elements intertwined with each others. The number of elements and their relationship makes the complexity; if everything was independent in our systems, we wouldn’t have a hard time to modify anything we want.

Modules should be independent of each other as much as possible. At the same time, we need to group the things which evolve together, to increase the cohesion of our code, inside the modules themselves. Analyzing the coupling in our codebase can show us where it could be avoided.

If you want to dive deeper in both cohesion and coupling, I’ve written another article about them.

This study classify coupling in four different categories I find quite useful.

Structural Coupling: Static Analysis of the Codebase

A coupling is structural when it can be directly found by analyzing the codebase (static analysis). In other words, you don’t need to run your code to find them.

The tools to calculate the different levels of coupling in codebases will be often language-dependent, unfortunately, contrary to other metrics like LOC or code shape.

Here are possible categories of coupling you can find:

Content coupling - Modules accessing the content of each others.
Common coupling - Modules mutating common variables with bigger scope (like global variables).
Control coupling - Modules controlling the logic of other ones.
External coupling - Modules exchanging information to one another using an external mean, like a file.
Stamp coupling - Modules exchanging elements, but the receiving end doesn’t act on all elements. For example, an array passed to a module which doesn’t use all the array’s elements.
Data coupling - Modules exchanging elements, and the receiving end use all of them.

These coupling are supposedly classed from worst to best, even if I believe that content coupling is better than common coupling, because it’s more difficult to know what is affected by global constructs.

These categories are useful while writing our code and while analyzing it, to understand what kind of coupling we’re creating.

What about OOP? Since the rise of the paradigm, many metrics are specifically designed around it. We’re not speaking about general “modules”, but more about classes and objects:

CBO (Coupling Between Object) - How much objects acts upon another.
CBE (Coupling Between Element) - More precise variation of the CBO. It considers two (or more) elements coupled if there is any dependency between them, like access, or modification of implementation details to one another.
CTM (Coupling Through Message passing) - Measures the number of messages sent by a considered class to the other classes in the system.
IC (Inheritance Coupling) - Calculate the coupling due to inheritance. I’ve already written about coupling and inheritance in another article.

Dynamic coupling: The Coupling at Runtime

Dynamic coupling concern every coupling happening at runtime, like dynamic binding or polymorphism. These metrics doesn’t differ much from the structural metrics seen above, except that they can’t be analyzed statically (with the code alone).

In my experience, you shouldn’t need these metrics very often, except if you tried to generalize everything in your codebase using polymorphism. It’s definitely something I wouldn’t advise.

Logical Coupling

Modules coupled logically are modules changing together frequently, even if there is no structural coupling between them. To find out logical coupling, we need to look at historical information, like Git’s history for example.

A tool like code-maat can give you this logical coupling by analyzing files commited together. The assumption is that if the same files are part of multiple commits, changing one oblige you to change the others. As a result, they’re logically couled.

Like any other metric, this one has its flaws. You’re not sure that developers will put the files they often modify together in the same commit, for example. If the commits are squashed automatically each time they’re merged to the main branch, you’ll get wrong results, too. That said, it’s still a useful metrics if you keep these flaws in mind.

In my experience, this kind of coupling happens quite often. If you find some of them, the best is to merge the logically coupled modules, to increase their cohesion. It can help you detect violations of the DRY principle for example; that the same knowledge appears in multiple places of the codebase.

Semantic Coupling

Sometimes, some modules use the knowledge of other modules, increasing the coupling between them and decreasing their cohesion. They’re often logically coupled, but not always. This is called semantic coupling.

Analyzing shared semantics between modules is hard, but some techniques are being developed, like machine learning models which are able to analyze relations between comments or names, for example.

Here are some metrics used for semantic coupling:

CCM - Conceptual Coupling Between Methods.
CCMC - Conceptual Coupling Between a Method and a Class.
CCBC - Conceptual Coupling Between two Classes, also called CSBC (Conceptual Similarity Between two Classes).

I never measured the semantic coupling of my modules, the other metrics being good enough for my needs.

Measuring Complexity is Only the Beginning

If you only take one thing from this article, grab this one: complexity metrics are all flawed; they can give you false positive if you’re not careful. That said, they can be useful to isolate a part of your codebase which might be more prone to defect. But, at the end of the day, it’s a combination of metrics, your experience, and your experiments which will show you the potential problems in your codebase.

When I want to find where the complexity is hiding, I always try the simplest metrics first. In order of preference:

LOC
Code shape
Structural coupling (common and content coupling)
Logical coupling

The Halstead metrics (or the ones based on them) and the cyclomatic complexity can be useful too, but only if they’re abnormally large.

Looking at complexity metrics alone is however not enough. A codebase is like a living organism: some parts change more or less often, modified by different actors with different purposes, understanding, and styles. That’s what we’ll look at in the next article of the series: code metrics in the context of a social, ever-changing environment.

Menu

Categories