The Valuable Dev

Measuring Software Complexity: The Impact of the Environment

The environment brings complexity overtime

This article is part of a series about complexity metrics:

The alarm rings suddenly and intensely, waking you up from the food coma you were getting into. Red lights are on, sign of a major crisis. What an idea to eat a massive burger on a release day!

Everybody’s running in every direction, but nobody seems to go anywhere. Developers begin to type frenetically some nonsense in their terminals. You glance at the metrics: the CI is in the red, the logs are nuts, the dependency graph is through the rough. Everything’s crashing.

You begin to hear some screams of terrors, accompanying the annoying alarm. It’s the customer service, no doubts!

The day you’ve anxiously anticipated finally came. It’s over. Everything’s going down. The system collapses on itself. The Apocalypse.

Fast-forward a horrible death march of three months, your managers, instead of blaming the developers as they usually do, finally seem to understand that software complexity needs to be managed. They assign you the task to find and report the most complex parts of your massive codebase. Then, depending on the finding, a war council will be created to decide of a strategy.

Instinctively, you begin to launch your good old static analytics tools. But your general knowledge about systems make you pause a moment. Even if looking at the complexity inside the system is useful, it might be even also to also consider the environment influencing the system itself.

We often think that isolating the parts of a codebase is the key for good software, but total isolation is not possible nor desirable. It begs the question: how to find the most complex parts of our codebases while considering the impact of its environment?

We’ll try to answer this question in this article. More precisely, we’ll see:

  • Pitfalls when considering the social environment of a system, like a codebase.
  • How to mix the rate of change of a codebase with complexity metrics.
  • What can be the benefits and drawbacks of aging code.
  • Studies about the correlation of number of developers and defects.
  • What’s the cognitive complexity metric.
  • Studies about capturing the thoughts and emotions of a team of developers.

Are you ready to look at software in a slightly different way? Me neither, but let’s jump in anyway.

Pitfalls when Considering the Environment

When we consider real world systems, we need to always keep in mind that its environment (what’s outside the system of interest) has always some impact. It’s the case for a codebase, too:

  • A codebase is written by humans for the machine to execute it, and for other humans to understand it. As a result, there is a strong social context in software development.
  • A codebase evolves over time. Its entropy changes, too. Yet, static analytic tools often give information only about a snapshot of the codebase, as it was fixed in time.

When we begin to add social metrics to find complexity, we need to be careful not to use these metrics against our peers. Blaming developers because we came up with some numbers “proving” that they did a bad job will only create a toxic, competitive culture. It has been shown that collaboration should, instead, be praised.

Additionally, measuring complexity in its social context is difficult, like any social study. Our social world is a highly dynamic, ever-changing system. It’s difficult to isolate what you want to measure, and it’s even more difficult to reduce the uncertainty of the consequences of these measurements. Many invisible factors can play a role in the consequences of our actions. Because of this complexity, our intuition, in this context, is likely to be wrong too.

In short, don’t use metrics related to productivity, cognition, or other human qualities, to squeeze as much work as possible from your exhausted team. Instead, they should be used to improve the processes, and the quality of the codebases. Code and processes are easier to change than people, and speaking about the defects of a codebase is often better perceived than speaking about people “defects”. Some team leaders out there sometimes forget that developers are people, not programs. Don’t make this mistake.

Tools are not inherently evils. That said, you can use a hammer to fix a house, or to look at someone’s brain. I wouldn’t advise for the later use.

Measuring Changes With Code Churn

We saw, in the previous article, some complexity metrics we can use on a codebase at a precise point in time. Thinking about it, working with a snapshot is the easiest thing to do, because it’s easier to observe static systems. Yet, what’s easy to measure is not necessarily what we should measure, even if we have tendency, as humans, to conflate the two concepts.

If a codebase and its environment were really static, nothing would change after a first period of creation. No need to use complexity metrics anymore! But, in reality, a codebase changes often, creating both the opportunity to adapt and reflect The Ever Changing Real World™, and the complexity which will gives you headaches and nightmare.

That’s why, according to studies, the best metric to find complexity spots in your codebase is to measure the code churn. This metric measures the total number of line of codes added and deleted in a precise period of time.

One of these studies shows that measuring code churn is indeed more accurate than counting the line of codes to find complexity spots. Another studies shows that code churn is also effective to predict future defects in a codebase.

There are two types of churn we can consider: absolute and relative churn. But, according to the studies above, they are both quite similar. To clear any misunderstanding, when I speak about code churn in this article, it refers to absolute code churn.

I can already see the question in your mind: how to calculate the code churn of a codebase?

The good old Git can give us the information we seek if we ask it politely. Many tools out there can calculate it for us, but I think it’s useful to know how to do it ourselves. We’ll use a shell, Git, and a bunch of GNU tools. It might work with the BSD counterparts also; I’ve no idea.

Here’s a Git command you can run in your favorite project’s root directory:

git log --pretty='' --date=short --numstat

It will output the churn for every file of the project, from the first commit till the last. The first number displayed is the count of lines added, the second one represents the lines deleted. The data is raw and not really useful, but it’s a good start.

We can also add to this command two options: --before, --after, or both. It allows to only display the churn in a specific period of time. For example, to display the code churn from 2015 to 2020 included:

git log --after=2015-01-01 --before=2021-01-01 --pretty='' --date=short --numstat

Adding awk and sort to the mix, we can go crazy by displaying the count of added and deleted lines per file:

git log --pretty='' --date=short --numstat \
| awk '{print $3,$1,$2}' \
| awk 'BEGIN{print "added","file","deleted"} {added[$1] += $2} {deleted[$1] += $3} END{for (i in added) print added[i], i, deleted[i]}' \
| sort -rn

The resulting output is already more useful: it will show you what are the files changing the most in your codebase. But it’s also likely that you’ll get many files you don’t really care about, like config files, or minified CSS files for example. After all, the files changing the most in your codebase are the ones which might grow in complexity, but only if there is any sort of complexity there at the first place.

Now, it could be interesting to filter the files modified in a specific period of time and using some static analysis to get the most complex ones. It could reveal the real complexity spots we’re after.

Combining Changes and Complexity Metrics

We saw in the previous article that no metric can show us the complexity of our code with great reliability. That said, some of them can still give us more information to take a more informed decision. As a result, to have an idea of the complexity of the code in each file, we can simply use the easiest complexity metric we can find, the venerable line of codes (LOC).

It’s far from being the perfect complexity metric, no doubt about that. Don’t expect 100% accuracy of finding complexity spots for every possible piece of code by only measuring LOC. But, according to studies, no static complexity metric do better, and it still gives us a useful heuristic to find this damn complexity. That’s what we should aim for: usefulness, instead of the impossible perfection.

So, how to sort the files of our codebase by complexity (using LOC) and by number of changes? Again, we can use Git and some common CLIs to do so. First, we can look at how many times the files in your favorite project were modified, and save the output to a file called “churn”:

git log --pretty='' --date=short --numstat | awk '{print $3}' | sort | uniq -c | sort -rn | awk '{ print "./" $2,$1 }' > churn

We don’t look at added and deleted lines this time, the concept of “modified file” is enough for now.

Next, we can count the lines of code per file and save the result into another file “comp”. We use the CLI cloc to do so:

cloc ./ --by-file --quiet --csv | awk -F ',' 'NR > 2 { print $2,$5 }' | head -n -1 > comp

Finally, we can merge both results and output a CSV using awk and sed:

awk '{files[$1]=(files[$1]?files[$1]FS$2:$2)} END { for (i in files) print files[i],i }' churn comp \
| tr ' ' ',' \
| sed '/^.*,.*,.*$/!d' \
| sort -rg

Running the above command, we get some columns representing, in order:

  1. How many times each file was modified.
  2. The lines of code.
  3. The filename.

If you’d like a ridiculous one liner combining the three commands above, here you go:

awk '{files[$1]=(files[$1]?files[$1]FS$2:$2)} END { for (i in files) print files[i],i }' \
=(git log --pretty='' --date=short --numstat | awk '{print $3}' | sort | uniq -c | sort -rn | awk '{ print "./" $2,$1 }') \
=(cloc ./ --by-file --quiet --csv | awk -F ',' 'NR > 2 { print $2,$5 }' | head -n -1) \
| tr ' ' ',' \
| sed '/^.*,.*,.*$/!d' \
| sort -rg \
| awk 'BEGIN{print "CHANGES,LINES,FILENAMES"} {print}'

The result is sorted by number of changes here, but you can save the output in a CSV and open it in an Excel-like application, to sort the result according to your needs.

You might have noticed that we didn’t choose a time period using the options --before and --after in our Git commands. I would encourage you to do so, to see clearly the complexity trend of specific periods of time. How to choose this time period? As a rule of thumb, we could try to look around important events; for example around releases, or the implementation of a complex functionality.

This information can be useful if you have a complex functionality to develop, or if you want to know what you should refactor to improve the stability of your application. In general, you need to give extra care to the files with high modification rate and high complexity. Make sure they’re properly tested, and refactor them if needed.

You can even plot different time periods to see if the complexity of some part of your codebase increase or decrease overtime. You don’t have to do that manually: many programs can do that for you. I like to use code-maat, a tool developed by Adam Tornhill (author of great books about measuring complexity I definitely recommend). If you want to have nice graphs and other visualizations instead of raw data in your shell, I sometimes use code forensics, a wrapper around code-maat.

As always, use your knowledge (both technical and domain related) to assess if the results you get are indeed complexity spots. False positives are common, whatever technique you’re using.

Aging Code

Older code is still there because it's so good... or nobody wants to touch it

Speaking about period of time, what about code which is too old? How time can influence our codebase?

According to some studies like this one, code seems to decay overtime. It shows that the older the code, the more likely bugs will appear.

Taking the bright side of ancient code, it can also mean that the code is good! It’s possible that nobody felt the need to refactor it because of its quality. Again, you’ll need to use your expertise to judge if the aging parts of your codebase are wonderful antique treasures or old rotten cucumbers.

There is another potential problem with aging code, without considering if it’s “good” or “bad”. The older it gets, the less chances the authors will still be around. That’s a shame, especially if you need some explanations how it works, or why it was designed that way.

As humans, we have tendency to forget easily. That’s why it’s important to capture the present context when functionalities are being codified in some codebase, by using good naming, writing good requirements or high level documentation. The future developers who need to maintain (or, even worst, change) your code will thank you. Logging all your decisions in a journal can also be a good tool serving generations to come.

More Developers, More Bugs?

I thought for a long time that the more developers there are on a project, the more complex the codebase gets. After all, we need to understand each other (a daily challenge by itself), and we need to coordinate. It can quickly create misunderstanding or misconceptions, the potential source of bugs and wrong implementations.

But if we look at the studies on the subject, nobody found empirical evidences that more developers working on the code correlates to more defects.

It can still be useful to know who are the developers who wrote most of the code you’re working on. A simple git blame can go a long way if you need some explanations about a piece of code, or about the context at the time it was written.

As an aside, I prefer thinking of the Git “blame” command as the “praise” command. Going into endless debates about the subjective question of “code quality” or “code cleanness” is often not the best idea. Instead, it’s better to focus on improving the code than blaming its authors.

Cognitive Complexity, Comprehension, and Readability

It’s now time to reveal the root of all our problems, as developers. It’s no surprise: our brains didn’t evolve to reason accurately about stacks of abstractions with hundreds of states. That’s why we try to automate things as much as possible (not to think about it), and hide the complexity as much as we can, praying that it won’t jump back to our neurons.

It’s nice to study complexity in our codebase, but it’s even more useful to understand the complexity we can handle in our poor brain. As I stated at the beginning of this article, humans write code for other humans, not only for a dummy compiler.

To understand what part of our codebase ask for more brain time (or more mental energy) would certainly be beneficial, to go around our limitations. But, as always, studying our own way of thinking is difficult, and far from being a solved problem. If it was, we would have drunk robots all over the place.

Let’s take understandability: according to this study, we spend 50% of our time trying to understand the code we’re reading or modifying. It wouldn’t be too bad to reduce this number. So let’s ask: how to find the parts of our codebase which are the most difficult to understand?

If you let your intuition going wild to answer this question, you’ll come up with a personal opinion on the subject. This opinion won’t necessarily be shared by your colleagues. This can create new endless debates full of subjectivity. Even if everybody agrees, how can you know that you’re indeed right in general?

Is there any way to measure more objectively this cognitive complexity?

A couple of years ago, Sonarsource (the company developing Sonarqube, one of the most used application to measure complexity) came up with the “cognitive complexity” metric, based on this study. It’s essentially an updated version of the cyclomatic complexity, adding more complexity for nested constructs.

This metric was empirically studied on 22 projects, and the developers of 17 of them found that the metrics were accurate. But the author points out that the sample size was too small for any statistical significance. Despite this, many applications measuring complexity propose this metric.

From there, other papers tried to validate the metric, like this one. It looked at many papers about the relation between code and understandability, and tried to compare the measure they made with the cognitive complexity metric.

At the end, it seems that this metric is correlated with the time the developers spend on the code to understand it. It also seems to match what part of the codebase the developers find complex.

That said, if you ask some questions to the same developers to see if they really understood the code, the result won’t be correlated with the metric itself. In short, we sometimes think that the code is complex even if we understand it, and other times we think the code is not complex even if we don’t really understanding it. Damn it, brain!

All in all, if you come across the “cognitive complexity” in one of your tool, it seems to be slightly better than the usual cyclomatic complexity but, again, I would rely more on code churn or aging code than this metric.

The question of understandability is even more problematic when we ask this question: is it even possible to come up with a general understandability metric, accurate for everybody?

Another interesting research, using mathematical models this time (instead of asking developers if they feel that the code is complex) class some constructs way more complex to understand than others. The worst of all being recursion, parallel processing, and interrupt processing (a second process interrupting a first one, doing some computations till it ends, and let the first process to continue). I would agree with the statement; but again, it’s only my flawed brain speaking here.

Readability can also be a source of complexity for our biological neural network. According to this other study, lexical inconsistencies have more impact than hundreds of other metrics coming from your favorite static analytic tools.

Here are some examples of lexical inconsistencies:

  • Getters doing other operations before returning the value you want (side effects).
  • Setters returning something.
  • Misleading comments.

In short, it’s about being misled by missing or wrong semantics. Keeping a tight consistency between the naming and the behavior of our code seems more important than anything else. As a result, it’s worse spending some time on the documentation (in a very large sense, including comments and naming) and refactoring it when necessary.

Emotional Awareness via Git Commit Messages

It's important to understand the feelings of developers on a project

Our brain is not only a cold machine trying to process logically the outside world. Big news: we have emotions, too.

Even if we prefer to think of ourselves as logical beasts, our emotions are a big part of our conscious life, want it or not. According to Daniel Goleman in his famous book Emotional Intelligence, emotions are often the first thing influencing our decisions, before the analytical part of our brain has time to react.

It was quite useful in the past for survival and it’s still useful today, but not necessarily when we design some beloved applications.

How to be aware of the negative emotions of our colleagues, to make sure they won’t be the indirect cause of some complexity in the codebase? More and more studies look at this question, like this one, which tried to analyze 60425 different Git commits using sentiment analysis algorithms.

This study (and others before) show that strong emotions affect the quality of the codebase and, potentially, its complexity. It’s not very clear yet in what ways, but it’s still useful to keep that in mind.

More importantly, this study answers one of the oldest question bothering humanity from the beginning of times: yes, the Git comments are more negative on Monday!

Weird jokes aside, this study also show that when different nationalities work on the same project, the Github comments are more positive. A good argument to support diversity in our teams.

You don’t need to use machine learning models to mine the Git comments of your own project. You can simply create cloud words from them. It can give you an idea of the mindset of a developer team.

To give you a quick idea, you can split your commit messages into words, count how many times each of them appear, and output the first 100:

git log --pretty=format:'%s' | tr ' ' '\n' | sed 's/.*/\L&/' | sort | uniq -c | sort -rg | head -n 100

Again, if you want to look at Git comments in specific periods of time (during a death march for example, or when the team seemed to enjoy coding some functionality), you can do so with the usual options --before and --after.

Then, we can quickly clean the data to get rid of the words we don’t care (like articles or prepositions), and plug the result into an application creating word clouds (like this one).

What do you want to see? Words related to the domain of the codebase. If you see a lot of “fix”, you can ask your colleagues to squash their commits when they fix the most recent changes. It might also mean that developers spend a lot of time maintaining the codebase (which is very often the case).

The Mysterious Environment

Trying to assess the overall context around the development of an application, and how much it affects the codebase itself, can seem like a daunting task. But, as we saw, there are interesting techniques and metrics out there which can shed a new light on our projects.

What did we see in this article?

  • According to studies, measuring churn is the best way to find complexity spots, and to forecast future defects.
  • Combining churn and some complexity metrics (like the count of line of code, or LOC) can help us filter out the interesting complexity spots.
  • Code decay needs to be taken into consideration. Old code didn’t change either because developers don’t want to change them (due to their complexity), or because the quality is good enough.
  • Even if it seems useful, be careful with the “cognitive complexity” metrics implemented in many applications. It’s not even sure if a cognitive complexity can be applied universally to all developers.
  • Aiming for consistency when writing your code seems to be one of the most important factor of a good codebase. Well-chosen name reflecting correctly the behavior they abstract is key, without omitting the important details.

In general, we shouldn’t forget that rate of changes, the business context, the company culture we work for, and the developers themselves have a great influence on the codebase. Even if this influence can be challenging to measure, it shouldn’t be discarded.

Share Your Knowledge