The Valuable Dev

Measuring Software Complexity at the Command Line

Should we rebuild everything too complex?

This is the third article of my series about complexity metrics:

I began a new job last January. New job means new colleagues, new offices, new business domain, and new codebases.

It wasn’t the first time I changed job. It’s quite common for developers nowadays to jump from company to company. But it doesn’t change the fact that, in these situations, we need to adapt to basically everything. It’s quite tiring; that’s why having some process to overcome this period more easily is always welcome.

To get a quick overview of a new codebase, I like to use simple metrics we’ve already seen in this series of article. They can give us some assumptions about the complexity of the codebase, what developer seems to have some knowledge about it, and the general mood of the project.

It also allows us to see quickly how honest the developers were during the job interview, and to see if they have an accurate view of their own systems.

First, it’s useful to come up with questions which could be partially answered by some metrics. As I was saying in the first article of this series, if there is no question, there is nothing to answer; so there is no need to measure.

In this article, I’ll show this process using two open source projects:

For each project, we’ll measure:

  • How long the project has been around.
  • The development activity per year.
  • Who could answer our questions.
  • How big the project is (in terms of lines of code).
  • The aggregation of churn and a complexity metric to uncover the hotspots of complexity.
  • The Git comments, to get the general mood of the project, and to see what are the most important domain concepts.

We’ll use some basic GNU command line tools to measure all these points.

Our goal is to get some useful assumptions about the codebase; they’re not absolute truths. We can then use our experience and our intuition to dig deeper, by looking at the code itself.

It’s useful to write a summary of these assumptions somewhere, to keep the ideas at hand when we have to modify the new codebase. The sections “In a Nutshell” would be these summaries.

With all these important points out of the way, let’s begin the analysis.

Analyzing Devdash

First, we’ll look at the open source project Devdash. I would recommend you to follow along by cloning the project and running all the commands you’ll see in this article. Even better: you can come up with your own questions and try to modify the different commands to answer them.

Also, don’t hesitate to share what you’ve found useful to measure in the comment section.

To get the results you’ll see in this article, you’ll need to rewind your Git history to match the time period when these measures were done. For that, you can run the following command at the root of the project:

git checkout $(git rev-list -n 1 --before="2022-04-21" master)`

Code Ageism: How Old is Devdash?

Looking at the age of the entire codebase is a good introduction to its possible complexity. So let’s ask our first question: how old is the codebase?

Let’s print the first and last commit using Git’s history:

git log --pretty="format:%cd %h %s"  --date=short --reverse \
    | awk 'NR == 1 {print} END {print}'

First, we get Git’s logs with the following information:

  • %cd - Committer date.
  • %h - Short commit hash.
  • %s - Subject (first 50 characters of a commit message).

We use GNU awk here to filter the logs, only printing the first and last one. A line is a “record” in awk terminology; printing the record when the Number of Record is equals to one prints the first line. I’ve written an introductory article about awk if you want to know more about this fantastic tool.

Here’s the result:

2018-08-06 c176d6b [master] First commit
2021-10-26 d8b871c Merge branch 'master' of

We can already come up with a couple of assumptions here:

The first comment “First commit” seems to indicate that we have indeed all the commits for the project. Some (old) projects switch VCS (Version Control System) during development, sometimes losing all the commits created beforehand. If we mine the Git commits to find some answers, it’s important to know if we have access to all of them.

Second, Devdash is not around for a long time: a bit more than 4 years at the time this article was written. It’s not a green field project, but I wouldn’t expect much complexity either. If further analysis prove the contrary, asking questions related to the history of the project can give us more useful information.

Third, the project didn’t change much, lately. We could think of two different explanations:

  1. It’s abandoned.
  2. It’s stable enough and it answers most needs for now.

If it’s abandoned, we can ask ourselves if the system is still used, or even if it still works. If it’s stable enough, it might be the sign of a healthy project, maybe maintained by individual(s) (or a company) who knows when to stop implementing features.

All in all, since the project wasn’t modified for a while, dependencies might be out of date. If it’s the case, they should be updated as soon as possible, or it could lead to security problems. The more we wait for updating dependencies, the more difficult it is to update them safely, without breaking parts of the codebase.

For example, I remember a project which used a version of a framework released more than 10 years ago. Nobody took the time to bump it to the next version. When you’re relying on a framework deprecated for a decade, it’s unlikely to update it without rewriting most of your application.

Security holes are no joke.

Number of Commits Overtime

We know for how long the project is around, but we don’t know the activity on the project itself. So let’s ask the following question: when was the project mostly implemented? It’s likely that bugs and complexity were introduced during the most active periods.

Let’s run the following to get the count of commits per year:

git log --pretty="format:%ad %h %s" --date=short --reverse \
    | awk '{print $1}' \
    | awk 'BEGIN{FS="-"} {print $1}' \
    | sort -n \
    | uniq -c \
    | awk 'BEGIN{print "commits","year"} {print}'

We use awk here again, first to get the date of each commit, then to get the year only, and finally to add some headers to the output.

If you want to output a CSV instead, to edit it in a spreadsheet editor for example, you can use the CLI sed to delete the indentation. Then, we use the command tr to replace the spaces with comma:

git log --pretty="format:%ad %h %s" --date=short --reverse \
    | awk '{print $1}' \
    | awk 'BEGIN{FS="-"} {print $1}' \
    | sort -n \
    | uniq -c \
    | sed 's/^\s*//' \
    | awk 'BEGIN{print "commits","year"} {print}' \
    | tr ' ' ','

Here’s the output:


It seems that the developers were most active in 2019 and 2020. These years will be our reference to determine what part of the project changed the most. We could also look at the period from the creation of the project till now, but it might be too much data to analyze. It would work for a project of this size, however.

Asking Questions

When we’ll go more into the details of the project and look at the code, it’s very likely that we’ll have some questions. The best would be to find the developers who were the most involved in the project and ask them directly, if we can.

Let’s get the number of commits per author with this simple command line:

git shortlog --after=2018-12-31 --before=2021-01-01 -sn --all

We limit the results to the time range we’ve defined: between 2019 and 2020. Also, the option -s only display a summary of the count of commits, and -n sort the output by number of commit (instead of ordering alphabetically by author’s name).

Here’s the result:

   156  Matthieu
    89  matthieu
    32  Matthieu Cneude

It looks like this “Matthieu” guy is the only developer on the project. Incidentally, he’s also the author of the article you’re reading! What a coincidence.

So, if you have a question, just ask me.

Project Size: The Lines of Code

It’s time to look at the complexity of the project, by trying to answer this question: how big the project is, and, therefore, how complex it might be?

As I’ve already described in the very first article in this series, any complexity metric we can get from a snapshot of a codebase is more or less as accurate as counting the project’s lines of code.

I know what you might think: “Counting lines of code to measure the complexity? Or you crazy, or just dumb? It won’t show anything! And I don’t want developers to write one-liners impossible to understand, just for the codebase to appear less complex!”

I believe that developers would only try to lower the complexity this way only if the company culture has a good amount of toxicity. Nobody should use the count of line of source code (SLOC) to judge developers, or to have any definitive opinion on anything. The point is to give a direction. Some clues about what’s happening in the codebase, without reading every single file.

Let’s see how we can use this count of source code useful. First, we need to install a CLI tool which can count SLOC for many languages: cloc.

Let’s run the following in the root of the project:

cloc .

Here’s the result: v 1.92  T=0.11 s (714.0 files/s, 133306.2 lines/s)
Language                     files          blank        comment           code
Go                              49           1429            298           8499
JSON                            14              0              0           3712
YAML                            11             11              7            461
Markdown                         3             85              0            219
Bourne Shell                     2              5              4             19
SUM:                            79           1530            309          12910

The first obvious observation: it’s a small Go project. The size is not surprising, since it was developed only by one person in two years.

What’s more surprising is the cheer amount of JSON and YAML in there. A quick recursive search for JSON files, thanks to the command ls **/*.json, show that they’re mostly fixtures for unit tests. Doing the same for YAML files shows that most of them are in the example folder.

It seems that Devdash can be configured using YAML in different ways. Looking quickly at the files (cat $(ls **/*.yml)), we can see that the configuration is quite flexible. In my experience, anything flexible bring also a cheer amount of complexity, because there are many use cases the code needs to cover.

The JSON files are used as fixtures, maybe the responses from some APIs. Request to external APIs are important indicators too, because we don’t control the result we get from them. It’s an external coupling which can bring a lot of communication overhead with the maintainers of the API. In short, and in my experience: it can be painful.

Calculating Churn

As we saw in the second part of this series of article, it seems that code churn is the best indicator of complexity and possible bugs overtime. It doesn’t measure a snapshot of the codebase anymore, but analyze the codebase overtime.

Intuitively, it makes sense: when the code change, bugs creep in. It’s not only the code; as we hinted above, if any external dependency (APIs, libraries…) change, it can also introduce bugs.

So let’s ask: what parts of the codebase change the most? Therefore, what parts of the codebase seem to gather the most complexity?

In the previous article, I came up with a command which outputs how many times each file changed, aggregated with its size in SLOC. Let’s reuse it for our purposes; I modified it slightly to add the time period of interest for both the Git and the cloc commands.

awk '{files[$1]=(files[$1]?files[$1]FS$2:$2)} END { for (i in files) print files[i],i }' \
    =(git log --after=2018-12-31 --before=2021-01-01 --pretty='' --date=short --numstat \
        | awk '{print $3}' \
        | sort \
        | uniq -c \
        | sort -rn \
        | awk '{ print "./" $2,$1 }') \
    =(cloc ./ --by-file --quiet --csv \
        | awk -F ',' 'NR > 2 { print $2,$5 }' \
        | head -n -1) \
    | tr ' ' ',' \
    | sed '/^.*,.*,.*$/!d' \
    | sort -rg \
    | awk 'BEGIN{print "CHANGES,LINES,FILENAME"} {print}'

Here are the first lines of output:


We have now the most complex files which change the most often. We don’t really care about the markdown files and, but the other files are interesting. It seems that internal/ga_widget.go, internal/project.go and internal/tui.go are the files which changed the most.

From there, we should make sure that:

  • These files are well tested. Do they have some sort of automatic tests?
  • These files are understandable for a newcomer. If not, asking questions to the main developer of the project at the time could be useful.

In short, if we have to modify these files at one point, we should be careful doing so. Monitoring them from time to time can be useful too, to see if the complexity increase. If it does, splitting them to improve their readability could be a good refactoring to do.

Git Comments: Mood and Domain Concepts

Let’s ask two more questions to get a first high-level picture of the project:

  • What are the most important domain concepts of this project?
  • Did the developers spend their time maintaining the project, or adding new functionalities?

We can look at the different Git comments to begin to answer these questions. The following command is also from the previous article of this series:

git log --pretty=format:'%s' --after=2018-12-31 --before=2021-01-01 --after=2018-12-31 --before=2021-01-01 \
    | tr ' ' '\n' \
    | sed 's/.*/\L&/' \
    | sort \
    | uniq -c \
    | sort -rg \
    | head -n 100

The output shows us the frequency of Git comment’s words. There are more add than fix words, which already give us some information: more features have been added, less fix might have been needed.

We can also output all the words without aggregating them, and put them into a word cloud generator, to visualize the result. Let’s run the following:

git log --pretty=format:'%s' --after=2018-12-31 --before=2021-01-01 \
    | tr ' ' '\n' \
    | sed 's/.*/\L&/'

We can then use a word cloud generator like this online tool. Additionally, it will filter the data, throwing away useless common words. Here’s the result:

Word cloud for the project Devdash

We can see here that the term “widget” is often used. You’ll notice that some of the files which changed the most and had the most complexity were about “widget”, too. It seems that this concept is central for Devdash’s domain.

There’s also the word “github” repeated quite often. Does Devdash call a Github API? Is it only some noise? As you can see, answering questions often bring more questions. These are not crucial for now, but they might be relevant later.

A last reassuring trend: it seems that the README was often modified, which could mean that an emphasis was made to write good documentation.

It’s also reassuring to see domain concepts in comments; it’s what we should fine. If there are more words related to bugs or maintenance, the project might be difficult to maintain.

In a Nutshell

Thanks to our commands, we have now more information about this project:

  1. It’s a young project which was mostly implemented in 2019 and 2020. Not much happened in 2021 or even this year (2022).
  2. It is implemented and maintained by one developer.
  3. It’s a small Go project manipulating widgets, an important domain concept.
  4. It seems that it call a cheer number of external APIs we don’t control, which could lead to problems down the road.
  5. The configuration files for Devdash look flexible and highly customizable, again a potential source of complexity in the codebase itself.

I would write this summary somewhere, and refer to it for the first changes I might need to do in the codebase. I would also add the files which have the biggest aggregation of line of code and churn to this summary.

We’ve uncovered important information about the project without beginning to read the code. The next step would be to look at the structure of the project, and have a glimpse at the biggest files which change most often.

Of course, you can adapt all these measures (and what to measure) depending on the task at hand.

Analyzing Kubernetes

Let’s now switch gear and look at a bigger project: Kubernetes. As with Devdash, you can rewind the Git history if you want to follow along, and get the same results as the ones in this article. Simply run the following command:

git checkout $(git rev-list -n 1 --before="2022-04-21" master)

We’ll try to answer the same questions we asked for Devdash, and see how things differ when analyzing a bigger project like Kubernetes.

Code Ageism: How Old is Kubernetes?

Let’s display the first and last commit of the project with the following command:

git log --pretty="format:%ad %h %s" --date=short --reverse \
    | awk 'NR == 1 {print} END {print}'

Here’s the output:

2014-06-06 2c4b3a562ce First commit
2022-04-15 a750d8054a6 Merge pull request #109487 from alculquicondor/disable-job-tracking

First, it seems that we indeed have all the commits of the project. Second, Kubernetes is around for quite some time: 8 years already!

We can also see that the project changed recently. The last commit was a couple of days ago.

Number of Commits Overtime

Let’s now look at the activity throughout the years with the command:

git log --pretty="format:%ad %h %s" --date=short --reverse \
    | awk '{print $1}' \
    | awk 'BEGIN{FS="-"} {print $1}' \
    | sort -n | uniq -c \
    |  sed 's/^\s*//' \
    | awk 'BEGIN{print "commits","year"} {print}' \
    | tr ' ' ','

Here the result:


The activity was very high from 2015 to 2019, and began to decline slowly from 2018 on. In comparison, not as many commits were created this year.

From there, we can decide to measure what happened in 2021. It would allow us to get the most recent activities while having a big enough sample, hopefully to limit the false positives in our results. We could also analyze both the year 2020 and 2021; that’s what we’ll do for every other measurements.

Asking Questions

Let’s now get the number of commits per author:

git shortlog -sn --after="2020-12-31" --all

Here are the biggest contributors to Kubernetes:

5470  Kubernetes Prow Robot
372  Jordan Liggitt
239  Anago GCB
200  Antonio Ojea
185  Tim Hockin

The first contributor is not human; let’s ignore it. Then, Jordan Liggitt and Anago GCB might be the people who can answer our questions. To get their email, we can add the option --email to the command above.

Project Size: The Lines of Code

Now that we have some basic information about the project, let’s look at the codebase itself. We can again compute the source line of code with cloc at the root of the project:

cloc .

The command will take quite some time to complete because of the size of the project. Here’s the first lines of output:

Language                      files          blank        comment           code
Go                            14879         500624         947818        3811811
JSON                            448              3              0         890831
YAML                           1295            678           1208         132809
Markdown                        465          20157              0          71196
Bourne Shell                    334           6349          12339          31217

A lot of Go in there! That’s normal, Kubernetes is primarily a Go project. There’s a lot of JSON and YAML, too. For what?

If we run a couple of commands, we’ll see that they are mostly used for testing:

  • find . -name "*.json" | wc -l - 585 JSON files.
  • find . -name "*.json" | grep -F "test" | wc -l - 509 JSON files which have the word test in their paths.
  • find . -name "*.yaml" | wc -l - 3704 YAML files.
  • find . -name "*.yaml" | grep -F "test" | wc -l - 3523 YAML files which have the word test in their paths.

We can also see that there are many Bash scripts in there. Let’s run the following command to see from what part of the project they come from:

find . -name "*.sh" | awk 'BEGIN{FS="/"} {print $2}' | sort | uniq -c | sort -rg

The result:

136 hack
 81 test
 40 cluster
 33 staging
 29 vendor
 13 build
  3 third_party
  1 plugin
  1 pkg

It seems that these Bash scripts are primarily in the hack directory. If we look at this directory, we’ll see that it contains mostly Bash scripts, for many purposes.

Here are two assumptions we can make from there:

  1. The name hack doesn’t inspire confidence. These scripts might go around some problems or limitations the developers have.
  2. Whatever we need to do in this codebase, we might need to run (or modify) these Bash scripts. Additionally, it doesn’t seem that these scripts are unit tested.

Looking at the scripts’ names, it seems that we can divide them in two categories:

  1. The “verify” scripts.
  2. The “update” scripts.

The first category might be linked to some kind of test. These assumptions are reinforced by the fact that the test directory is the second one containing the most Bash scripts.

The second category is more worrying; these scripts might need to run in some specific situations. We should definitely write that in our project’s summary.

Calculating Churn

If we look at the aggregation of churn and complexity per file, it might be better to limit our analysis to some part of the project. Otherwise, it will timeout. The cost of a huge codebase!

In Go Projects, a big part of the logic is implemented, most of the time, in the pkg directory. So let’s analyse it:

awk '{files[$1]=(files[$1]?files[$1]FS$2:$2)} END { for (i in files) print files[i],i }' \
    =(git log --after="2020-12-31" --pretty='' --date=short --numstat ./pkg \
        | awk '{print $3}' \
        | sort \
        | uniq -c \
        | sort -rn \
        | awk '{ print "./" $2,$1 }') \
    =(cloc ./pkg --by-file --quiet --csv \
        | awk -F ',' 'NR > 2 { print $2,$5 }' \
        | head -n -1) \
    | tr ' ' ',' \
    | sed '/^.*,.*,.*$/!d' \
    | sort -rg \
    | awk 'BEGIN{print "CHANGES,LINES,FILENAMES"} {print}'

Here are the first lines of the output:


We can already see that some file have many lines of code. The output is sorted here by the number of changes, but we could also sort by the number of line. To do so, replace sort -rg \ at the end of the command with sort -t ',' -k2 -rg \.

First, the file kube_feature.go seems to be logically coupled with a lot of things: it’s not very big, but it changes very often, compared to the other files. It could be some sort of configuration file.

Next, the files in apis/core seem quite complex, especially validation.go. Hopefully, it seems to be tested as well thanks to the huge validation_test.go (the biggest file in the whole codebase). With a test file of this size, our hope is that the tests are well defined, commented, and not coupled to one another.

The other files which seem to change often are the ones from the /pkg/kubelet package. They might be logically coupled if they change together. The same could be said about the files in pkg/proxy or pkg/apis. If I would need to work on these files, I would do more analysis on the kubelet package to understand better the logical dependencies.

All and all, I would have expected more changes on a codebase that size. It seems to me that the changes are not happening only on the same files: it’s a good sign. It means that there is no “god files” which gather all the logic.

As I said, the Bash scripts can be a little worrying (huge Bash scripts are often a pain), especially when they’re all in some weirdly named “hack” directory. So it might be worthwhile to also analyze them:

awk '{files[$1]=(files[$1]?files[$1]FS$2:$2)} END { for (i in files) print files[i],i }' \
    =(git log --after="2020-12-31" --pretty='' --date=short --numstat ./hack \
        | awk '{print $3}' \
        | sort \
        | uniq -c \
        | sort -rn \
        | awk '{ print "./" $2,$1 }') \
    =(cloc ./hack --by-file --quiet --csv \
        | awk -F ',' 'NR > 2 { print $2,$5 }' \
        | head -n -1) \
    | tr ' ' ',' \
    | sed '/^.*,.*,.*$/!d' \
    | sort -rg \
    | head -n 30 \
    | awk 'BEGIN{print "CHANGES,LINES,FILENAMES"} {print}'

Here’s the first lines of the output:


I would keep an eye on the two first files, because of their rate of change and their size. If they continue to grow overtime, it might be a good idea to split them, to make them more understandable.

Git Comments: Mood and Domain Concepts

Last stop: the Git comments. Let’s run the following command:

git log --pretty=format:'%s' --after="2020-12-31" \
    | tr ' ' '\n' \
    | sed 's/.*/\L&/' \
    | sort \
    | uniq -c \
    | sort -rg \
    | head -n 100

There’s quite a lot of noise in the result. In general, the bigger the project is, the more noise we’ll have in our results. It’s important to keep that in mind, because more noise potentially mean false positives and wrong assumptions.

We can try to visualize these words using a word cloud, to see more clearly in this mess. Let’s plug in the result of the following commands:

git log --pretty=format:'%s' --after="2020-12-31" \
    | tr ' ' '\n' \
    | sed 's/.*/\L&/'

We could also filter the words we don’t find interesting for our analysis, by piping the command above to something like grep -Ev "merge|pull|from", for example. These words are likely to be from Git itself, not really from the developers.

Then, let’s plug the output in this tool. Here’s a possible result:

Word cloud for the Kubernetes project

First, the team of Kubernetes seem to speak about tests quite often. It correlates well with the other analysis we’ve done. That’s always good news.

Looking at the important domain concepts, kubelet seems quite important (as we already saw when we analyzed the aggregation of complexity and churn at the file level), as well as node, pod, metric, proxy, and scheduler. If you know a bit how Kubernetes work, it makes sense.

The cheer amount of fix is a bit alarming, but it also makes sense that a tool as big as Kubernetes need quite some maintenance. It’s not a greenfield project anymore.

In a Nutshell

It’s already more difficult to get useful information from a bigger project like Kubernetes, but it’s also difficult to do anything in this kind of project without any information. In that context, the measurements we did can be really useful.

Let’s try to make our summary:

  1. Kubernetes is a very well known project going on for 8 years, which might explain its size.
  2. The activity on the project began to slow down in 2018, and even more in 2020.
  3. It’s mainly written in Golang, with a lot of fixtures in JSON and YAML for the tests.
  4. It includes a cheer number of bash scripts, mostly for “verifying” and “updating” things, and for tests. We should dive more (reading code, asking questions) to know in what context we might need to use or to change them.
  5. In general, it seems that there are many tests. It’s fortunate, for a project of this size!
  6. The churn indicates that a couple of packages are often changed: apis/core, pkg/proxy, pkg/apis, and pkg/kubelet. There might be some logical coupling going on there.
  7. The git comments show us important domain concepts, like kubelet, node, pod, metric

Again, I would write that somewhere and use it as reference for any change I need to do in Kubernetes. I would add to the summary the biggest files which change most often, too, or maybe what seems to be the most changed packages.

Do We Need to Know Everything?

It can be difficult to measure everything we want only using basic CLI tools, but they’re great to quickly get some useful information about a codebase. The results can give us interesting clues what to be careful about, especially if we need to modify the system.

From there, it could be useful to use other tools to get even more information if we need to answer more questions. My favorite tool for that is another CLI, code-maat. For example, it can give more useful information about logical coupling, and perform analysis on bigger entities than files.

It’s important to understand that measuring, by itself, is far from being enough in many situations. We can only make educated assumptions from our measuring. Then, we can use our experience and tacit knowledge to confirm what we need to do, and set our priorities right.

So, what did we see in this article?

  • The longer a system is around, the more complex it might be. Looking at the age of the codebase can already give us a little idea.
  • Looking at the number of commits per year can give us a glimpse of the codebase’s activity.
  • Asking questions to the main developers of a codebase can be an effective way to dig deeper. We can find this information in Git’s history directly.
  • According to many studies, static complexity metrics are more or less correlated to the source lines of code (SLOC). Looking at the size of the project can offer us another glimpse of its complexity.
  • Again, according to some studies, churn (amount of code added and deleted) is one of the best metric to get the possible hotspots of complexity. It allows us to have a more precise idea where the complexity might be.
  • Mining Git comments can also give us useful information, even if there might also be a lot of noise in there.

As much as I like code-maat, I find it a bit clunky to install and use (especially since I always need to first create a file with Git output before using it). I’m thinking more and more about developing a CLI tool which could calculate important metrics. The metrics would be more or less “important” according to studies, my experience, and the experience of others.

What do you think about that? What metrics would be useful for you? The comment section is waiting for your input.

Share Your Knowledge