BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Cheesemake: a Declarative Build Tool for C Programs

Cheesemake: a Declarative Build Tool for C Programs

Key Takeaways

  • Writing a build script for a C program can be a challenge for a beginner
  • Declarative build tools have overtaken the imperative style in recent years but the trend has yet to make it to the C world
  • A large proportion of build cases can be covered with a simple script
  • A plugin system allows extensibility to other cases
  • I had to do something during lockdown so I did this

This article will describe how I came to spend some of the last few months writing a build tool for C programs. Although it is a small project, it nicely solves the problem from my point of view and could well be useful for others looking for a simple way to build their C projects. Along the way, I'll also try and say something about getting a software project off the ground, how to tackle technical problems that arise, and some of the steps on the path to working software.

Getting started is the hardest part

It’s been a long lockdown. Those of us fortunate enough to have suffered nothing more than a long spell indoors during this terrible time have still had to contend with the need to fill all those hours with something more than Netflix and Animal Crossing. In theory this is a great opportunity to start projects that might otherwise have seemed too daunting. There might never again be enough time in life to write a compiler or just learn a new programming language. Personally, I feel that if I can try to create something that I wouldn’t have attempted in normal circumstances then I can tell myself that this summer wasn’t all just a waste. 

The interest that I harbour, but previously seemed destined to be unexplored, is GPU programming. It’s not particularly an interest in graphics or computer games or anything like that, but more a curiosity about how parallel computing differs in practice from the sequential kind and whether the potential speedups are accessible without losing all the programming techniques that make solving complex problems tractable. Perhaps some AI applications will turn out to be interesting, but the intention isn’t to head directly in that direction, more to become familiar with the landscape of GPU programming, get used to the tools then see what kind of programs look possible from there.

This is all a long way from the web applications that I’m more used to working on and none of the relevant tools are particularly familiar to me. GPU programming requires a low level language like C which in turn necessitates familiarity with compilers, linkers and even the underlying operating system. Not beginner stuff. Still I don’t intend to take any shortcuts. I won’t rely on high level libraries with python bindings. If I’m going to learn something new then I might as well learn it from the bottom.  

Initially, things were going well. I bought a Tegra chip from Nvidia, since my XPS13 doesn’t offer much in the way of graphical processing power, and installed all the tools, including Linux, vim, and a C compiler, that I imagined a real hardcore GPU programmer needs. Then, I ran straight into my first hurdle: how do I build C programs?

I know enough to feed some source into a compiler and get an executable out the other end, but I’m used to working in a world where whole projects are compiled, tested, and packaged with one command. None of the build tools available for C offer that kind of convenience out of the box. It seems the usual procedure is to learn an arcane language to configure the build tool, run it to generate an incomprehensible script then issue a collection of commands that will hopefully result in a working binary.

There also seems to be no default format for C projects. Certainly you can put your source code in a src directory and test sources in test. However, compiling the sources, and linking them against the various tests, with the correct flags etc., seems to be a non-standard operation even though most C projects must do such a thing. Even if you know how to do this with gcc, getting a build tool to output the right Makefile to do the same thing automatically is more difficult than I imagined it would be. Perhaps some of the fault lies with me, no doubt I’m a spoiled millennial who doesn’t want to put in the work to learn something a little tricky. I’m used to programming tools that work with the push of a button, but maybe those tools lack the integrity of ones that require expertise to configure.

Key features

Cheesemake is my attempt to gather the knowledge needed to build a well laid out, properly tooled C project and present it in a way that requires little or no technical knowledge to use. Cheesemake projects can be configured with a JSON file called recipe.json and built with a simple command. The rest, compiling, linking, testing, is taken care of by Cheesemake.

Cheesemake employs the “convention over configuration” approach that is common in other parts of the programming world, so sources can’t go anywhere other than src and tests have to live in test. This is a limitation, but it’s also important to make it easy for other programmers to figure out how things work. As mentioned, all the configuration for the build goes in the recipe.json file. Here’s a simple example:

{
    "name": "example",
    "compiler": "gcc -g -Wall",
    "packaging": "executable",
    "dependencies":
        [
   		   {
   			   "package": "glib-2.0"
   		   },
   		   {
   			   "package": "check",
   			   "scope": "test"
   		   }
        ]
}

My hope is that nothing in the recipe file needs much explanation. The name is the name of the project and of the binary that is outputs. Packages have a scope to determine whether they should be linked with the source or just the tests. Needless to say there are no arcane directives or mysterious symbols.

Cheesemake has six phases, which are:

  1. Validate, during which static analysis tools can run
  2. Compile, when files which have been changed or added since the last compilation are compiled
  3. Test, when the tests are compiled, linked with the program binaries, and executed
  4. Package, when the compiled binaries are linked into a library or executable
  5. Verify, during which dynamic analysis tools can run
  6. Run, when the program is run.

Calling Cheesemake with one of its phases as an argument, e.g.,  cmk package, runs every phase in order up to the one specified. Alternatively, cmk clean package will do the same but it will recompile everything from scratch. Arranging things this way means that the entire build can be run with just one command, which makes it easy to notice if a test or something else gets broken. 

Implementation

Cheesemake is written in Bash, mainly to avoid adding more dependencies and it seems reasonable to assume that everyone can run Bash these days since we live in a world where Linux runs on Windows. Having decided to write my script in Bash, one thing that occurred to me is that the core abstraction of Bash is a text stream, so could Bash make a serviceable functional programming language? The answer is a tentative yes and I’ll describe what I did to emphasise that aspect of the language in my use of it.

The first thing a functional programmer would want when dealing with a text stream is the ability to map an operation over the stream. In this context that would mean taking a function and applying it to every line and piping out the result. Unfortunately map does not seem to be built in to Bash, but it is easy to define:

map()
{
    while read line; do
        $@ $line
    done
}

This is not the only possible way of defining map, in particular you could map over functions that read stdin like this:

map()
{
    while read line; do
        echo $line | $@
    done
}

But there are a couple of advantages to doing it the first way rather than the second. Firstly you can reuse functions from elsewhere that already take parameters and secondly the space-separated columns of a line will be assigned to the numbered parameters so you can reformat lines without using awkward awk expressions. For example, this is how you could use this feature:

format()
{
    echo $2 follows $1
}

echo $(cat) | map format

Even better would be a flatmap. That would mean that the operation could vary the number of lines in the stream instead of just mapping each input line to one output line. Can you see how to define  ? In fact the answer is that the map we already defined is actually a flatmap. If the operation echos more than once then multiple lines replace a single line, and if it doesn’t echo at all then that line is omitted.

We’ve actually skipped over the most fundamental aspect of any functional language: first class functions. We’ve already seen that we can pass a reference to a function just by passing its name. Since in Bash everything is text, the name of a function is implicitly a reference to it. We don’t have real anonymous lambda functions but that has the advantage of forcing you to name every function, no bad thing I’d say.

Function references turn out to be quite important to writing good Bash for one simple reason: passing data to Bash functions is a bit of a disaster. Aside from the fact that Bash functions don’t have named parameters, which is an omission we will pass over in silence, Bash functions behave in a way which would surprise programmers from pretty much any other language. It’s easy for a newcomer to the language to write something like:

do_something expression1 expression2

and find that it doesn’t do at all what was expected. Firstly, if expression1 contains a new line, then everything afterwards will be ignored. Then, if there are any spaces in the one remaining line of expression1, the text will be split up and passed in separate pieces. If this doesn’t seem problematic to you, consider trying to pass JSON data as expression1. You can try your hardest to pass your data as $1 and $2 rather than as $@: 

do_something "$(expression1 | tr -d '\n')" "$(expression2 | tr -d '\n')"

This really isn’t worth the effort, just pass a function that gets the data, instead:

get_data1()
{
    echo expression1
}

get_data2()
{
    echo expression2
}

do_something get_data1 get_data2

Examples like these show how passing callbacks is an important part of writing robust bash code. We haven’t quite made a functional language out of Bash — that would require an exploration of algebraic data types—but using some ideas from FP did make the problem of writing good Bash programs a bit more tractable. At the very least it shows th value of having more than one paradigm up one’s sleeve for when the situation calls for it.

Making it useful in the real world

Back to Cheesemake and I have to acknowledge that it won’t be able to build more than simple projects without some kind of module system.

recipe.json file describes how to build a single binary, a library, or an executable, but many projects are likely to have more than one output. Furthermore, those outputs may depend on each other or share common code between them.

Cheesemake handles these cases with a simple module system. Each module has its own recipe.json which can contain dependencies on other modules. Cheesemake doesn’t do anything fancy like determine the dependency order of submodules. Instead, modules will be built in the order they are declared in the recipe.json file, but dependencies between modules are resolved just by copying the build/lib directory between parent and child modules.  As a consequence of that, modules can include dependencies on other modules as long as they are either submodules of the current module, or siblings that come before the current module in their parents recipe.

Here is a more complete recipe file with a couple of module dependencies:

{
    "name": "example",
    "compiler": "gcc -g -Wall",
    "args": "--no-worries",
    "packaging": "executable",
    "modules":
        [
            "submod"
        ],
    "dependencies":
        [
            {
                "package": "glib-2.0"
            },
            {
                "package": "check",
                "scope": "test"
            },
            {
                "package": "subxample",
                "scope": "module"
            },
            {
                "package": "subsubxample",
                "scope": "module"
            }
        ],
    "plugins":
        [
            {
                "name": "cppcheck",
                "phase": "validate"
            },
            {
                "name": "valgrind",
                "phase": "verify",
                "config":
                    {
                        "args": "wibble"
                    }
            },
            {
                "name": "gcovr",
                "phase": "verify"
            },
            {
                "name": "gprof",
                "phase": "verify"
            }
        ]
}

Finally, let’s review Cheesemake’s plugins system. Cheesemake comes with plugins for splintcppcheckvalgrindcuda-memcheckgprof and gcovr and the user is free to define any extra ones locally. I think this is where Cheesemake really shows its value. Normally, configuring all these tools would be plenty of work, including for someone familiar with them and even more so  for a beginner. That’s a real shame because it’s beginners who are particularly in need of good tooling to catch inevitable mistakes. Programming tools are much more valuable when they’re run from an automated build script, because it gives them the best chance to catch errors early while letting the programmer get on with writing the code.

Plugins are bound to either the validate phase, which runs before compilation and is appropriate for static analysis tools, or the verify phase, which runs after the test phase and is appropriate for dynamic analysis tools. Most plugins  can be configured with the config property.

Making it work in the real world

Taking a piece of software from being an interesting toy to a production ready tool can be one of the most difficult parts of a software project. It's something of a chicken and egg problem because to be sure that everything works you really need real users using the software in anger, but to get those users you need to convince them that it works well enough to use. The best you can do as a developer is try your best to find as many bugs as possible before someone else does.

It would be nice if Cheesemake ran anywhere Bash can run, but in reality that's a tall order, since Bash can run on all sorts of Unices. The situation is made more complicated by the fact that much of what we might think of as Bash is actually separate utilities that come with the OS. The cp command, for example, is implemented and behaves differently on Linux and BSD, lacking the useful -u flag on the latter. In some cases it's worth detecting platform specific features and using them when available, but more often it's better to code to the lowest common denominator so the software behaves the same wherever it's run.

There's no substitute for actually running the software on different OSs. In practice this means creating lots of VMs to try and find any assumptions in the code that don't hold true for all platforms. Cheesemake was tested on Fedora, Debian, Ubuntu on ARM, FreeBSD and Alpine Linux. That is by no means everything, but covers a nice range of common choices for desktop, server and embedded systems. There are also several different compilers that we'd like it to work with, including gcc, clang, and nvcc, so that adds another dimension to the number of possible test scenarios. If you find any glitches in Cheesemake, bug reports are very welcome on GitHub.

Why did I do all this?

It’s been a long lockdown and I’m still not much closer to becoming a GPU programmer. On the bright side I have learned a lot about C compilers and I’ve become a better Bash and C programmer. It may seem strange to start a project by spending weeks writing a build script for it, but the more time I spend writing software the more I realise that building working software is like building a house; it’s important to get the foundations right so you can build on top of them. In fact, the stronger you build the foundations, the higher you can build your edifice.

I tend to believe that when programming is hard it’s often because we’ve made it hard for ourselves. That is not to minimise the effort and technical knowledge required to create working software, but software also has a unique capacity to create abstractions that simplify those difficult tasks. We should be able to present advanced tools in a way that’s accessible to beginners. Enabling others to write great code is an essential part of being a programmer, because there’s a limit to how much a single person with one pair of hands can get done on their own.

I don’t know if my script is the right approach to solve the problem of building C programs, but I feel that I’ve covered enough of the main use cases in 500 lines of code to show that a more usable solution is possible. I know that a lot of programmers today don’t need to write in C and those that do are probably already comfortable with the tool they’re currently using, but C is still a widely used language and plenty of important software projects require knowledge of it or its sibling C++. There is therefore plenty of potential value to be gained by making it easier to get started with the language, not least giving more people the chance to learn about low-level systems programming.

About the Author

Martin Rixham works in IT consulting, applying years of experience in web development and various JVM languages. He divides his time between London and Bangalore where he has the privilege of being part of two great IT communities.

Rate this Article

Adoption
Style

BT