At the November LLVM developers meeting, Doug Gregor of Apple gave a presentation on adding modules to C. From the talk's abstract:
The C preprocessor has long been a source of problems for programmers and tools alike. Programmers must contend with widespread macro pollution and include-ordering problems due to ill-behaved headers. Developers habitually employ various preprocessor workarounds, such as LONG_MACRO_PREFIXES, include guards, and the occasional #undef of a library macro to mitigate these problems.
Tools, on the other hand, must cope with the inherent scalability problems associated with parsing the same headers repeatedly, because each different preprocessing context could effect how a header is interpreted---even though the programmer rarely wants it.
Modules seeks to solve this problem by isolating the interface of a particular library and compiling it (once) into an efficient, serialized representation that can be efficiently imported whenever that library is used, improving both the programmer's experience and the scalability of the compilation process.
The basic premise of the proposal is to avoid having to use the pre-processor to include vast quantities of headers when compiling even the simplest files, as a way of speeding up compilation and allowing for re-use of previously parsed headers. In the eponymous "Hello World" example, he highlights how a 64-character C program becomes 11,074 characters after pre-procesing, and an 81-character C++ program becomes 1,161,033 characters after pre-processing. He also notes that re-parsing headers can lead to fragility, since the inclusion depends on the state of the pre-processor at the time (for example, with #define FILE "myfile.txt"
before #include <stdio.h>
, the pre-processor will mangle the headers and result in a failed build).
The proposal is to use a new keyword, import
, to load the module. Instead of being a pre-processor textual inclusion, the compiler can understand that this is a fixed version and thus only parse the module once. Additional uses of the same module can use the same previously parsed data structures instead of starting anew each time.
Modules can also be nested, which allows a sub-module to be imported; in the example given, he demonstrates a std
module with a stdio
submodule, leading to inclusion with import std.stdio
. Importing a module brings in all its public API to the client but can hide its non-public APIs. For that to happen, the module needs to declare what is and isn't public, for which the public
keyword can be used:
// stdio.c export std.stdio: public: typedef struct { … } FILE; int printf(const char*, …) { … }
Note that in this example, it is sufficient to just provide the implementation of the file – no headers are needed. The export
contains the name of the module (std.stdio
in this case) and the public
demarcates the public vs non-public parts of the API. This can be compiled into both the library and sufficient metadata for the function types and macros to be available in client code.
Of course, this is a suggestion for the future and not a standard, so how will it happen? The proposal is to use headers for the public API of existing modules, and define modules to be a set of headers:
// /usr/include/module.map module std { module stdio { header "stdio.h" } module stdlib { header "stdlib.h" } module math { header "math.h" } exclude header "assert.h" } module ClangAST { umbrella "AST/AST.h" module * { } } // can use "import ClangAST.Decl" for AST/Decl.h
To facilitate module generation in future (and in part to make it easy for Objective-C frameworks to export modules) an 'umbrella module' allows a set of headers in a directory to be exported as a single module.
Compilers that are adapted to process modules can take advantage of a single pass over the headers to construct the module, and then re-use that module information in subsequent headers. (The format of the compiled module isn't yet specified, and may be compiler-specific.) Modules may also add additional meta-information, such as what libraries are required for the module to run – this will allow the compiler to pick up on the linking flags needed for each of the modules without the user having to build up a large array of -l flags at link time.
To take advantage of modules, the only change required at the client side will be to replace the #include
with an equivalent import
. Additionally, modules can provide better compiler diagnostics since they have the information of what modules export which functions/types after pre-processing is done; this will allow compiler errors and IDE quick fixes to suggest adding the required import instead of falling over.
Finally, re-using module information also permits debugging information to be associated with the module, instead of with every object file that replicates the contents. This results in both the compiler and linker emitting less debugging information, which in turn, speeds up the compilation time. It also provides additional type information to the debugger, so the debugger will report the correct types as defined in the module (instead of them being the result of in-lined text in each object file).
The net effect of the modules proposal is to provide a transition path and compatibility with existing tools, whilst providing benefits (primarily in the form of compilation speed and improved diagnostic error messages/debugging) with very little change required by the users. It also provides an incremental way that files can be upgraded, switching from individual pre-processor instructions to module based imports. Whilst no compilation speed measurements were given as part of the modules presentation, work is already underway in LLVM to implement these modules. And although the modules don't take into account either versioning or name-spacing (largely due to having to build in backward compatibility) the benefits, if widely adopted, will have a significant boost in compilation speed of C and C++ based programs. Furthermore, the backward compatibility story has been explicitly laid out, and like the LLVM blocks specification, is likely to be made available for inclusion in other compilers or standards as and when required. However, of the widely used C and C++ compilers, the LLVM compiler toolchain has been the only one innovating and leading by example. Whether these features are introduced in other compilers will likely depend on the success of the LLVM implementation and the benefits it confers.