Key Takeaways
- Code can be redundant for a variety of reasons ranging from unused variables to incomplete changes and abandoned development.
- Redundant code has a number of consequences including bloated source code, reduced reliable and reduced maintainability. In some cases dead code can also effect performance.
- To detect it the author developed a tool that created the Abstract Syntax Tree for C# source by using Roslyn and trained it using a number of GitHub projects including Roslyn and MSBuild.
- Once detected redundant code can be removed either manually by deleting it or commenting it out, or by using an automatic tool. The use of feature branches can help to prevent redundant code getting checked into the main code branch.
Introduction
Some time ago I developed a tool that analysed dependencies within source code. It created the Abstract Syntax Tree for C# source by using Roslyn and C++ source by using libclang. To validate that it was working as intended I then implemented functionality to identify unused methods. This showed that the parsing of my C# code was much more accurate than that of my C++ code, hence I chose to focus on the further development of the C# parser and other people’s more sophisticated C# code.
Initially the tool would flag the lines where the redundant methods existed, as the scale of the problem became apparent an option to automatically delete those lines was implemented. A typical analysis would involve running the tool repeatedly to prune back the source tree as brutally as possible. This was then followed by several cycles of reverting changes so as to get successful builds and then passing tests. The reasons for failure being that the tool had behaved incorrectly or there was a known limitation, examples of the latter being reflection or the existence of a code contract.
The tool was trained on various GitHub repositories for C# projects that were chosen on the basis that I had used them and thus wanted to contribute back. Ultimately a pull request was submitted to the community asking for discussion of the changes in my branch. As the tool is brutal and I was engaging online with people for the first time this was where diplomacy was required and hopefully I didn’t offend too many people. By contributing and being involved in the subsequent discussions my understanding of the issues was greatly increased and this article is an attempt to feed that through into the wider community.
The GitHub projects analysed
Mono.Cecil decompiles .NET code to C# and written by JB Evain. It was suggested that a mere 36 lines be deleted and JB chose to independently add some of the changes after review rather than merge in the branch.
Automatic Graph Layout is an official Microsoft project developed by Lev Nachmanson, Sergey Pupyrev, Tim Dwyer, Ted Hart, and Roman Prutkin for drawing graphs and directed graphs and is used by Visual Studio for displaying various interactive diagrams. The pull request was for deletion of 4674 lines of code, a number of which were to do with SilverLight (the deprecation of which was announced in 2015). The branch was merged in without modification or discussion.
Roslyn is the modern C# compiler maintained by a team from the .NET foundation. In this instance the pull request was for 18364 lines to be deleted, which resulted in a good discussion and led to most of the categorisation which is discussed below. Obviously it was too large to be merged and instead led to a number of individual issues being raised.
MSBuild is an official Microsoft project that any user of Visual Studio should be familiar with. Analysis led to a pull request for 3722 lines to be deleted, unfortunately the team didn’t have the capacity at the time to review the suggested changes.
The last to be analysed was the System.XML assembly within the .NET Core foundational libraries. These libraries are maintained by the .NET foundation and the team had a tracking issue raised to eliminate dead code. The issue was addressed by a per assembly process that involved trimming the assembly (more commonly known as dead code elimination) and then creating a difference between the untrimmed and trimmed assemblies to identify what compiled code had been removed. This difference then informs the source code to be deleted, the work typically undertaken by the volunteer community.
Given a sample of only five projects it is not wise to draw any conclusions, especially given a lack of reliable statistics:
Mono.Cecil | MSAGL | Roslyn | MSBuild | CoreFX (System.XML) | |
Deleted | 36 | 4674 | 18364 | 3722 | 427 |
Initial commit | 2010/04/12 | 2015/02/22 | 2014/03/18 | 2015/03/12 | 2014/11/08 |
Authors (main) | 39 (1) | 25 (4) | 285 (31) | 90 (6) | 526 (29) |
The initial commit date is notable for MSBuild given its long history. Just counting the number of authors without assessing their contribution is almost certainly meaningless so a rough estimate of the significant contributors is given. Having said that I would speculate that:
- Newly developed projects have more redundant code than mature projects
- Code is written assuming it will be needed
- Fewer tests corresponding to fewer bugs found
- Larger teams produce more redundant code
- Team communications scale quadratically with team size, see Brooks
Categorisation of test results
The tool tests for redundant methods and like any test it can be analysed for success and failure.
The true negative:
- Code is useful.
The false negatives:
- Test only code
- Dead code
- Although the code is called it does nothing useful and the call can be omitted.
- Could be duplicate code.
- Note that deprecated code can be considered to be one step away from being dead code.
The true positives:
- Intentionally abandoned development
- Methods are added but change of circumstance means they are never used.
- Incomplete change
- As code is changed some methods have all their references deleted but the subsequent commit hasn't taken this into account. This is sometimes called oxbow code. It may be an unintended artefact of refactoring.
- Unused variables
- Hopefully this will generate a compiler warning and/or be reported by the IDE.
The false positives:
- Deficiencies in the tool
- The tool does not detect the use of reflection or code contracts and relies on errors in build or test to detect these.
- Regression
- This was seen to occur when, during the course of development, a method had been altered to no longer call the method identified as dead. Obviously there was no test implemented to pick up this regression.
- Lack of conditional compilation
- Typically this referred to a lack of #if DEBUG guards. Roslyn had been used to build Release flavours and this had identified bloat in the corresponding binary.
- Unintentionally abandoned development
- This was identified with the specific case of test data but no corresponding test. In the general case I would hope this was a temporary problem as more code gets added.
- Unused in the subsystem
- Insufficient code was analysed to find anything using the method. This is mainly an issue when analysing public methods although it can also occur due to reflection. A common example is when test helper functionality is processed without the corresponding test code although that raises a design issue of where the test helper code should be located.
- Debugger only methods
- Some projects have methods to print out state when the debugger is attached (potentially to a release build). These should be marked in some way so that they are not analysed.
It is to be noted that where the tool fails is not necessarily due to a lack of inference. Some of the cases should be covered by other tools. e.g. compiler warnings should flag some of the issues, duplicate code detectors are also useful.
The effects of redundant code
The false negatives and true positives can be collectively viewed as instances of YAGNI and raise the following considerations:
- Bloated source
- In the past I’ve experienced text editors which couldn’t open large files.
- Developer cognition is limited by the number of independent entities that can be held simultaneously in the brain. Usually this is taken to be seven so removing methods will lead to better ability to reason.
- Any time taken by the developer to read and understand the source is most probably wasted.
- Modern IDEs typically create Abstract Syntax Trees from source so redundant code will slow them down.
- Bloated executables
- This is particularly a problem for mobile phones where upgrades lead to the removal of apps so as to free up storage, ultimately leading to having to upgrade the device when functionality is too compromised.
- Bloated runtimes
- A greater attack surface for the hacker raises security concerns.
- Some applications for embedded devices allocate all dynamic memory at start so redundant code will reduce the size of the available heap.
- Reduced performance
- In the specific case of dead code, execution will waste processor cycles.
- Reduced reliability
- In the specific case of dead code, execution may cause a crash.
- Reduced maintainability
- If the code unexpectedly stops being redundant then the behaviour is not predicted. e.g. Knight Capital losing $440 million.
- The percentage of code coverage is potentially increased by false negatives and decreased by false positives.
- Unnecessary tests will slow the execution of the test suite.
- Static analysis tools will have to process more and thus work more slowly.
- The design / architecture of the software appears more complicated than it actually is.
Given the wide variety of problems that may occur due to redundant code it is best to treat it as a code smell.
Management of existing redundant code
There are four ways of dealing with redundant code:
- Ignore it
- At least it will still compile although there is an associated maintenance cost.
- Not recommended.
- Automatic removal from executable
- If a compiled language use any dead code elimination option in the linker to remove it.
- If a dynamic language use tree shaking to eliminate at run-time.
- Again, not recommended.
- Block it out
- Either comment it out or use pre-compiler directives.
- Use code / comment folding in the IDE to avoid seeing it.
- Delete it
- Rely on source code control to keep the old copy for reference if you know to look for it and can find it.
- Alternatively you may never need it again.
If the source code is going to be changed then any changes should be reviewed by someone familiar with that section of the code to ensure that the code is really redundant.
Management of new development
Assuming the aim is not to introduce new redundant code during development then what stratagems are to be deployed? If partial commits are allowed then it is likely that redundant code will be temporarily added to the code base. This can be mitigated by using feature branches and only merging into the master branch once the tests provide coverage and the source passes static analysis. Another advantage of feature branches is that when development is abandoned the branch can remain unmerged.
About the Author
Zebedee Mason is a numerical analyst with 25 years experience in the CAD/CAM/CAE industry who has recently moved into SLAM. Over the years he has worked on legacy code bases (some of which originated in the days of his childhood), re-written numerous pieces of academic code into components of commercial software and may even have written a few original algorithms (commercial confidence prevents confirmation). This background has given him an appreciation of software tools, of which he has written a number to increase productivity in spite of the interesting choices made by previous developers and IT managers.