.NET 4.6 comes with several CLR features to improve performance. Some are automatically enabled, others such as SIMD and Async Local Storage require changes to how you write your applications.
SIMD
A long running point of pride for the Mono team was their support of SIMD or Single Instruction Multiple Data vectors. This is a CPU level construct for performing the same operation on up to 8 values at the same time. With version 4.6 of the .NET’s CLR, Windows developers can do the same.
To see SIMD in action, consider this example. So start, let’s say you wanted to add two arrays together to get a third array in the form of c[i] = a[i] + b[i]. Using SIMD you would write this:
for (int i = 0; i < size; i += Vector<int>.Count)
{
Vector<int> v = new Vector<int>(A,i) + new Vector<int>(B,i);
v.CopyTo(C,i);
}
Notice how the loop increment is Vector<int>.Count. This will have a value of 4 or 8 depending on your CPU. The .NET JIT compiler will recognize the CPU and emit code to add the arrays together in batches of 4 or 8.
This can be a bit tedious, so Microsoft is also offering a set of helper classes including:
- Matrix3x2 Structure
- Matrix4x4 Structure
- Plane Structure
- Quaternion Structure
- Vector Class
- Vector(T) Structure
- Vector2 Structure
- Vector3 Structure
- Vector4 Structure
Assembly Unloading
Most developers don’t know this, but .NET often loads the same assembly twice. This can happen when it first loads the IL version of an assembly and then loads the corresponding NGEN (i.e. precompiled) version of the same. This is rather wasteful in terms of physical memory, especially for large 32-bit applications such as Visual Studio.
With .NET 4.6, the CLR will free the memory used by the IL version of the assembly once the NGEN version has been loaded.
Garbage Collection
In the past we’ve talked about the Sustained Low-Latency Garbage Collection for .NET 4.0. While this is certainly more reliable than completely turning off the GC for a period of time, it isn’t a complete solution for many GC scenarios.
In 4.6, you have a more sophisticated way to temporarily turn off the garbage collector. The TryStartNoGCRegion allows you to specify how much memory you need from the small and large object heaps.
If you don’t have enough memory, the runtime will either return false or block until the GC has freed enough memory. You control this behavior by passing in a flag to TryStartNoGCRegion. If you do successfully enter into a no GC region, you’ll need to call EndNoGCRegion when you are done.
Not indicated in the documentation is whether or not this technique is thread-safe. Given how the GC works, you’ll probably want to avoid have two threads trying to change the GC state at the same time.
Another improvement in the GC is how it handles pinned objects. Though not made clear in the documentation, when you pin an object you often effectively pin the adjacent objects. Rich Lander writes,
The GC now handles pinned objects in a more optimized way. It is now possible for the GC to compact more memory around pinned objects. This change can provide a surprisingly impactful improvement for large-scale workloads with significant use of pinning.
The GC is also smarter about how it uses memory in older generations. Rich continues,
Promotion of generation 1 objects to generation 2 has been updated to use memory more efficiently. The GC attempts to use free space in a given generation before allocating a new memory segment. A new algorithm has been adopted that uses a free space region to allocate an object that more closely matches the object size.
Async Local Storage
The final improvement isn’t directly related to performance, but it can be used to that effect. Before the popularization of asynchronous APIS, one could cache information in what’s known as thread local storage (TLS). This acts like a global that is scoped to a specific thread, which means you can access contextual information and caches without having to explicitly pass around a context object.
With async/await, thread local storage isn’t useful. Every time you call await, there is a chance that you’ll jump to another thread. Or at the very least, some other bit of code will use your thread and possibly corrupt your TLS.
The new version of .NET fixes this with the introduction of async local storage. ALS is semantically equivalent to thread local storage, but it travels with you across await calls. It is exposed via the AsyncLocal, which in turn uses the CallContext class to store the data.