Brian Goetz and Cliff Click spoke at JavaOne conference last week about the concurrency revolution from a hardware perspective. They started off the discussion saying that the clock rate has been increasing exponentially until recently, but not so any longer. And for years, CPU designers focused on increasing sequential performance with techniques like higher clock frequency and Instruction Level Parallelism (ILP) but this approach is limited. Going forward, the designers will focus on parallelism for increasing the throughput.
Brian gave an overview the main periods in CPU history which includes the CISC and RISC systems era and now multi-core machines. Cliff discussed the impact of data caching on the system performance. He said, as a general principle developers should think about data, not code. Data locality should be a main design concern for high-performance software. It's also important to follow the principle of share less, mutate less. The combination of sharing the mutable data is not desired because this will cause cache contention and requires synchronization.
Some of the point solutions to achieve concurrency in applications are:
- Thread Pools and Worklists: Thread Pools and Work Queues approach is a reasonable solution for coarse-grained concurrency which includes server applications with medium-weight requests like Database, File and Web Servers. This library support was added in JDK 5.
- Fork/Join: The Fork/Join technique is used for recursive decomposition. This is a good approach for lightweight CPU-bound problems that fit in memory, but not so good for I/O bound operations. Fork/join framework is already available in OpenJDK project and will be part of JDK 7 Release.
- Map/Reduce: Map/Reduce approach is used to decompose the data queries across a cluster. It was designed for very large input data sets (usually distributed) and the framework handles the concerns like distribution, reliability, and scheduling. Open-source implementations like Hadoop are available for Map/Reduce.
- Actors: In Actors computing model, the state is not shared and all mutable state is confined to actors. Actors communicate by sending messages to each other. This model works well in Erlang and Scala and possible in Java, but requires more discipline to implement.
- Software Transactional Memory (STM): The Software Transactional Memory approach has been sold as "garbage collection for concurrency". It works in Clojure language because Clojure is mostly functional and limits mutable state.
- Graphics Processing Units: Graphics Processing Units (GPUs), which have several simple cores, are great for doing the same operation to lot of data. They are widely supported by APIs like CUDA, OpenCL, and Microsoft’s DirectCompute.
The speakers concluded the presentation by saying that the CPU's have grown under the hood and the performance model has changed and the new approach is to have more, simpler cores. They also suggested that developers should think about the parallel computing requirements from the initial phases of the application development process.