Amongst the flurry of announcements at re:invent 2016 was the launch of a developer preview for a new F1 instance type. The F1 comes with one to eight high end Xilinx Field Programmable Gate Arrays (FPGAs) to provide programmable hardware to complement the Intel E5 2686 v4 processors that come with up to 976 GiB of RAM and 4 TB of NVMe SSD storage. The FPGAs are likely to be used for risk management, simulation, search and machine learning applications, or anything else that can benefit from hardware optimised coprocessors.
FPGAs have been around since the mid 80s, and have proven popular as a means to prototype Application Specific Integrated Circuits (ASICs) or create specialist hardware where the production volume doesn’t justify fabricating an ASIC. For a typical application, the speed of dedicated hardware can be around 1000 times faster than software running on a CPU, and FPGA fabrication has benefited from Moore’s Law improvements over time so that speed advantage has remained in place over the last few decades. Unfortunately the FPGA speed advantage has been difficult and expensive to exploit, as the devices need to be programmed using a Hardware Description Language (HDL) such as Verilog or VHDL. Attempts have been made to program FPGAs in higher level languages like C and frameworks such as OpenCL, but there’s often a tradeoff between simplicity and speed.
The F1 instance, type offers Xilinx UltraScale+ VU9P devices, which are fabricated using a 16nm process. Each chip offers just over 2.5 million logic elements. The FPGAs are connected to the CPU and memory by a PCIe x16 interface, delivering up to 12 Gbps of bidirectional communications with shared access to the same memory space. Where there are multiple FPGAs within an instance they can also share access to a 400 Gbps bidirectional ring for low latency communications. Although that all sounds pretty impressive, it will be a disappointment to those who’ve been making use of FPGAs utilising QuickPath Interconnect (QPI), as QPI offers even high bandwidth and lower latency for an FPGA placed into a CPU socket. The bidirectional ring will however prove advantageous to designs that harness more than one FPGA.
Getting a working tool chain to start FPGA programming can often be a challenge, and Amazon have addressed this by offering Amazon Machine Images (AMIs) containing design and verification tools. The FPGAs can be coded using VHDL or Verilog, and verified with Xilinx Vivado Design Suite. Once a design is complete it can be packaged into an Amazon FPGA Image (AFI), which extends on the usual AMI. AFIs can then be listed in the AWS Marketplace so that FPGA based applications can be offered to others.
FPGAs get used in applications where the time investment to program the hardware pays off in scale or latency savings versus software on general purpose CPUs. Typical examples are line speed deep packet inspection in networks (e.g. Xilinx paper [pdf]), low latency trading systems in financial services (e.g. University of Heidelberg paper [pdf]), optimisation of search engines (e.g. Microsoft Project Catapult), and machine learning (e.g. it seems that the Tensor Processing Unit [TPU] was prototyped with FPGA). Some of those use cases are very network centric, which implies that the traffic will need to migrate to the AWS network for FPGA to be useful. For example, AWS would have to displace existing colocation providers as the prefered venue for trading and financial market liquidity for low latency trading to make sense as an application. In the short term, it’s likely that F1 will mostly be used for machine learning and specialist data analysis applications.
One of the surprises with F1 is that it isn’t based on Intel’s Xeon chips that combine CPU and FPGA on the same die, which comes as a result of Intel’s acquisition of Altera in 2015; so perhaps there will be an F2 instance type not too far away. The availability of FPGAs as a service will also significantly reduce the cost of entry barriers to using the technology, which could spur a fresh round of effort towards simplifying the developer experience. One early sign of this happening is Reconfigure.io announcement of a platform that provides Go language support for FPGA accelerators, so Communication Sequential Goroutines in software could become Communication Parallel Goroutines in silicon. It's also worth noting that development boards like MyStorm are making FPGAs and their development to hopbbyists and ethusiasts.