In this podcast, we talk to Changhoon Kim, who is a Director of System Architecture at Barefoot Networks, and actively working for the P4 language consortium. We talk about the new PISA (protocol independence switch architecture) which promises multi terabit switching, and a domain specific programming language designed for networking.
Key Takeaways
- The P4 language allows to program the data plane of network switches that support Protocol Independent Switch Architecture (PISA).
- P4 logic runs at full data rate, up to Tb/s.
- Interesting applications include in-band telemetry which allows to annotate packets with information to trace the path and timing of the packet in the network with no extra processing overhead.
- Other applications are small but extremely fast caches inside the switch.
- Hardware supporting PISA is available as well as simulators and other tools.
Subscribe on:
- 1:40 You can introduce your own new ideas, or remove unneeded functionality.
- 1:50 Programmability is a natural blessing.
So why haven’t switches always been programmable?
- 2:00 It’s a great question. There used to be a non-trivial amount of penalty associated with programmability, especially for high-speed networking ASICs.
- 2:15 If you have to guarantee very low latency (sub-microsecond), programmability means additional logic, so performance would have been lower in terms of packets per second, as well as the power consumption.
- 2:45 We are at an inflection point in the silicon and networking industry; people have realised that programmability in the data plan is now achievable.
What is the difference between a data plane and a control plane?
- 3:15 Networking systems are comprised of a data plane and a control plane. A control plane is a general-purpose CPU which runs a variant of Linux, and a number of routing protocol daemons.
- 3:40 The interesting thing about networking systems is that they also need a data plane - essentially a high-performance networking chip (or ASIC) that is connected to the control plane through PCIe.
- 3:55 All the packets arrive at the data plane, with multiple stages and table matches from the data plane to the wire.
- 4:10 The data packets never go into the control plane - what it does is manage the routing tables which are downloaded to the data plane.
What are P4 and PISA?
- 4:45 The control plane is software-based, so you can download your own open-source software to create your own control plane.
- 5:10 The data plane has always been built with fixed-function chips, using networking protocols in hardware through Verilog or VHDL.
- 5:30 The data plane chip can now be built to be programmable, and this kind of new chip is what we call PISA (Protocol Independent Switch Architecture).
- 6:05 This is a domain-specific processor, so you need a domain-specific language to express what you need to do with this chip, so this language is called P4 (Programming Protocol-independent Packet Processors).
So this is similar to GPU-specific programming languages for GPUs?
- 6:45 GPGPUs are batch-processing machines where you can do massively parallel processing, so they have their own architecture.
- 7:00 PISA is built for networking, and has packet parsers as well as multiple matching units, and each stages are the same.
- 7:20 Architecturally, they are also different. FPGA are a programmable gate array, so you can use any kind of logic - whether it is processing intensive or functionally expensive.
- 7:35 At the same time, it’s less efficient in terms of number of gates than if you had to realise the architecture in silicon.
- 7:50 FPGAs allow for greater experimentation during the prototyping phases of a project.
- 7:55 PISA is on the other hand an ASIC, which is as efficient as it can be.
- 8:05 It’s also designed to be programmable as an ASIC.
How would you describe the P4 programming language?
- 8:20 It’s designed to express networking data processing - it’s mainly (but not completely) declarative.
- 8:35 If you look at the reference switch implementations written in P4, the whole suite is only 8kloc.
- 8:55 With P4 you can define your own new headers, looking like C structures, and you can define your own parsing logic as a finite state machine.
- 9:10 The actions are associated with pattern matching for the incoming data logic.
- 9:25 The actions are built from primitive instructions based on the PISA architecture.
- 9:40 The instruction set is relatively simple - basic boolean and integer operations, packet header manipulation, and some state management instructions.
What kind of things can you build with this technology?
- 10:25 We built this to do programmable network processing - it can handle billions of packets per second, with predictable and guaranteed latency.
- 10:50 This accelerated performance is the unique value of this machine - and it also has some data structure support (arrays, hashes and so on).
- 11:05 So you can view it as a special type of computing machine, but which doesn’t support floating point, pointer or stack logic.
- 11:30 If you have requirements for constant-time data structure lookup, they can be implemented in p4 as well.
- 11:50 Even in networking, there are innovations that can happen. PISA and P4 allows people to experiment with packet processing without having to build their own hardware.
- 12:10 People came up with the idea of data plane telemetry as opposed to control plane telemetry.
- 12:15 This telemetry is running in the data plane, and is able to collect telemetry at full data rate.
- 12:25 So in-band network telemetry (INT) is an approach where packets carry the load information along with the packet data.
- 12:45 An INT switch will receive the packet and forward it on, but also adds its own metadata (which switch ID received the packet, which incoming and outgoing ports were used, what the arrival and departure times were and so on).
- 13:10 All this information is available at any switch, but to get this information out of the box in a traditional switch is impossible because it’s difficult to manipulate the packets.
- 13:20 This is possible to do with PISA and P4 - as long as you can keep this metadata confined to your internal network, and strip it on egress, you can gain better visibility into what is happening.
- 13:20 Imagine the possibilities that this opens up - it can give you much better visibility as to what your applications are doing.
- 14:00 A lot of applications these days are distributed and therefore involve networking to some degree.
- 14:10 You need to find out if any delays are due to your software, hardware, networking or routing problems, virtual or physical switches.
- 14:30 It’s such a stark contrast between the way people manage compute or storage applications.
- 14:40 Compute and storage are all software these days, and can be instrumented at almost every level.
- 14:50 When something goes wrong, they can analyse the data to find out what went wrong.
- 14:55 In networking, nothing like that exists - traceroute, ping and snmp are used, which are thirty years old.
- 15:10 Data plane telemetry is offering to change all that - that’s the innovation that is happening in the networking field.
- 15:30 People are building extremely fast features - a load balancer, or DDOS detector, or tcp/syn authenticator - because data plane programmability gives you those capabilities.
- 15:20 The value of this kind of device is that runs at billions of packets per second, and is extremely robust against DDoS attacks.
- 16:15 Last but not least - the last remaining innovation that will happen when the applications consider PISA as part of computing or storage systems.
- 16:30 People have been talking about bringing the Paxos (distributed consensus) protocol down to the switches or the networking system.
- 16:40 The number of Paxos messages it can handle is two or three orders of magnitude higher than a software-based system on a general purpose CPU.
- 16:55 In general, a Paxos algorithm is not CPU intensive - it is a number of messages that get sent and received as part of the consensus building.
- 17:00 It’s essentially checking that sequence numbers are increasing - message handling workload, which is well-suited to the PISA architecture.
- 17:20 People have also started talking about some storage applications or something between storage and compute.
- 17:30 We have written a paper at SigComm describing Netcache, which realises small but extremely fast memcached cache that load-balances multiple memcached servers located in the same rack. [http://conferences.sigcomm.org/events/apnet2017/slides/chang.pdf]
- 17:50 Today, if you have multiple memcached servers, the popularity of certain keys is skewed; a small number of servers will hit utilisation quickly.
- 18:05 At that point, the tail latency of reads or writes is high enough that you declare the cluster as having reached its maximum capability.
- 18:10 Even if you have 40 servers, the average net aggregate may only be loading 5 of them.
- 18:20 If you can introduce a small but extremely fast cache in front of these servers, then if you catch N lg(N) of hot items, then the theory tells that the utilisation of these servers can be very similar.
- 18:50 So a small but extremely fast cache in front of 40 memcached servers may scale up a factor of 40 times over a single server.
- 18:55 Something like this can be realised with a PISA machine, because it has its own on-chip SRAM.
- 19:05 Jointly optimising your compute or storage application (along with the network) opens up a lot of possibilities.
There is no FPGA in your machines - how does the programmability work?
- 19:55 It’s something that is difficult to explain - essentially the key components that allow programming at full line speed are a programmable packet parser and a programmable match action unit.
- 20:15 The programmable parser is a component that receives a sequence of bytes and then generates a parsed result.
- 20:30 Given an incoming packet, you read through the first N number of bytes, and slice header fields based on the specification.
- 20:40 You then move the header fields into the containers - and the programmable parser allows you to express your own state machine.
- 21:00 The other component is the programmable match action units - one such example is nothing but a collection of SRAM or TCAM operations with some ALU operations.
- 21:20 Each MAU can handle one lookup, which can lead to networking at most one match, and depending on what happens, executing some action with given handler.
- 21:45 You have several of these MAUs in one stage, so you can parallelise things that can happen at the same time.
- 22:00 You then have several of these stages connected together in sequence, so you can realise some dependencies as well.
- 22:05 All the MAUs are identical, which makes it a good compiler target.
Why is this only coming now - why wasn’t it around ten years ago?
- 22:50 When you build a high-speed networking ASIC there are a few functional units that you need.
- 23:00 For example, a serializer/deserializer is needed to convert a series of bytes into parsed representations.
- 23:10 The reason this is hard to realise in the electronics world is that this needs extremely high speed processing.
- 23:30 This logic takes 1/3 of any networking ASIC today, and is independent of whether you have custom programming or not.
- 23:50 You have to have a certain amount of capacity to support 6 or 8 terabit per second chips.
- 24:00 The second portion is the memory that you need to host tables or packet buffers where the actual switching happens.
- 24:10 These tables and memory have nothing to do with the programmability or not; you have to have it to support routing operations.
- 24:25 The third portion is wires for the routing - there are lots of wires working in parallel, because there is a lot of parsed data flowing in containers.
- 24:45 They have to go around every corner of the chip, which takes up that area on the chip.
- 25:00 All of these are irrelevant for programmability.
- 25:05 The last portion is the programmability of the chip - such as IP, VLAn, Queuing, MPLS, VxLAN - that’s implemented in the logic portion.
- 25:25 It gets smaller and smaller relatively because the other are growing larger over time.
- 25:45 Finally, we’re at a point where this difference introduced in programmability of the logic portion is insignificant enough to the full ASIC size.
- 26:00 So we are now at a stage at a cost where the additional cost for the logic is irrelevant compared to the other cost of the chip.
Where can people find out more information about P4?
- 26:20 There are people who are creating programmable FPGAs or NPUs - special kinds of CPUs that are doing special adds based on networking - that are used to build functions for networking purposes.
- 26:50 ASICs work at multiple terabits per second; FPGAs and NPUs work an order of magnitude slower.
- 27:05 Programming FPGAs are not trivial - you have to synthesise gates and wonder about timing - but you can program it using P4, which makes it easier to innovate.
- 27:25 NPUs are the same - they offer a lot of flexibility, but you have to do microcoding, and you have to worry about performance degradation based on workload.
- 27:40 The more interesting aspect is that software-based switches - OVS is one of the most widely used - are also adapting P4.
- 28:00 So why would you introduce another layer of programmability such as P4 into a software routing stack?
- 28:10 It is not trivial to write Linux kernel code that can be maintained or even upstreamed.
- 28:30 So if OVS introduced a means to write this programmability with P4 then you would just have to write the P4 logic and it can be migrated to newer versions of OVS.
- 28:50 That decouples a lot of hairy issues, which is why even software switches are adopting P4.
- 29:00 The website for the P4 language is http://p4.org, and it’s entirely Apache-2.0 based open source.