We divide the discussion of switch architectures into software and hardware switch architectures.
Software Switch Architectures
We first discuss the viability of software switching and then review dataflow graph abstractions, also discussing, e.g., Click, ClickOS, and software NICs. We proceed by revisiting literature on match-action abstractions, discussing OVS and PISCES. We conclude with a review on packet I/O libraries.
Viability of Software Switching
Towards High Performance Virtual Routers on Commodity Hardware
The paper is the first to study the performance limitations when building both software routers and software virtual routers on commodity CPU platforms. The authors observe that the fundamental performance bottleneck is the memory system, and that through careful mapping of tasks to CPU cores one can achieve very high forwarding rates. The authors also identify principles for the construction of high-performance software router systems on commodity hardware.
Flow processing and the rise of commodity network hardware
The paper introduces the FlowStream switch architecture, which enables flow processing and forwarding at unprecedented flexibility and low cost by consolidating middlebox functionality, such as load balancing, packet inspection and intrusion detection, and commodity switch technologies, offering the possibility to control the switching of flows in a fine-grained manner, into a single integrated package deployed on commodity hardware.
RouteBricks: Exploiting Parallelism to Scale Software Routers
RouteBricks is concerned with enabling high-speed parallel processing in software routers, using a software router architecture that parallelizes router functionality both across multiple servers and across multiple cores within a single server. RouteBricks adopts a fully programmable Click/Linux environment and is built entirely from off-the-shelf, general-purpose server hardware.
The Dataflow Graph Abstraction
The click modular router
Introduces Click, a software architecture for building flexible and configurable routers from packet processing modules implementing simple router functions like packet classification, queuing, scheduling, organized into a directed graph with packet processing modules at the vertices; packets flow along the edges of the graph.
Fast and Flexible: Parallel Packet Processing with GPUs and Click
The paper introduces Snap, a framework for packet processing that exploits the parallelism available on modern GPUs, while remaining flexible, with packet processing tasks implemented as simple modular elements that are composed to build fully functional routers and switches. Snap is based on the Click modular router, which it extends by adding new architectural features that support batched packet processing, memory structures optimized for offloading to coprocessors, and asynchronous scheduling with in-order completion.
ClickOS and the Art of Network Function Virtualization
The paper introduces ClickOS, a high-performance, virtualized software middlebox platform. ClickOS virtual machines are small (5MB), boot quickly (about 30 milliseconds), add little delay (45 microseconds), and over one hundred of them can be concurrently run while saturating a 10Gb pipe on a commodity server. A wide range of middleboxes is implemented, including a firewall, a carrier-grade NAT and a load balancer, and the evaluations suggest that ClickOS can handle packets in the millions per second.
Berkeley Extensible Software Switch
BESS is the Berkeley Extensible Software Switch developed at the University of California, Berkeley and at Nefeli Networks. BESS is heavily inspired by the Click modular router, representing a packet processing pipeline as a dataflow (multi)graph that consists of modules, each of which implements a NIC feature, and ports that act as sources and sinks for this pipeline. Packets received at a port flow through the pipeline to another port, and each module in the pipeline performs module-specific operations on packets.
mSwitch: A Highly-scalable, Modular Software Switch
The authors make the observation that it is difficult to simultaneously provide high packet rates, high throughput, low CPU usage, high port density and a flexible data plane in a same architecture. A new architecture called mSwitch is proposed and four distinct modules are implemented on top: a learning bridge, an accelerated Open vSwitch module, a protocol demultiplexer for userspace protocol stacks, and a filtering module that can direct packets to virtualized middleboxes.
NetBricks: Taking the V out of NFV
NetBricks is an NFV framework adopting the "graph-based" pipeline abstraction and embracing type checking and safe runtimes to provide isolation efficiently in software, providing the same memory isolation as containers and VMs without incurring the same performance penalties. The new isolation technique is called zero-copy software isolation.
The Match-action Abstraction
The Design and Implementation of Open vSwitch
The paper describes the design and implementation of Open vSwitch, a multi-layer, open source virtual switch. The design document details the advanced flow classification and caching techniques that Open vSwitch uses to optimize its operations and conserve hypervisor resources.
PISCES: A Programmable, Protocol-Independent Software Switch
PISCES is a software switch derived from Open vSwitch (OVS), a hypervisor switch whose behavior is customized using P4. PISCES is not hard-wired to specific protocols; this independence makes it easy to add new features. The paper also shows how the compiler can analyze the high-level P4 specification to optimize forwarding performance; the evaluations show that PISCES performs comparably to OVS but PISCES programs are about 40 times shorter than equivalent OVS source code.
SoftFlow: A Middlebox Architecture for Open vSwitch
The paper presents SoftFlow, an extension to Open vSwitch that seamlessly integrates middlebox functionality while maintaining the familiar OpenFlow forwarding model and performing significantly better than alternative techniques for middlebox integration.
Dataplane Specialization for High-performance OpenFlow Software Switching
The authors argue that, instead of enforcing the same universal fast-path semantics to all OpenFlow applications and optimizing for the common case, as it is done in Open vSwitch, a programmable software switch should rather automatically specialize its dataplane piecemeal with respect to the configured workload. They introduce ESwitch, a switch architecture that uses on-the-fly template-based code generation to compile any OpenFlow pipeline into efficient machine code, which can then be readily used as the switch fast-path, delivering superior packet processing speed, improved latency and CPU scalability, and predictable performance.
Dynamic Compilation and Optimization of Packet Processing Programs
The paper makes the observation that data-plane compilation is fundamentally static, i.e., the input of the compiler is a fixed description of the forwarding plane semantics and the output is code that can accommodate any packet processing behavior set by the controller at runtime. The authors advocate a dynamic approach to data plane compilation instead, where not just the semantics but the intended behavior is also input to the compiler, opening the door to a handful of runtime optimization opportunities that can be leveraged to improve the performance of custom-compiled datapaths beyond what is possible in a static setting.
Andromeda: Performance, Isolation, and Velocity at Scale in Cloud Network Virtualization
This paper presents the design and experience with Andromeda, the network virtualization stack underlying the Google Cloud Platform. Andromeda is designed around the Hoverboard programming model, which uses gateways for the long tail of low bandwidth flows enabling the control plane to program network connectivity for tens of thousands of VMs in seconds, and applies per-flow processing to elephant flows only. The paper cites statistics indicating that above 80% of VM pairs never talk to each other in a deployment and only 1–2% generate sufficient traffic to warrant per-flow processing. The architecture also uses a high-performance OS bypass software packet processing path for CPU-intensive per packet operations, implemented on coprocessor threads.
Packet I/O libraries
Netmap: a novel framework for fast packet I/O
Netmap is a framework that enables commodity operating systems to handle the millions of packets per seconds, without requiring custom hardware or changes to applications. The idea is to eliminate inefficiencies in OSes' standard packet processing datapaths: per-packet dynamic memory allocations are removed by preallocating resources, system call overheads are amortized over large I/O batches, and memory copies are eliminated by sharing buffers and metadata between kernel and userspace, while still protecting access to device registers and other kernel memory areas.
Intel DPDK: Data Plane Development Kit
DPDK is a set of libraries and drivers for fast packet processing, including a multicore framework, huge page memory, ring buffers, poll-mode drivers for networking I/O, crypto and eventdev, etc. DPDK can be used to receive and send packets within the minimum number of CPU cycles (usually less than 80 cycles), develop fast packet capture algorithms (like tcpdump), and run third-party fast path stacks.
The Fast Data Project
FD.io (Fast data – Input/Output) is a collection of several projects and libraries to support flexible, programmable and composable services on a generic hardware platform, using a high-throughput, low-latency and resource-efficient IO services suitable to many architectures (x86, ARM, and PowerPC) and deployment environments (bare metal, VM, container).
Hardware Switch Architectures
We start off by discussing a first incarnation of a programmable switch, PLUG, then discuss the SwitchBlade platform and the seminal paper on RMT (Reconfigurable Match Tables). We then review existing performance evaluation studies and literature dealing with performance monitoring and the issue of potential inconsistencies in reconfigurable networks. We conclude with a paper on Azure SmartNICs based on FPGAs.
PLUG: Flexible Lookup Modules for Rapid Deployment of New Protocols in High-speed Routers
The first incarnation of the "programmable switch". PLUG (Pipelined Lookup Grid) is a flexible lookup module that can achieve generality without loosing efficiency, because various custom lookup modules have the same fundamental features that PLUG retains: area dominated by memories, simple processing, and strict access patterns defined by the data structure. The authors IPv4, Ethernet, Ethane, and SEATTLE in a dataflow-based programming model for the PLUG and mapped them to the PLUG hardware, showing that throughput, area, power, and latency of PLUGs are close to those of specialized lookup modules.
SwitchBlade: A Platform for Rapid Deployment of Network Protocols on Programmable Hardware
SwitchBlade is a platform for rapidly deploying custom protocols on programmable hardware. SwitchBlade uses a pipeline-based design that allows individual hardware modules to be enabled or disabled on the fly, integrates common packet-processing functions as hardware modules enabling different protocols to use these functions without having to resynthesize hardware, and uses a customizable forwarding engine that supports both longest-prefix matching in the packet header and exact matching on a hash value. SwitchBlade also allows multiple custom data planes to operate in parallel on the same physical hardware, while providing complete isolation for protocols running in parallel.
Forwarding Metamorphosis: Fast Programmable Match-action Processing in Hardware for SDN
This seminal paper presents RMT to overcome two limitations in current switching chips and OpenFlow: (1) conventional hardware switches are rigid, allowing "Match-Action" processing on only a fixed set of fields, and (2) the OpenFlow specification only defines a limited repertoire of packet processing actions. The RMT (Reconfigurable Match Tables) model is a RISC-inspired pipelined architecture for switching chips, including an essential minimal set of action primitives to specify how headers are processed in hardware. RMT allows the forwarding plane to be changed in the field without modifying hardware.
High-Speed Packet Processing using Reconfigurable Computing
The paper presents a tool chain that maps a domain-specific declarative packet-processing language with object-oriented semantics, called PX, to high-performance reconfigurable-computing architectures based on field-programmable gate array (FPGA) technology, including components for packet parsing, editing, and table lookups.
What You Need to Know About SDN Control and Data Planes
The definite source on OpenFlow switches and the differences between them. The authors measure, report, and explain the performance characteristics of the control- and data-planes in three hardware OpenFlow switches. The results highlight differences between the OpenFlow specification and its implementations that, if ignored, pose a serious threat to network security and correctness.
BlueSwitch: Enabling Provably Consistent Configuration of Network Switches
The paper is motivated by the challenges involved in consistent updates of distributed network configurations, given the complexity of modern switch datapaths and the exposed opaque configuration mechanisms. The authors demonstrate that even simple rule updates result in inconsistent packet switching in multi-table datapaths. The main contribution of the paper is a hardware design that supports a transactional configuration mechanism, providing strong switch-level atomicity: all packets traversing the datapath will encounter either the old configuration or the new one, and never an inconsistent mix of the two. The approach is prototyped using the NetFPGA hardware platform.
ClickNP: Highly Flexible and High Performance Network Processing with Reconfigurable Hardware
This paper focuses on accelerating NFs with FPGAs. However, FPGA is predominately programmed using low-level hardware description languages (HDLs), which are hard to code and difficult to debug. More importantly, HDLs are almost inaccessible for most software programmers. This paper presents ClickNP, a FPGA-accelerated platform, which is highly flexible as it is completely programmable using high-level C-like languages and exposes a modular programming abstraction that resembles Click Modular Router, and also high performance.
dRMT: Disaggregated Programmable Switching
A follow-up to the RMT paper. dRMT (disaggregated Reconfigurable Match-Action Table) is a new architecture for programmable switches, which overcomes two important restrictions of RMT: (1) table memory is local to an RMT pipeline stage, implying that memory not used by one stage cannot be reclaimed by another, and (2) RMT is hardwired to always sequentially execute matches followed by actions as packets traverse pipeline stages. dRMT resolves both issues by disaggregating the memory and compute resources of a programmable switch, moving table memories out of pipeline stages and into a centralized pool that is accessible through a crossbar. In addition, dRMT replaces RMT's pipeline stages with a cluster of processors that can execute match and action operations in any order.
Language-Directed Hardware Design for Network Performance Monitoring
The authors ask what switch hardware primitives are required to support an expressive language of network performance questions. They present a performance query language, Marple, modeled on familiar functional constructs, backed by a new programmable key-value store primitive on switch hardware that performs flexible aggregations at line rate and scales to millions of keys. Marple can express switch queries that could previously run only on end hosts, while Marple queries only occupy a modest fraction of a switch's hardware resources.
Azure Accelerated Networking: SmartNICs in the Public Cloud
Modern public cloud architectures rely on complex networking policies and running the necessary network stacks on CPU cores takes away processing power from VMs, increasing the cost of running cloud services, and adding latency and variability to network performance. The paper presents the design of AccelNet, the Azure Accelerated Networking scheme for offloading host networking to hardware, using custom Azure SmartNICs based on FPGAs, including the hardware/software co-design model, performance results on key workloads, and experiences and lessons learned from developing and deploying AccelNet.
Hybrid Hardware/Software Architectures
It is often believed that the performance of programmable network processors is lower than hard‐coded chips. There exists interesting literature questioning this assumption and exploring these overheads empirically. We also discuss opportunities coming from Graphics Processing Units (GPUs) acceleration, e.g., for packet processing, as well as from hybrid hardware/software architectures in general.
PacketShader: A GPU-accelerated Software Router
PacketShader is a high-performance software router framework for general packet processing with Graphics Processing Unit (GPU) acceleration, exploiting the massively-parallel processing power of GPU to address the CPU bottleneck in software routers, combined with a high-performance packet I/O engine. The paper presents implementations for IPv4 and IPv6 forwarding, OpenFlow switching, and IPsec tunneling to demonstrate the flexibility and performance advantage of PacketShader.
Cheap Silicon: A Myth or Reality Picking the Right Data Plane Hardware for Software Defined Networking
Industry insight holds that programmable network processors are of lower performance than their hard-coded counterparts, such as Ethernet chips. The paper argues that, contrast to the common view, the overhead of programmability is relatively low, and that the apparent difference between programmable and hard-coded chips is not primarily due to programmability itself, but because the internal balance of programmable network processors is tuned to more complex use cases.
Raising the Bar for Using GPUs in Software Packet Processing
The paper opens the debate as to whether Graphics Processing Units (GPUs) are useful for accelerating software-based routing and packet handling applications. The authors argue that for many such applications the benefits arise less from the GPU hardware itself than from the expression of the problem in a language such as CUDA or OpenCL that facilitates memory latency hiding and vectorization through massive concurrency. They then demonstrate that applying a similar style of optimization to algorithm implementations, a CPU-only implementation is more resource efficient than the version running on the GPU.
CacheFlow: Dependency-Aware Rule-Caching for Software-Defined Networks
The paper presents an architecture to allow high-speed forwarding even with large rule tables and fast updates, by combining the best of hardware and software processing. The CacheFlow system caches the most popular rules in the small TCAM and relies on software to handle the small amount of cache-miss traffic. The authors observe that one cannot blindly apply existing cache-replacement algorithms, because of dependencies between rules with overlapping patterns. Rather long dependency chains must be broken to cache smaller groups of rules while preserving the semantics of the policy.
APUNet: Revitalizing GPU as Packet Processing Accelerator
This is the answer to the question raised by the "Raising the Bar for Using GPUs" paper. Kalia et al. argue that the key enabler for high packet-processing performance is the inherent feature of GPU that automatically hides memory access latency rather than its parallel computation power and claim that CPU can outperform or achieve a similar performance as GPU if its code is re-arranged to run concurrently with memory access. This paper revists these claims and find, with eight popular algorithms widely used in network applications, that (a) there are many compute-bound algorithms that do benefit from the parallel computation capacity of GPU while CPU-based optimizations fail to help, and (b) the relative performance advantage of CPU over GPU in most applications is due to data transfer bottleneck in PCIe communication of discrete GPU rather than lack of capacity of GPU itself.
We Need to Talk About NICs
The paper's aim is to reinvigorate the discussion on the design of network interface cards (NICs) in the network and OS community. The authors argue that currently operating systems fail to efficiently exploit and manage the considerable hardware resources provided by modern network interface controllers. They then describe Dragonet, a network stack that represents both the physical capabilities of the network hardware and the current protocol state of the machine as dataﬂow graphs.
NetFPGA SUME: Toward 100 Gbps as Research Commodity
The paper presents a reference design and implementation for a programmable NIC that is in wide-scale use today as an accessible development environment that both reuses existing codebases and enables new designs. NetFPGA SUME is an FPGA-based PCI Express board with I/O capabilities for 100 Gbps operation as a network interface card, multiport switch, firewall, or test and measurement environment.
SoftNIC: A Software NIC to Augment Hardware
SoftNIC is a hybrid software-hardware architecture to bridge the gap between limited hardware capabilities and ever changing user demands. SoftNIC provides a programmable platform that allows applications to leverage NIC features implemented in software and hardware, without sacrificing performance. This paper serves the foundation for the BESS software switch.
High Performance Packet Processing with FlexNIC
The authors argue that the primary reason for high memory and processing overheads inherent to packet processing applications is the inefficient use of the memory and I/O resources by commodity NICs. They propose FlexNIC, a flexible network DMA interface that can be used to reduce packet processing overheads; FlexNIC allows services to install packet processing rules into the NIC, which then executes simple operations on packets while exchanging them with host memory. This moves some of the packet processing traditionally done in software to the NIC, where it can be done flexibly and at high speed.
Floem: A Programming System for NIC-Accelerated Network Applications
The paper presents Floem, a set of programming abstractions for NIC-accelerated applications to ease developing server applications that offload computation and data to a NIC accelerator. Floem simplifies typical offloading issues, like data placement and caching, partitioning of code and its parallelism, and communication strategies between program components across devices, by providing language-level abstractions for logical and physical queues, global per-packet state, remote caching; and interfacing with external application code. The paper also presents evaluations to demonstrate how these abstractions help explore a space of NIC-offloading designs for real-world applications, including a key-value store and a distributed real-time data analytics system.
Offloading Distributed Applications Onto SmartNICs Using iPipe
The paper presents iPipe, a generic actor-based offloading framework to run distributed applications on commodity SmartNICs. The paper details the iPipe design, built around a hybrid scheduler that combines first-come-first-served with deficit round-robin policies to schedule offloading tasks at microsecond-scale precision on SmartNICs, and presents 3 custom-built use cases, a real-time data analytics engine, a distributed transaction system, and a replicated key-value store, to show that SmartNIC offloading brings about significant performance benefits in terms of traffic control, computing capability, onboard memory, and host communication.
PicNIC: Predictable Virtualized NIC
The paper addresses the noisy neighbor problem in the context of NICs. Using data from a major public cloud provider, the paper systematically characterizes how performance isolation can break in virtualization stacks and finds a fundamental tradeoff between isolation and efficiency. The paper then proposes PicNIC, the Predictable Virtualized NIC, a new NIC design that shares resources efficiently in the common case while rapidly reacting to ensure isolation in order to provide predictable performance for isolated workloads. PicNIC builds on three constructs to quickly detect isolation breakdown and to enforce it when necessary: CPU-fair weighted fair queues at receivers, receiver-driven congestion control for backpressure, and sender-side admission control with shaping. Evaluations show that this combination ensures isolation for VMs at sub-millisecond timescales with negligible overhead.