Featured image of post Bridging Worlds

Bridging Worlds

High-Performance Telco Networking in Kubernetes with SR-IOV, DRA, and NRI

Bridging Worlds: High-Performance Telco Networking in Kubernetes with SR-IOV, DRA, and NRI

The Telco Imperative: Why Bare-Metal Speed Matters in a Virtualized World

The telecommunications industry is undergoing a profound transformation, driven by the shift from proprietary, hardware-centric appliances to flexible, software-based solutions running on standard IT servers. This paradigm, known as Network Functions Virtualization (NFV), promises unprecedented agility, scalability, and cost efficiency. However, this transition introduces a formidable challenge: performance. For critical network functions, particularly those in the data plane of a 5G network like the User Plane Function (UPF), performance is not merely a feature—it is an absolute prerequisite. These functions must process millions of packets per second with minimal latency, a task for which traditional software-based virtual networking is often ill-equipped. This is where Single Root I/O Virtualization (SR-IOV) becomes an indispensable technology.

A Quick Primer on SR-IOV: Beyond the Virtual Switch

Single Root I/O Virtualization (SR-IOV) is a hardware-based virtualization technology, standardized by the PCI Special Interest Group (PCI-SIG), that allows a single physical PCIe device, such as a Network Interface Card (NIC), to present itself as multiple distinct virtual devices. The fundamental value of SR-IOV lies in its ability to bypass the hypervisor or host kernel’s software switch, which is a notorious performance bottleneck in virtualized environments.

In a conventional virtual networking setup, network packets from a workload (be it a Virtual Machine or a container) must traverse the host’s kernel networking stack. This process involves multiple data copies—from the application’s user space to kernel space, and then again onto the I/O system—and significant CPU overhead for packet processing. This software-based path introduces latency and limits throughput, making it unsuitable for the demanding requirements of Telco data planes.

SR-IOV elegantly solves this problem by creating a direct, hardware-based data path from the workload to the physical NIC. It leverages Direct Memory Access (DMA) to eliminate kernel-space data copies, allowing packets to move directly between the application’s memory and the NIC hardware. This results in near-native, bare-metal I/O performance, providing the low latency and high throughput essential for network-intensive applications. The European Telecommunications Standards Institute (ETSI) explicitly acknowledges in its NFV Performance specification that for I/O-relevant workloads, “techniques relying on bypassing the OS, its I/O mechanism and its interruptions may become essential”. The existence and maturity of SR-IOV was a direct catalyst for the success of the entire NFV movement; without a viable way to bridge the performance gap created by software-based virtualization, Telco data planes would have remained locked into proprietary hardware.

The Architecture of Performance: Physical Functions (PFs) and Virtual Functions (VFs)

The SR-IOV architecture is composed of two primary components that work in concert to achieve device virtualization.

  • Physical Function (PF): The PF is the full-featured, traditional PCIe function on an SR-IOV capable device. It contains the SR-IOV extended capability structure and is responsible for the overall management and configuration of the device. The PF driver, which runs in the host operating system (or hypervisor), controls the enabling of SR-IOV and the creation, or “spawning,” of Virtual Functions. The PF retains full control over the physical hardware and is the only function that can configure the physical port.
  • Virtual Functions (VFs): VFs are lightweight PCIe functions that are derived from and managed by the PF. Each VF has its own dedicated resources, such as register sets and queues, and its own unique PCI configuration space, making it appear to the system as a separate, independent physical device. These VFs share the underlying physical resources of the device, like the physical network port, with the PF and other VFs. A key design feature of VFs is their simplicity, which allows a single physical device to support a large number of them—for instance, modern NVIDIA adapters can expose up to 127 VFs per physical port. Once created, a VF can be directly assigned, or “passed through,” to a workload, giving it direct hardware access.

To enable this functionality, the underlying hardware and host operating system must be properly configured. This includes enabling I/O virtualization extensions in the system’s BIOS/UEFI, such as Intel VT-d or AMD-V with IOMMU, and ensuring the kernel is booted with the appropriate parameters (e.g., intel_iommu=on).

Why SR-IOV is Non-Negotiable for NFV and 5G

For Communication Service Providers (CSPs), SR-IOV is not just a performance optimization; it is a foundational technology for building modern, cloud-native networks. The massive throughput and low-latency demands of 5G services, combined with the need to run a high density of Virtualized Network Functions (VNFs) or Containerized Network Functions (CNFs) on each server, make SR-IOV a critical enabler.

As network speeds escalate from 10 Gbit/s to 40, 100 Gbit/s, and beyond, the processing overhead on the host CPU becomes untenable with software-based switching. SR-IOV offloads this processing to the NIC hardware, freeing up CPU cores to run the actual network function logic. Furthermore, it allows a single high-speed physical port to be partitioned and shared among dozens of CNFs, providing the scalability needed to deploy complex service chains efficiently. For any serious NFV deployment handling data plane traffic, SR-IOV is the established standard for achieving the required performance and density.

The Evolution of Device Management in Kubernetes: From Plugins to Dynamic Allocation

While SR-IOV provides the necessary hardware capabilities, making those capabilities available to containerized workloads in a seamless, declarative, and “Kubernetes-native” way requires a sophisticated resource management framework. For years, this was handled by the Kubernetes device plugin framework. However, its limitations paved the way for a more powerful and flexible successor: Dynamic Resource Allocation (DRA).

Introducing Dynamic Resource Allocation (DRA): A Flexible, Storage-Inspired API

Dynamic Resource Allocation (DRA) is a built-in Kubernetes feature, now generally available, designed to fundamentally improve how specialized hardware resources are requested, managed, and consumed by workloads. It addresses the shortcomings of the device plugin model by introducing a flexible, extensible, and vendor-agnostic API for managing devices like GPUs, FPGAs, and high-performance NICs.

The design of DRA represents a significant paradigm shift, drawing direct inspiration from the mature and well-understood model for dynamic storage provisioning in Kubernetes, which uses StorageClass, PersistentVolumeClaim (PVC), and PersistentVolume (PV) objects. This analogy is powerful: just as a PVC allows a user to declaratively request storage with certain characteristics without needing to know the details of the underlying storage system, DRA allows a user to request a hardware device with specific attributes. This approach decouples the application’s resource request from the physical implementation, providing immense flexibility and simplifying the user experience.

Core DRA Concepts: DeviceClass, ResourceClaim, and Declarative Requests

The DRA framework is built upon a set of new API objects within the resource.k8s.io API group that work together to orchestrate the entire allocation lifecycle.

  • DeviceClass: This is a cluster-scoped resource typically created by a cluster administrator or installed as part of a device driver. It defines a category of available hardware and uses Common Expression Language (CEL) to specify selection criteria based on device attributes. For example, a DeviceClass could be created to represent all SR-IOV VFs from a specific vendor or those connected to a particular physical port.
  • ResourceClaim: This is a namespaced object created by a user to request an instance of a resource from a specific DeviceClass. It is the direct analogue to a PersistentVolumeClaim. A ResourceClaim can be created manually, and if so, it can be shared by multiple Pods, enabling use cases where several workloads need access to the same physical device.
  • ResourceClaimTemplate: For cases where each Pod needs its own dedicated device, a ResourceClaimTemplate can be embedded directly within a workload’s specification (e.g., a Deployment or StatefulSet). Kubernetes will then automatically generate a unique ResourceClaim for each Pod created from the template, with the lifecycle of the claim tied to the lifecycle of the Pod.
  • ResourceSlice: This is the mechanism by which DRA drivers advertise available resources on a node to the rest of the cluster. A driver running on a node will discover the local hardware and publish one or more ResourceSlice objects containing detailed attributes about each available device. The Kubernetes scheduler consumes this information directly to make intelligent placement decisions.

This new model signifies a deeper architectural change. By abstracting diverse hardware behind a consistent, declarative, claim-based API—first with storage (CSI), then networking (CNI), and now specialized devices (DRA)—Kubernetes completes its evolution into a true “data center operating system.” It solidifies its role as a universal infrastructure abstraction layer, capable of managing not just stateless applications but also the complex, stateful, and hardware-dependent workloads that are increasingly common in modern computing.

Device Plugins vs. Dynamic Resource Allocation (DRA)

To fully appreciate the advancements brought by DRA, it is crucial to compare it directly with the legacy device plugin framework it is designed to replace.

FeatureDevice Plugin ModelDynamic Resource Allocation (DRA) Model
Allocation UnitExposes devices as a simple integer count (e.g., intel.com/sriov: 5). The whole device is allocated.Allocates specific, named resources with rich attributes. Allows for partitioning and fine-grained selection.
Resource SharingNot natively supported. A device is exclusively allocated to a single container.Natively supports sharing a single device among multiple Pods via a shared ResourceClaim.
ConfigurationNo standard mechanism for passing parameters. Relies on out-of-band methods like annotations or on-disk files.Natively supports structured, vendor-specific parameters within the ResourceClaim and DeviceClass.
SchedulingKubelet advertises resources; the scheduler performs simple integer-based matching. It is “topology blind”.The scheduler directly consumes ResourceSlice data, enabling intelligent, topology-aware placement decisions without external driver interaction during scheduling.
API AbstractionLow-level, imperative. Tightly coupled to the Kubelet.High-level, declarative, and analogous to storage (PV/PVC model), decoupling the user’s request from the physical resource.

Fine-Grained Control at the Edge: Understanding NRI Hooks

While DRA provides a powerful, cluster-wide mechanism for scheduling and allocating devices, there is often a need for more fine-grained, node-local configuration that occurs just before a container starts. This is the “last mile” of resource management, and it is addressed by another crucial technology: the Node Resource Interface (NRI).

Beyond the Kubelet: What is the Node Resource Interface (NRI)?

The Node Resource Interface (NRI) is a standardized plugin mechanism for OCI-compliant container runtimes, such as containerd and CRI-O. Its primary purpose is to allow external, domain-specific logic to hook into the lifecycle of pods and containers at the runtime level. This interception happens after the kubelet has made its scheduling decision and has instructed the container runtime to create the container, but before the container process is actually started.

This capability is essential because the default resource management provided by Kubernetes, while robust, is necessarily generic. It treats all vCPUs as equal and has a simplified view of memory topology, which is insufficient for high-performance workloads that require strict NUMA affinity or precise CPU pinning. The kubelet’s built-in managers, like the CPU Manager and Topology Manager, attempt to address this but can be rigid and complex to configure. NRI was created to fill this gap, providing a flexible, out-of-process extension point that decouples advanced, node-level resource management from the core Kubernetes components and their release cycles.

How NRI Enables Runtime-Level Resource Manipulation

An NRI plugin is a daemon-like process that registers with the container runtime and subscribes to specific lifecycle events, such as RunPodSandbox and StopPodSandbox. When one of these events occurs, the runtime pauses its operation and sends a request to the plugin, providing the full context of the pod or container, including its OCI specification.

The plugin can then inspect this specification and make “controlled changes” to it before returning it to the runtime. This allows the plugin to perform a wide range of advanced configurations, such as:

  • Adjusting CPU and memory resource allocations with NUMA awareness.
  • Pinning container processes to specific CPU cores.
  • Modifying cgroup parameters for fine-grained Quality of Service (QoS).
  • Injecting device-specific configurations or OCI hooks.

The topology-aware NRI plugin is a prime example of this, using its knowledge of the node’s hardware topology to optimize resource assignments for containers, a task far beyond the scope of the default Kubernetes scheduler.

The Role of NRI in a Modern Kubernetes Networking Stack

In the context of an SR-IOV networking solution, the synergy between DRA and NRI becomes clear. DRA excels at the cluster-level task of allocating a suitable SR-IOV VF to a Pod and ensuring that Pod is scheduled onto the correct node. However, once the Pod is on the node, the VF may require specific configurations—such as setting a MAC address, assigning a VLAN tag, or enabling a particular QoS profile—before it can be used by the application.

This is where NRI provides the perfect integration point. A DRA driver can pass the desired configuration parameters to the node (for example, by adding annotations to the Pod object during the allocation phase). An associated NRI plugin on that node can then intercept the RunPodSandbox event, read those parameters, and execute the necessary commands (often via a CNI ADD call) to configure the VF precisely as requested, just moments before the container’s network namespace is finalized and the application starts. When the pod is terminated, the StopPodSandbox event is used to trigger a CNI DEL call to clean up the network interface. This creates a complete, end-to-end provisioning pipeline, from declarative user request to fine-grained runtime configuration.

Deep Dive: The dra-driver-sriov Project

The dra-driver-sriov project, developed by Sebastian Sch, represents the synthesis of these three powerful technologies: SR-IOV for hardware performance, DRA for declarative allocation, and NRI for runtime configuration. It provides a complete solution for managing SR-IOV network devices in a modern, cloud-native fashion.

Note: As the specific GitHub repository is not publicly accessible at the time of this writing, the following architectural analysis is based on the standard design patterns for DRA drivers, as exemplified by the official kubernetes-sigs/dra-example-driver, and the logical integration points for NRI.

Architecture and Workflow: Deconstructing the Driver

A production-grade DRA driver typically consists of two main components, which are deployed and managed separately within the Kubernetes cluster.

  • Control-Plane Component (Controller): This component usually runs as a Kubernetes Deployment. Its primary responsibility is to act as a central controller that watches for ResourceClaim objects cluster-wide. When a new claim is created, this controller works in cooperation with the Kubernetes scheduler to decide which specific VF on which specific node should be allocated to satisfy the claim.
  • Node-Local Component (Kubelet Plugin): This component is deployed as a DaemonSet, ensuring an instance runs on every relevant worker node in the cluster. It has several key responsibilities:
    1. Discovery: It scans the local node’s hardware to discover available SR-IOV VFs.
    2. Advertisement: It publishes information about these VFs and their attributes (e.g., parent PF, PCI address, driver in use) as ResourceSlice objects in the Kubernetes API server.
    3. Lifecycle Management: It implements the kubelet plugin gRPC interface to handle NodePrepareResource and NodeUnprepareResource calls from the kubelet, performing the necessary steps to make a device available to a container and clean it up afterward.

The combination of DRA and NRI creates a complete, two-stage provisioning pipeline that mirrors the architectural division of responsibility within Kubernetes itself. DRA handles the “control plane” stage—cluster-wide scheduling and allocation—while NRI handles the “node-level” stage—final, precise, runtime configuration. This separation of concerns is architecturally elegant and highly scalable. It allows Kubernetes to make intelligent placement decisions without needing to understand the low-level details of every device, while vendors can still provide the deep, device-specific tuning required for maximum performance via NRI plugins.

The End-to-End Flow

The entire process, from a user requesting a network interface to an application using it, follows a clear and orchestrated sequence of events:

  1. Discovery: On each worker node, the dra-driver-sriov DaemonSet pod starts up, inspects the local system (e.g., via the sysfs filesystem), and identifies all available SR-IOV VFs on the physical NICs.
  2. Advertisement: The driver creates ResourceSlice objects for these VFs, populating them with relevant attributes such as the parent PF’s interface name, the VF’s PCI address, and the currently bound driver. These slices are published to the Kubernetes API server.
  3. Request: A user, such as a Telco platform engineer, deploys a CNF. The Pod specification includes a resourceClaims section that references a ResourceClaim (or contains a ResourceClaimTemplate) requesting an SR-IOV VF from a specific DeviceClass.
  4. Scheduling: The Kubernetes scheduler sees the pending Pod and its associated resource claim. It consults the available ResourceSlice objects across the cluster to find a node that has a suitable, available VF. Once a match is found, the scheduler allocates that specific VF to the claim and binds the Pod to that node. This allocation is recorded in the status field of the ResourceClaim object.
  5. Preparation: The Kubelet on the selected node sees the bound Pod. It calls the NodePrepareResource gRPC endpoint on the local dra-driver-sriov DaemonSet pod, informing it which VF needs to be prepared for the incoming Pod.
  6. Configuration (The NRI Hook): As the container runtime proceeds to create the pod sandbox, the dra-driver-sriov NRI plugin intercepts the RunPodSandbox event. It retrieves the prepared device details and invokes a CNI ADD command. This CNI call moves the VF into the pod’s network namespace and configures it according to the parameters (e.g., VLAN ID, MAC address) specified in the ResourceClaim and the corresponding NetworkAttachmentDefinition.
  7. Attachment: The container runtime, having received the (potentially modified) OCI spec back from the NRI plugin, finalizes the container setup. The now fully configured VF is moved into the Pod’s network namespace, where it appears as a network interface (e.g., net1) ready for use by the application.
  8. Cleanup: When the Pod is terminated, the container runtime triggers the StopPodSandbox NRI event. The plugin intercepts this and invokes a CNI DEL command to remove the VF from the pod’s network namespace and release its network configuration. The Kubelet then calls NodeUnprepareResource on the DRA driver, which restores the VF’s original kernel driver and cleans up any other node-local state.

A Deeper Look: How dra-driver-sriov Works

An examination of the project’s structure reveals a sophisticated, modular architecture where each component has a distinct responsibility. The driver is deployed via a Helm chart and runs as a single binary on each node, handling discovery, allocation, and runtime configuration.

Key Components and Their Roles

  • Main Entrypoint (cmd/dra-driver-sriov/main.go): This is the heart of the driver, responsible for parsing command-line flags, initializing all other components, and starting the main process that runs the DRA Kubelet plugin and the NRI plugin.
  • Device Discovery and State (pkg/devicestate): The discovery.go file contains logic to scan the host system’s /sys/bus/pci/devices directory to find all SR-IOV capable Physical Functions (PFs) and their associated Virtual Functions (VFs). It gathers detailed attributes for each VF, such as its PCI address, vendor/device IDs, parent PF name, and NUMA node. This information is managed by state.go, which maintains an in-memory database of all allocatable devices on the node.
  • Resource Filtering (pkg/controller/resourcefiltercontroller.go): This controller runs within the driver and watches for SriovResourceFilter custom resources. This CRD allows administrators to define which VFs are made available for allocation. The controller matches node labels against the nodeSelector in the CR and applies the specified filters (e.g., by pfNames, vendors, devices) to the discovered VFs. Only VFs that pass these filters are marked with a resourceName and become eligible for advertisement and allocation.
  • Core DRA Driver (pkg/driver): This component implements the gRPC server that communicates with the Kubelet.
    • driver.go handles the registration of the plugin and the publishing of available (filtered) VFs as ResourceSlice objects to the Kubernetes API server.
    • dra_hook.go implements the NodePrepareResource and NodeUnprepareResource gRPC calls. When NodePrepareResource is called for a ResourceClaim, it fetches the VfConfig parameters. It then uses the devicestate manager to perform actions like binding the VF’s PCI address to a specific kernel driver (e.g., vfio-pci) if requested.
  • Container Device Interface (pkg/cdi): After a device is prepared, the DRA driver calls the CDI handler. cdi.go is responsible for creating CDI JSON specification files in /var/run/cdi. These files instruct the container runtime (like CRI-O or containerd) on how to expose the device to the container. This includes injecting environment variables (e.g., SRIOVNETWORK_PCI_ADDRESSES) and creating device nodes (e.g., /dev/vfio/1) inside the container for drivers like vfio-pci.
  • Pod State Management (pkg/podmanager): This component acts as a state machine, tracking which devices have been prepared for which Pod and ResourceClaim. It uses a checkpoint file (checkpoint.json) to persist this state, ensuring that the driver can recover its state and correctly clean up resources after a restart.
  • Node Resource Interface (pkg/nri): The NRI plugin (nri.go) hooks into the container runtime’s lifecycle. It intercepts the RunPodSandbox and StopPodSandbox events. On RunPodSandbox, which occurs after the Kubelet has prepared the resources but before the container starts, the NRI plugin retrieves the prepared device details from the podmanager.
  • Container Network Interface (pkg/cni): The NRI plugin invokes the CNI logic in cni.go. During the RunPodSandbox hook, it executes a CNI ADD command, using the configuration from the NetworkAttachmentDefinition specified in the VfConfig. This crucial step moves the VF network interface into the Pod’s network namespace and configures its IP address, MAC address, and other network settings. During the StopPodSandbox hook, it executes a CNI DEL command to detach the interface and clean up the network configuration.

End-to-End Workflow Revisited

  1. Startup: The dra-driver-sriov binary starts on a node. The devicestate manager discovers all VFs. The resourcefiltercontroller applies any SriovResourceFilter policies, determining the final set of allocatable VFs. The driver publishes these VFs as ResourceSlice objects.
  2. Allocation: A Pod requests a VF via a ResourceClaim. The Kubernetes scheduler allocates a specific VF on a specific node.
  3. Preparation (Kubelet -> DRA): The Kubelet calls NodePrepareResource. The driver binds the VF to the requested kernel driver (e.g., vfio-pci) and calls the cdi handler to write the CDI spec file. The podmanager records the prepared state.
  4. Attachment (Runtime -> NRI -> CNI): The container runtime triggers the RunPodSandbox NRI event. The nri plugin looks up the prepared device in the podmanager and calls the cni component to execute CNI ADD, plumbing the VF into the Pod’s network namespace.
  5. Cleanup: When the Pod is terminated, the StopPodSandbox NRI event triggers a CNI DEL. The Kubelet’s NodeUnprepareResource call triggers the driver to restore the VF’s original kernel driver and clean up state in the podmanager and cdi cache.

Practical Use-Cases and Configuration Examples

The flexibility of this model enables a variety of advanced, high-performance use cases, as demonstrated by the project’s examples.

Basic Usage: Claiming a Single VF

This is the most fundamental use case, where a single pod requests one SR-IOV VF. This is ideal for applications that need a single, hardware-accelerated network interface.

First, a NetworkAttachmentDefinition is created to define the CNI configuration, including the IPAM (IP Address Management) plugin.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# NetworkAttachmentDefinition for the SR-IOV CNI
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  name: vf-test1
  namespace: vf-test1
spec:
  config: |-
    {
        "cniVersion": "1.0.0",
        "name": "vf-test1",
        "type": "sriov",
        "ipam": {
            "type": "host-local",
            "ranges": [[{"subnet": "10.0.1.0/24"}]]
        }
    }    

Next, a ResourceClaimTemplate is defined. This template will be used to dynamically generate a ResourceClaim for each pod that references it. The VfConfig parameters instruct the driver to name the interface net1 inside the container and to use the vf-test1 NetworkAttachmentDefinition for configuration.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# ResourceClaimTemplate to request a single VF
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  namespace: vf-test1
  name: single-vf
spec:
  spec:
    devices:
      requests:
      - name: vf
        exactly:
          deviceClassName: sriovnetwork.openshift.io
      config:
      - requests: ["vf"]
        opaque:
          driver: sriovnetwork.openshift.io
          parameters:
            apiVersion: sriovnetwork.openshift.io/v1alpha1
            kind: VfConfig
            ifName: net1
            netAttachDefName: vf-test1

Finally, the Pod consumes the claim. The resourceClaims field links the pod to the ResourceClaimTemplate, which triggers the allocation process.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# Pod consuming the single VF
apiVersion: v1
kind: Pod
metadata:
  namespace: vf-test1
  name: pod0
spec:
  containers:
  - name: ctr0
    image: quay.io/schseba/toolbox:latest
    command: ["/bin/bash", "-c", "sleep INF"]
    securityContext:
      capabilities: { add: }
    resources:
      claims:
      - name: vf
  resourceClaims:
  - name: vf
    resourceClaimTemplateName: single-vf

Advanced Networking: Multiple VFs in a Single Pod

For applications requiring higher availability, load balancing, or network segmentation, a single pod can claim multiple VFs. This is achieved by setting count: 2 in the ResourceClaimTemplate. The driver will allocate two separate VFs, and the CNI will typically name them sequentially (e.g., net1, net2) inside the pod’s network namespace.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# ResourceClaimTemplate requesting two VFs
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  namespace: vf-test3
  name: multiple-vf
spec:
  spec:
    devices:
      requests:
      - name: vf
        exactly:
          deviceClassName: sriovnetwork.openshift.io
          count: 2 # Request exactly two VFs
      config:
        - requests: [ "vf" ]
          opaque:
            driver: sriovnetwork.openshift.io
            parameters:
              apiVersion: sriovnetwork.openshift.io/v1alpha1
              kind: VfConfig
              netAttachDefName: vf-test

Userspace Networking: Using the VFIO-PCI Driver

For high-performance userspace applications like DPDK, the kernel’s networking stack can be bypassed entirely. This requires binding the VF to the vfio-pci driver instead of a standard kernel network driver. The VfConfig allows this to be specified declaratively. The driver will handle unbinding from the kernel driver, binding to vfio-pci, and creating the necessary character device files (e.g., /dev/vfio/1) inside the container via CDI.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# ResourceClaimTemplate requesting a VF with the vfio-pci driver
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  namespace: vf-test2
  name: single-vf
spec:
  spec:
    devices:
      requests:
      - name: vf
        exactly:
          deviceClassName: vf-test2
      config:
      - requests: ["vf"]
        opaque:
          driver: sriovnetwork.openshift.io
          parameters:
            apiVersion: sriovnetwork.openshift.io/v1alpha1
            kind: VfConfig
            ifName: net1
            netAttachDefName: vf-test
            driver: vfio-pci # Specify the userspace driver
            addVhostMount: true # Optionally mount vhost devices

Fine-Grained Selection: Filtering Available VFs

In complex environments, administrators may need to partition VFs into different pools. The SriovResourceFilter Custom Resource allows administrators to create named resource pools based on hardware attributes like vendor ID or the parent Physical Function name (pfNames).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# SriovResourceFilter to create two named resource pools
apiVersion: sriovnetwork.openshift.io/v1alpha1
kind: SriovResourceFilter
metadata:
  name: example-resource-filter
  namespace: dra-sriov-driver
spec:
  nodeSelector:
    kubernetes.io/hostname: dra-ctlplane-0.dra.lab
  configs:
  - resourceName: "eth0_resource"
    resourceFilters:
    - vendors: ["8086"]
      pfNames: ["eth0"]
  - resourceName: "eth1_resource"
    resourceFilters:
    - vendors: ["8086"]
      pfNames: ["eth1"]

Workloads can then target a specific pool using a CEL expression in the ResourceClaimTemplate’s selectors field, ensuring they receive a VF with the desired characteristics.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# ResourceClaimTemplate selecting a VF from a filtered pool
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  namespace: vf-test4
  name: single-vf
spec:
  spec:
    devices:
      requests:
      - name: vf
        exactly:
          deviceClassName: sriovnetwork.openshift.io
          count: 1
          selectors:
          - cel: # Use CEL to select the desired resource pool
              expression: device.attributes["sriovnetwork.openshift.io"].resourceName == "eth1_resource"
      config:
      - requests: ["vf"]
        opaque:
          driver: sriovnetwork.openshift.io
          parameters:
            apiVersion: sriovnetwork.openshift.io/v1alpha1
            kind: VfConfig
            ifName: net1
            netAttachDefName: vf-test1

Conclusion: The Future of High-Performance Networking in Kubernetes

The dra-driver-sriov project and the underlying technologies it integrates—SR-IOV, DRA, and NRI—represent a significant milestone in the evolution of cloud-native infrastructure. It demonstrates a complete, architecturally sound solution to one of the most persistent challenges in the container ecosystem: providing high-performance, hardware-accelerated networking for demanding workloads in a declarative, automated, and Kubernetes-native manner.

By synthesizing SR-IOV’s raw hardware speed, DRA’s elegant, claim-based allocation model, and NRI’s precise, runtime control, this approach solves a critical problem for the telecommunications industry’s transition to 5G and NFV. It provides a clear path for deploying latency-sensitive and high-throughput CNFs on Kubernetes without compromising on performance or manageability.

More broadly, this project is a powerful indicator of a larger trend: the maturation of Kubernetes as the de facto operating system for distributed systems, capable of managing not only stateless web applications but the full spectrum of enterprise and service provider workloads. The shift from imperative, script-driven configuration of specialized hardware to a declarative, API-centric model is a paradigm shift. It lowers the barrier to entry for complex, hardware-dependent applications and will undoubtedly accelerate the adoption of cloud-native principles in industries like telecommunications, high-performance computing, and artificial intelligence for years to come.

Technical insights into Kubernetes, Cloud Native and networking
Built with Hugo
Theme Stack designed by Jimmy