wiki:DPDKNotes

Version 3 (modified by rjs51, 7 years ago) (diff)

Update to reflect that DPDK v1.5 or newer is now required and the changes made to the build system.

Notes on Libtrace Intel Data Plane Development Kit (DPDK) Support

This format is considered experimental and has limitations that should be understood before using

The Intel Data Plane Development Kit format allows packets to be captured in a truly zero copy manner and provides direct access to every packet with almost zero overhead. This means more CPU is left for your application to process the packet. Libtraces Intel DPDK capture format works in a very similar way to the DAG capture format. The format supports most Intel NIC's see the DPDK release notes pdf.

Documentation and source code for the Intel DPDK can be downloaded from http://www.intel.com/go/DPDK the links are in a box at the bottom of the page.

System Requirements

  • Gettimeofday() and/or clock_gettime() must be implemented as virtual system calls for your Linux kernel, these are called for every packet received so the advantage of using DPDK will be lost if a system call still has to be made.
  • DPDK is a polling format hence it is highly recommended to use a multicore system so other processes can be run on the remaining cores.
  • For better performance the CPU core that has DPDK bound to it should only be running DPDK as such interrupts could be disabled on this core.

Libtrace application requirements

  • DPDK v1.5 or newer is required
  • The same thread must be used to create, start and read/write packets and all other calls to libtrace format dependent functions.
  • Minimal processing should be done on the thread interacting with libtrace and the DPDK format, for two main reasons:
    1. Packets will be dropped when queues fill up this applies to all formats
    2. The timestamping of DPDK packets occurs when trace_read_packet() is called (using gettimeofday()) so the longer packet processing takes the less accurate the timestamps are, unless hardware timestamping is being used.
  • When using the DPDK format the system should remain on at all times, don't put it to sleep or into hibernation.
  • There is a limitation of the DPDK format that only allows one trace to be created at any given time. This means only a single interface can be reading or writing (but not both) using the DPDK format at a given time. This does not stop other libtrace formats being used.

Basic Setup Guide for libtrace with Intel DPDK

It is strongly recommended that you build and test Intel DPDK with it's included samples and verify they are functioning correctly before attempting to use build libtrace with DPDK.

  1. Read the DPDK Getting Started Guide and make sure the prerequisites are met such as hugepages.
  2. Download DPDK from the Intel website http://www.intel.com/go/DPDK or dpdk.org http://www.dpdk.org/
  3. Extract the archive:
    ~unzip DPDK-1.6.0-18 -d IntelDPDK
    ~cd IntelDPDK/DPDK-1.6.0
  1. Apply optional patches (For a specific card HW timestamping etc.. make sure changes are also made to libtrace defines where needed)
  1. Make the DPDK library with the CONFIG_RTE_BUILD_COMBINE_LIBS=y and EXTRA_CFLAGS="-fPIC" added. This should create a the static library x86_64-default-linuxapp-gcc/libs/libintel_dpdk.a required by libtrace, note prior to DPDK v1.5 CONFIG_RTE_BUILD_COMBINE_LIBS is not supported and this lirbary will not be created.
    ~make install T=x86_64-default-linuxapp-gcc CONFIG_RTE_BUILD_COMBINE_LIBS=y EXTRA_CFLAGS="-fPIC" 
  1. Export RTE_SDK and RTE_TARGET
    ~export RTE_SDK=`pwd`
    ~export RTE_TARGET=x86_64-default-linuxapp-gcc
  1. Set any advance options within libtrace if required if you have applied patches (defines at top of ./lib/format_dpdk.c)
  2. Configure and build - RTE_SDK and RTE_TARGET must be set in the environment for Intel DPDK to be detected
    ~cd ../../libtrace-svn/
    ~./configure
    ~make
    ~sudo make install
  1. Load the DPDK modules
    ~cd $RTE_TARGET/kmod
    ~sudo modprobe uio
    ~sudo insmod ./igb_uio.ko
  1. Use the pci_unbind.py tool (found in IntelDPDK/tools/) to bind the port you want to use to the igb_uio driver
    ~cd ../IntelDPDK/DPDK-1.6.0
    ~sudo ./pci_unbind.py --status
    Network devices using IGB_UIO driver
    ====================================
    <none>

    Network devices using kernel driver
    ===================================
    0000:01:00.0 '82580 Gigabit Network Connection' if=eth1 drv=igb unused=igb_uio 
    0000:01:00.1 '82580 Gigabit Network Connection' if=eth2 drv=igb unused=igb_uio 
    0000:03:00.0 'NetXtreme BCM5754 Gigabit Ethernet PCI Express' if=eth0 drv=tg3 unused=<none> *Active*

    Other network devices
    =====================
    <none>
    ~sudo ./pci_unbind.py -b igb_uio 0000:01:00.0
    ~sudo ./pci_unbind.py --status
    Network devices using IGB_UIO driver
    ====================================
    0000:01:00.0 '82580 Gigabit Network Connection' drv=igb_uio unused=

    Network devices using kernel driver
    ===================================
    0000:01:00.1 '82580 Gigabit Network Connection' if=eth2 drv=igb unused=igb_uio 
    0000:03:00.0 'NetXtreme BCM5754 Gigabit Ethernet PCI Express' if=eth0 drv=tg3 unused=<none> *Active*

    Other network devices
    =====================
    <none>

  1. Test a libtrace tool here the pci address can be found with the pci_unbind tool
    ~/tracesummary dpdk:0000:01:00.0

Advance Settings (Defines at the top of libtrace/lib/dpdk.c)

This is based upon testing using the Intel DPDK 1.3.1_7(No longer supported by libtrace) and a Intel 82580 based Ethernet controller. Some of these settings are not supported by all controllers.

NB_RX_MBUF - Number of memory buffers i.e. number of packets in the ring buffer

Patch included libtrace/Intel DPDK Patches/larger_ring.patch

NB_RX_MBUF controls the maximum number of packets the DPDK format can buffer at one time. In general the larger this is the lower the packet drop rate is (Ideally this becomes 0). There is a limit placed on the NB_RX_MBUF of 4k per RX ring by the pmd driver. This is controlled by a define for the IGB driver it is located in IntelDPDK/lib/librte_pmd_e1000/igb_rxtx.c line 1063

#define IGB_MAX_RING_DESC

It appears this can be increased without any side-effects (except more memory usage). There is a limit of 65535 due to DPDK using a uint16_t to represent this size. In order to exceed this multiples queues would need to be used (not supported by libtrace). NOTE: 65535 itself cannot be used directly due to the alignment size however 65536 - alignment(such as 128) can be used. If you want to use this setting on your Intel NIC, check with the documentation to make sure there isn't a hardware limit placed on this value.

Capturing Bad Packets - Those with an ethernet checksum mismatch

A minor change can be made to the pmd driver IntelDPDK/lib/librte_pmd_e1000/igb_rxtx.c that keeps packets with bad ethernet checksums which would otherwise be dropped by default. Simply change rctl &= ~E1000_RCTL_SBP; to rctl |= E1000_RCTL_SBP;

NOTE: Bad packets don’t appear to get timestamped, so this will cause problems if used with Hardware Timestamping because there is no way of knowing if a packet is bad or not and if a timestamp is sitting in front of the packet.

HAS_HW_TIMESTAMPS_82580 - Hardware Timestamping Packets (Implemented for Intel 82580 based NICs)

To get a hardware timestamp from the Intel DPDK a change must be made to the pmd driver. I’ve made a patch for Intel 82580 based NICs see libtrace/Intel DPDK Patches/hardware_timestamp.patch. This must be first applied to DPDK and then set the HAS_HW_TIMESTAMPS_82580 define in dpdk.c to 1. Once applied the libtrace DPDK format can only be used with Intel 82580 Controllers. Packets must be read by calling trace_read_packet within half of the hardware clocks wrap around time which for Intel 82580 controller is 18/2 seconds.

In order to use timestamping the Intel NIC must support Receive Packet Timestamp in Buffer. This means the NIC will place the timestamp in a header before packet data. Libtrace then needs to correctly interpret this header things that need to be considered are:

  • Clock resolution - convert this to nanoseconds
  • Synchronizing with the current time - record the time of the first packet you've received and add this to all packets after it.
  • Timer wrap around - Compare system time to that of the last packet received and estimate how many times the timer has (possibly) wrapped around then pick what makes sense.
  • Consider what happens after the device is paused - You need to restart timestamps because the clock will be reset when starting it again.

The current implementation gets a system timestamp (hopefully via vsyscall) every time a packet is received. This could be done differently on a system that didn’t implement vsys calls by starting a background thread to increase a counter (i.e. do what estimated_wraps does) every 18 seconds when the clock is expected to wrap around. At this point you should get the system time to make sure you stay correctly in sync with it and the next sleep should be based on the difference.

GET_MAC_CRC_CHECKSUM

This option can be turned on by setting the define GET_MAC_CRC_CHECKSUM to 1. This gets the full packet including the checksum. This is safe to turn on, however it should be noted when writing to native interfaces like int: and ring: it's assumed that there is no checksum.

USE_CLOCK_GETTIME

Use get_clocktime() instead of gettimeofday() (nanoseconds vs microseconds). This should only be considered if clock_gettime() is a virtual system call for your system. One should remember that this timestamp is added by libtrace when trace_read_packet() is called so it's likely that the accuracy of this timestamp isn't close enough to hardware to support nanosecond accuracy anyway. If you require accurate timestamping to the nanosecond hardware timestamping is the only way to truly achieve this.

NOTE: This setting has no effect if hardware timestamping is already being used.

Capturing Jumbo Frames

Jumbo frames can be captured by setting the TRACE_OPTION_SNAPLEN using trace_config(). The size specified here excludes the checksum size and is limited to around 9k by most Intel NIC's. TRACE_OPTION_SNAPLEN may be set to less than the maximum Ethernet packet size of 1514 however this setting will drop any packets that fall above that size. So if snaplen was set to 100 then any packet over 100 bytes + 4 bytes (Ethernet CRC) will be dropped automatically by the NIC.