Test Environment¶
Physical Testbeds¶
FD.io CSIT performance tests are executed in physical testbeds hosted by LF for FD.io project. Two physical testbed topology types are used:
- 3-Node Topology: Consisting of two servers acting as SUTs (Systems Under Test) and one server as TG (Traffic Generator), all connected in ring topology.
- 2-Node Topology: Consisting of one server acting as SUTs and one server as TG both connected in ring topology.
Tested SUT servers are based on a range of processors including Intel Xeon Haswell-SP, Intel Xeon Skylake-SP, Arm, Intel Atom. More detailed description is provided in Physical Testbeds. Tested logical topologies are described in Logical Topologies.
Server Specifications¶
Complete technical specifications of compute servers used in CSIT physical testbeds are maintained on FD.io wiki pages: CSIT/Testbeds: Xeon Hsw, VIRL and CSIT Testbeds: Xeon Skx, Arm, Atom.
Pre-Test Server Calibration¶
Number of SUT server sub-system runtime parameters have been identified as impacting data plane performance tests. Calibrating those parameters is part of FD.io CSIT pre-test activities, and includes measuring and reporting following:
- System level core jitter – measure duration of core interrupts by Linux in clock cycles and how often interrupts happen. Using CPU core jitter tool.
- Memory bandwidth – measure bandwidth with Intel MLC tool.
- Memory latency – measure memory latency with Intel MLC tool.
- Cache latency at all levels (L1, L2, and Last Level Cache) – measure cache latency with Intel MLC tool.
Measured values of listed parameters are especially important for repeatable zero packet loss throughput measurements across multiple system instances. Generally they come useful as a background data for comparing data plane performance results across disparate servers.
Following sections include measured calibration data for Intel Xeon Haswell and Intel Xeon Skylake testbeds.
Calibration Data - Haswell¶
Following sections include sample calibration data measured on t1-sut1 server running in one of the Intel Xeon Haswell testbeds as specified in CSIT/Testbeds: Xeon Hsw, VIRL.
Calibration data obtained from all other servers in Haswell testbeds shows the same or similar values.
Linux cmdline¶
$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-4.4.0-72-generic root=UUID=efb7e8b3-3548-4440-98f6-6ebe102e9ec6 ro isolcpus=1-17,19-35 nohz_full=1-17,19-35 rcu_nocbs=1-17,19-35 intel_pstate=disable console=tty0 console=ttyS0,115200n8
Linux uname¶
$ uname -a
Linux t3-sut2 4.4.0-72-generic #93-Ubuntu SMP Fri Mar 31 14:07:41 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
System-level core jitter¶
$ sudo taskset -c 3 /home/testuser/pma_tools/jitter/jitter -i 30
Linux Jitter testing program version 1.8
Iterations=30
The pragram will execute a dummy function 80000 times
Display is updated every 20000 displayUpdate intervals
Timings are in CPU Core cycles
Inst_Min: Minimum Excution time during the display update interval(default is ~1 second)
Inst_Max: Maximum Excution time during the display update interval(default is ~1 second)
Inst_jitter: Jitter in the Excution time during rhe display update interval. This is the value of interest
last_Exec: The Excution time of last iteration just before the display update
Abs_Min: Absolute Minimum Excution time since the program started or statistics were reset
Abs_Max: Absolute Maximum Excution time since the program started or statistics were reset
tmp: Cumulative value calcualted by the dummy function
Interval: Time interval between the display updates in Core Cycles
Sample No: Sample number
Inst_Min Inst_Max Inst_jitter last_Exec Abs_min Abs_max tmp Interval Sample No
160024 172636 12612 160028 160024 172636 1573060608 3205463144 1
160024 188236 28212 160028 160024 188236 958595072 3205500844 2
160024 185676 25652 160028 160024 188236 344129536 3205485976 3
160024 172608 12584 160024 160024 188236 4024631296 3205472740 4
160024 179260 19236 160028 160024 188236 3410165760 3205502164 5
160024 172432 12408 160024 160024 188236 2795700224 3205452036 6
160024 178820 18796 160024 160024 188236 2181234688 3205455408 7
160024 172512 12488 160028 160024 188236 1566769152 3205461528 8
160024 172636 12612 160028 160024 188236 952303616 3205478820 9
160024 173676 13652 160028 160024 188236 337838080 3205470412 10
160024 178776 18752 160028 160024 188236 4018339840 3205481472 11
160024 172788 12764 160028 160024 188236 3403874304 3205492336 12
160024 174616 14592 160028 160024 188236 2789408768 3205474904 13
160024 174440 14416 160028 160024 188236 2174943232 3205479448 14
160024 178748 18724 160024 160024 188236 1560477696 3205482668 15
160024 172588 12564 169404 160024 188236 946012160 3205510496 16
160024 172636 12612 160024 160024 188236 331546624 3205472204 17
160024 172480 12456 160024 160024 188236 4012048384 3205455864 18
160024 172740 12716 160028 160024 188236 3397582848 3205464932 19
160024 179200 19176 160028 160024 188236 2783117312 3205476012 20
160024 172480 12456 160028 160024 188236 2168651776 3205465632 21
160024 172728 12704 160024 160024 188236 1554186240 3205497204 22
160024 172620 12596 160028 160024 188236 939720704 3205466972 23
160024 172640 12616 160028 160024 188236 325255168 3205471216 24
160024 172484 12460 160028 160024 188236 4005756928 3205467388 25
160024 172636 12612 160028 160024 188236 3391291392 3205482748 26
160024 179056 19032 160024 160024 188236 2776825856 3205467152 27
160024 172672 12648 160024 160024 188236 2162360320 3205483268 28
160024 176932 16908 160024 160024 188236 1547894784 3205488536 29
160024 172452 12428 160028 160024 188236 933429248 3205440636 30
Memory bandwidth¶
$ sudo /home/testuser/mlc --bandwidth_matrix
Intel(R) Memory Latency Checker - v3.5
Command line parameters: --bandwidth_matrix
Using buffer size of 100.000MB/thread for reads and an additional 100.000MB/thread for writes
Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Numa node
Numa node 0 1
0 57935.5 30265.2
1 30284.6 58409.9
$ sudo /home/testuser/mlc --peak_injection_bandwidth
Intel(R) Memory Latency Checker - v3.5
Command line parameters: --peak_injection_bandwidth
Using buffer size of 100.000MB/thread for reads and an additional 100.000MB/thread for writes
Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads : 115762.2
3:1 Reads-Writes : 106242.2
2:1 Reads-Writes : 103031.8
1:1 Reads-Writes : 87943.7
Stream-triad like: 100048.4
$ sudo /home/testuser/mlc --max_bandwidth
Intel(R) Memory Latency Checker - v3.5
Command line parameters: --max_bandwidth
Using buffer size of 100.000MB/thread for reads and an additional 100.000MB/thread for writes
Measuring Maximum Memory Bandwidths for the system
Will take several minutes to complete as multiple injection rates will be tried to get the best bandwidth
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads : 115782.41
3:1 Reads-Writes : 105965.78
2:1 Reads-Writes : 103162.38
1:1 Reads-Writes : 88255.82
Stream-triad like: 105608.10
Memory latency¶
$ sudo /home/testuser/mlc --latency_matrix
Intel(R) Memory Latency Checker - v3.5
Command line parameters: --latency_matrix
Using buffer size of 200.000MB
Measuring idle latencies (in ns)...
Numa node
Numa node 0 1
0 101.0 132.0
1 141.2 98.8
$ sudo /home/testuser/mlc --idle_latency
Intel(R) Memory Latency Checker - v3.5
Command line parameters: --idle_latency
Using buffer size of 200.000MB
Each iteration took 227.2 core clocks ( 99.0 ns)
$ sudo /home/testuser/mlc --loaded_latency
Intel(R) Memory Latency Checker - v3.5
Command line parameters: --loaded_latency
Using buffer size of 100.000MB/thread for reads and an additional 100.000MB/thread for writes
Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject Latency Bandwidth
Delay (ns) MB/sec
==========================
00000 294.08 115841.6
00002 294.27 115851.5
00008 293.67 115821.8
00015 278.92 115587.5
00050 246.80 113991.2
00100 206.86 104508.1
00200 123.72 72873.6
00300 113.35 52641.1
00400 108.89 41078.9
00500 108.11 33699.1
00700 106.19 24878.0
01000 104.75 17948.1
01300 103.72 14089.0
01700 102.95 11013.6
02500 102.25 7756.3
03500 101.81 5749.3
05000 101.46 4230.4
09000 101.05 2641.4
20000 100.77 1542.5
L1/L2/LLC latency¶
$ sudo /home/testuser/mlc --c2c_latency
Intel(R) Memory Latency Checker - v3.5
Command line parameters: --c2c_latency
Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT latency 42.1
Local Socket L2->L2 HITM latency 47.0
Remote Socket L2->L2 HITM latency (data address homed in writer socket)
Reader Numa Node
Writer Numa Node 0 1
0 - 108.0
1 106.9 -
Remote Socket L2->L2 HITM latency (data address homed in reader socket)
Reader Numa Node
Writer Numa Node 0 1
0 - 107.7
1 106.6 -
Calibration Data - Skylake¶
Following sections include sample calibration data measured on s11-t31-sut1 server running in one of the Intel Xeon Skylake testbeds as specified in CSIT Testbeds: Xeon Skx, Arm, Atom.
Calibration data obtained from all other servers in Skylake testbeds shows the same or similar values.
Linux cmdline¶
$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-4.15.0-23-generic root=UUID=759ad671-ad46-441b-a75b-9f54e81837bb ro isolcpus=1-27,29-55,57-83,85-111 nohz_full=1-27,29-55,57-83,85-111 rcu_nocbs=1-27,29-55,57-83,85-111 numa_balancing=disable intel_pstate=disable intel_iommu=on iommu=pt nmi_watchdog=0 audit=0 nosoftlockup processor.max_cstate=1 intel_idle.max_cstate=1 hpet=disable tsc=reliable mce=off console=tty0 console=ttyS0,115200n8
Linux uname¶
$ uname -a
Linux s5-t22-sut1 4.15.0-23-generic #25-Ubuntu SMP Wed May 23 18:02:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
System-level core jitter¶
$ sudo taskset -c 3 /home/testuser/pma_tools/jitter/jitter -i 20
Linux Jitter testing program version 1.8
Iterations=20
The pragram will execute a dummy function 80000 times
Display is updated every 20000 displayUpdate intervals
Timings are in CPU Core cycles
Inst_Min: Minimum Excution time during the display update interval(default is ~1 second)
Inst_Max: Maximum Excution time during the display update interval(default is ~1 second)
Inst_jitter: Jitter in the Excution time during rhe display update interval. This is the value of interest
last_Exec: The Excution time of last iteration just before the display update
Abs_Min: Absolute Minimum Excution time since the program started or statistics were reset
Abs_Max: Absolute Maximum Excution time since the program started or statistics were reset
tmp: Cumulative value calcualted by the dummy function
Interval: Time interval between the display updates in Core Cycles
Sample No: Sample number
Inst_Min Inst_Max Inst_jitter last_Exec Abs_min Abs_max tmp Interval Sample No
160022 171330 11308 160022 160022 171330 2538733568 3204142750 1
160022 167294 7272 160026 160022 171330 328335360 3203873548 2
160022 167560 7538 160026 160022 171330 2412904448 3203878736 3
160022 169000 8978 160024 160022 171330 202506240 3203864588 4
160022 166572 6550 160026 160022 171330 2287075328 3203866224 5
160022 167460 7438 160026 160022 171330 76677120 3203854632 6
160022 168134 8112 160024 160022 171330 2161246208 3203874674 7
160022 169094 9072 160022 160022 171330 4245815296 3203878798 8
160022 172460 12438 160024 160022 172460 2035417088 3204112010 9
160022 167862 7840 160030 160022 172460 4119986176 3203856800 10
160022 168398 8376 160024 160022 172460 1909587968 3203854192 11
160022 167548 7526 160024 160022 172460 3994157056 3203847442 12
160022 167562 7540 160026 160022 172460 1783758848 3203862936 13
160022 167604 7582 160024 160022 172460 3868327936 3203859346 14
160022 168262 8240 160024 160022 172460 1657929728 3203851120 15
160022 169700 9678 160024 160022 172460 3742498816 3203877690 16
160022 170476 10454 160026 160022 172460 1532100608 3204088480 17
160022 167798 7776 160024 160022 172460 3616669696 3203862072 18
160022 166540 6518 160024 160022 172460 1406271488 3203836904 19
160022 167516 7494 160024 160022 172460 3490840576 3203848120 20
Memory bandwidth¶
$ sudo /home/testuser/mlc --bandwidth_matrix
Intel(R) Memory Latency Checker - v3.5
Command line parameters: --bandwidth_matrix
Using buffer size of 100.000MB/thread for reads and an additional 100.000MB/thread for writes
Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Numa node
Numa node 0 1
0 107947.7 50951.5
1 50834.6 108183.4
$ sudo /home/testuser/mlc --peak_injection_bandwidth
Intel(R) Memory Latency Checker - v3.5
Command line parameters: --peak_injection_bandwidth
Using buffer size of 100.000MB/thread for reads and an additional 100.000MB/thread for writes
Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads : 215733.9
3:1 Reads-Writes : 182141.9
2:1 Reads-Writes : 178615.7
1:1 Reads-Writes : 149911.3
Stream-triad like: 159533.6
$ sudo /home/testuser/mlc --max_bandwidth
Intel(R) Memory Latency Checker - v3.5
Command line parameters: --max_bandwidth
Using buffer size of 100.000MB/thread for reads and an additional 100.000MB/thread for writes
Measuring Maximum Memory Bandwidths for the system
Will take several minutes to complete as multiple injection rates will be tried to get the best bandwidth
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads : 216875.73
3:1 Reads-Writes : 182615.14
2:1 Reads-Writes : 178745.67
1:1 Reads-Writes : 149485.27
Stream-triad like: 180057.87
Memory latency¶
$ sudo /home/testuser/mlc --latency_matrix
Intel(R) Memory Latency Checker - v3.5
Command line parameters: --latency_matrix
Using buffer size of 2000.000MB
Measuring idle latencies (in ns)...
Numa node
Numa node 0 1
0 81.4 131.1
1 131.1 81.3
$ sudo /home/testuser/mlc --idle_latency
Intel(R) Memory Latency Checker - v3.5
Command line parameters: --idle_latency
Using buffer size of 2000.000MB
Each iteration took 202.0 core clocks ( 80.8 ns)
$ sudo /home/testuser/mlc --loaded_latency
Intel(R) Memory Latency Checker - v3.5
Command line parameters: --loaded_latency
Using buffer size of 100.000MB/thread for reads and an additional 100.000MB/thread for writes
Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject Latency Bandwidth
Delay (ns) MB/sec
==========================
00000 282.66 215712.8
00002 282.14 215757.4
00008 280.21 215868.1
00015 279.20 216313.2
00050 275.25 216643.0
00100 227.05 215075.0
00200 121.92 160242.9
00300 101.21 111587.4
00400 95.48 85019.7
00500 94.46 68717.3
00700 92.27 49742.2
01000 91.03 35264.8
01300 90.11 27396.3
01700 89.34 21178.7
02500 90.15 14672.8
03500 89.00 10715.7
05000 82.00 7788.2
09000 81.46 4684.0
20000 81.40 2541.9
L1/L2/LLC latency¶
$ sudo /home/testuser/mlc --c2c_latency
Intel(R) Memory Latency Checker - v3.5
Command line parameters: --c2c_latency
Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT latency 53.7
Local Socket L2->L2 HITM latency 53.7
Remote Socket L2->L2 HITM latency (data address homed in writer socket)
Reader Numa Node
Writer Numa Node 0 1
0 - 113.9
1 113.9 -
Remote Socket L2->L2 HITM latency (data address homed in reader socket)
Reader Numa Node
Writer Numa Node 0 1
0 - 177.9
1 177.6 -
SUT Settings - Linux¶
System provisioning is done by combination of PXE boot unattented install and Ansible described in CSIT Testbed Setup.
Below a subset of the running configuration:
- Xeon Haswell - Ubuntu 16.04.1 LTS
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.1 LTS
Release: 16.04
Codename: xenial
- Xeon Skylake - Ubuntu 18.04 LTS
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04 LTS
Release: 18.04
Codename: bionic
Linux Boot Parameters¶
- isolcpus=<cpu number>-<cpu number> used for all cpu cores apart from first core of each socket used for running VPP worker threads and Qemu/LXC processes https://www.kernel.org/doc/Documentation/admin-guide/kernel-parameters.txt
- intel_pstate=disable - [X86] Do not enable intel_pstate as the default scaling driver for the supported processors. Intel P-State driver decide what P-state (CPU core power state) to use based on requesting policy from the cpufreq core. [X86 - Either 32-bit or 64-bit x86] https://www.kernel.org/doc/Documentation/cpu-freq/intel-pstate.txt
- nohz_full=<cpu number>-<cpu number> - [KNL,BOOT] In kernels built with CONFIG_NO_HZ_FULL=y, set the specified list of CPUs whose tick will be stopped whenever possible. The boot CPU will be forced outside the range to maintain the timekeeping. The CPUs in this range must also be included in the rcu_nocbs= set. Specifies the adaptive-ticks CPU cores, causing kernel to avoid sending scheduling-clock interrupts to listed cores as long as they have a single runnable task. [KNL - Is a kernel start-up parameter, SMP - The kernel is an SMP kernel]. https://www.kernel.org/doc/Documentation/timers/NO_HZ.txt
- rcu_nocbs - [KNL] In kernels built with CONFIG_RCU_NOCB_CPU=y, set the specified list of CPUs to be no-callback CPUs, that never queue RCU callbacks (read-copy update). https://www.kernel.org/doc/Documentation/admin-guide/kernel-parameters.txt
- numa_balancing=disable - [KNL,X86] Disable automatic NUMA balancing.
- intel_iommu=enable - [DMAR] Enable Intel IOMMU driver (DMAR) option.
- iommu=on, iommu=pt - [x86, IA-64] Disable IOMMU bypass, using IOMMU for PCI devices.
- nmi_watchdog=0 - [KNL,BUGS=X86] Debugging features for SMP kernels. Turn hardlockup detector in nmi_watchdog off.
- nosoftlockup - [KNL] Disable the soft-lockup detector.
- tsc=reliable - Disable clocksource stability checks for TSC. [x86] reliable: mark tsc clocksource as reliable, this disables clocksource verification at runtime, as well as the stability checks done at bootup. Used to enable high-resolution timer mode on older hardware, and in virtualized environment.
- hpet=disable - [X86-32,HPET] Disable HPET and use PIT instead.
Applied Boot Cmdline¶
- Xeon Haswell - Ubuntu 16.04.1 LTS
$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-4.4.0-72-generic root=UUID=35ea11e4-e44f-4f67-8cbe-12f09c49ed90 ro isolcpus=1-17,19-35 nohz_full=1-17,19-35 rcu_nocbs=1-17,19-35 intel_pstate=disable console=tty0 console=ttyS0,115200n8
- Xeon Skylake - Ubuntu 18.04 LTS
$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-4.15.0-23-generic root=UUID=3fa246fd-1b80-4361-bb90-f339a6bbed51 ro isolcpus=1-27,29-55,57-83,85-111 nohz_full=1-27,29-55,57-83,85-111 rcu_nocbs=1-27,29-55,57-83,85-111 numa_balancing=disable intel_pstate=disable intel_iommu=on iommu=pt nmi_watchdog=0 audit=0 nosoftlockup processor.max_cstate=1 intel_idle.max_cstate=1 hpet=disable tsc=reliable mce=off console=tty0 console=ttyS0,115200n8
Host IRQ Affinity¶
IRQs are pinned to core 0. The same configuration is applied in host Linux and guest VM.
$ for l in `ls /proc/irq`; do echo 1 | sudo tee /proc/irq/$l/smp_affinity; done
Host RCU Affinity¶
RCUs are pinned to core 0. The same configuration is applied in host Linux and guest VM.
$ for i in `pgrep rcu[^c]` ; do sudo taskset -pc 0 $i ; done
Host Writeback Affinity¶
Writebacks are pinned to core 0. The same configuration is applied in host Linux and guest VM.
$ echo 1 | sudo tee /sys/bus/workqueue/devices/writeback/cpumask
DUT Settings - DPDK¶
DPDK Version¶
DPDK 18.05
DPDK Compile Parameters¶
make install T=x86_64-native-linuxapp-gcc -j
Testpmd Startup Configuration¶
Testpmd startup configuration changes per test case with different settings for $$CORES, $$RXQ and max-pkt-len parameter if test is sending jumbo frames. Startup command template:
testpmd -c $$CORE_MASK -n 4 -- --numa --nb-ports=2 --portmask=0x3 --nb-cores=$$CORES --max-pkt-len=9000 --txqflags=0 --forward-mode=io --rxq=$$RXQ --txq=$$TXQ --burst=64 --rxd=1024 --txd=1024 --disable-link-check --auto-start
L3FWD Startup Configuration¶
L3FWD startup configuration changes per test case with different settings for $$CORES and enable-jumbo parameter if test is sending jumbo frames. Startup command template:
l3fwd -l $$CORE_LIST -n 4 -- -P -L -p 0x3 --config='${port_config}' --enable-jumbo --max-pkt-len=9000 --eth-dest=0,${adj_mac0} --eth-dest=1,${adj_mac1} --parse-ptype
TG Settings - TRex¶
TG Version¶
TRex v2.35
DPDK version¶
DPDK v17.11
TG Build Script used¶
TG Startup Configuration¶
$ cat /etc/trex_cfg.yaml
- port_limit : 2
version : 2
interfaces : ["0000:0d:00.0","0000:0d:00.1"]
port_info :
- dest_mac : [0x3c,0xfd,0xfe,0x9c,0xee,0xf5]
src_mac : [0x3c,0xfd,0xfe,0x9c,0xee,0xf4]
- dest_mac : [0x3c,0xfd,0xfe,0x9c,0xee,0xf4]
src_mac : [0x3c,0xfd,0xfe,0x9c,0xee,0xf5]
TG Startup Command¶
$ sh -c 'cd <t-rex-install-dir>/scripts/ && sudo nohup ./t-rex-64 -i -c 7 --iom 0 > /tmp/trex.log 2>&1 &'> /dev/null