User login

Richard Sanger's blog

24

Feb

2015

This week I deprecated the various statistic functions trace_get_filtered/accepted/dropped/received and replaced them with a single function trace_get_statistics().

This allows for a single atomic update of statistics, assuming the underlying format can provide this. I have also improved the documentation of the meaning of the statistics. Internally I removed the unused get_captured_packets interface.

I updated ring: and int: to use the new statistic interface and by using the device statistics managed to get some extra counters implemented. This included the number of packets filtered which was previously unavailable and the new packets errors.

I also updated the dag25: format to use the new interface as well as other minor fixes and testing of the dag format.

17

Feb

2015

At the beginning of the week I worked on master libtrace applying some important bug fixes for DPDK 1.7 which were already present in my branch in preparation for a libtrace release. I also looked into DPDK 1.8 support and added this to master libtrace.

I finished my work to get some refactoring of the main loop code finished. The rest of the week I spent continuing Dan's work on refactoring the int and ring format into separate files. This branch also removed a lot of code duplication by making these formats and the dag format stream based removing most differences between the single-threaded and parallel frameworks. I ended up opting to split the int and ring into 3 files one for int, one for ring and one for the common functions shared by both. This branch also included Dan's earlier work on the DAG format which also updated the format to support batches of packets.

I merged all the changes from master back into my branch, in preparation for the next release of libtrace which will include the parallel framework.

I looked into the statistic counter functions in int and discovered their logic was not ideal in all but the simplest case: reading counts at the end of the trace. In a discussion with Shane we decided the best approach was to deprecate and replace said functions with a single function which would allow all statistics to be retrieved at once in a atomic manner. I will do this next week.

I also looked at the possibility of giving a more realistic packet drop count for the Linux int and ring formats by get the statistics from the card (/proc/net/dev). This seemed to work flawlessly.

11

Feb

2015

Spent the week continuing refactoring code, in particular the packet loop.

I moved support for delaying packets (for tracetime playback) into this loop since packets are now read in batches. I've reworked the code to allow messages to be received between packets when they are being delayed. When tracetime playback is not enabled we will not check for messages between packets in a batch for performance reasons.

02

Feb

2015

Monday was a day off, and I spent most of Tuesday working on my slides and graphs as per Shane's suggestions for my NZNOG talk.

The rest of the week was spent attending NZNOG. The WAND presentation went well, however slightly overtime. There were many interesting talks and I enjoyed my time. I talked to a few people interested in my research after the presentation.

02

Feb

2015

Continued refactoring the pstart method and other related parts of the code.

I wrote slides for NZNOG which for the practice presentation on Thursday. I got Dan to help run some last minute results for the slides, however was still missing one graph.

On the Friday I fixed my presentation as per suggestions given after the practice presentation. And I worked on a python ostinato script to run tests and generate results automatically and left this running some situations over the weekend.

20

Jan

2015

Because DPDK timestamps are calculated for a batch of packets based upon the link speed it is quite important too ensure that linkspeed is correct and updated if it was to change. As such I registered a callback for link status changes with DPDK. This works well however the ixgbe DPDK driver has a large (1-3sec) delay before sending these notifications to allow the status to settle (which seems excessive). I also found that ixgbe would not renegotiate connection speed (if it was changed) unless I stopped and started the interface again in DPDK, however this is not a safe operation while reading from the format so I did not include this. Ideally this should be fixed in the DPDK driver.

I've updated the int and ring formats to work again with the new batch read interface. I also fixed a handful of other minor bugs present in the parallel code.

I fixed a rare bug in parallel tracenanon due to unsafe data accesses. And started work refactoring the parallel code, such as adding proper error handling. And documenting the code.

07

Jan

2015

I finished up performance optimisations for the DPDK format for the time being. A libtrace application without any other load (tracestats with no filters) can almost capture 100% 64byte @10Gbit on a single thread.
This is not possible via the parallel interface on a single thread but can be reached easily with 2 threads.

As a summary of the last couple of weeks work on DPDK and what has made the biggest impact.
* Rewriting to DPDK format to use batch reads, this has been written into the parallel interface and makes a huge difference.
* Rewrites of the libtrace parallel pipeline to avoid duplicating work.
* Introducing pauses after failed reads helps to reduce load on memory and allows for faster capture rates.
* By default CPU cores on the same NUMA node as the PCI card are preferred which helped reach 100% capture.
* Time stamping is now done per batch, with packet spacing assumed to be at line rate. This greatly reduced the number of gettimeofday/clockgettime calls, which was slow.
* Simplified the layout of the DPDK packet by moving our additional header to be straight after the DPDK header rather than prepended to the start of the packet.
* Surprisingly NUMA configuration on m_buf memory seemed to make no difference, along with the number memory channels that you tell dpdk about. I'm not sure why.

10

Dec

2014

Continued work on the DPDK format this week. I ran into a few more issues with perf crashing the kernel, so upgraded one machine to jessie with the 3.16 kernel. So far this has been working well, it also appears developers have also added support for more CPU counters.

I've removed some extra header information from the dpdk header which mainly existed for rt capabilities. A discussion with Shane and Dan resulted in the proposal that rt should be removed from libtrace itself and become supported via a libtrace application. For now rt might be broken and rt support will be reworked once formats are working efficiently.

Most remaining performance issues left with the dpdk format code appear to be simply due to running to many instructions per packet in the libtrace library rather than an excessive number of cache/branch/tlb misses etc. I will continue to look at reducing this for now the easiest place to optimise this would be to reduce the number of calls to clock_gettime() to one per batch or less and essentially create fake timestamps. What we are currently doing --- calling clock_gettime() in a loop for a batch --- is not any better. Of course 1Gbit cards that support hardware timestamping of every packet will be able to get a more accurate timestamp, sadly no Intel 10Gbit cards support this currently.

03

Dec

2014

Started work on batching packet processing from the DPDK format, or for that matter all parallel formats. Because I had previously allowed batching from single threaded formats that are split by libtrace this has not been to difficult to fit in.

However performance was lacking and I have been working on trimming the code down as much as possible and finding the bottlenecks. For this perf has been very useful.

One large factor that I noticed slowing performance was work being done on all packets in a batch at various stages in the pipeline (after they are kicked out of CPU cache) so I have been moving all processing as close to when the user reads packets.

25

Nov

2014

Looking at the dpdk: format in more detail again this week trying to test/optimise its performance in libtrace.

Currently with 64bit packets at 10Gbit line-rate libtrace is unable to capture more then about 50% of the packets. This seems to be a bootleneck in regards to the packet rate, using double the packet size results in almost 100% capture.

I worked on a simple existing application I wrote directly on top of DPDK and saw similar performance. It appears that I need to batch read packets from DPDK otherwise no matter how many threads/cores I add the performance caps out at about 50% capture. This will extend upon the batch support already used to improve performance when using reading from single threaded formats to reduce look contention. This will also be useful to other formats which support bulk reads. The intention currently is to hide all batching internally within libtrace such that libtrace applications still receive a single packet at a time.

Using this simple program I'm also looking into which settings are optimal such as the processor socket used, memory allocation etc and anything else DPDK lets you configure.