User login

Dan Collins's blog

16

Jan

2015

I spent this week working with the new DAG 10X2-S cards. I spent a bunch of time reading through all the documentation and then testing various input buffer sizes at line rate to see capture percentage. I have a feeling that dagsnap (the endace tool provided for quickly capturing packets) isn't designed to be efficient, and I required a 2GB input buffer in order to capture 100% at 10G line rate with 64 byte packets.

I then got back onto libtrace with the dag format. The latest dag API is missing DUCK support (from what I could tell) and so I temporarily disabled that. It doesn't get called in my code, so I'll have to investigate if we need to keep it in. On the other hand, I was able to use a 128MB input buffer to capture 100% of line rate 64 byte packets.

Next week I'll start playing with the enhanced packet processing to try and balance the input stream across multiple CPUs. This will let me test libtrace parallel with 10G input rate. Richard was telling me he isn't able to get 100% capture using 8 cores, so it will be interesting to see how well DAG performs.

I also wrote a little patch with the dag library to get it to run on new kernels. This was a simple matter of replacing a deprecated macro. This change will support kernel 2.6 upwards. Brad submitted the patch to Endace in case they're interested.

09

Jan

2015

Over the break my patch made it into the kernel which is nice. I was going to characterise the performance of TPv2 vs TPv3, but ostinato wasn't up to the task (rate control was either line rate, 6 mpps or anything specified under 1 mpps). So I spent some time changing the way packets are sent from ostinato. Richard pointed out that intel cards can hardware rate limited (1 mbit resolution) and this allowed me to add very fine grain control to packet rates in ostinato. The format ostinato uses to move packets around was a little convoluted, but I managed to calculate how many packets to send giving us the ability to send a specified number of packets at a specified rate. The accuracy will be detirmined when we get the 10 gbit DAG cards.

Once that was sorted I started looking for other things to be done to libtrace. I started looking into the Rijndael implementation to try to fix the compiler warnings. Richard came up with the idea of using unions to negate the need to cast between uint8 and uint32. I like this idea because it may be possible to upgrade the library to use uint64 instead (this will halve the calculations it needs). This assumes that calculations can be parallelised in this way.

Also on the to do list is pulling changes from the vanilla libtrace into the parallel libtrace for the DAG cards, and testing libtrace at 10 gig when the new DAG cards arrive.

05

Jan

2015

This report is for the last week of last year.

I spent the week tracking down the bug in TPACKET_V3. This involved tracing the code path through net/packet/af_packet.c and also figuring out how the poll function works.

Poll works by first testing to see if any of the events have already happened. If they have, poll will exit. If not, poll will put the task that called poll into a sleep (specified by poll timeout). In order to break the sleep, something needs to signal the kernel that an event has occured. The kernel will then wake up poll and it will rescan for events.

In TPACKET_V3, poll wasn't alerted if a block expired meaning that, while waiting for data to be received, all of the blocks could time out leaving no space for the next packet to be received. The received packet would be dropped and the code would wake up and clear all of the blocks. If you were to send packets slow enough, you would drop all of them.

I made the patch and submitted it to David Miller who then forwarded it to Linus Torvalds. WAND helped me out a lot to actually submit the patch as it got rejected several times due to formatting problems. It got accepted, and can now be seen in the kernel: https://github.com/torvalds/linux/commit/da413eec729dae5dcb150e2eb34c5e7...

17

Dec

2014

I did some benchmarking of TPACKETV2 and TPACKETV3 and found that TPV3 performs much the same as TPV2 but with a considerably lower CPU usage (due to processing blocks rather than individual packets). With really low traffic flows, TPV3 was dropping packets. Replicate by sending just two packets to TPV3 and TPV2, and notice that TPV3 doesn't see them ever.

This is a kernel bug! TPACKETV3 has an issue where if the kernel marks a block as timed out, it doesn't notify any of the poll watchers. If the kernel marks every block as timed out, the next packet to be received on the interface will be dropped (I'm not sure if this extends to multiple packets, but I think it does). I wasn't able to track this down during the week and plan to continue it on Monday (this blog is written in retrospect).

05

Dec

2014

Gave up on the ordered combiner. I tried a number of different things I thought would optmise my ring buffer's performance only to be beaten by the deque with the same sorting algorithm.

I put my attention back to the ring buffer - both for TPACKET_V3 and also to fix support for TPACKET_V2 when using parallel libtrace. There is almost enough difference in the working of TPACKET_V3 to warrant a new format, however the code is much the same but with different structures. We want to keep support for TPACKET_V2, falling back if TPACKET_V3 is not available. One suggestion has been to use memcpy to overwrite the functions for linux_ring, replacing them with linux_ringv3 functions if TPACKET_V3 support is present.

Before I started implementing things, I decided to take some time to benchmark the existing implementation. linux_ring is broken in parallel libtrace, so I only benchmarked the performance of the single threaded standard libtrace. Using ostinato on a standard interface (not using DPDK) I was able to get a maximum datarate of around 430Mbits with 64 byte packets. This turned out to be enough, as I was seeing around 40% packet loss. I kept turning down the data rate until packet loss was no longer seen and this was at around 340Mbits.

Next I started building a simple packet counter program that made use of TPACKET_V3. This has not been completed yet, but I have started to understand the details of TPACKET_V3 versus TPACKET_V2.

01

Dec

2014

This week I spent analysing the performance of the ordered combiner used to combine the output of multiple capture streams into one (for writing to a file for example). Richard had written a deque based on a doubly linked list which I had assumed could be improved with an array backed ring buffer. However I was unable to obtain the performance seen with the deque for unknown reasons.

Optimising the ring buffer involved isolating and testing to try to improve the insertion and removal speed. Even with no dynamic memory and no modulo operations, the ring buffer performed slower than the doubly linked list.

I also came up with the idea of using a heap to sort the entries, rather than using selection sort. This will involve keeping track of the smallest entry still left in the queues to ensure we don't pull any sorted values larger than that entry. I'm not sure if I'll continue with this, as the existing solution is already very good.

21

Nov

2014

This week I took what I learned the previous week and added parallel libtrace support for the DAG 7.5G2 card. This largely involved copying the functionality of the existing format into something which was thread safe. Endace made this simple by providing a threadsafe mechanism to access the receive streams.

Once I got this working I moved on to look at adding support for TPACKET_V3 for the linux ring format. This is supposed to create some good gains in performance. For the time being, it looks as though parallel support for TPACKET_V2 is incomplete and I was also keen to work on something else so I have stopped working on this format upgrade.

Instead I'm looking into optimising the performance of the ordered combiner used in combining multiple streams into one. This involves buffering and sorting the inbound streams. I have some ideas on how I may improve performance, but I want to write some test cases first to benchmark the existing code. I plan to do this statically to remove the variations caused by things such as the kernel (dropping packets, delays etc would skew results).

18

Nov

2014

This week I spend some time familiarising myself with the operation of libtrace and the DAG cards. My current goal is getting parallel support for libtrace working with a DAG card, and that has involved consideration into bidirectional capture (having TCP streams filtered into one receive stream).

The DAG 7.5G2 makes use of a DSM module to split data into one of two receive streams. It turns out the filtering algorithm works great for all data that is not VLAN tagged. Richard S and I spent some time creating a bitwise filter that can take a VLAN tagged frame and split it into the two streams while still preserving bidirectional flows. A python script was written to create arbitrary configuration files for the DSM module to facilitate this split.

Some further research showed the DSM is only available in the 7.5G2 and that all other cards have a more effective filtering algorithm. Rather than continue working on getting the configuration generation added to libtrace it was decided that the 7.5G would not support bidirectional flows (in cases where packets are VLAN tagged) and other DAG cards will.