User login

Richard Sanger's blog




Started my summer scholarship this week the goal is to completing parallel libtrace particularly focusing on testing on live formats.

Brad setup the 10Gbit machines this week and I've had a play with ostinato to generate traffic. It looks like the dpdk version of ostinato can generate 64byte packets at 10Gbit linerate, this looks like it will make a suitable replacement for the ixia traffic generator. Ostinato has a very similar interface to that supplied by the ixia and we should be able to do higher packet rates.

Fixed a couple of bugs in dpdk particularly new ones introduced with the 1.7.1 library.

Discussed libtrace and DAG cards with Dan, who is working on dag as a natively parallel format.




This week the focus has been moved to writing my honours report.

I've started from my mid-term report and I am reusing some parts of the introductory chapters which have changed little. Like a lot of honours students the word count for tracking my progress is up at




I've moved all of the existing code across to the new combiner API, including tests. I'm currently working through removing old obsolete options, some of which have never been used, and replacing them with a new method to set the combiner.




Updated the DPDK format in classic libtrace to support the latest library versions per Shane's request, we now have a fairly nice way with dealing with the differences in library versions. Also ported some other patches from my branch, such as supporting multiple libtrace instances each running DPDK on different interfaces. Updated documentation for the DPDK format and moved this to github. Given that I have some 10Gbit machines on the way that I'll be wanting to try with DPDK, this is good to get into again. Some of these changes I still need to pull into my own branch.

I've also been working on refactoring the combiner step in my parallel libtrace(between the perpkt threads and reporter thread) to provide an API so users can provide their own. This removes the ability to call trace_get_results(), in favour of delivery directly to the reporter function.




I fixed statistic counters in the linux int and ring format to print totals across all threads not just the first.

Added support for the reporter method as an argument to trace_pstart.

Moved many of the new tuning options(such as buffer sizes etc.) into a single structure that can be configured. Because there are quite a few of these now it seems to make the most sense rather than having lots of configuration options for each individually.

Fixed a rare bug due to timing, if the trace is using a hasher thread and is paused right as it's ending it could cause a deadlock.




Spent most of the week working on memory management, which I have now committed. Mostly spent time refactoring the code and fixing bugs.

Got told I could purchase a 10Gig testbed on Friday, I had a look at the Intel 10Gig NICs mainly due to their compatibility with DPDK. The main downside is that they don't support timestamping every packet like the 1Gig cards do. Instead they will only timestamp PTP packets.




I've started the week working on the parallel/live formats, mainly ensuring that the statistic counters for them are correct. For this I stored copies of some statistics from the 'format' against the 'trace' itself, this deals with the case the format is paused or stopped when it is closed such as the case with interrupting a live format. Otherwise this would typically be lost because the format is closed and losses its statistics. I also moved the accepted counter into each thread so this didn't require a lock. I still need to verify I'm grabbing the right statistics from the live formats, that is to say summing each threads counters if necessary.

Later in the week I started to look at managing memory better and avoiding excessive calls to malloc and processing packets in batches. This is coming along well :).

I have added a batch size to the reads from a single threaded format, to reduce lock contention and improve performance. Accompanying this I've finally got a nicer generic memory recycler, which is currently used for packets. This takes much of its design from the DPDK memory buffers, they include a thread local cache to reduce accesses to the main pool. However unlike DPDK this is all dynamic and any new thread will automagically create a thread local cache with very little overhead and tidies itself when the thread exits. This also allows the total number of packets to be limited, effectively putting a cap on memory usage, however cation must be applied to ensure this doesn't run out.

This is still work in progress but is showing a significant improvement.




Finished the DPDK format register and unregister threads as I talked about in my last report. This means that the parallel DPDK should be thread safe now and still benefit from thread local memory caches, this whole thing is quite dependent on the version of the DPDK library in use, however I don't expect that this part of DPDK is going to change. I still need to provide a public interface to the user so they can register their own threads, but I'll deal with that when it comes up next.

Added the storage for per thread format data against libtrace threads and passed the thread to read_packet. This replaces an old thread lookup function which was unnecessary and slow.

Found and fixed a handful of bugs in my parallel implementation of int:.

Added some tests for live Linux formats (int:, ring: and pcapint:) to vanilla libtrace, I will port these to parallel this week. This uses network namespaces and virtual Ethernet adapters to simulate a direct connection between two ports. From this I found and fixed a bug with the wire length reported by int: when using TRACE_OPTION_SNAPLEN. And fixed a minor bug in pcapint:.

I will work on porting these to test my live parallel formats. I found that I will need to store format stats against the trace when it ends (before it is paused) since a pause may loss this information. I will also look at batch processing packets from single threaded traces to reduce locking overhead.




Continued looking through the Intel DPDK code for the best way to implement the format. A threads id assigned by DPDK is needed for memory pools to work if caches are enabled. A thread is assigned a number between 0-31 which is stored in TLS(__thread storage type), there is a corresponding 32 element array of local caches present in every memory pool used as local caches. It would be nicer if every memory pool had these caches stored in TLS but this would not be possible since they are dynamically created objects.

The two options here are to accept the performance hit and disable the caches (The cache memory would still be allocated). This seems like a very bad idea because this would mean every packet read requires a lock to the shared memory structure, rather than in the caching case where the cache will be filled from the main memory whenever required in bulk transactions. Instead it seems best (at least for the parallel implementation) to enable the cache.

This brings the second problem with the DPDK system, threads they are completely handled by DPDK including starting them. Under this system at most a single physical core can have a single software thread associated with it. This is an issue because the main thread is used to schedule operations rather than read packets meaning 1 thread cannot be used with DPDK.

There are a couple of ways with dealing with this, simply accept this as a limitation or create threads with the correct setup within libtrace.

So I'm working on solving all these problems by introducing a register thread function. That way we can register threads to the DPDK library, currently this seems like it would just be a matter of setting the thread id and binding to a CPU core. This would allow caches to be enabled, the reducer thread to be used and any other threads the user decides to add. Additionally this would allow us to include the existing message passing system with user created threads. In the future any other format that required initialisation of TLS would also be able to use this system.

These ideas of buffer handling and batch processing seem very worthwhile experimenting with within my libtrace codes memory handling.




Added more tests to hit some edge cases to ensure that things such as the hashing functions are applied correctly and that single threaded code path is sane. Part of this involved tidying up an edge case around pausing a trace, when 1+ thread/s have already finished and pause is called. Now this only waits for threads that haven't finished, and those finishing will signal a condition to ensure that thread current states are visible.

I ran the code through valgrind and tided a minor memory leak. Removed the hasher thread and the call to the hashing function from the singled threaded code path. Removed some configuration options that are no longer needed.

Starting looking into the DPDK format again, it looks like threads will keep a local cache of free buffers for reuse, I think this is bound to an internal DPDK thread id. This could be an issue with different threads trying to free buffers if they are not registered to DPDK. However this can be turned off. It sounds like extra threads can reasonably safely be introduced as long as they get a unique internal DPDK number.

I removed the DPDK restriction of one libtrace program per machine, now it is one per port, however still limited to one port either reading or writing (but not both) for a single running libtrace application.