User login

Richard Sanger's blog




Spent this week working with the i.mx6 board. Looked into some limitations of the board and its gigibit Ethernet. Found out how to get a new kernel building and onto it.




Spent most of the week tidying testing and fixing the ndt Web10G code. Submitted Web10G patch to the ndt developers.

I now have nic which supports network timestamping and I've started on getting that working. Got the drivers working; linux is detecting hardware timestamping capability. Working on a basic app which uses hardware timestamps to get a understanding of the API.




I now have a working version of ndt that uses web10g. Worked a bit on tidying up the new code.

I've been finding and fixing bugs in ndt for the rest of the week. I'm finding both existing bugs present in ndt and bugs in the my new web10g code. I've been submitting existing bugs along with patches to the ndt developers.




Added detection and correction of incorrect tcp congestion control into ndt. This is all done via the /proc/ interface. A kernel running ndt should be using reno rather then cubic which is often the default.

Moved my web10g code which I had previously incorporated into ndt 3.6.4 into the latest svn version (which is where I've been making all these other changes). Not quite finished this, still got some autotools configuration files to change and some compile errors/warnings.




Continued on from last week comparing web100 and web10g implementations of ndt on my kernel as well as the ndt web100 server bundled with perfSONAR. This time with 2 directly connected machines on the emulation network, rather than across red cables like I was previously.

I came to the conclusion that web100 and web10g are both giving ndt similar values which look to be accurate.

Did a bit of reading into Linux's implementation of TCP. And starting trying to make improvements to ndt, currently started to look at warning and correcting incorrect server setups. I've added detection and correction of packet coalescing settings. Coalescing should be off because packet pair timing is used to calculate the slowest links speed which relies on accurate timestamps of each packets arrival.
I still want to compare this to hardware time stamping which should give even more accurate link speed detection, while still allowing packet coalescing to be enabled.




More web10g stuff this week. Figured out the web10g library for the 3.2 kernel wasn't as complete as the 3.5 kernel so now I'm using that. Successfully modified ndt 3.6.4 to use web10g.

Built a web100 kernel to compare it my web10g version. Ran into issue's with buffer bloat. TCP allocates a large receive buffer when ndt is doing throughput testing resulting in a 100ms+ latency for a local test which should be <1ms, of course this high ping TCP observes is only justifying the need for such a large buffer :(. Reducing the buffer size seemed to fix this latency issue. Still need to look into this more next week.

I also compared the latest web100 version of ndt downloaded from their svn repository to the older web100 3.6.4. The latest official release server was always timing out during client->server server->client throughput testing and was unusable. Both gave almost identical results even when mixing the client and server versions. So it appears that modifying the 3.6.4 to use web10g is still valid and there hasn't been a major changes/bug fixes that affect the results.




Started looking at web10G this week, which aims to provide TCP information from the kernel to user processes, such as window-size RTT etc. Built a 2.6 and 3.2 kernel with the web10G patch, both seem to be working. Played with the sample applications which just output this info.

2.6 kernel still seems to be using the /proc interface like (web100 did) which web10G are trying to move away from.

3.2 kernel is using netlink, special sockets used for Inter-process communication. Currently their sample applications consists of a sender and receiver (one to ask for stats about connect X and one to receive the stats). Started to try and put this all into one application. After this familiarisation I'll try get the network diagnostic tester to use web10G and see if results are better than the what web100 gave.

Fixed a few bugs that Shane found in the libtrace PACKET_MMAP capturing, now committed the into the svn. Added a blog post with my PACKET_MMAP testing results.





In an attempt to increase the speed of libtraces native Linux capture I've implemented PACKET_MMAP socket capturing. PACKET_MMAP consists of a ring buffer mapped in shared memory that both the kernel and a user program can directly access, this is ideally suited to packet capturing. This allows libtrace to check the status of a frame in the ring buffer without the need to make a system call. This is also implemented as zero copy within libtrace.

This should be included in the next version of libtrace as the 'ring:' URI input/output format. The current 'int:' URI will remain unchanged.

PACKET_MMAP is supported in the latest linux kernels.
Linux kernel or higher support both RX_RING (reading) and TX_RING (writing).


Ixia Traffic Generator sending a stream of packets across a 1 Gbit connection directly connected to the testing machine.

Running libtrace on a clean Debian install, Linux kernel 2.6.32-5-amd64.
Intel(R) Core(TM)2 Duo CPU E6750 @ 2.66GHz , 4G ram, Intel Corporation 82574L Gigabit Network Connection.


I used the libtrace tool tracestats to count packets and report the number dropped, and time to provide other useful statistics.
root@machine5:~# /usr/bin/time -v tracestats ring:eth1
This is ideal since tracestats only counts packets and does no additional processing. This gives a good benchmark for the libtrace library itself without the additional overhead caused by processing packets.

I also monitored the systems CPU usage average every 5 seconds using:
root@machine5:~# sar -u 5

Each test consisted of 20 seconds of data transmission. In this way CPU times are comparable even if tracestats runs longer in one test than another.
Since we have packets being sent for 20 seconds, I've taken the middle three 5 second intervals from sar to calculate the average CPU usage during packet capture.

Test 1:
Comparing the current capture method, recv(socket), to the new PACKET_MMAP way.
Traffic was generated at 1Gbit speed full link speed and the packet size was varied from 64bytes (1488095 packets/seconds) to 1518 (81274 packets/second). A ring buffer size of 128 frames, a single frame being space to hold a single captured packet, was used which is fairly small. If packet loss is an issue the buffer can easily be made 100 times that.

Test 2:
Compare different PACKET_MMAP buffer sizes.
Traffic was generated at 1Gbit speed full link speed and the packet size was randomly picked between 64–1518 bytes. This averaged 153940 packets/second.


Test 1:

Recv() graph
On the whole when using PACKET_MMAP the CPU usage is a lot lower leaving more CPU for processing the packets. The kernel itself started to lose packets (not even reporting packets as being dropped) at and below 150 bytes in size. Interesting things happen with the very small packet size of 64 to PACKET_MMAP, no doubt caused by the extremely high rate of packets.

Test 2:

PACKET_MMAP buffer sizes graph

Here 'int' is referring to the current recv() capture method where all the other numbers refer to the PACKET_MMAP buffer size. Anything over 8 frames seem to have almost no loss, with this traffic rate at least, 153940 packets/second with random packet sizes. CPU usage is consistent across all buffer sizes.

I've attached all the graphs and the raw output from sar and tracestats.




More work on libtrace and PACKET_MMAP this week.

Got writing back to the interface working using the packet_mmap method. Significant speed improvement about 3x that of the current method.

Change the buffer allocation for PACKET_MMAP to allow the maximum achievable buffer size (kernel dependent) and more efficient memory use.

Added to existing code to use tpacket2_hdr as opposed to tpacket_hdr if available. Version 2 adds time stamping to the nanosecond, as opposed to v1 which was only to the nearest microsecond. This also fixes a major mistake made in the v1 header, it wasn't a fixed size on 32&64-bit machines which causes compatibility issues.

Had a play with the Ixia traffic generator. Ran some comparisons between PACKET_MMAP and the current capture method. As well as packet drop rate depending on PACKET_MMAP buffer sizes.




Modified the libtrace native Linux capture using the ring buffer system, PACKET_MMAP.
Using the ring buffer libtrace can continuously read packets without the need to make a kernel call until there are no more packets, the buffer is shared in kernel and user memory so the kernel can continue the fill it at the same time libtrace reads it. When the buffer is empty it will ask the kernel to be notified when more data is available. This is faster at capturing packets then the traditional method used at the moment; asking the kernel to deliver each packet.

The ring system is out performing the traditional method, resulting in less packet loss and less system cpu time.

I also started to add the ring buffer for writing back out the the interface. I'm not expecting to see much of a difference in performance compared to the current method.

Had a look at the precision time protocol PTP protocol.