User login

Search Projects

Project Members

Shane Alcock admin

Libtrace

Libtrace is a library for both capturing and processing packet traces. It supports a variety of common trace formats, including pcap, ERF, live DAG capture, native Linux and BSD sockets, TSH and legacy ERF formats. Libtrace also supports reading and writing using several different compression formats, including gzip, bzip2 and lzo. Libtrace uses a multi-threaded approach for decompressing and compressing trace files to improve trace processing performance on multi-core CPUs.

The libtrace API provides functions for accessing the headers in a packet directly,
up to and including the transport header.

Libtrace can also output packets using any supported output trace format, including
pcap, ERF, DAG transmit and native sockets.

Libtrace is bundled with several tools for performing common trace processing and analysis tasks. These include tracesplit, tracemerge, traceanon, tracepktdump and tracereport (amongst others).

01

Jul

2015

Short week as I was on leave on Thursday and Friday.

Continued tweaking the event groups produced by netevmon. My main focus has been on ensuring that the start time for a group lines up with the start time of the earliest event in the group. When this doesn't happen, it suggests that there is an incongruity in the logic for updating events and groups based on a new observed detection. Now the problem happens rarely -- which is good from the perspective that I am making progress but it is also bad because it takes a lot longer for a bad group to occur so testing and debugging is much slower.

Spent a bit of time rewriting Yindong's python trace analysis using C++ and libflowmanager. My program was able to run much faster and use a lot less memory, which should mean that wraith won't be hosed for months while Yindong waits for his analysis to run.

Added a new API function to libtrace to strip VLAN and MPLS headers from packets. This makes the packets easier to analyse with BPF filters as you don't need to construct complicated filters to deal with the possible presence of VLAN tags that you don't care about.

Installed libtrace on the Endace probe and managed to get it happily processing packets from a virtual DAG without too much difficulty.

04

May

2015

Packet capture is commonly used in networks to monitor the traffic their users are producing. This allows network operators to detect and monitor threats. Libtrace is a library which provides a simple programming interface for the capture and analysis of network packets.

This project aims to implement extend libtrace to support the processing of network packets in parallel, to better libtrace’s performance. We name the solution developed in this project parallel libtrace. An overview of parallel libtrace and the design process is presented. The challenges encountered in the design of parallel libtrace are described in more detail. By designing a user friendly and efficient library we have been able to improve the performance of some libtrace applications.

Author(s): 
Richard Sanger

28

Apr

2015

Spent much of my week keeping an eye on BTM and dealing with new connections as they come online. Had a couple of false starts with the Wellington machine, as the management interface was up but was not allowing any inbound connections. This was finally sorted on Thursday night (turning the modem on and off again did the trick), so much of Friday was figuring out which Wellington connections were working and which were not.

A few of the BTM connections have a lot of difficulty running AMP tests to a few of the scheduled targets: AMP fails to resolve DNS properly for these targets but using dig or ping gets the right results. Did some packet captures to see what was going on: it looks like the answer record appears in the wrong section of the response and I guess libunbound doesn't deal with that too well. The problem seems to affect only connections using a specific brand of modem, so I am imagining there is some bug in the DNS cache software on the modem.

Continued tracing my NNTSC live export problem. It soon became apparent that NNTSC itself was not the problem: instead, the client was not reading data from NNTSC, causing the receive window to fill up and preventing NNTSC from sending new data. A bit of profiling suggested that the HMM detector in netevmon was potentially the problem. After disabling that detector, I was able to keep things running over the long weekend without any problems.

Fixed a libwandio bug in the LZO writer. Turns out that the "compression" can sometimes result in a larger block than the original uncompressed data, especially when doing full payload capture. In that case, you are supposed to write out the original block instead but we were mistakenly writing the compressed block.

08

Apr

2015

Continued hunting for the bug in the NNTSC live exporter with mixed success. I've narrowed it down to definitely being the per-client queue that is the problem and it doesn't appear to be due to any obvious slowness inserting into the queue. Unfortunately, the problem seems to only occur once or twice a day so it takes a day before any changes or additional debugging take effect.

Went back to working on the mechanical Turk app for event detection. Finally finished a tutorial that shows most of the basic event types and how to classify them properly. Got Brendon and Brad to run through the tutorial and tweaked it according to their feedback. The biggest problem is the length of the tutorial -- it takes a decent chunk of our survey time to just run through the tutorial so I'm working on ways to speed it up a bit (as well as event classification in general). These include adding hot-keys for significance rating and using an imagemap to make the "start time" graph clickable.

Spent a decent chunk of my week trying to track down an obscure libtrace bug that affected a couple of 513 students, which would cause the threaded I/O to segfault whenever reading from the larger trace file. Replicating the bug proved quite difficult as I didn't have much info about the systems they were working with. After going through a few VMs, I eventually figured out that the bug was specific to 32 bit little-endian architectures: due to some lazy #includes, the size of an off_t was either 4 to 8 bytes between different parts of the libwandio source code which resulted in some very badly sized reads. The bug was found and fixed a bit too late for those affected students unfortunately.

30

Mar

2015

Continued developing code to group events by common AS path segments. Managed to add an "update tree" function to the suffix tree implementation I was using and then changed it to use ASNs rather than characters to reduce the number of comparisons required. Also developed code to query NNTSC for an AS path based on the source, destination and address family for a latency event, so all of the pieces are now in place.

In testing, I found a problem where live NNTSC exporting would occasionally fall several minutes behind the data that was being inserted in the database. Because this would only happen occasionally (and usually overnight), debugging this problem has taken a very long time. Found a potential cause in a unhandled E_WOULDBLOCK on the client socket so I've fixed that and am waiting to see if that has resolved the problem.

Did some basic testing of libtrace 4 for Richard, mainly trying to build it on the various OS's that we currently support. This has created a whole bunch of extra work for him due the various ways in which pthreads are implemented on different systems. Wrote my first parallel libtrace program on Friday -- there was a bit of a learning curve but I got it working in the end.

16

Feb

2015

Tested and released a new libtrace version (3.0.22). The plan is that this will be the last release of libtrace3 (barring urgent bug fixes) and Richard and I can now focus on turning parallel libtrace into the first libtrace4 release.

Continued working on adding detector configuration to netevmon. Everything seems to be in place and working now, so will look to extend the configuration to cover other aspects of netevmon (e.g. eventing or anomalyfeed) in the near future.

Spent a bit of time on Friday refactoring some of Meena's eventing code to hopefully perform a bit better and avoid attempts to insert duplicate events into the database. The code currently catches the duplicate exception and continues on, but it would be much nicer if we weren't relying on the database to tell us that the event already exists.

10

Feb

2015

Libtrace 3.0.22 has been released today.

This is (hopefully) the final release of libtrace version 3, as we are now turning our attention to preparing to release libtrace 4 a.k.a. 'Parallel Libtrace'.

This release includes the following changes / fixes:
* Added protocol decoding support for GRE and VXLAN.
* DPDK format now supports 1.7.1 and 1.8.0 versions of DPDK.
* DAG format now supports DAG 5.2 libraries.
* Fixed degraded performance introduced to ring: in 3.0.21
* DAG dropped packet count no longer includes packets observed while libtrace was not using the DAG card.
* Fixed bad PCI addressing in DPDK format.
* libwandio now reports an error when reading from a truncated gzip-compressed file, so it is now consistent with zlib-based tools.

The full list of changes in this release can be found in the libtrace ChangeLog.

You can download the new version of libtrace from the libtrace website.

09

Feb

2015

Generated some fresh DS probabilities based on the results of the new DistDiff detector. Turns out it isn't quite as good as I was originally hoping (credibility is around 56%) but we'll see how the additional detector pans out for us in the long run.

Started adding a proper configuration system to netevmon, so that we can easily enable and tweak the individual detectors via a config file (as opposed to the parameters all being hard-coded). I'm using libyaml to parse the config file itself, using Brendon's AMP schedule parsing code as an example.

Spent a day looking into DUCK support for libtrace, since the newest major DAG library release had changed both the duckinf structure and the name and value of the ioctl to read it. When trying to test my changes, I found that we had broken DUCK in both libtrace and wdcap so managed to get all that working sensibly again. Whether this was worthwhile is a bit debatable, seeing as we don't really capture DUCK information anymore and nobody else had noticed this stuff was broken for months :)

03

Nov

2014

Continued the painful process of migrating my python prototype for mode detection over to C++ for inclusion in netevmon. Managed to get the embedded R portion working correctly, which should be the trickiest part.

Spent a bit of time with our new libtrace testbed, getting the DAG 7.5G2s configured and capturing correctly. Ran into some problems getting the card to steer packets captured on each interface into separate stream buffers, as the firmware we are currently running doesn't appear to support steering.

06

Oct

2014

Finished and submitted my PAM paper, after incorporating some feedback from Richard.

Fixed a minor libwandio bug where it was not giving any indication that a gzipped file was truncated early and content was missing.

Managed to get a new version of the amplet code from Brendon installed on my test amplet. Set up a full schedule of tests and found a few bugs that I reported back to the developer. By the end of the week, we were getting closer to having a full set of tests working properly -- just one or two outstanding bugs in the traceroute test.

Got netevmon running again on the test NNTSC. Noticed that we are getting a lot of false positives for the changepoint and mode detectors for test targets that are hosted on Akamai. This is because the series is fluctuating between two latency values and the detectors get confused as to which of the values is "normal" -- whenever it switches between them, we get an erroneous event. Added a new time series type to combat this: multimodal, where the series has 2 or 3 clear modes that it is always switching between. Multimodal series will not run the changepoint or mode detectors, but I hope to add a special multimode detector that alerts if a new and different mode appears (or an old mode disappears).

15

Sep

2014

Released a new version of libtrace on Tuesday that contains the most recent batch of bug fixes. Started moving the libtrace wiki from trac to github; only the tool pages are left to migrate.

Updated netevmon to support the new family-based streams in NNTSC. Since this new approach results in one time series per stream (as opposed to multiple streams having to be aggregated into each time series), this greatly simplified the anomalyfeed script. Added event detection for changes in AS paths which operates in much the same way as the old IP path event detection.

Started adding the ability to specify a subset of streams / collections for event detection in netevmon, rather than automatically running against all streams. The streams / collections of interest are provided via a config file and a SIGHUP will cause the file to be re-read and any necessary changes made. This also
meant I had to add unsubscribe support to the NNTSC exporter, so that it would stop sending live updates for streams that had been removed from the config file.

09

Sep

2014

Libtrace 3.0.21 has been released today.

This release fixes many bugs that have been reported by our users, including:
* trace_interrupt() now works properly for int, bpf, dag and ring formats.
* fixed double-counting of accepted packets when using the event API.
* fixed incorrect filtered packet counts for bpf format.
* fixed crash when performing very large reads with libwandio.
* fixed inconsistent behaviour if a bad filter string is used with int and dag formats.
* fixed potential infinite loop when combining filters, the event API and the pcapint format.
* fixed incorrect wire lengths when using SNAPLEN config option to truncate packets captured using the int format.

The full list of changes in this release can be found in the libtrace ChangeLog.

You can download the new version of libtrace from the libtrace website.

08

Sep

2014

Finished up a draft of the PAM paper, eventually managing to squeeze it into the 12 page limit.

Spent a bit of time learning about DPDK while investigating a build bug reported by someone trying to use libtrace's DPDK support. Turns out we were a little way behind current DPDK releases, but Richard S has managed to bring us more up-to-date over the past few days. Spent my Friday afternoon fixing up the last outstanding known issue in libtrace (trace_interrupt not working for most live formats) in preparation for a release in the next week or two.

11

Aug

2014

Added support for the new amp-tcpping test to ampy and amp-web.

Started on yet another major database schema change. This time, we're getting rid of address-based streams for amp collections and instead having one stream per address family per target. For example, instead of having an amp-icmp stream for every google address we observed, we'll just have two: one for ipv4 and one for ipv6.

This will hopefully result in some performance improvements. Firstly, we'll be doing a maximum of 2 inserts per test/source/dest combination, rather than anywhere up to 20 for some targets. We'll also have a lot less streams to search and process when starting up a NNTSC client. Finally, we should save a lot of time when querying for data, as almost all of our use cases were taking the old stream data and aggregating it based on address family anyway. Now our data is effectively pre-aggregated -- we also will have a lot less joins and unions across multiple tables.

By the end of the week, my test NNTSC was successfully collecting and storing data using this new schema. I also had ampy fetching data for amp-icmp and amp-tcpping, with amp-traceroute most of the way towards working. The main complexity with amp-traceroute is that we should be deploying Brendon's AS path traceroute next week, so I'm changing the rainbow graph to fetch AS path data and adding a method to query the IP path data that will support the monitor map graph that was implemented last summer.

Spent a day working on libtrace following some bug reports from Mike Schiffman at Farsight Security. Fixed some tricky bugs that popped up when using BPF filters with the event API.

Deployed the update-less version of NNTSC on skeptic finally. Unfortunately this initially made the performance even worse, as we were trying to keep the last timestamp cache up to date after every message. Changed it so that NNTSC only writes to the cache once every 5 mins of realtime, which seems to have solved the problem. In fact, we are now finally starting to (slowly) catch up on the message queue on skeptic.

15

Jul

2014

Released libtrace 3.0.20 on Monday.

Got as much of our development environment up and running again after the power outage over the weekend. There are still a few non-critical things that needed some assistance from Brad that I wasn't able to get going on Monday, but they can wait until next week when we're both here.

On leave from Tuesday for the rest of the week.

07

Jul

2014

Libtrace 3.0.20 has been released today.

This release fixes several bugs that have been reported by users, adds support for LZMA compression to libwandio and adds an API function for getting the fragment offset for an IP packet.

The bugs fixed in this release are:
* Fixed broken snaplen option for ring: input.
* Fixed trace_get_source_port and trace_get_destination_port returning bogus port numbers when given a fragmented packet.
* Fixed timestamp byte ordering on big endian architectures.
* Removed assert failure if a bad compression level or method is provided when configuring an output trace. A libtrace error is raised instead.
* Fixed broken compiler feature checking in configure script. Compiler features are also detected for compilers other than gcc, e.g. clang.
* Fixed potential segfaults in OSPF libpacketdump parser if the packet is truncated midway through the OSPF header.

The OSPF bug fix unfortunately resulted in the 'len' field in the libtrace_ospf_t structure being renamed to 'ospf_len' -- if you are using libtrace to process OSPF packets, please make sure you update your code accordingly.

The full list of changes in this release can be found in the libtrace ChangeLog.

You can download the new version of libtrace from the libtrace website.

07

Jul

2014

Carrying on from last week, storing a cache entry per stream turned out to be a bad idea. Some matrix meshes consist of 100s of streams so we spend a lot of time looking up cache entries. As a result, I rewrote the caching code to store one dictionary per collection, mapping stream ids to tuples containing the timestamps. This gets looked up once per query, so only one cache operation is required to generate a matrix.

Updating the cache when we have to query for missing values is a bit annoying, as we cannot simply update the dictionary and put it back in the cache once the query is complete as the data inserting process may have updated other cache entries with new 'most recent data' timestamps while we were fulfilling our query. Instead, we have to re-fetch the dictionary, update the one stream we're changing and then immediately store the dictionary again.

Updated ampy to no longer keep track of active streams and removed support for ACTIVE_STREAMS queries from the NNTSC protocol.

Merged Perry's lzma support into libwandio. Started working towards a new libtrace release -- managed to build and pass all tests on our various development boxes so should be able to push out a release next week.

Spent a day reading over Meenakshee's thesis. Suggested a series of mostly minor edits and changes but overall it is looking pretty good.

30

Jun

2014

Found and fixed a large memory leak in netevmon that had caused prophet to run out of memory over the weekend. The problem was that I was allocated space for storing IPv6 address strings in the Traceroute detector but not freeing it properly if the address was already in our LRU. Also took the opportunity to make our memory use for Traceroute much more efficient, i.e. having a global hop LRU across all traceroute streams rather than one per stream which was leading to a lot of duplication.

Started looking into our insertion speed problems. One obvious source of slowdowns is the UPDATE that we use to remember when we last inserted data for a stream. This update is being called once per measurement interval for each collection and becomes quite onerous when the streams table gets very large. Implemented a solution where the first and last insertion for each stream is stored in memcache instead of the database. If there is no entry in memcache when a query comes in for the stream, we can query the data table for that stream for min and max timestamp instead, although this is a slightly expensive operation.

Once I had that working, I removed the 'streams' table from NNTSC entirely as it was no longer needed (each collection has its own stream table with specific details about each stream; the streams table was mainly for storing common properties across all collections like lasttimestamp). This meant I had to remove or change all references in the NNTSC database code to the streams table but was otherwise straightforward.

Spent Friday fixing a bug in libtrace where trace_get_source_port and trace_get_destination_port would return bogus values if called on fragmented packets. Added a new API function for getting the fragment offset and more fragment flag from a packet. I needed this anyway for fixing the bug and given the amount of bit-shifting, masking, multiplying and header parsing (for v6) involved, it would probably be useful to other people as well.

03

Jun

2014

Spent Mon-Wed on Jury service.

Continued fixing problems with gcc-isms in libtrace. Added proper checks for each of the various gcc optimisations that we use in libtrace, e.g. 'pure', 'deprecated', 'unused'. Tested the changes on a variety of system and they seem to be working as expected.

Started testing the new ampy/amp-web on prophet. Found plenty of little bugs that needed fixing, but it now seems to be capable of drawing sensible graphs for most of the collections. Just a couple more to test, along with the matrix.

29

May

2014

Finished most of the ampy reimplementation. Implemented all of the remaining collections and documented everything that I hadn't done the previous week, including the external API. Add caching for stream->view and view->groups mappings and added extra methods for querying aspects of the amp meta-data that I had forgotten about, e.g. site information and a list of available meshes.

Started re-working amp-web to use the new ampy API, tidying up a lot of the python side of amp-web as I went. In particular, I've removed a lot of web API functions that we don't use anymore and also broken the matrix handling code down into more manageable functions. Next job is to actually install and test the new ampy and amp-web.

Spent a decent chunk of time chasing down a libtrace bug on Mac OS X, which was proving difficult to replicated. Unfortunately, it turned out that I had already fixed the bug in libtrace 3.0.19 but the reporter didn't realise they were using 3.0.18 instead. Also, received a patch to the libtrace build system to try and better support compilers other than gcc (e.g. clang) which prompted me to take a closer look at some of the gcc-isms in our build process. In the process, I found that our attempts to check if -fvisibility is available was not working at all. Once I had replaced the configure check with something that works, the whole libtrace build broke because some function symbols were no longer being exported. Managed to get it all back working again late on Friday afternoon, but I'll need to make sure the new checks work properly on other systems, particularly FreeBSD 10 which only has clang by default.