Libtrace is a library for both capturing and processing packet traces. It supports a variety of common trace formats, including pcap, ERF, live DAG capture, native Linux and BSD sockets, TSH and legacy ERF formats. Libtrace also supports reading and writing using several different compression formats, including gzip, bzip2 and lzo. Libtrace uses a multi-threaded approach for decompressing and compressing trace files to improve trace processing performance on multi-core CPUs.
The libtrace API provides functions for accessing the headers in a packet directly,
up to and including the transport header.
Libtrace can also output packets using any supported output trace format, including
pcap, ERF, DAG transmit and native sockets.
Libtrace is bundled with several tools for performing common trace processing and analysis tasks. These include tracesplit, tracemerge, traceanon, tracepktdump and tracereport (amongst others).
Spent a bit of time testing out some of Brendon's AMP tutorial instructions, making sure that everything so far is sane and no steps are missing. I anticipate there will be a lot more of this next week as the tutorial gets closer to a complete draft.
Continued working on verifying and fixing the auto-generated FSMs. Going over the entire set of generated FSMs from my test dataset threw up a number of bogus looking machines, so I've been working on investigating and (when necessary) fixing the problems. I've also managed to get self-repeating states working correctly for the most part; just one or two edge cases that still need to be detected and handled properly. Re-implemented tagging the original call logs with the FSMs that were matched by subsequences within the call log -- the current implementation is naive in that it assumes any state within a machine could be a start state, which is not going to scale well so I need to come up with a way to infer potential start states (or at least rule out definite non-start states).
Re-worked libflowmanager to be usable in a parallel situation. Previously, the flow map was a global variable. Now, you can have multiple flow maps so you can have one per thread and use libtrace's bidirectional hashing to ensure that each flow corresponds to only one thread, and therefore only one flow map.
Started experimenting with using parallel libtrace with libprotoident applications. I soon ran into a bug where using the built-in hasher thread to distribute packets could cause a deadlock, so spent most of Friday trying to track this down.
Finished up the libtrace4 and wandio releases and pushed them out.
Installed a mock version of skeptic on an openstack VM to test how InfluxDB copes with the full public AMP dataset. In general, InfluxDB seems to be coping OK when inserting / browsing data but the memory requirements of anomaly_ts are a bit larger than I would like so that's an avenue to chase up in the near future.
Continued implementing syscall FSMs manually to find out about other cases we need to consider when trying to automate the process. Added the ability to express a state as another FSM so we can build more complex machines from the smaller ones. Documented the code and put it into bitbucket so other people can start working with it.
Also started trying to use the FSMs on another dataset that Alan had collected. Turns out this dataset had a bunch of new syscalls that my previous parser hadn't seen before so it required a bit of updating.
Libtrace 4.0.0 is now out of beta and considered ready for general release.
We've fixed quite a few bugs over the course of the beta. More details can be found on the ChangeLog page on libtrace wiki. However, while we're no longer in beta, there may still be a few bugs out there -- don't hesitate to report any problems you find to us at contact [at] wand [dot] net [dot] nz.
Another major change since the beta release is that we've re-licensed libtrace and libpacketdump to be under the LGPL v3 (rather than the GPL v2). Hopefully this will encourage people who were turned off by the restrictions of the GPL to now adopt libtrace for their packet capture and analysis needs.
This version of libtrace includes an all new API that resulted from Richard Sanger's Parallel Libtrace project, which aimed to add the ability to read and process packets in parallel to libtrace. Libtrace can now also better leverage any native parallelism in the packet source, e.g. multiple streams on DAG, DPDK pipelines or packet fanout on Linux interfaces.
Please note that the old libtrace 3 API is still entirely intact and will continue to be supported and maintained throughout the lifetime of libtrace 4. All of your old libtrace 3 programs should still build and run happily against libtrace 4; please let us know if this turns out to not be the case so we can fix it!
Learn about the new API and how parallel libtrace works by reading the Parallel Libtrace HOWTO.
Download the new release from the libtrace website.
Released new versions of libprotoident and libflowmanager with the new LGPL licensing. Also re-licensed and tested potential libtrace and wandio releases but haven't quite got to the stage where I want to push out the releases just yet.
Continued messing around with deriving FSMs from common system call patterns and turning them into runnable code. I've got 8 FSMs drawn up and have implemented 5 of them. Developed a bit of backend for applying my FSMs to the log data so that I can implement new FSMs with the least amount of coding possible (e.g. common actions like checking fd consistency and making sure paramaters match expected values are all done within a parent FSM class and the child classes just list the relevant data to compare against). Hopefully this will help move towards automated generation of the FSM code.
Had a few meetings where we discussed the FSM approach (and RA3 in general) with a few of the industry partners and they seem reasonably pleased with what we are trying to achieve so that's reassuring.
Helped Brendon try to debug some issues with data not appearing on graphs on the recently updated deployment. As a result of this, we've realised we need to re-think how we are storing and presenting traceroute data so that we can't avoid these problems in the future.
Finished up the first release version of the event filtering for amp-web and rolled it out to lamp on Thursday morning. Most of this week's work was polishing up some of the rough edges and making sure the UI behaves in a reasonable fashion -- Brad was very helpful playing the role of an average user and finding bad behaviour.
Post-release, tracked down and fixed the issue that was causing netevmon to not run the loss detector. Added support for loss events to eventing and the dashboard.
Released a new version of libprotoident, which includes all of my recent additions from the unexpected traffic study.
Marked the last libtrace assignment and pushed out the marks to the students.
Marked the 513 libtrace assignments. Some students performed very well and I was glad to see that the investigative task proved to be very doable.
Started working on adding the ability to filter events and event groups on the amp-web dashboard. Most of my effort so far has been in producing a mock-up of the interface, which I showed to Nathan and Chris on Thursday afternoon. Started replacing some hard-coded filtering settings with a dynamic template that uses user preferences stored in a database on Friday.
Fixed a few little netevmon issues that cropped when trying to restart netevmon on prophet prior to starting work on the dashboard filtering, mostly in relation to ensuring that the 'purge event database' option works sensibly.
Started writing up a short paper on the unexpected traffic analysis I've been doing for the past few weeks. Made decent progress -- I've got a mostly complete draft, just missing a conclusion and an abstract.
Spent a decent chunk of Thursday dealing with the fallout from upgrading influxdb to 0.11 on prophet. This broke most of our existing rollup tables, as the data type that we were now inserting (int) was no longer compatible with the data type that we apparently used to insert (float). Compounding matters was influxdb's lack of visibility into what data types are associated with any given column. Ended up trashing and re-creating the database (somewhat by accident) which fixed the problem, but not an ideal solution if we ever roll this out in production.
513 assignment was due at 5pm on Friday, so dealt with a few final queries from students. 20 submissions in the end, so a bit of marking to do next week.
Continued working away at the Unknown traffic from my libprotoident port study. Added new protocols for Telegram Messenger and Kuguo, as well as improved DNS (especially TCP DNS) and NTP matching. I still have a bit more Unknown traffic to identify before I'd be comfortable putting the results in a paper, but we're getting closer.
Gave my 513 lectures this week. Looking forward to seeing how the class get on with my assignment.
Met with Ryan Jones who is doing an Honours project that will use netevmon to try and find events in the CSC data. Gave him access to the code and a few hints to start out, but I imagine I'll have to dedicate some more time to this over the course of the year.
Continued prepping for the trip to San Diego. Wrote a talk on AMP to present at AIMS, since Brendon won't be able to attend. Managed to finally settle on a project that I'll be working on at the BGP hackathon: adding useful filtering to the BGPstream software.
Met with Harris to talk about the CSC dataset and how he can go about looking for interesting events in the dataset. Wrote some example code to extract a metric from the data (syscalls per second for each major type) and added a module to NNTSC for the new metric. Hopefully Harris will be able to use that to start adding his own metrics.
Noticed that there was a lot of variation in my rtstats performance test results. It seems that the achievable packet rate for ring: seems to fluctuate from test to test, but will remain constant within a test. For example, one test run I'll get 1 million packets per second consistently for 60 seconds, the next test run I'll get 40,000 packets per second for 60 seconds. Spent a lot of time looking into this further (including an afternoon with Richard S. trying out various things), but we're still unable to account for this inconsistency.
Ran some full experiments with the stats and rtstats workloads (using ring: to capture) to make sure the numbers match up with what Richard was seeing in his earlier tests. So far, we're getting the expected behaviour: adding more threads makes stats perform worse (due to threading overhead outweighing any performance gain), but helps with rtstats. Wrote the section in the paper that describes our evaluation methodology, so now I just need to fill it in with some results!
Wrote my talk on NNTSC for AIMS. It's a 10 minute talk that is meant to provoke some discussion, so it is pretty light on implementation details. At least I hope people will come away from the talk knowing that there are some battles in this space that we've already fought so they won't repeat our mistakes.
Helped Andy get started with NNTSC so he can try implementing some InfluxDB support for storing data. The idea so far is to keep postgres around for doing the things it does well (streams, traceroute data) and use Influx for the rest.
Tracked down a segfault in the Ostinato drone whenever I tried to halt packet generation on the DPDK interfaces. This took a lot longer than it normally would have, since valgrind doesn't work too well with DPDK and there are about 10 threads active when the problem occurs. It eventually proved to be a simple case of a '<=' being used instead of a '<', but that was enough to corrupt the return pointer for the function that was running at the time, causing the segfault.
Once I fixed that, I was able to write some scripts to orchestrate sending packets at specific rates for a period of time, while having a libtrace program running at the other end of the link trying to capture and process these packets. Once the packet generation is over, the libtrace program is halted. This will form the basis of my experiments to determine how much traffic we can capture and process with parallel libtrace. The experiments will use different capture methods (ring, DAG, DPDK, PF_RING etc), different packet rates, different numbers of processing threads (from 1 - 16) and different workloads ranging from just counting packets to cryptopan anonymisation.
My initial tests have shown that the numbers of dropped packets are not particularly consistent across captures with otherwise identical parameters, so I'll have to run each experiment multiple times so that I can get some more statistically valid results.
Also spent a bit of time helping Brendon capture some traces of his ICMP packets to help figure out whether his timing issues are network-based or host-based.
Added a new graph type to the AMP website for showing loss as a percentage over time. This graph is now shown when clicking on a cell in the loss matrix, as well as being able to be accessed through the graph browser. Fixed a complaint regarding the matrix where clicking on a cell in an IPv4 only matrix would take you to a graph showing lines for both IPv4 and IPv6 so you would never get the smokeping-style colouring via the matrix.
Started messing around with ostinato scripting on the 10g dev boxes and using DPDK to generate packets at 10G rates. Had a few issues initially because I was using an old version of the DPDK-enabled ostinato that Richard had lying around; updating to Dan's most recent version seemed to fix that.
Spent a bit of time looking at the data collected during the CSC and how it might be able to be used as ground truth for developing some security event detection techniques.
Finished up the implementation chapter of the libtrace paper. Added a couple of diagrams to augment some of the textual explanations. Got Richard S. to read over what I've got so far and made a few tweaks based on his feedback.
Spent a decent chunk of time looking at Unknown UDP port 80 traffic in libprotoident. Found a clear pattern that was contributing most of the traffic, which I traced back to Tencent. Unfortunately Tencent publishes a lot of applications so that knowledge wasn't conclusive on its own.
My initial suspicion was that it might have been game traffic so I downloaded and played a few popular multiplayer games via the Tencent games client, capturing the network traffic and comparing it against my current unknown traffic. No luck, but then I had the bright idea to look a bit more closely at video call traffic in WeChat (a messaging app). Sure enough, once I was able to successfully create two WeChat accounts and get a video call going between them, I started seeing the traffic I wanted.
Also added rules for Acer Cloud and OpenTracker over UDP.
Wrote a skeleton for a centralised collector of progger data for Harris to start filling in with actual useful code.
Continued writing up the implementation chapter of the libtrace paper. It's turning out to be a pretty long paper, as there are a lot of design decisions that warrant discussion (memory management, combiners, hashers etc.).
Succumbed to my head cold on Thursday, so had a day at home to rest and recover.
Started writing some content for the parallel libtrace paper. Managed to churn out an introduction, a background and a little bit of the implementation section.
Fixed a couple of bugs in netevmon prior to the deployment: crashing when trying to reconnect to a restarted NNTSC and some confusing event descriptions for changepoint events.
Finished setting up a mobile app test environment for JP. I've configured my old iPhone to act as an extra client for 2-way communication apps (messaging etc.). So far the environment has already been helpful, as we've managed to identify one of the major outstanding patterns as being used by the Taobao mobile shopping app.
Finished up the demo for STRATUS forum and helped Harris put together both a video and a live website.
Spent a bit of time trying to fix some unintuitive traceroute events that we were seeing on lamp. The problem was arising when a normally unresponsive hop was responding to traceroute, which was inserting an extra AS transition into our "path".
Rebuilt DPDK and Ostinato on 10g-dev2 after Richard upgraded it to Jessie so that I can resume my parallel libtrace development and testing once he's done with his experiments.
Installed and tested a variety of Android emulators to try and setup an environment where JP and I can more easily capture mobile app traffic. Turned out Bluestacks on my iMac ended up being the most useful, as the others I tried either lacked the Google Play Store (so finding and installing the "official" apps would be hard) or needed more computing power than I had available.
Tested and fixed my vanilla PF_RING libtrace code. I've been able to get comparable performance with the pfcount tool included with the PF_RING release so I'm fairly happy with that. Started working on adding support for the ZC version of the PF_RING driver, which uses an entirely different API.
Helped Harris get his head around how NNTSC works so that he could add support for the Ceilometer data. Set myself up with an OpenStack VM so that I can start working on the web graphs to display the data now that it is in a NNTSC database. Also spent a bit of time writing up an explanation of how netevmon works so that Harris can start looking into running our detectors against the Ceilometer data.
Worked with Brendon on Friday to get NNTSC and netevmon installed and running on the lamp machine.
Wrote some scripts to do some basic exploration of the Ceilometer data to check which collections and series are most suitable for using with netevmon. I've found that around 30-35% of the series for CPU utilisation, network byte rates and disk read/write rates are at least long enough to be worth using -- this works out to ~600 series for each metric, so we'll have a reasonable sample size. The spacing between measurements is more of a concern, as it is very inconsistent. There are some parts of NNTSC and netevmon that assume a fairly constant measurement rate, so these will need to be re-evaluated.
Started adding PF_RING support to libtrace. For a start, I'm just working with the standard PF_RING driver (not the ZC extension) and I've written code that should work with the old API. Once I've tested that, I'll start adding native parallel support using one thread per receive queue in the driver.
Also spent a bit of time planning a paper on parallel libtrace. I'm anticipating the main narrative to be about how we've achieved better potential performance by adding parallelism (depending on the workload and the number of threads), while still maintaining the key design goals of library (e.g. abstraction of complexity, format agnosticism, etc.). We'll show that the same parallel libtrace code can achieve better performance across multiple input formats, i.e. DAG, ring, PF_RING (once complete).
Spent a fair chunk of my week proof-reading, first a document responding to questions about the BTM project, then Dan and Darren's Honours reports.
Tracked down and fixed a bug in parallel libtrace where ticks were messing with the ordered combiner, causing some packets to be sent to the reporter out of order. Also managed to replicate and fix the memory leak bug that was causing Yindong's wdcap on wraith to invoke the OOM killer.
Continued poking at unknown port 443 and port 80 traffic in libprotoident. Most of my time was spent trying to install and capture traffic from various Chinese applications that I had reason to suspect were causing most of my remaining unknown traffic, with mixed success.
Finally released the libtrace4 beta on Tuesday, after doing some final testing with the DAG cards in the 10G dev machines.
Managed to find a few more protocols to add to libprotoident, but am now trying to move towards releasing a new version. Starting having a closer look at TCP port 80 and TCP port 443 traffic in my Waikato traces, with the aim of trying to get as much traffic correctly classified as I can prior to doing an in-depth analysis of what is actually using those ports.
Spent Friday afternoon reading over Darren's honours report and providing some hopefully useful feedback.