Libprotoident is a library that performs application layer protocol identification for flows. Unlike many techniques that require capturing the entire packet payload, only the first four bytes of payload sent in each direction, the size of the first payload-bearing packet in each direction and the TCP or UDP port numbers for the flow are used by libprotoident. Libprotoident features a very simple API that is easy to use, enabling developers to quickly write code that can make use of the protocol identification rules present in the library without needing to know anything about the applications they are trying to identify.
This study aims to identify and quantify applications that are making use of port numbers that are typically associated with other major Internet applications (i.e. port 53, 80, 123, 443, 8000 and 8080) to bypass port-based traffic controls such as firewalls. We use lightweight packet inspection to examine each flow observed using these ports on our campus network over the course of a week in September 2015 and identify applications that are producing network traffic that does not match the expected application for each port. We find that there are numerous programs that co-opt the port numbers of major Internet applications on our campus, many of which are Chinese in origin and are not recognized by existing traffic classification tools. As a result of our investigation, new rules for identifying over 20 new applications have been made available to the research community.
Double report due to being away at IMC.
Gave a practice run of my IMC talk. Still needed to carve out some content and streamline it a bit, so spent more time working on that.
Documented the process for adding a new metric (i.e. a new graph/matrix for an existing collection) to the AMP website. Worked through an example by adding HTTP page size to my development website. Identified a number of issues with some of the terminology in the amp-web code that need to be fixed in the long run, but this will probably require a decent bit of code re-architecting.
Attended IMC in Santa Monica. Talk seemed to go over pretty well and it was great to catch up with people I had met at AIMS earlier this year, as well as some familiar faces from previous IMCs. Took the opportunity to have a brief holiday in L.A. afterwards.
Continued reading over versions of Dimeji's and Richard's papers and providing them with (hopefully) useful feedback.
Continued working on my IMC talk. All of the content is in place now, but it's probably a bit long. Will try to refine more before I leave.
More tweaking of the syscall analysis code to try and make sure the output matches my expectations in each example that I've tested it against. The algorithm still doesn't work too well in the presence of multi-syscall loops, so I'll need to think of an approach to recognise and represent loops rather than treating each iteration of the loop as a branch.
Released new versions of libprotoident and libflowmanager with the new LGPL licensing. Also re-licensed and tested potential libtrace and wandio releases but haven't quite got to the stage where I want to push out the releases just yet.
Continued messing around with deriving FSMs from common system call patterns and turning them into runnable code. I've got 8 FSMs drawn up and have implemented 5 of them. Developed a bit of backend for applying my FSMs to the log data so that I can implement new FSMs with the least amount of coding possible (e.g. common actions like checking fd consistency and making sure paramaters match expected values are all done within a parent FSM class and the child classes just list the relevant data to compare against). Hopefully this will help move towards automated generation of the FSM code.
Had a few meetings where we discussed the FSM approach (and RA3 in general) with a few of the industry partners and they seem reasonably pleased with what we are trying to achieve so that's reassuring.
Helped Brendon try to debug some issues with data not appearing on graphs on the recently updated deployment. As a result of this, we've realised we need to re-think how we are storing and presenting traceroute data so that we can't avoid these problems in the future.
Libprotoident 2.0.9 has been released today.
The biggest change in this release is that libprotoident is now using the LGPL v3 license rather than the GPL v2 license. We hope that this will be welcome news to some people who had previously wanted to use libprotoident in their software but were put off by the restrictions of the GPL license. Note that we are aware that our other libraries (libtrace, libflowmanager, wandio) that libprotoident depends on are still GPL -- rest assured that LGPL versions of these libraries will appear soon.
We've also added support for another 12 new application protocols, including Facebook Messenger, Facebook Zero, Overwatch and Baidu Yun P2P. We've improved the rules for a further 16 protocols such as Google Hangouts, Minecraft, QUIC, World of Warcraft and DOTA2.
As always, the full list of changes can be found in the libprotoident ChangeLog.
Started looking at the most common patterns in my example sysdig logs. It's pretty obvious that we can easily recognise some low-level actions based on the sequence of system calls and produce models that can be used to identify them. For example, loading a .so shared library will generally result in the same sequence of system calls (with some minor variations) and therefore that can be expressed as a finite state machine.
Developed FSMs for four low level actions: loading a .so library, loading a python module, receiving a typed character via ssh and reading a modprobe config file. Implemented the SSH action as code so I can now find and replace those sequences in the logs with a single SSHCharInput action.
Helped Brendon install NNTSC, ampy and amp-web packages on one of our existing deployments on Thursday. We ended up with a problem where NNTSC would not return query data to the web-site and it took a lot of time (and debugging) to find the source of our problem: incongruous versions of psycopg2 in pip vs the debian package.
Started prepping a libprotoident release. libprotoident is moving to an LGPL license so I've had to replace the blurb at the top of every source file. Been working through the usual pre-release testing and ChangeLog updating.
Spent Wednesday at the Honours conference. I thought all of our students presented well and gave good accounts of their work so far.
Worked on the camera-ready version of my IMC paper. Managed to add some nice content to address the reviewer feedback we got, only to find that I had been using a font size that was too small (i.e. the default font size that every previous IMC has ever used). Unfortunately, switching to the bigger font size would mean I would have to remove almost all the new content I had added, so I'm hoping the PC chairs will change their mind and revert back to the old font size.
Wrote the basic architecture for a provenance log parsing library that can be used with both live progger records and sysdig log files. This will replace the old progger-central which I had written as a hacky PoC which was in danger of becoming production code otherwise.
Got my script to extract common patterns from Sysdig logs working reasonably well. Took a few attempts to get some nicely formatted output that contains all the information I should need to track down what actions are causing the repeated patterns.
Spent a fair bit of time helping CROW get a handle on the Endace Probe, what it can do and how it might fit into their research goals.
Listened to our 520 student practice talks on Thursday. The projects themselves are pretty good -- just the usual issue of the students underselling just how much actual work they had put in to the development side of their project.
Got NNTSC and amp-web working with the sysdig data that Harris gave me, so we have a simple proof-of-concept. After talking with Harris some more, he is interested in finding patterns in the syscall logs that are "predictable" so that we can build models of known specific actions on a system (e.g. opening a file with vim, starting a python interpreter etc). Started working on a script to identify common patterns in the sysdig logs so that we can get an idea as to what these patterns look like and how hard they will be to recognise and identify.
Continued tracking down unknown traffic patterns with libprotoident. Managed to nail one pattern that had been bugging me for a long time: the Baidu Yun P2P protocol. Also added rules for YY, Overwatch, Zoom TCP and NetCat CCTV.
My IMC paper on unexpected traffic on well-known ports was accepted, which is great news. Spent Monday going over reviewer feedback and thinking about what revisions I need to make for the camera-ready version.
Continued working on integrating STRATUS with NNTSC. Spent way too much time trying to figure out why my data was not being inserted into the Influx database -- turns out the timestamp for the test data I was using was too old for the default retention policies so it was being automatically discarded. Fudged the test data times to be more recent and it finally worked.
Added file operations metric support to ampy and amp-web so we can now look at simple graphs of open frequency data. Found some scalability issues with our modal dialogs in cases where the number of possible options for a dropdown is very high, so I've gone back and added pagination support to all modal dropdowns so they only load 30 or so options at a time. This had some interesting flow-on effects, especially for the latency modal dialog which had a lot of custom code for populating the tabs for the different latency metrics. I think I've ironed out all of the extra wrinkles now.
Spent a little more time with the July traces to track down some more unknown protocols. Added a rule for the Netcore vulnerability scan (which happens a lot!) and updated rules for a lot of (mostly game-related) protocols.
Started working on integrating some of the STRATUS metrics into NNTSC so that we can explore using time-series based event detection to highlight potentially interesting file interactions. Going forward, I'm going to be splitting my time 50:50 between STRATUS development and WAND research work -- existing research might progress a bit slower as a result.
Continued poking at unknown flows in the July trace data. Added protocols for Final Fantasy XIV and Facebook Messenger. Noticed that we are still having issues with the vDAG pipe on the probe that services wdcap dropping packets so our captures are sometimes missing packets. Moving IP encryption off onto wraith seems to have helped with this, but is not an ideal solution.
Short week after taking leave on Monday and Tuesday.
Spent most of my remaining week looking at some new captures I took using the upgraded Probe. The main aim was to see whether there were any new protocols that libprotoident should be able to identify. Managed to find a handful of new protocols: Facebook Zero, Forticlient SSL VPN and Discord, as well as made some improvements to the rules for existing protocols (including the AMP throughput test!).
Most of my time was actually spent unsuccessfully hunting down what appears to be a new Chinese P2P protocol, which is a shame because it was contributing a very large amount of unknown traffic in my sample dataset.
Using BSOD on the live traffic feed also allowed me to spot a student that was doing vast quantities of torrenting on the campus network (which Brad reported to ITS) and our WITS FTP server being hammered with tons of download attempts from China. Fair to say, we've gotten some good milage of the upgraded Probe already.
Fixed a couple of outstanding bugs in amp-web. Should be ready to push some new packages out to skeptic and lamp early next week now.
Finished up the first release version of the event filtering for amp-web and rolled it out to lamp on Thursday morning. Most of this week's work was polishing up some of the rough edges and making sure the UI behaves in a reasonable fashion -- Brad was very helpful playing the role of an average user and finding bad behaviour.
Post-release, tracked down and fixed the issue that was causing netevmon to not run the loss detector. Added support for loss events to eventing and the dashboard.
Released a new version of libprotoident, which includes all of my recent additions from the unexpected traffic study.
Marked the last libtrace assignment and pushed out the marks to the students.
After what seems like forever, I've finally managed to put together a new libprotoident release that includes all of the new protocol rules I've developed over the past couple of years. This release adds support for around 70 new protocols, including QUIC, SPDY, Cisco SSL VPN, Weibo and Line. A further 28 protocols have had their rules refined and improved, including BitTorrent, QQ, WeChat, Xunlei and DNS.
The lpi_live tool has been removed in this release, as this has been decommissioned in favour of the lpicollector tool.
Also, please note that libflowmanager 2.0.4 is required to build the libprotoident tools. Older versions of libflowmanager will fail the configure check.
The full list of changes can be found in the libprotoident ChangeLog.
Started writing up a short paper on the unexpected traffic analysis I've been doing for the past few weeks. Made decent progress -- I've got a mostly complete draft, just missing a conclusion and an abstract.
Spent a decent chunk of Thursday dealing with the fallout from upgrading influxdb to 0.11 on prophet. This broke most of our existing rollup tables, as the data type that we were now inserting (int) was no longer compatible with the data type that we apparently used to insert (float). Compounding matters was influxdb's lack of visibility into what data types are associated with any given column. Ended up trashing and re-creating the database (somewhat by accident) which fixed the problem, but not an ideal solution if we ever roll this out in production.
513 assignment was due at 5pm on Friday, so dealt with a few final queries from students. 20 submissions in the end, so a bit of marking to do next week.
Continued making progress with my unidentified mice flows in libprotoident. Added a whole pile of new rules, mostly for various Chinese apps again. Have probably done enough now that I can draw a line under this and start writing the paper itself; there are a few obvious patterns that I would like to identify but this has consumed a lot of time already.
Answered a handful of questions from 513 students -- mostly intelligent ones, so I'm reasonably confident about how the class is going overall. Due date is this coming Friday, so we'll know for sure soon enough.
Helped finish off the funding proposal in the first half of the week.
Continued working with libprotoident. This week I gave up on the elephant flows and started looking at the mice flows. Found some interesting stuff; the highlight being a huge number of flows on TCP port 80 that seem to be associated with the Baidu web browser. The behaviour of these flows is particularly odd: connect to server, send a FIN with seqno N, retransmit FIN a few times, send a non-FIN packet with 1 byte of payload (0x00) and seqno N-1 (incredibly invalid TCP behaviour!), server sends a RST. End result is > 150,000 flows over a week on port 80 with a single outgoing byte of payload.
Added some filters on the Endace probe to see if we can find people doing this traffic on campus, as the Baidu browser is pretty well-known for having a tendency to leak all sorts of private data back to its masters. Found multiple staff PCs that appear to be doing this sort of traffic, so Brad and I will try to prepare a report for ITS next week.
Met with Nathan at Lightwire on Thursday afternoon re: AMP and netevmon. Came away with plenty of ideas and suggestions for improvements we can make and hopefully we also helped Nathan understand parts of our system better as well. The good news is that netevmon seems to mostly be picking up valid events, but even so the number and frequency of these events can be overwhelming so we need better control over what events are shown to the user.
Worked on the next MBIE funding proposal document. Still got a fair way to go so this will probably eat up a lot of next week too.
Continued trying to identify the remaining Unknown applications in the Waikato Sept 15 traces. Only managed to identify one new protocol (Xunlei Accelerated) but this did account for 14G of unknown traffic on TCP port 8080 so that has gotten rid of the biggest outstanding quantity of unknown traffic. The rest are looking like they might get the better of me -- it's almost all Chinese in origin and I can identify the parent company (Tencent, CERNET, Taobao etc) but actually figuring out which of the myriad of apps these companies own is mostly just trial and error at this stage.
Continued working away at the Unknown traffic from my libprotoident port study. Added new protocols for Telegram Messenger and Kuguo, as well as improved DNS (especially TCP DNS) and NTP matching. I still have a bit more Unknown traffic to identify before I'd be comfortable putting the results in a paper, but we're getting closer.
Gave my 513 lectures this week. Looking forward to seeing how the class get on with my assignment.
Met with Ryan Jones who is doing an Honours project that will use netevmon to try and find events in the CSC data. Gave him access to the code and a few hints to start out, but I imagine I'll have to dedicate some more time to this over the course of the year.
My fixes to Andy's InfluxDB code seems to be resulting in consistent and correct bins being stored in the rollup tables. Threw netevmon at the development system to see if it can cope, which it seems to be doing OK. There's still a bit of a concern around long-term memory usage, but I'll see how that pans out over the next couple of weeks.
Spent the rest of my week concentrating on finishing up JP's summer study on unexpected traffic on typically open ports. Managed to improve a few existing rules to recognise more traffic, as well as add new rules for QQ video chat and what appears to be a C&C covert channel for some Chinese malware using UDP port 53. Started framing up a paper for IMC based on this study.
Did some final prep work for the libtrace lectures and assignment for 513.
Finished up the implementation chapter of the libtrace paper. Added a couple of diagrams to augment some of the textual explanations. Got Richard S. to read over what I've got so far and made a few tweaks based on his feedback.
Spent a decent chunk of time looking at Unknown UDP port 80 traffic in libprotoident. Found a clear pattern that was contributing most of the traffic, which I traced back to Tencent. Unfortunately Tencent publishes a lot of applications so that knowledge wasn't conclusive on its own.
My initial suspicion was that it might have been game traffic so I downloaded and played a few popular multiplayer games via the Tencent games client, capturing the network traffic and comparing it against my current unknown traffic. No luck, but then I had the bright idea to look a bit more closely at video call traffic in WeChat (a messaging app). Sure enough, once I was able to successfully create two WeChat accounts and get a video call going between them, I started seeing the traffic I wanted.
Also added rules for Acer Cloud and OpenTracker over UDP.