Continued redevelopment of the NNTSC exporting code to be more robust and reliable. Replaced the live data pipes used by the dataparsers to push live data to the exporter with a RabbitMQ queue, which seems to be working well.
Modified the way that subscribing to streams worked to try and solve a problem we were having where data that arrived while historical data was being queried was not being pushed out to interested clients. Now, we store any live data that arrives for streams that are still being queried and push that out as soon as we get an indication from the query thread that the query has finished.
Unfortunately, we can still miss historical measurements if they haven't been committed to the database at the time when the final query begins. This often crops up if netevmon is resubscribing after NNTSC has been restarted, resulting in us missing out on the last historical measurement before the subscribe message arrives. Still looking for an elegant way to solve this one.
Added a version check message to the NNTSC protocol. This message is sent by the server as soon as a client connects and the client API has been updated to require its internal version to match the one received from the server. If not, the client stops and prints a message telling the user to update their client API. This should be helpful to the students who were previously getting weird broken behaviour with no apparent explanation whenever I made an API change to the production NNTSC on prophet.
Chased down a build issue with libtrace on FreeBSD 10. Turns out we had made the dist tarball with an old version of libtool which was stupidly written to never build shared libraries if the OS matched FreeBSD 1* (because FreeBSD 1.X didn't support shared libraries). Easy enough to fix, I just have to remember to make the libtrace distribution on something other than Debian Squeeze. Will start working on a new libtrace release in the near future so I don't keep getting emails from FreeBSD users.
Sam Russel solved my vlan problem for me. The issue was I was using vlan 1 and the pronto treats it as native. I tried my tests again using vlan 2 and everything worked perfectly.
I have however found a bigger issue with the pronto. The stats counters are extremely inaccurate. Much moreso than the problems I have been having with openvswitch.
Speaking of openvswitch, Joe has been trying to fix the counter inaccuracy for me, so I have been trying to help with that. I'm starting to come to grips with how the stats reporting is performed in OVS, but I still havent got any ideas about what is causing the problem with the counters.
A start has been made on mathematically modelling a program along the lines of doubletree to probe load balancers once. Initial attempts to model the distributions of the local and global sets are underway and some initial results have been collected.
A run of my version of fastmapping has been started. I am rechecking the validity of this approach to attempting to observe black holes in load balancers.
Looked into the DPDK format again since last time a couple of new releases had been made and this had broken the build process due to it relying on patches. The main problems came from the addition of new NICs and features as separate libraries and a removed function. I've found a good solution to this which I expect will continue work for future releases. I discovered DPDK has an option to compile all of the libraries into one and instead of patches we will require a specific build command. For some reason I couldn't seem to get autotools to detect the newest DPDK library yet the older version library works correctly I cannot seem to figure out why.
Also started to hack together the parallel version of DPDK, I discovered I might not need to use the DPDK to start the threads however I need to look into how safe this really is.
Caught up with Perry about any suggestions and requests he had
* Using empty tick packets to ensure packets are always received on any given thread at a timed interval for live (or tracetime) traces otherwise after a certain number of packets.
* Discussed having a message type for packets rather than a separate argument to the per packet function and details of how to best wait for either a message or a file descriptor without slowing performance.
* Link Up/Down messages
* Locking the creation of the filter is not enough because complex filters might use memory so one filer must be compiled per thread.
* Having a separate format for every thread which would allow snaplength filters etc to be set separately (I'm still not 100% sold on this)
* Wrapping single threaded formats inside another to split the trace out
* Trace Swiss Army Knife - all tools in one
* Wanted a copy on SVN so he could see the progress so far
The plan for this week is to tidy up and get a copy into a SVN branch and get the DPDK build fix working and back into trunk and update wiki docs.
This week I have captured a number of traces of BitTorrent traffic, both encrypted and unencrypted for training the model.
I have also been working on pre-processing these traces to remove flows with packet loss or retransmission as these are unsuitable for training the model.
I also now have the traces sorted into flows with libflowmanager, and a list of packet sizes and arrival times associated to each flow.
Tidied up the reporting done on the icmp, traceroute and dns tests in
AMP to use variable length strings for names, as well as properly
packing and byteswapping the reporting structures. The average report
message size should now be much smaller than it was. Also updated the
nntsc plugins for amp data to deal with the new format.
Tweaked the parser for the http test to better ignore strings that look
are generated on the fly within the <script> block.
Started going through all the NNTSC exporting code and replacing any instances of blocking sends with non-blocking alternatives. This should ultimately make both NNTSC and netevmon more stable when processing large amounts of historical data. It is also proving a good opportunity to tidy up some of this code, which had gotten a little ropey with all the hacking done on it leading up to NZNOG.
Spent a decent chunk of my week catching up on various support requests. Had two separate people email about issues with BSOD on Friday.
Wrote a draft version of this year's libtrace assignment for 513. I've changed it quite a bit from last years, based on what the students managed to achieve last year. The assignment itself should require a bit more work this time around, but should be easily doable in just C rather than requiring the additional learning curve of the STL. It should also be much harder to just rip off the examples :)
Read through the full report on a study into traffic classifier accuracy that evaluated libprotoident along with a bunch of other classifiers ( http://vbn.aau.dk/files/179043085/TBU_Extended_dpi_report.pdf ). Pleased to see that libprotoident did extremely well in the cases where it would be expected to do well, i.e. non-web applications.
On monday this week we set up the pronto so that I can examine its foibles in practice.
It turns out there are a few of them. Multiple tables is a wash, it simply doesnt seem to work.
I am currently testing what I can do with vlans. The documentation says vlan stripping doesnt work, however, as far as I can tell my packets are leaving the switch with no vlan tags. I figure this is something I am doing wrong with my test because it really makes no sense to me at all otherwise, but I cant track the problem down.
I also thought a bit about what to do when you discover packet loss. I can definitely identify whether packet loss is caused by congestion or more mysterious means provided the level of congestion is not too much more significant than the amount of mysterious packet loss.
Was at NZNOG in Nelson for all of this week. Enjoyed the SDN workshop,
which has made a lot of those concepts more concrete and real for me
which is helpful. Presented part of our talk about AMP, and it sounds
like there are a few people who are keen to be involved in running it
(as new hosts, or even within their network as part of their own
Spent most of the week tidying up things in preparation for the AMP
website to be demoed at NZNOG next week. Fixed up some graph colours,
labels and descriptions that were inconsistent across views so that they
all match. Tried to squeeze a bit more performance out of our current
database setup to make our queries faster, and wrote a quick script to
keep refreshing the matrix data into memcache.
Added some error notifications for when ajax requests failed so that the
user has some feedback rather than waiting forever with no indication
that something might have gone wrong.
Fixed up a bug with the area selection in the summary graph that would
shrink the selection if it was at the edge of the graph.