Shane Alcock's Blog
Continued keeping an eye on BTM until Brendon got back on Thursday. Briefed Brendon on all the problems we had noticed and what we thought was required to fix them.
Finished up my event detection webapp. Started experimenting with automating the running event detectors with a range of different parameter options. The first detector I'm looking at (Plateau) has about 15,000 different parameter combinations that I would like to try, so I'm going to have to be pretty smart about recognising events as being the same across different runs.
Started adding worker threads to anomaly_ts so that we can be more parallel. Each stream will be hashed to a consistent worker thread so that measurements will always be evaluated in order, but I still have to consider the impact of the resulting events not being in strict chronological order across all streams.
Continued keeping an eye on the BTM monitors. Changed several connections to use the ISPs DNS server rather than relying on the modem to provide DNS, which seems to have resolved many of our DNS issues.
Spent a bit of time digging into the problem of intermittent latency results for Akamai sites. It appears that our latency tests are interfering with one another as moving one of the previously failing tests to a new offset away from the others fixed the problem for that test.
Continued working on my Event Detection webapp. Added two new modes: one where the user does the tutorial, then rates 20 pre-chosen events and one where the user rates the same events without doing the tutorial. This will hopefully give us some feedback on how useful the tutorial is and whether the time required to complete the tutorial is worth it. Also added proper user tracking, with the generation of a unique code at the end of the 'survey' that the user can enter into the Mechanical Turk to indicate they have completed the task.
Spent much of my week keeping an eye on BTM and dealing with new connections as they come online. Had a couple of false starts with the Wellington machine, as the management interface was up but was not allowing any inbound connections. This was finally sorted on Thursday night (turning the modem on and off again did the trick), so much of Friday was figuring out which Wellington connections were working and which were not.
A few of the BTM connections have a lot of difficulty running AMP tests to a few of the scheduled targets: AMP fails to resolve DNS properly for these targets but using dig or ping gets the right results. Did some packet captures to see what was going on: it looks like the answer record appears in the wrong section of the response and I guess libunbound doesn't deal with that too well. The problem seems to affect only connections using a specific brand of modem, so I am imagining there is some bug in the DNS cache software on the modem.
Continued tracing my NNTSC live export problem. It soon became apparent that NNTSC itself was not the problem: instead, the client was not reading data from NNTSC, causing the receive window to fill up and preventing NNTSC from sending new data. A bit of profiling suggested that the HMM detector in netevmon was potentially the problem. After disabling that detector, I was able to keep things running over the long weekend without any problems.
Fixed a libwandio bug in the LZO writer. Turns out that the "compression" can sometimes result in a larger block than the original uncompressed data, especially when doing full payload capture. In that case, you are supposed to write out the original block instead but we were mistakenly writing the compressed block.
Much of my week was taken up with matter relating to the Wynyard meeting on Wednesday. Meeting itself went reasonably well and definitely got the impression there was some interest in what we do and how we do it.
Continued marking the libtrace assignment for 513. Just a handful more to go.
Started getting familiar with the new AMP deployment, so I am better able to keep an eye on it while Brendon and Brad are away. Had a few connections come online on Friday which required a little attention, but overall I think it is still running smoothly.
Short week due to the Easter break.
Prepared an extended version of my latency event detection talk to give to Wynyard Group next week. It'll be nice to not be under so much time pressure when giving the talk this time around :)
Started marking the 513 libtrace assignment.
The live exporting bug in NNTSC remains unsolved. I've narrowed it down to the internal client queue not being read from for a decent chunk of time, but am not yet sure what the client thread is doing instead of reading from the queue.
Continued hunting for the bug in the NNTSC live exporter with mixed success. I've narrowed it down to definitely being the per-client queue that is the problem and it doesn't appear to be due to any obvious slowness inserting into the queue. Unfortunately, the problem seems to only occur once or twice a day so it takes a day before any changes or additional debugging take effect.
Went back to working on the mechanical Turk app for event detection. Finally finished a tutorial that shows most of the basic event types and how to classify them properly. Got Brendon and Brad to run through the tutorial and tweaked it according to their feedback. The biggest problem is the length of the tutorial -- it takes a decent chunk of our survey time to just run through the tutorial so I'm working on ways to speed it up a bit (as well as event classification in general). These include adding hot-keys for significance rating and using an imagemap to make the "start time" graph clickable.
Spent a decent chunk of my week trying to track down an obscure libtrace bug that affected a couple of 513 students, which would cause the threaded I/O to segfault whenever reading from the larger trace file. Replicating the bug proved quite difficult as I didn't have much info about the systems they were working with. After going through a few VMs, I eventually figured out that the bug was specific to 32 bit little-endian architectures: due to some lazy #includes, the size of an off_t was either 4 to 8 bytes between different parts of the libwandio source code which resulted in some very badly sized reads. The bug was found and fixed a bit too late for those affected students unfortunately.
Continued developing code to group events by common AS path segments. Managed to add an "update tree" function to the suffix tree implementation I was using and then changed it to use ASNs rather than characters to reduce the number of comparisons required. Also developed code to query NNTSC for an AS path based on the source, destination and address family for a latency event, so all of the pieces are now in place.
In testing, I found a problem where live NNTSC exporting would occasionally fall several minutes behind the data that was being inserted in the database. Because this would only happen occasionally (and usually overnight), debugging this problem has taken a very long time. Found a potential cause in a unhandled E_WOULDBLOCK on the client socket so I've fixed that and am waiting to see if that has resolved the problem.
Did some basic testing of libtrace 4 for Richard, mainly trying to build it on the various OS's that we currently support. This has created a whole bunch of extra work for him due the various ways in which pthreads are implemented on different systems. Wrote my first parallel libtrace program on Friday -- there was a bit of a learning curve but I got it working in the end.
Back after a week on holiday. Spent a decent chunk of time catching up on emails, mostly from students having trouble with the 513 libtrace assignment.
Continued tweaking and testing the new eventing code. Discovered an issue where the "live" exporter was operating several hours behind the time data was arriving. Looks like there is a bottleneck with one of the internal queues when a client subscribes to a large number of streams, but still investigating this one.
prophet started to run out disk space again, so had to stop our test data collection, purge some old data and wait for the database to finish vacuuming to regain some disk. Discovering that we had a couple of GBs of rabbit logs wasn't ideal either.
While fixing the prophet problem, did some reading and experimenting with suffix trees created from AS paths with the aim of identifying common path segments that could be used to group latency events. There doesn't appear to be a python suffix tree module that does exactly what I want, but I'm hoping I can tweak one of the existing ones. The main thing I'm missing is the ability to update an existing suffix tree after concatenating a new string rather than having to create a whole new tree from scratch.
Added graphs for the HTTP test to amp-web, which helped reveal a couple of problems with the HTTP test that Brendon duly fixed.
Updated amp-web to support the new eventing database schema. Managed to get eventing up and running successfully, with significant event groups appearing on the dashboard and events also marked on the graphs themselves.
Gave my annual libtrace lectures, which seemed to go fairly well. Already I've got students thinking about and working on the assignment so they are at least smart enough to start early :)
Continued to develop a tutorial for my event classification app. Found plenty of good examples of different types of events so now it is just a matter of writing all the explanatory text for each example.
Continued working on my Django app for crowd-sourcing event classifications from the general public. The core app is functional but I found that it was very difficult to explain all the intricacies of my personal classification approach using textual instructions alone. As a result, I'm working on adding a tutorial component so that users can be trained using a series of practical exercises where they encounter most of the various types of events we typically see.
Updated eventing to be able to handle more than just latency events and hopefully group events across different collections, i.e. group traceroute path changes with changes in latency. Started testing eventing with live data, so hopefully our event dashboard shouldn't be too far away from being up and running again.
Ryan has asked me to help out with teaching libtrace in 513 again this semester. Spent a day or so working on a new assignment and updating my slides. Looking forward to seeing how the students go with this year's assignment, especially the MSS analysis task.