User login

Shane Alcock's Blog




Fixed my remaining issues with threaded anomaly_ts. Had a few problems where a call to sscanf was interfering with some strtok_r calls I was making, but once I replace the sscanf with some manual string parsing everything worked again.

Continued looking into my NNTSC live queue delays. Narrowed the problem down to there being a time delay between publishing a message to the live rabbit queue and the message actually appearing in the queue (thanks to the firehose feature in rabbitmq!). After doing a fair bit of reading and experimenting, I theorised that the cause was the live queue being 'durable'. Even though the published messages themselves are not marked as persistent, publishing to a durable queue seems to require touching disk which can be slow on a resource-constrained machine like prophet. Removed the durable flag from the live queue and managed to run successfully over the long weekend without ever falling behind.

Migrated all netevmon configuration to use a single YAML config file for all three components. Previously, each component supported a series of getopt command line arguments which was a bit unwieldy.




Continued refactoring the matrix javascript code in amp-web to be less of an embarrassment. This took quite a bit longer than anticipated because a) javascript and b) I was trying to ensure that switching between different matrix types would result in sensible meshs, metrics and splits being chosen based on past user choices. Eventually got to the stage where I'm pretty happy with the new code so now we just need to find a time to deploy some of the changes on BTM.

Started testing my new parallel anomaly_ts code. The main hiccup was that embedded R is not thread-safe, so I've had to wrap any calls out to R with a mutex. This creates a bit of a bottleneck in the parallel system so we may need to revisit writing our own implementation of the complex math that I've been fobbing off to R. After fixing that, latency time series seem to work fairly well in parallel but AS traceroute series definitely do not so I'll be looking into that some more next week.




Spent a week working on the amp-web matrix. First task was to add HTTP and Throughput test matrices so we could make the BTM website available to the various participants. This was a bit trickier than anticipated as a lot of the matrix code was written with just the ICMP test in mind so there were a lot of hard-coded references to IPv4/IPv6 splits that were not appropriate for either test.

Updated amp mesh database to list which tests were appropriate for each mesh. This enabled us to limit the mesh selection dropdowns to only contain meshes that were appropriate for the currently selected matrix, as there was little overlap between the targets for the latency, HTTP and throughput tests.

Once that was all done, I started going back over all of the matrix code to make it much more maintainable and extendable. Collection-specific code was moved into the existing collection modules that already handled other aspects of amp-web, rather than the previous approach of hideous if-else blocks all through the matrix and tooltip code. Finished fixing all the python code in amp-web and started on the javascript on Friday afternoon.




Continued keeping an eye on BTM until Brendon got back on Thursday. Briefed Brendon on all the problems we had noticed and what we thought was required to fix them.

Finished up my event detection webapp. Started experimenting with automating the running event detectors with a range of different parameter options. The first detector I'm looking at (Plateau) has about 15,000 different parameter combinations that I would like to try, so I'm going to have to be pretty smart about recognising events as being the same across different runs.

Started adding worker threads to anomaly_ts so that we can be more parallel. Each stream will be hashed to a consistent worker thread so that measurements will always be evaluated in order, but I still have to consider the impact of the resulting events not being in strict chronological order across all streams.




Continued keeping an eye on the BTM monitors. Changed several connections to use the ISPs DNS server rather than relying on the modem to provide DNS, which seems to have resolved many of our DNS issues.

Spent a bit of time digging into the problem of intermittent latency results for Akamai sites. It appears that our latency tests are interfering with one another as moving one of the previously failing tests to a new offset away from the others fixed the problem for that test.

Continued working on my Event Detection webapp. Added two new modes: one where the user does the tutorial, then rates 20 pre-chosen events and one where the user rates the same events without doing the tutorial. This will hopefully give us some feedback on how useful the tutorial is and whether the time required to complete the tutorial is worth it. Also added proper user tracking, with the generation of a unique code at the end of the 'survey' that the user can enter into the Mechanical Turk to indicate they have completed the task.




Spent much of my week keeping an eye on BTM and dealing with new connections as they come online. Had a couple of false starts with the Wellington machine, as the management interface was up but was not allowing any inbound connections. This was finally sorted on Thursday night (turning the modem on and off again did the trick), so much of Friday was figuring out which Wellington connections were working and which were not.

A few of the BTM connections have a lot of difficulty running AMP tests to a few of the scheduled targets: AMP fails to resolve DNS properly for these targets but using dig or ping gets the right results. Did some packet captures to see what was going on: it looks like the answer record appears in the wrong section of the response and I guess libunbound doesn't deal with that too well. The problem seems to affect only connections using a specific brand of modem, so I am imagining there is some bug in the DNS cache software on the modem.

Continued tracing my NNTSC live export problem. It soon became apparent that NNTSC itself was not the problem: instead, the client was not reading data from NNTSC, causing the receive window to fill up and preventing NNTSC from sending new data. A bit of profiling suggested that the HMM detector in netevmon was potentially the problem. After disabling that detector, I was able to keep things running over the long weekend without any problems.

Fixed a libwandio bug in the LZO writer. Turns out that the "compression" can sometimes result in a larger block than the original uncompressed data, especially when doing full payload capture. In that case, you are supposed to write out the original block instead but we were mistakenly writing the compressed block.




Much of my week was taken up with matter relating to the Wynyard meeting on Wednesday. Meeting itself went reasonably well and definitely got the impression there was some interest in what we do and how we do it.

Continued marking the libtrace assignment for 513. Just a handful more to go.

Started getting familiar with the new AMP deployment, so I am better able to keep an eye on it while Brendon and Brad are away. Had a few connections come online on Friday which required a little attention, but overall I think it is still running smoothly.




Short week due to the Easter break.

Prepared an extended version of my latency event detection talk to give to Wynyard Group next week. It'll be nice to not be under so much time pressure when giving the talk this time around :)

Started marking the 513 libtrace assignment.

The live exporting bug in NNTSC remains unsolved. I've narrowed it down to the internal client queue not being read from for a decent chunk of time, but am not yet sure what the client thread is doing instead of reading from the queue.




Continued hunting for the bug in the NNTSC live exporter with mixed success. I've narrowed it down to definitely being the per-client queue that is the problem and it doesn't appear to be due to any obvious slowness inserting into the queue. Unfortunately, the problem seems to only occur once or twice a day so it takes a day before any changes or additional debugging take effect.

Went back to working on the mechanical Turk app for event detection. Finally finished a tutorial that shows most of the basic event types and how to classify them properly. Got Brendon and Brad to run through the tutorial and tweaked it according to their feedback. The biggest problem is the length of the tutorial -- it takes a decent chunk of our survey time to just run through the tutorial so I'm working on ways to speed it up a bit (as well as event classification in general). These include adding hot-keys for significance rating and using an imagemap to make the "start time" graph clickable.

Spent a decent chunk of my week trying to track down an obscure libtrace bug that affected a couple of 513 students, which would cause the threaded I/O to segfault whenever reading from the larger trace file. Replicating the bug proved quite difficult as I didn't have much info about the systems they were working with. After going through a few VMs, I eventually figured out that the bug was specific to 32 bit little-endian architectures: due to some lazy #includes, the size of an off_t was either 4 to 8 bytes between different parts of the libwandio source code which resulted in some very badly sized reads. The bug was found and fixed a bit too late for those affected students unfortunately.




Continued developing code to group events by common AS path segments. Managed to add an "update tree" function to the suffix tree implementation I was using and then changed it to use ASNs rather than characters to reduce the number of comparisons required. Also developed code to query NNTSC for an AS path based on the source, destination and address family for a latency event, so all of the pieces are now in place.

In testing, I found a problem where live NNTSC exporting would occasionally fall several minutes behind the data that was being inserted in the database. Because this would only happen occasionally (and usually overnight), debugging this problem has taken a very long time. Found a potential cause in a unhandled E_WOULDBLOCK on the client socket so I've fixed that and am waiting to see if that has resolved the problem.

Did some basic testing of libtrace 4 for Richard, mainly trying to build it on the various OS's that we currently support. This has created a whole bunch of extra work for him due the various ways in which pthreads are implemented on different systems. Wrote my first parallel libtrace program on Friday -- there was a bit of a learning curve but I got it working in the end.