I took the first step towards actually writing this week. Figured out my Chapter layout and made a bunch of notes about what to include.
I also completed my first test. It took pretty much all week to run, so I may have to do slightly fewer tests some how. Either fewer repetitions or test fewer different values. Probably both.
I figured out a solution to my multiple paths algorithm, basically just by prioritising the paths from the current node, and having everything fall into line with that. It makes it fairly slow, and it means that you end up prioritising longer paths over shorter ones in a lot of situations, but there is a limit to how bad the paths can be and it works. May still be non polynomial, but it is precaclulatable at least.. Networks cant have all that many nodes right?
Further work has been carried out on the black hole detection system based on a fast mapping approach. An initial data set has been collected and the construction of an analysis routine has begun to investigate the series of MDA and Paris traceroute runs. Much of the same code will be able to be used as in the earlier routine, however the new data sets have all the traces mixed in together so the ones for analysis must be identified and grouped according to destination address. This so that destination cases where a black hole is found can be reported.
Another angle relating to this same work is the development of the drivers. It turns out that the program loop waits at some points if no new results need processing. This means that scheduled regular tasks will not be triggered if they rely on the loop circulating. In particular changes to the targets list will not be processed and new targets will not be analysed. This will require investigation into how to avoid the waiting at certain steps. Once this is achieved some sleeps will also need to be added to avoid too much CPU usage.
The Internet simulator appears to have carried out a successful simulation when the data set was reduced to a third. This success was achieved after having to make a change to an existing assertion about some data variables. It seems that under certain circumstances the was able to detect an allowable condition. The following is the assertion: assert(firstHintTime <= simTime); There is a method which can occasionally reset the firstHintTime and possibly make it greater than the simTime: initialiseHints(void).
I have also started on an algorithm to process warts data and approximate a simulation without the great cost of processing packet by packet. This approach is still able to provide information about packet costs as warts records most commonly needed packet details.
Updated some configuration in amp-web to allow fully specifying how to
connect to the amp/views/event databases.
Set up some throughput tests to collect data for Shane to test inserting
the data. While doing so I found and fixed some small issues with
schedule parsing (test parameters that included the schedule delimiter
were being truncated) and test establishment (EADDRINUSE wasn't being
picked up in some situations).
Started adding configuration support for running multiple amplet clients
on a single machine. Some schedule configuration can be shared globally
between all clients, but they also need to be able to specify schedules
that belong only to a single client. Nametables, keys, etc also need to
be set up so that each client knows where they are.
Started writing code to configure rabbitmq on a client and isolate our
data from anything else that might already be on that broker (e.g.
another amplet client). Each amplet client should now operate within a
private vhost and no longer require permissions on the default one.
Fixed problems we were having with netevmon causing NNTSC to fill up its queues and therefore use huge amounts of memory. There were two components to this fix: the most effective change was to modify netevmon to only ask for one stream at a time (previously we asked for them all at once because this was the most efficient way to query the old database schema). The other change was to compress the pickled query result before exporting it which reduced the queue footprint and also meant we could send the data faster, meaning that the queue would drain quicker.
Fixed a bug in ampy that was preventing events from showing up on the graphs or the dashboard. We now have a fully functioning netevmon running on prophet again.
Spent a couple of days going over the AMP event ground truth I generated a few weeks back after Meena reported that there were a number of events being reported now that didn't have ground truth. This was due to the changes and improvements I had made to netevmon while working on the ground truth -- as a result, some events disappeared but there were also a few new ones that took their place. Noticed a few bugs in Meena's new eventing script while I was doing this where it was reporting incorrect stream properties, so I tracked those down for her while I was at it.
Wrote a NNTSC dataparser for the new AMP throughput test. Found a few bugs in the test itself for Brendon to solve, but both the test and the dataparser seem to be working in the most basic cases.
Had a play with Nevil's python-libtrace code and reported a few bugs and missing features back to him. Looking forward to those being fixed as it is pretty nifty otherwise.
I have tests up and running.
There was a bit of an issue with the fact that it relies on using os.system to call tc. This activates loss after a random time period and then it times how long it takes for it to notice the packet loss. There seemed to be a problem if the random time period was too short, the whole program would lock up. So I solved that by adding a minimum time period to the sleep. This seems like a bad way of fixing this problem, but it worked and I have no idea how else I would get around it.
Anyway the test is running and I'm guessing it will be running all week.
There is a new ovs version and a new picos version so I am checking all of our past issues against those again just incase someone fixed them. Fingers crossed..
Also played around with our SDN with Brad. We fixed the issue with it constantly disconnecting from the controller. It seems our drop rule was taking precedence over the openvswitch hidden flows it uses to allow inband control. So all our control traffic was getting dropped. We had an awful work around, but then the new picos version fixed it.
I have been working on the fastmapping like approach. Two drivers are required for this. One detects Paris Traceroute runs that are shorter than the original MDA (load balancer detecting) run and the other uses the data from this to initiate a further series of Paris traceroute runs using the flow ID that was used in the short Paris run. Paris traceroute uses consistently the same flow ID within a trace analysis. The first driver has worked correctly and the second is under test.
For the Internet simulator a shortage of virtual memory has been encountered when processing a complete team traceroute data set from Caida. Tony has said it may be possible to require less memory by adjusting the simulator program. The focus of this work is to try and quantify the cost of control packets when using Doubletree, and to compare this with Traceroute.
It was also decided recently to try and simulate Megatree using the Internet simulator. However the amount of data required to do this is much larger than the already too big case above. Alternatively it may be possible simulate Megatree using a variation of my warts analysis programs which operates at the discovered topology level rather than necessarily the packet level. The warts data could be processed in the order it was collected and the Megatree savings could be made using the available data. An approximation of packet usage for a given load balancer will be required as only the grand total of packets used for bringing forward are recorded. Bringing forward is the way that MDA gains access to nested load balancers and finds flow IDs that give access to the successor set of nodes for probing. The actual probing packets are recorded. This approach should require much less computing power and still give detailed information on the performance of Megatree with various factors and levels applied. These include sequential versus parallel probing of destinations and various degrees of this, with and without distributed Megatree, and with and without Megatree.
Spent some time tidying up the code to adjust nameservers for AMP at
runtime, and adding in configuration options to allow them to be set.
While doing this realised that name resolution wasn't neccessarily going
to respect the interface/address bindings set up for the tests, so
looked into ways I could make this happen. The best/easiest way so far
seems to be to create my own sockets for the resolver to use and then
bind them how I like. This appears to work with my testing so far, but
is possibly getting a bit too specific to the internals of the libc
library I'm using.
Also wrote some unit tests around the ICMP test response packet
processing to help make sure that malformed or incorrect packets are
correctly dealt with.
Updated the AMP dataparser in NNTSC to process more messages in a single batch before committing. This should improve speed when working through a large message backlog, as well as save on some I/O time during normal operation. This change required some modification to the way we handle disconnects and other errors, as we now have to re-insert all the previously uncommitted messages so we can't just disconnect and retry the current message.
Tried to bring our database cursor management in line with suggested best practice, i.e. closing cursors whenever we're done with them.
Improved exporting performance by limiting frequency calculations to the first 200 rows and using a RealDictCursor rather than a DictCursor to fetch query results. The RealDictCursor means we don't need to convert results into dictionaries ourselves -- they are already in the right format so we can avoid touching most rows by simply chucking them straight into our result.
Spent some time helping Meena write a script to batch-process her event data. This should allow us to easily repeat her event grouping and significance calculations using various parameters without requiring manual intervention. Found a few bugs along the way which have now been fixed.
Was planning to work the short week between Easter and Anzac day but fell ill with a cold instead.
I've been looking more deeply into Contiki this week in an attempt to become more familiar with the code. One issue I found was that Contiki only seems to support one of the five buttons on the MB950 (application board) out of the box, so I wanted to see if it were possible to fix this. It seems as though some effort has gone into writing the code such that it can be extended to support more buttons easily (for example, the button that is supported can be interchanged with any one of the other buttons), but it stops short of actually implementing multiple buttons, which is really strange. In any case, I should stop getting held up with this and move on to more relevant things. Over the next few days I will start playing with IPv6 communication and then revisit some of the initial research I did to determine which direction to take the project.
Just been writing code to set up the tests and investigating how to programmatically generate packet loss in a way that will give me the best accuracy in terms of timing.
I havent been able to find a better option than just calling tc from within python with os.system. I am unsure how much that is going to affect the level of accuracy or how I can determine how much it affects the accuracy. So that is a bit of a concern, but I am soldiering on in the mean time.
I also helped Brad with our new SDN some more too. It wont stay up for more than a couple of minutes, because there seem to be issues with the routeflow rules interfering with the hidden rules (which have higher priorities than the routeflow rules, and therefore should not be being interfered with). But the point is that when the neighbour resolution on the switches times outs it cant re-resolve the address of the controller, so it loses connection. It's a bit weird.