User login

Blogs

10

Sep

2013

I've been writing the proposal and thinking about how things will scale.

To poll both ends of a path I can add an extra flow on each switch specifying the ingress switch, so that when a packet leaves the system it is counted with the other packets that entered the fabric at that switch. This will require tables and stacked mpls labels (3 layers of mpls), though it could probably be made to work with two.
This way I can poll both ends of a path, but inbetween I am aggregating paths, since the alternative means I have a number of flows to create paths on each switch that is quadratic in the number of switches in the fabric. This is going to be a complication for accurately locating problems just by polling counters.

10

Sep

2013

The internet simulator has been run using a complete data team from Caida. I was necessary to increase the allowed heap size and the trace list to be carried out was reduced to 100,000.

The next round of scamper data collection on Caida and Planetlab has been initiated.

The modification to the analysis of scamper data has been further updated to count diamonds based on the same load balancing node. There is a certain amount of trial and error involved, so it will take some more time to do this.

09

Sep

2013

Spent some more time looking at the broken NAT in virtual box (it looks
to use lwip) and how it was affecting my traceroute tests. In doing so,
found and fixed a few issues where incorrectly sized packets could be
sent, or correctly sized packets could be sent with incorrect length
fields. After switching from NAT to bridged mode everything seems to
work properly.

A lot more time than I like was spent trying to differentiate between
problems caused by the NAT and problems with my tests. The ability to
run tests as standalone programs was very helpful for this and made it a
lot easier to pinpoint problems such as the NAT incorrectly sizing
embedded packets in ICMP responses.

09

Sep

2013

Spent most of the week on leave, so not much got done this week.

In the time I was here, I fixed a number of bugs with the auto-scaling summary graph that occurred when there was no data to plot in the detail view.

I implemented yet another new algorithm for trying to determine if a time series is constant or noisy, as the previous one was pretty awful at recognising that the time series had moved from constant to noisy. The new one is better at that, but still appears to have problems for some of our streams -- it now tends to flick between constant and noisy a little too frequently -- so it will be back to the drawing board somewhat on that one.

08

Sep

2013

Continued work on the change-point detection algorithm, optimised memory usage so now only a single vector of probabilities is stored. Testing against data shows that this is too sensitive at the moment.

I got quite busy this week with assignments so I only worked Wednesday.

07

Sep

2013

Again not a lot to report this week as my time was spent on other assignments

07

Sep

2013

Tested out the effect of manual nova compute access policy changes. Have decided on project-role-based access control as opposed to project-based access control in order to minimise changes to the access policy file. Also need to test the ability to custom create user accounts on the admin panel of the dashboard.

05

Sep

2013

Tidied up a lot of the javascript within amp-web. Moved all of the external scripts (i.e. stuff not developed by us) into a separate lib directory and ensured that everything used consistent and specific terminology.

Added config options to amp-web for specifying the location of the netevmon and amp meta-data databases. Previously we had assumed these were on the local machine, which proved troublesome when Brad tried to get Cuz running on warlock.

Capped the maximum range of the summary graph to prevent users from zooming out into empty space.

Fixed some byte-ordering bugs in libpacketdump's RadioTap and 802.11 header parsing on big endian architectures.

03

Sep

2013

Got the AMP packaging for CentOS working well enough that I can now
build and install packages that almost work straight out of the box.
Split the package into two parts - one that operates with a local broker
and one without. Merged some of the required changes back into the
Debian versions too.

While testing the packaging on a CentOS virtual machine I found a few
interesting issues. Tests where the targets were resolved at runtime
were being run to two targets, but with the same address - for some
reason identical, duplicate IPv4 addresses were being returned if
getaddrinfo() was given AF_UNSPEC (which I do because I want both IPv4
and IPv6 addresses if available). Also, with the traceroute test I was
seeing some very high latency to some hops, extra hops at the end of my
paths, etc. Some of this was due to late responses arriving and being
treated the same as on time ones, and some of this appears to be related
to possibly broken behaviour in the VM NAT - ICMP TTL expired messages
are being received where the TTL in the embedded packet is too large to
have expired on that hop!

Spent some time trying to figure out exactly what was going on in the VM
case, and how best to make the test robust in these cases.

03

Sep

2013

Since doing my presentation I have done a bit more reading and have just started on writing the proposal.

So the idea is to include fault detection into the distributed router used for Cardigan. Looking at packet counts hitting various flows/ports and injecting packets to determine when there is a problem. I have to look at how to make it quick to react without overreporting or overwhelming the controller.

This would like to use openflow groups, which have failover mechanisms, however these are not implemented by anyone as far as I am aware.