User login

Florin Zaicu's blog

20

Feb

2018

Carrying on from last week, I finished creating a multi-link failure scenario for a Fat Tree topology of k=4. I then collected recovery time stats which I have cleaned up and graphed. While collecting the recovery stats for the topology I found a bug, which I have fixed, that was causing the VM to crash. The simulation framework stops the pktgen generation process by sending a SIGINT to its PID. This PID was occasionally incorrect due to the way it was being recorded, causing termination of a process that was critical to the system, thus crashing the VM.

Closer to the end of the week I started investigating further, separating the switches into Mininet namespaces. I found out that this feature is not supported by Mininet as OpenVSwitch needs to be exposed in order to establish a connection to the OpenFlow controller. The only way to fix this behaviour would be to modify Mininet itself which doesn't seem like such a good idea at this point time. At the end of the week, I started looking at adding latency to the control channel to be able to better simulate real network conditions.

12

Feb

2018

This week I spent some time looking at larger topologies that are common in datacenter and carrier networks. I implemented a module which allows creating a dynamic k-arry FatTree topology in Mininet (common in DC networks). I then modified the controllers to find host location in the topology dynamically. This used to be statically specified, however, when implementing the new FatTree topo the pre-specified location no longer worked. Host discovery uses a similar mechanism and approach to RYUs link discovery (LLDP packets) without a liveness check mechanism. So we will only use the packets to figure out where the hosts are in our topology.

I then spent the remainder of the week looking at writing failure scenarios for a FatTree topology of k=4. I have worked on several diagrams that show how path splicing will behave for specific link failure scenarios and also the paths taken for the reactive controller. I am currently in the processing of finishing off a multi-link failure scenario for the new topology and then collecting and processing recovery stats for it.

07

Feb

2018

This week I modified the simulation framework to allow me to perform multi-link timing tests. The failure config file has been modified to allow specifying multiple prober locations. After a bit of initial planning and testing, when dealing with multi-link failures we need to be careful when selecting the logger locations as we may not be able to time that particular link failure. Multi-link failures are now performed sequentially, i.e we take down the first link and time it, then we take down the next and so on.

I then modified the disabled multi-link failure scenario, collected recovery times for it, cleaned up and processed the stats. With this, I have completed the set of recovery time stats for all current network and failure scenarios I have created.

30

Jan

2018

This week was very short due to NZNOG. I worked on cleaning and processing the recovery times I re-collected after the modifications to the timing method of the simulation framework. I have graphed and computed stats based on the collected raw data. I then quickly fixed up comments in my simulation framework and have started looking at my multilink failure scenario. This failure scenario is currently disabled as the reactive and proactive controllers weren't timing the same thing. I am currently in the process of devising a better way to handle multi-link failure scenarios and modifying it so that the recovery times are comparable between the two different controller types.

23

Jan

2018

This week I looked into why the timing results that I collected vary by a large amount for sequential executions of the experiment. After analysing packet traces, I found that TCPDUMP wasn't showing all packets on the primary logger (logger on the primary path directly connected to failed link). When a link is taken down TCPDUMP will crash silently. I was getting packet information by piping TCPDUMPs output to a file. TCPDUMP uses a buffer when it outputs packet information. The buffer doesn't get fully processed when the app exits from a link going down. This was causing fewer packets to be displayed on the output and thus making the results vary by a large margin, depending on the state of the buffer and how many packets were captured. We can fix this problem by running TCPDUMP with the --immediate-mode flag, which disabled the output buffering.

Next, I looked at mitigating the effect of the location of the loggers in the network on recovery time. Because I was using packet timestamps, the location of the logger may affect the recovery time calculation based on the size of the recovery or detour path. I fixed this issue by re-implementing the loggers in libtrace and creating a recovery time calculation app with libtrace. This app will take two packet traces and use the pktgen timestamps to calculate the recovery time. I am also getting packets lost by looking at the pktgen sequence numbers between the traces on the two loggers (last packet on primary logger and first packet on secondary logger). Using the pktgen fields also allows us to place the two loggers on separate switches. Previously we had to have them on the same switch to account for clock differences in the virtual switches.

After these modifications, I have recollected the recovery time stats on my lab machine. I am currently in the process of finishing off the cleaning and re-graphing of these stats.

16

Jan

2018

This week I finished the implementation of all parts of the optimisation mechanism of the proactive controller, which optimises installed recovery paths. While testing I have found several issues and bugs which I have resolved. I then refactored the controllers to clean them up and remove any duplicated code where appropriate. I have also ensured that the controllers have adequate comments.

I then modified the recovery time simulation framework to allow specifying the state for each individual topology. My recovery time framework uses a wait state file which defines the state the network switches (flow and group rule elements), have to be in before starting the failure experiment. This was defined as a per controller file before the modification. Differences in topology may make the wait state mechanism inaccurate. It would have also only worked on topologies that are similar which restricts our simulation testing.

At the end of the week, I have cleaned up, processed and graphed the stats that I have collected which look at the recovery time of the 3 controllers on 3 different topologies using 5 failure scenarios.

08

Jan

2018

This week was a shorter week. I finished implementing the modifications to the proactive controller to allow optimisation of recovery paths. The proactive controller will still use protection to quickly recover from failures at the switch level, however, the controller is made aware of links that have failed. The failed link information is then used to update topology information of the controller and re-compute optimal paths in the network.

15

Dec

2017

At the start of this week, I finished modifying the proactive protection recovery controller to retrieve and use the network topology dynamically when computing paths through the network. I then worked on improving the path splicing algorithm by allowing it to consider more nodes from the primary and secondary path as potential path splice destination and source nodes. I finished implementing a new version of the path splice computation algorithm which seems to be able to produce paths that are more minimally overlapping and closer to the destination node.

I also added a few more failure scenarios and network topologies to my testing framework. I then modified the failure scenario parser/loader and files of my simulation framework to allow specifying different Tcpdump logger locations for various network topologies. The loggers are used to calculate recovery time under certain failure scenarios. This modification was needed as different controllers will produce different recovery paths. Several other fixes and changes were also performed to fix bugs and problems found.

At the end of the week, I started working on extending the proactive controller to receive link failure notifications and optimise the pre-installed recovery paths based on the new network topology.

07

Nov

2017

Last week I kept on reading through more traffic engineering papers. The papers that I am currently looking through on TE cover different network types as well as look at various TE goals, such as resource utilisation optimisation, QoE maximisation and congestion minimisation. They were found from a literature review paper that looks at how SDN can benefit TE. The papers that I am focusing on look at OpenFlow and TE. Last week I have also finished and ran my Fast-Failover group timing tests.

This week I am planning on finishing off reading the TE with OpenFlow papers from the literature review. I also want to try and run more tests and see if I can get a sense of how long a recovery based error detection method (not using fast failover groups and precomputation) takes to complete. I currently have results for protection and would like to compare them with recovery. I am also currently in the process of looking for source code for some of the error detection and recovery systems presented in the papers I have read through. I would like to run some of these systems or methods and potentially assess their behaviour, problems and performance.

31

Oct

2017

Last week I carried on reading through a few more traffic engineering papers and have also looked at a couple of SDN controller performance evaluation papers. At the end of the week, I started to set-up VM environments to run some quick tests and hopefully benchmark some of the solutions available for error detection.

This week I have a few more TE papers to go through. I also want to run some tests to find out how long the fast failover group takes to switch to a new bucket when a link goes down. I also want to try and run a few benchmarks on some the current error detection and recovery methods/systems from the papers I have read.