Finished off gathering field testing data on Caida. Began the next run which is testing the efficiency of several flow ID selection modes. It was necessary to update the driver which was previously used to collect this data type on Yoyo.
More work has been done on the slide set for the conference. More statistical tests have been carried out and more graphs have been produced from the raw data. Further pruning of the slide set has been carried out.
Managed to get a working implementation of Flott which does the necessary initialisation and calculations for obtaining the t-entropy of a given string! It took longer than expected though - I was right about the objects and functions that I would need out of the original source code, but missed a number of lines in different places which meant that the tokens and values used in the calculations were incorrect, thus resulting in an incorrect output. So, I spent many, many hours adding debugging output in my implementation and the original code after each iteration/processing and compared the results to figure out what had gone wrong. I was then able to produce a t-entropy value that was very close to the original program's output. After going over the original code again, there was a scaling factor that I had missed and that fixed the last issue.
Over the next week, the plan is to refactor the code and finalise it for addition to Netevmon.
Finished reformatting the data to remove some mess and unnecessary
layers of nesting that had crept in while trying different things. It
should now be set up to deal properly with representing multiple lines,
split up or grouped by however the backend wants to do so. Updated all
the tests to use the new data format.
Spent an afternoon with Shane and Brad designing how we are going to
represent graphs with multiple lines, in a way that will let us merge
and split data series based on how the user wants to view the data.
Tidied up the autogenerated colours for the smokeping graphs to use
consistent series colours across the summary and detail views, while
also being able to use the default smokeping colouring if there is only
a single series being plotted.
I added multiple table support to RouteFlow, and am now trying to add my 591 work on top of that, but it is taking longer than I expected.
Having multiple tables does simplify the structure of things and fixes most of the interface issues I had with the older version of this code, but in spite of this I am having a lot of problems getting this to work.
The new psycopg2-based query system was generally working well but using significant amounts of memory. This turned out to be due to the default cursor being client-side, which meant that the entire result was being sent to the querier at once and stored in memory. I changed the large data queries to use a server-side cursor which immediately solved the memory problem. Instead, results are now shipped to the client in small chunks as needed -- since the NNTSC database and exporter process are typically located on the same host, this is not likely to be problematic.
Netevmon now tries to use the measurement frequency reported by NNTSC for the historical data wherever possible rather than trying to guesstimate the frequency based on the time difference between the first two measurements. The previous approach was failing badly with our new one stream per tested address approach for AMP as individual addresses were often tested intermittently. If there is no historical data, then a new algorithm is used that simply finds the smallest difference in the first N measurements and uses that.
Changed the table structure for storing AMP traceroute data. The previous method was causing too many problems and required too much special treatment to query efficiently. In the end, we decided to bite the bullet and re-design the whole thing, at the cost of all of the traceroute data we had collected over the past few months (actually, it is still there but would be painful to convert over to the new format).
Had a long but fruitful meeting with Brendon and Brad where we worked out a 'view' system for describing what streams should be displayed on a graph. Users will be able to create and customise their own views and share them easily with other users. Stream selections will be described using expressions rather than explicitly listing stream ids as it is now (although listing specific streams will still be possible).
This will allow us to create a graph showing a single line aggregating all streams that match the expression: "collection=amp-icmp AND source=ampz.waikato.ac.nz AND destination=www.google.com AND family=ipv4". Our view could also include a second line for IPv6. By using expressions, we can have the view automatically update to include new streams that match the criteria after the view was created, e.g. new Google addresses.
Finished all of my chapters this week and have had them reviewed, just a matter now of tidying up a few places and then I'll be done.
I gave the final version of my thesis to the printers so they can do the hardbound version.
Moved the multiple series line graphs back to using the smokegraph
module, but with colouring based on the series rather than to indicate
loss. This appears to work well for the smaller data series that I've
tested on, though I have yet to get a sensibly aggregated set of data
for those graphs with very large numbers of streams.
The new graphs with arbitrary numbers of data series had caused event
labels to be triggered on mouseover for almost all series except the
first, which I fixed. Only a dummy series will trigger mouse events, so
that it doesn't try to display information about every single data point
on the graph. Through profiling I also found many extraneous loops and
checks for events that could be prevented by properly disabling events
on the summary graph as well.
Also spent some time reading and critiquing honours reports, not long to go!
I have been monitoring the Caida run of non 5-tuple field analysis. I have started to download the completed warts sets.
A dump of load balancer ID and next hops was included in the data analysis of the warts files from Caida and Planetlab. A perl program using this data was written to count load balancers found by only one of the vantage points. This is to help determine how affective the coverage we have is.
Data from the fourth scamper run has been processed, comparing it to the first run. These results have been incorporated into the conference slides. Furthermore the introduction of the slide set has been extended and more graphs have been added. Some pruning has also been carried out.
The large analysis of Caida runs on wraith is still running. This should provide information about how many new load balancers are added with each new vantage point. This is using quite a lot of memory as the data from all the vantage points is held at the same time, so wraith is ideal for the purpose.
Over the last two weeks, I have been working on the TEntropy detector.
During the first week, I used anomaly_ts and anomaly_feed and produced output for a number of different streams by using a combination of different metrics, string lengths, sliding window sizes, and range delimiters. After producing strings for each sliding window sample, a python script calls the external t_Entropy function with the string as a parameter to obtain the average t-Entropy for each string and pipes the output to a file. I then wrote another Python script to produce a Gnuplot script for producing time-series graphs so that I could inspect the results. At this point, it was apparent that the t-Entropy detector was a feasible option and hence, I had to start implementing the actual t-entropy calculations within Netevmon.
Spent last week going over the T_Entropy library that I found called Fast Low Memory T-transform (flott), which is used to compute the T-complexity of a string which in turn is used to compute the t-Entropy. Unfortunately, the library consisted of around a dozen .c and header files, which made it somewhat tricky to determine which parts I would need. So, I spent around 3 days looking over the source code and trying to understand it before starting to work on adding the necessary bits to a new detector. Found the function that is used for calculating the actual t-complexity, t-information and t-entropy values, so have been working on duplicating those calculations. However, there are a number of other initialisation functions that are required before the t-* can be calculated, so I have to look into them at some point.
Also had a bunch of marking to do, so couldn't spend all week working on the flott adaptation.