User login

Andrew Mackintosh's blog

02

Feb

2015

This week I have been extending the code for gathering flow infomation, in order for it to provide a "Flow Fingerprint". This consists of the server IP address, server port, the transport protocol and the application protocol as identified by libprotoident.
I surmise that these attributes are enough to identify the majority of elephant flows such as: "TCP traffic to a port other than 80 directed toward a dropbox server using the HTTP protocol" is likely an elephant flow.

Combining this simple approach with the data gathered by the flow information scraper, and a manually set threshold (C) should be able to identify common elephant flow configurations.

27

Jan

2015

I completed my presentation and delivered it to members of the machine learning group on Tuesday.
It was agreed that my investigation involves a number of complex problems, and that a simple approach would be required for the time being to create a training data set of elephant flows.

I was also advised to use the command line version of Weka when using large data sets, as it tends to be more efficient.

With a shift in focus, I will now be investigating simpler methods for identifying hot-spot IP address and port combinations. Essentially each time a packet exceeds an elephant threshold, it's IP address and port number will be logged. If a certain number of elephants are observed from this combination, each flow originating from there will be treated as an elephant.

19

Jan

2015

This week I have been doing a bit of tidying up.
I have spent some time tweaking the language in my literature review and fixing some of the references that got mangled.
I have also been adding more information to my presentation that I will be making to the Machine Learning group this Tuesday.

I have also fixed the build script for my flow crunching program, which extracts attributes from the flows contained in a libtrace compatible file. This is intended for offline use.

09

Jan

2015

After speaking with Bernhard at the end of the year, I have been working on a presentation to be made at the next Machine Learning Group meeting.

The presentation is indended to elicit any ideas that members of the ML group might have, regarding a solution to the problem.
It summarises the problem that my research attempts to address and the progress I have made so far in my investigation, especially items taken from my literature review.

The feedback I receive from this presentation will inform the direction that any development takes.

15

Dec

2014

I have concluded the hard investigation for my literature review, and am now focusing on including more details in my summaries and general polish to the writing.

I am also investigating Data Steam Mining after having a conversation with Sam Sarjent about the topic. I have arranged a meeting with Bernhard Pfahringer to discuss this area of research further, and see if it can be used for this project.

This week I shall conclude writing the literature review, and continue investigating data stream mining techniques for flow classification. Hopefully by the end of the week I shall be able to make an informed decision about which type of flow classifier would be practical and accurate enough to build an implementation of.

08

Dec

2014

I continued my investigation into elephant flow detection by looking into prior work using machine learning techniques.
One paper I read provided an in depth comparison of the classifiers present in WEKA for performing traffic classification, and importantly compared the performance of each rather than just the accuracy.

Flow clustering can also be used to identify elephant flows, by sorting traffic into a pair of clusters and inspecting the mean flow length of each. This technique was originally used to sort flows into multiple clusters of traffic types (such as interactive, bulk-transfer, etc).

Flow clustering can also be used as part of `semi-supervised learning', where flows are sorted into clusters and pre-labeled data is used to determine which cluster corresponds to which protocol.
This could likely be adapted to work with flow size, but is less useful as there would be only two clusters.

The Emilie classification system uses a Support Vector Machine classifier with the first three packets of each flow to determine the protocol of the flow. This is attractive as it requires no deep packet inspection, and can act on the headers of the packets. A system such as this could be adapted for our purposes.

This week, I shall conclude my literature review by investigating possible ways to integrate the detectors I have looked at into existing SDN architectures.

01

Dec

2014

This week I have spent most of my time focusing on the statistical detection methods. It is widely considered unfeasible to maintain a full flow table for measuring flow statistics as high bit rate connections would not be able to process each packet in time.
On routers, techniques such as "Bloom Filters" can be used. This uses a hash of the 4-tuple (assuming TCP only) for each flow and using it in a hash map. Samples are taken at particular intervals, and if a flow appears in two samples it is declared as an elephant.

Spike detection is also used in the TCP case, which leverages the fact that the majority of mice flows complete during their Slow Start phase.
By monitoring the outgoing buffer of a router, and carefully selecting a threshold value it can be predicted whether the link is currently hosting an elephant flow.
It can do this before the buffer becomes full, due to the fact that once the threshold is reached the next transfer will contain twice the number of packets as the previous one.
This allows the device to react before any loss occurs, and can move the next burst of packets into a lower priority buffer.

I have also begun investigating machine learning techniques, including a paper which compares the performance of a number of simple classifiers from the WEKA suite.
Rather than simply comparing prediction accuracy (which is often very similar between classifiers) it also considers the time required to make a prediction, and the time required to build the classifier.

Next week I shall continue the investigation into machine learning techinques, including papers which propose solutions to detecting protocols as these could potentially be used to predict elephant flows.

23

Nov

2014

This week I have continued work on my literature review of detecting elephant flows.
My initial findings reveal that there are two major methods for detecting elephants, statistical and machine learning.
I have also been investigating papers from machine learning that are detecting application protocols, as many of these techniques will be applicable to this investigation.

Unfortunately, progress has been delayed due to a family emergency this week.
Next week I shall continue investigating and writing the literature review.

14

Nov

2014

I have been investigating detecting elephant flows in networks in real time.
This week I have been focusing on processing trace data in WEKA in order to find any underlying predictors for elephant flows.
Currently I have two ARFF files generated from the output of the lpicollector program which summarises the flows in a trace file and reports on their statistics.

These ARFF files are particularly large (over 1,000,000 instances) and are too resource intensive for WEKA to process.
I have used the full data sets to form two clusters. These clusters tend to naturally contain the mice and elephant flows, saving the effort of manually classifying them.

In order to make predictions over this large dataset I have been resampling them into smaller samples (a few thousand instances) and experimenting with different classifiers and attribute sets.
This has not yielded any surprising results thus far.
The best predictor of flow size has been either destination IP address, destination port number, or the size of the payload in the first packet sent to the server.

Next week I will begin a literature review into methods for detecting elephant flows and contrast them against the findings of the WEKA experiments.

11

Feb

2014

This week I have captured a number of traces of BitTorrent traffic, both encrypted and unencrypted for training the model.
I have also been working on pre-processing these traces to remove flows with packet loss or retransmission as these are unsuitable for training the model.

I also now have the traces sorted into flows with libflowmanager, and a list of packet sizes and arrival times associated to each flow.