User login

Kieran Matherson's blog

31

Aug

2015

Finished my site with Django. First I added the functionality to display a logfile to the page to get familiar with Django, plus it can be extended later for filtering a single file etc.

Next I implemented the 60 second window search between two files, this takes the first file and goes through each window comparing it to the window in the second file; specifically it compares the similarity using dice's coefficient of each event to each other event then averages that to display the average similarity for that window.

Next I linked each window to a separate page where the window is broken down into its singular events (each step breaks it down further). So at this stage we can compare one event to the events in the second files window using the same similarity score.

In the vein of breaking the files down into smaller parts for each step to process these huge files in manageable steps, in the future I think it would be nice to be able to visualise the difference between two events; for example Google's diff-match-patch API is a good example of showing the differences between two strings.

And of course working on my presentation for the honours conference.

25

Aug

2015

Looked into string similarity measures like jaccard index and dice's coefficient to use as a metric for comparing events. Then looked into how to display the information so started making a website with Django to display window similarity and then from there once a window is selected it is broken down into the events it is made of for further examination. Also examined googles diff match patch API for visualising the difference between 2 strings.

With the new data from Brad that comes from multiple machines I made a script to parse this into a database to use in conjunction with the site. Aim is to get this finished off in time for the presentation.

17

Aug

2015

Added the functionality of comparing a window within both logs to see the relationship between a time period. This way the part which examines a single event to a window can be considered to break it down further.

Have learnt that with the repetitiveness of log files alot of the connections that were being added as tokens weren't really telling us anything so those were ignored and blacklisted to see what could be connected. For example just looking at ip addresses and further breaking that down to IP addresses not on the 130.217 network.

To see if we can find anything really useful other logs are being considered since we want to see if a useful event could be detected. As an example from the sys log we could possibly detect a power outage if we saw in two machines that the same log at the same time period rebooted etc.

10

Aug

2015

Started examining connections between multiple files. In other words if an events happens in one file if we look into other files around the same time are there any relations?

Firstly to be able to check a time window all the log file timestamps needed to be standardised so I made a script to process each timestamp and convert it into a unix timestamp. This allows for easy comparisons by just having to subtract one timestamp fromt he other to find the difference in seconds between two events.

Once I had usable timestamps I created a program to compare two files against one another. It goes through event by event in the first file and then goes through the second file to find events within a sixty second time window of one anther. Once an event is found a similarity between the two events is calculated and tokens for matches and stored.

Alot of the lines need to be stripped of character like: =, [, < etc. so that any useful information can be compared. Have also found that since alot of the IP addresses are 130.217 there are alot of matches on just that so it may be useful to implement a blacklist for that network prefix because IP addresses can be are assigned more weight than normal words.

27

Jul

2015

Firstly I made the MultinomialNiaveBayes sort by confidence and print it's predictions in order of confidence to system out. The application allows for multiple files to be used for training and once trained another file can be specified for testing.

Next to allow the user to correct the mistakes of the program I made a GUI which displays the output in list form to be displayed in a tabular format. The user is then able to go through each event and update whether or not it is safe. Finally once the user is happy they can save the instances to an arff file so it can be fed back in for training allowing it to improve from its previous mistakes.

Now I'm looking into grouping events that occur within some time period e.g. 60 seconds of one another from multiple files to see how prevalent the connections between multiple file types are. Some measure of similarity between events will need to be used so each "word" will be treated as a token to compare two events, this can be adjusted to weight certain parts of events e.g. treat an IP address as 4 separate parts so instead of just having a weight of one if the whole thing matches it would now have a weight of four or if just the network section matches a weight of two.

To be able to find the time between each event I will need to convert the time stamps into a usable format which will need to be customised for log files with different time stamp formats.

06

Jul

2015

Built the classifier with WEKA using weka.classifiers.bayes.NaiveBayesMultinomialText using the ArffLoader class to load instances instead of the data set to help save memory.

Then used test data set to build a ranked list to check the output. Just need to polish that and evaluate the classifier.

29

Jun

2015

Talked to brad about getting tagged log file data and he modified bearwall to log both packets that were allowed and packets that were blocked, these are indicated in the event in the log and in the case they were blocked the reason is indicated. So now I have a days worth of tagged firewall logs (about 11,000 lines).

I then made an application to process this data and convert it into the ARFF format so that it will be compatible to work with in WEKA. Now I need to split this into separate sets for testing, parameter tuning and evaluation.

Now I will build the application with the WEKA framework to rank this data based on classification accuracy.

15

Jun

2015

Found a paper where they cluster event logs with word vector pairs; this approach compares each pair to each other pair in the supplied logs allowing it to cluster lines with similar parts. There is also a toolkit associated with the paper that allows you to specify the input files and the support for making a pair then outputs the clusters where the support is reached, the outlier clusters can also be outputted. This will need to be investigated further to see if it is a good possible solution.

Had a meeting with Antti Puurula about possible approaches, where we discussed outputting a ranking into lists of safe and unsafe events. It was discussed on how this could be evaluated with a Mean Average Precision measurement and then a few algorithms that could be used for scoring events like clustering if the feature space could be separated, supervised learning if all the data was tagged or his recommended a supervised learning where the user manually updates the list of safe/unsafe and the classifier updates iteratively.

Then non language features like time stamps were discussed on how to integrate them as well by having another algorithm like niavebayes handling continuous features. This way we could identify events happening within a certain time period of one another to tie events between files.

25

May

2015

Finished off making word count program that outputs comma seperated format: word, occurances, frequency. To help give a better understanding of the document's that are being worked with. Next I think it will be useful to get the counts stats on how many words occur only once, how many occur > 10 etc.

On Friday 22/5 I had a meeting with Bob Durrant from the Statistics department to see get his opinion on possible approaches etc. I now have a better idea of what I'm going to do next firstly ignoring Topic Modeling for now and looking at Clustering, specifically to start with only the bearwall firewall logs to get started then look into comparing multiple types of log files later.

Firstly I will start with a simple version of K-Means and add functionality onto it as I go along e.g. seeing if there is more meaning in an IP address by separating the network and host portions and many other possibilities. Will also need to look into the languages and libraries I have found for this type of work to decide what I will be using.

Also I have been working on my presentation that I will have on Wednesday 27.

18

May

2015

I am going to organise a meeting with Bob Durrant from Statistics to discuss my project and get his opinion on approaches etc. and Bob has suggested an expert on text mining that would be glad to discuss my project.

For these upcoming meetings and in general it was suggested that I make brief summaries on the information I am working on so I created a program to process my log files and output a word frequency count for each set of input files that belong to a process; the output is in .csv format so it can easily be loaded into excel to be sorted, plotted etc. It can also take specific date as input so the logs for one specific day can be filtered.

Next I will use my program to make frequency tables for each of the processes in my log files (where the process would be considered the author in text mining), then I will take a few lines from each process to give a short summary of each process's output which will give a summary of the type of files I'm working with.