User login

Kieran Matherson's blog




Researched into document machine learning algorithms for processing small documents. Since social media is becoming ever popular there is a lot of work going into how to learn useful things from posts on social media, for example learning things about an event being posted on Twitter. I was especially interested in what techniques were used for Twitter since a message cannot be more than 140 characters, is posted by a user and has an interesting makeup of #hashtags etc. When we look at a log entry it is sort of similar in structure i.e. each event is a short line, is made by one specific program and has port numbers and IP addresses.

Some of the studies found better results by aggregating the "tweets" into one document which is already done with log files, others try and address the short nature of log files by looking at biterm pairs and many other techniques. Mostly the algorithms are based off topic modeling which infers topic's (groups of words) that could possibly generate the document, though I did find other clustering algorithms like spherical K-Means which I will look into further.

Looked into Mallet's API further and looked into how the importer works that creates the .mallet file that gets passed as input. So I was able to change the regex to parse tokens to include IP adresses and other numbers etc. and found it coverts the token strings to integers for storage. Then after getting the IP addresses etc. included in the input I tried it with topic modeling but it failed and the output was all weird characters so I need to find out what effect the numeric and punctuation characters have on both the input generation and modeling steps.




Started using mallet on the files that I have collect. Tested it on the entire directory at once and it created a .mallet file quickly and a topic model in 1 hour 30 mins, I then used it on the Bearwall logs which I unzipped and took a lot longer over 3 hours which makes me believe that it ignores the zipped files.

Then I looked at the topic keys it had generated and it had stripped out all the numbers and only kept words so looking at the topics didn't really prove to show anything useful.

So next step is to look into other programs and methods that are better suited towards log files because it would be more useful to see it grouping events together from multiple files of different applications. Though that's not to say the topic modeling the Mallet is suited towards won't be helpful I will also need to look into if there is an option to retain numbers etc. and if so re-test.




Spent this week working on other assignments to get them finished before end of teaching recess and didn't get to work on project.

Next step is still to research popular document learning algorithms and if they exist within Mallet see how well they adapt to log files, adjusting them if needed for a better fit to this text format.




Went and saw Brad and got /var/log log files from ns1 zipped up to start testing with so I can decide whether or not I need Syslog logs at the next stage since it is simpler to use the logs already available than to set up Syslog on a machine so these will be a great start.

Unzipped them and now trying to find the best way to combine them into a .mallet file. Tried to execute it on the whole folder but after a couple hours it was still going; When I have time I will leave it running for awhile because it may just take awhile to do 400MB of logs, for the moment I'll use subfolders while waiting. From the examples I went through they had all their files in the .txt format but when testing on a single folder Mallet seems to be able to decompress and read files in the plain file format.

I ran mallet on the single sub folder /var/log/kernel to create a simple topic model which worked well but didn't really have enough info to tell anything interesting so I will be looking for some bigger subsets to test on until I can get the entire log to combine.

Now I will start researching and testing different document clustering algorithms for finding patterns within these logs.




First to catch up on the previous weeks, I created my brief and proposal (see attached for more info).

This week my goal was to become familiar with using the tool Mallet which I will use to analyse log files.

There aren't many tutorials on Mallet but I Found some good ones on using the command line interface (CLI) and I should be able to accomplish most of what I want to do just with the CLI; I will use it for now to start my testing and will possibly move onto the java API if I need to perform something more complicated than what is capable with the CLI.

I then looked further into Mallet's java API and there aren't really any tutorials on how to use it but if I need to accomplish something more complicated a combination of the example files and docs should be sufficient to make a useful program.

The next step is to get some log files from Syslog or var/log and start converting them to Mallet's file format ".mallet" (which could be a single file or an entire directory) and then run the data through some machine learning methods.