User login

Weekly Report

02

May

2011

It looks like the major problem with the spam dataset I've been using is
the classification of greylisted flows as spam. Greylisting is a very
common thing to happen to incoming flows on our mailserver and mostly
looks to occur early on before data is sent. This meant there was a vast
number of almost identical flows that were being counted as spam. Removing
these flows from consideration gives me a smaller dataset, but one in
which almost every flow traverses at least one link that is entirely
classified as spam or ham. At this point the small number of flows that
don't do this appear to involve TLS and will require closer investigation.
Will also need to expand into newer and larger datasets, hopefully some
without greylisting that see larger volumes of spam.