Meanwhile the project forum works since more than six weeks with the new filter system. We encountered a few minor issues and made mainly good experiences with the new system.
1. Issue: unlocking entries
When classifying entries as ham, the entries will be unlocked in the case they was locked before. This occurs in the case of classifying old threads that was locked by hand or by automatisms (after a certain time, after a time of inactivity or after a number of opened newer threads). Currently one can circumvent the problem by locking the thread again after classifying the last entry of the thread as ham. The next minor release will fix this issue.
2. Observation: the bayes filter found many spam entries after a short time of training
From the very first days of bringing the Bayes filter into service we saw most of the spam attempts to be hidden. We see up to 40 proper classified spam entries on some days. In the first days the Bayes filter ran parallel with the Akismet filter. After a few days we disabled the Akismet filter for forum entries. Additionally there is the classical bad word list that is part of the forum software since the very first days (implemented since My Little Forum 1.x).
So the filters in their combination found many typical spam attempts from the beginning. These entries have to be manually classifyed afterwards to train the Bayes filter. This is an ongoing task for the forum team.
After disabling the Akismet filter for forum entries the combination of the Bayes filter and the bad word list finds nearly all of the spam entries but not all. Currently we see none or one to three entries a day that is not properly classified automatically. Classifyiing those spam entries manually is part of the ongoing task for the forum team, mentioned above.
3. Training: diligent work of classifying existing entries as ham
The Bayes filter is in generally not a spam filter but a system that computes probabilities. In our case probabilities for entries being ham or spam. There are words in spam entries, that will occuring with a very low probability in ham entries. On the other hand there are many words, sequences or sentences that one will find here and there. So the filter must get trained for spam but also for recognising ham.
Especially if one is running a well visited forum since a long time, one should have a more or less big amount of entries, that is classifyable as ham – assuming the spam entries will be deleted regulary. Having classified all content of a forum, the Bayes filter will have a proper base to make it's decisions. On the other hand, it will be a very diligent work to classify all the content manually.
We have over 8200 entries in this forum (the project forum). Since the launch of the Bayes filter system I am working on the classification of ancient entries and threads. I often thought about how to automate this process during this work, but until now I had no proper idea how to do it. You have to keep in mind, that you as the forum operator want to check, if an entry is really ham, if it has classifyable content, only a few agreeing or disagreeing words or no content at all but should be kept. For every single case you have to decide if you want to classify the entry as ham and if so, to additionally train the filter or not.
That means, that the work, to look into every single entry, remains. There is no escape but not to use the Bayes filter. But the filter brings you as a forum operator into the position to (at least partly) abandon dependencies on external services. So we strongly recommend to work with the filter and to get down to the business to classify the old entries. Once done, one will have only to do the daily work on current forum traffic. And this is a manageable job (exception proves the rule).
As last: currently we have no experience with the "automated filter training".
Trenne niemals Müll, denn er hat nur eine Silbe!