some information can be found at the developer website.
I saw this page before but didn't read the whole page.
I'm curious to see the amount of false positives and negatives and what time and count of words it needs to get stable.
Yes, me too. It depends on the HAM AND SPAM frequency of a forum. If you never train SPAM, all entries will classified as HAM. Training for both, detecting spam and ham, is the most important task. For that reason, do NOT classified spam posted by a Sockenpuppe. Restrict the training to SPAM written by bots.
The entries I made was one-to-one copies of the originals from my forum.
I think, especially the different languages are a challenge for the script and the forum operators.
If the forum is operated in e.g. German language and spam entries are only in English, it will be quite easy to detect the spam (my opinion).
But in exapmle here we write mainly in English but also in German. The spam is often in English, Russian or Ukrainian language. At least we get ham and spam in English language. Here it is no problem because we use Akismet and Bad Behavior and we do not store the spam messages. This may be different in the future, when we activate the feature (at least in the first days or weeks).
So, I don't think, that one can give a more general answer to this topic.
A Russia forum does not benefit from a German or English database.
At least in the first view but it may benefit from examples of spam in English language.
Trenne niemals Müll, denn er hat nur eine Silbe!