some information can be found at the developer website.
I'm curious to see the amount of false positives and negatives and what time and count of words it needs to get stable.
Yes, me too. It depends on the HAM AND SPAM frequency of a forum. If you never train SPAM, all entries will classified as HAM. Training for both, detecting spam and ham, is the most important task. For that reason, do NOT classified spam posted by a Sockenpuppe. Restrict the training to SPAM written by bots.
I think, especially the different languages are a challenge for the script and the forum operators.
If the forum is operated in e.g. German language and spam entries are only in English, it will be quite easy to detect the spam (my opinion). So, I don't think, that one can give a more general answer to this topic.
often overlapping with the languages of the valid entries?
THAT is the challenge which is (hopefully) solved by Bayes statistics
How can we provide a dataset of training data for the forum operators (in the light of different languages), so they have not to start at the point 0?
This point is discussed in the B8 documentation. In the end, it make not sense to provide such a database because of the different languages. A Russia forum does not benefit from a German or English database.
applied-geodesy.org - OpenSource Least-Squares Adjustment Software for Geodetic Sciences