But in exapmle here we write mainly in English but also in German. The spam is often in English, Russian or Ukrainian language.
If we never get/got SPAM in German, no German message will flagged as SPAM because no words are classified as SPAM.
No, but at least messages in English can be ham or spam in this, the project forum.
Here it is no problem because we use Akismet and we do not store the spam messages.
??? Akismet stored the messages.
No, the form can be configured to store the spam temporary (setting
save_spam = 1). But in this forum messages, that are detected as spam will not be stored and does actually not be handled with B8 because we do use Bad Behavior, Akismet and Stop Forum Spam.
However, I will remove Akismet due to protection of data privacy (in my forum). For that reason, I need to store the trainings data NOT the spam messages.
This is a reasonable step.
Maybe it is a misinterpretation: You have to flag the entry. If it is SPAM, you can delete the entry after flagging.
Yes, I know.
At least in the first view but it may benefit from examples of spam in English language.
No, please read the theory about Bayes. A forum about e.g. flowers does not benefit from a trainings database, where HAM is derived from postings about e.g. animals. Its all about content. Of course, SPAM maybe the same but HAM isn't and you need both of them. For that reason, it makes not sense to provide a database with general entries - what ever general may be.
Ham will be different, spam will be the same. A dataset with spam as addendum for the existing training data is at least for me imaginable. But that's only a idée fixe, nothing we must think about.
Trenne niemals Müll, denn er hat nur eine Silbe!