Avatar

New anti spam features and their effectivity (Technics)

by Auge ⌂ @, Tuesday, January 22, 2019, 15:31 (93 days ago)

Hello

After the introduction of separated minimum and maximum values for time between requesting a form and sending it back to the server the amount of new registrations that never got activated reached 0. Even spam postings seems to be absent in this forum at the moment.

In my more or less unknown own forum I see also no registrations but on the other hand I get around 40 to 50 spam postings per day. All of them get recognised (by Akismet, Bad Behavior or Stop Forum Spam) and hidden.

This is the current status. All spam messages arrived between 2019-01-21 15:00 and 2019-01-22 16:25.

[image]

My minimum of 10 seconds for sending back the posting form is obviously not enough. Micha introduced a Bayes-filter for version MLF 2.5 a a further spam prevention mechanism.

@Micha: Would it work with a experimental backport or should I upgrade to master branch of MLF?

Tschö, Auge

--
Trenne niemals Müll, denn er hat nur eine Silbe!

Avatar

New anti spam features and their effectivity

by Micha ⌂, Tuesday, January 22, 2019, 20:04 (93 days ago) @ Auge

Hello,

After the introduction of separated minimum and maximum values for time between requesting a form and sending it back to the server the amount of new registrations that never got activated reached 0.

I also set the character restriction for the password

[image]

@Micha: Would it work with a experimental backport or should I upgrade to master branch of MLF?

"backport" means to integrate the filter to the 2.4 branch??? :confused: I'm not sure, how to translate the word in this context in a right way. :-(

As I have already noted, I don't like this working on different versions. The 2.5 branch contains completely implemented features, but up to now, we are withheld these features to users. I would support the switch-over to 2.5. This is my personal opinion!

/Micha

--
applied-geodesy.org - OpenSource Least-Squares Adjustment Software for Geodetic Sciences

Avatar

New anti spam features and their effectivity

by Auge ⌂ @, Tuesday, January 22, 2019, 20:21 (93 days ago) @ Micha

Hello

After the introduction of separated minimum and maximum values for time between requesting a form and sending it back to the server the amount of new registrations that never got activated reached 0.


I also set the character restriction for the password

At the moment there are no unwanted registrations in my forum. So I saw no necessity to enforce additional rules for passwords until now (beside the enforced minimal length).

@Micha: Would it work with a experimental backport or should I upgrade to master branch of MLF?


"backport" means to integrate the filter to the 2.4 branch??? :confused: I'm not sure, how to translate the word in this context in a right way. :-(

No, no integration in the 2.4 branch. I wanted only to test it in my own forum (which is necessarily a 2.4.x version).

As I have already noted, I don't like this working on different versions. The 2.5 branch contains completely implemented features, but up to now, we are withheld these features to users. I would support the switch-over to 2.5. This is my personal opinion!

I wanted to support the 2.4 branch only for fixes after beginning with the one or another new feature for 2.5. It's gone more than only a bit muddled because of the instanly necessary changes here and there. Let's focus on 2.5.

But that's not the point I wanted to talk about. Did you include the bayes-filter into your forum? Or in another words, have you a working copy and are you able to share first insights? I ask, because the minimum time for request-to-sending-back of the posting form doesn't prevent spam postings in my forum. On the other hand I saw not a single spam entry here in the project forum since several days. I've no clue, what's going on and what's the difference between this and my forum.

Tschö, Auge

--
Trenne niemals Müll, denn er hat nur eine Silbe!

Avatar

New anti spam features and their effectivity

by Micha ⌂, Tuesday, January 22, 2019, 20:37 (93 days ago) @ Auge

Hello,

Did you include the bayes-filter into your forum?

I only implement the filter to my local installation. A lot of files must be changed manually that why I don't like to include the filter to my forum (without an official update).

Or in another words, have you a working copy and are you able to share first insights?

No.

I ask, because the minimum time for request-to-sending-back of the posting form doesn't prevent spam postings in my forum.

But the implemented filters sees to work fine in your forum. All SPAM messages are flaggend correctly. The new bayes-filter works the same way. Based on trainings data, the filter flags the messages. At the beginning, the filter may works poor because no trainings data are available. The filter needs SPAM and HAM postings to learn the different (i.e. to calculate the probability of the SPAM level).

/Micha

--
applied-geodesy.org - OpenSource Least-Squares Adjustment Software for Geodetic Sciences

Avatar

New anti spam features and their effectivity

by Auge ⌂ @, Wednesday, January 23, 2019, 07:32 (93 days ago) @ Micha

Hello

Did you include the bayes-filter into your forum?


I only implement the filter to my local installation. A lot of files must be changed manually that why I don't like to include the filter to my forum (without an official update).

O.k.

I ask, because the minimum time for request-to-sending-back of the posting form doesn't prevent spam postings in my forum.


But the implemented filters seems to work fine in your forum. All SPAM messages are flaggend correctly.

Yes, they does. I am a bit consternated because of the differences between here and there. This forum is more known than my forum but my forum gets flooded with (well detected) spam and this forum not.

The new bayes-filter works the same way. Based on trainings data, the filter flags the messages. At the beginning, the filter may works poor because no trainings data are available. The filter needs SPAM and HAM postings to learn the different (i.e. to calculate the probability of the SPAM level).

I see, I have to upgrade for training data only. :-)

Tschö, Auge

--
Trenne niemals Müll, denn er hat nur eine Silbe!

Avatar

New anti spam features and their effectivity

by Micha ⌂, Wednesday, January 23, 2019, 09:18 (92 days ago) @ Auge

Hello,

I see, I have to upgrade for training data only. :-)

Please note: Due to the SQL changes, you may could not downgrade to 2.4.

/Micha

--
applied-geodesy.org - OpenSource Least-Squares Adjustment Software for Geodetic Sciences

Avatar

New anti spam features and their effectivity

by Auge ⌂ @, Wednesday, January 23, 2019, 10:05 (92 days ago) @ Micha

Hello

I see, I have to upgrade for training data only. :-)


Please note: Due to the SQL changes, you may could not downgrade to 2.4.

That's true, but when and if I will decide to upgrade, I will not downgrade anymore. On the other hand, there's no argument against a fresh installation with import of the existing database.

Tschö, Auge

--
Trenne niemals Müll, denn er hat nur eine Silbe!

Avatar

New anti spam features and their effectivity

by Micha ⌂, Wednesday, January 23, 2019, 10:20 (92 days ago) @ Auge

Hello,

That's true, but when and if I will decide to upgrade, I will not downgrade anymore.

If you use 2.5, you have to update your forum manually because 2.4.xxx is always lower than 2.5. The update routine will exclude your version.

On the other hand, there's no argument against a fresh installation with import of the existing database.

It was only a warning ;-)

/Micha

--
applied-geodesy.org - OpenSource Least-Squares Adjustment Software for Geodetic Sciences

Avatar

New anti spam features and their effectivity

by Auge ⌂ @, Sunday, February 03, 2019, 13:40 (81 days ago) @ Auge

Hello

… the minimum time for request-to-sending-back of the posting form doesn't prevent spam postings in my forum. On the other hand I saw not a single spam entry here in the project forum since several days. I've no clue, what's going on and what's the difference between this and my forum.

Now I know the difference. In my forum the setting save_spam was enabled and here it is disabled. I enabled the settings in this forum too and within a half hour I spotted the first spam entry. So the project forum got spam messages too but the script rejects them instantanly.

I changed the setting in my forum to the same value and hope not to see the mass of spam never again.

Tschö, Auge

--
Trenne niemals Müll, denn er hat nur eine Silbe!

Avatar

New anti spam features and their effectivity

by Micha ⌂, Sunday, February 03, 2019, 19:17 (81 days ago) @ Auge

Hello,

In my forum the setting save_spam was enabled and here it is disabled. I enabled the settings in this forum too

Really? The option "Spam speichern (als Spam gekennzeichnet und nicht angezeigt)?" was disabled.

I enabled the option, now. This means: SPAM will saved but not shown to the users (only mods and admin have access to this entries) (default setting of mlf).

/Micha

--
applied-geodesy.org - OpenSource Least-Squares Adjustment Software for Geodetic Sciences

Avatar

New anti spam features and their effectivity

by Auge ⌂ @, Sunday, February 03, 2019, 20:12 (81 days ago) @ Micha

Hello

In my forum the setting save_spam was enabled and here it is disabled. I enabled the settings in this forum too


Really? The option "Spam speichern (als Spam gekennzeichnet und nicht angezeigt)?" was disabled.

Yes, I enabled it here for testing and disabled it again after getting my answers.

I enabled the option, now. This means: SPAM will saved but not shown to the users (only mods and admin have access to this entries) (default setting of mlf).

Yes and in the meantime we got 4 spam messages (2019-02-03 21:10 CET). I disable it again, we don't need to save spam messages from Nirvana.

Tschö, Auge

--
Trenne niemals Müll, denn er hat nur eine Silbe!

Avatar

New anti spam features and their effectivity

by Micha ⌂, Sunday, February 03, 2019, 20:16 (81 days ago) @ Auge

Hello,

I disable it again, we don't need to save spam messages from Nirvana.

In this case wrong classified SPAM messages are deleted, too. For that reason, I enabled the option. You don't have do do anything because in 168 hours, the messages will delete automatically.

/Micha

--
applied-geodesy.org - OpenSource Least-Squares Adjustment Software for Geodetic Sciences

Avatar

New anti spam features and their effectivity

by Auge ⌂ @, Sunday, February 03, 2019, 20:49 (81 days ago) @ Micha

Hello

In this case wrong classified SPAM messages are deleted, too. For that reason, I enabled the option.

O.k.

You don't have do do anything because in 168 hours, the messages will delete automatically.

No problem, but it looks unordentlich. ;-)

Tschö, Auge

--
Trenne niemals Müll, denn er hat nur eine Silbe!

Avatar

New anti spam features and their effectivity

by Auge ⌂ @, Monday, February 04, 2019, 12:37 (80 days ago) @ Micha

Hello

In this case wrong classified SPAM messages are deleted, too. For that reason, I enabled the option. You don't have do do anything because in 168 hours, the messages will delete automatically.

It's IMHO a very confusing situation at the moment. Until the late forenoon we got spam postings every few minutes. Since then there is a deceptive silence. It "sounds" like the quiet before the storm.

Maybe a botnetwork was disabled? We will see …

[edit]No, at the early afternoon it started again.[/edit]

tss, tss, tss

Tschö, Auge

--
Trenne niemals Müll, denn er hat nur eine Silbe!

Avatar

New anti spam features and their effectivity

by Micha ⌂, Monday, February 04, 2019, 14:26 (80 days ago) @ Auge

Hi,

[edit]No, at the early afternoon it started again.[/edit]

93, at the moment. The filter works fine. ;-)

/Micha

--
applied-geodesy.org - OpenSource Least-Squares Adjustment Software for Geodetic Sciences

Avatar

New anti spam features and their effectivity

by Magma, Wednesday, January 23, 2019, 01:10 (93 days ago) @ Auge

Hi, Is this below that puts spam into a list Akismet? and do you pay for the key or do you use the free option?

[image]


Also I wanted to ask, what is the setting called for changing the time a logged in user gets automatically logged out when they have not loaded a page in a certain time?

Avatar

New anti spam features and their effectivity

by Auge ⌂ @, Wednesday, January 23, 2019, 07:27 (93 days ago) @ Magma

Hello

Is this below that puts spam into a list Akismet?

[image]

No, it is the result of a request to Akismet.

and do you pay for the key or do you use the free option?

There is an option to pay for the service? Anyway, I use the free option.

Tschö, Auge

--
Trenne niemals Müll, denn er hat nur eine Silbe!

Avatar

New anti spam features and their effectivity

by Magma, Saturday, February 02, 2019, 21:10 (82 days ago) @ Auge

Auge, What about this

Also I wanted to ask, what is the setting called for changing the time a logged in user gets automatically logged out when they have not loaded a page in a certain time?

What is the default time for a inactive logged in user to get automatically logged out?

Avatar

New anti spam features and their effectivity

by Auge ⌂ @, Sunday, February 03, 2019, 12:34 (81 days ago) @ Magma

Hello

Also I wanted to ask, what is the setting called for changing the time a logged in user gets automatically logged out when they have not loaded a page in a certain time?


What is the default time for a inactive logged in user to get automatically logged out?

There is no setting to explicitely log out a user after a certain time. It depends of the session lifetime, defined in the php.ini of your server. The only time restriction, that belongs to this question more or less, is the lifetime of the cookies that defaults to 30 days.

Tschö, Auge

--
Trenne niemals Müll, denn er hat nur eine Silbe!

Avatar

New anti spam features and their efficiency, @Micha

by Auge ⌂ @, Monday, February 11, 2019, 12:36 (73 days ago) @ Auge

Hello

I installed today a first alpha version of MLF 2.5 to test the new features. Currently there are no visible features for the audience, only background improvements. I had to solve a few issues with the installation script (will be in the repo today in the evening) but at the end I succeeded. :-)

I activated the statistical spam filter (B8) and wrote five entries. The entries was claimed not to be spam

mlf25_B8_rating
---------------------------
eid | spam | training_type
---------------------------
  1 |    0 |             0
  2 |    0 |             0
  3 |    0 |             0
  4 |    0 |             0
  5 |    0 |             0

mlf25_B8_wordlist
--------------------------------------
    token    | count_ham | count_spam
--------------------------------------
b8*dbversion |         3 |       NULL
B8*texts     |         0 |          0

I've absolutely no clue, how to interpret the values. Ok, my entries seems to be no spam. But, are these entries the training data? Where are the found words? Does the filter work or not?

Tschö, Auge

--
Trenne niemals Müll, denn er hat nur eine Silbe!

Avatar

New anti spam features and their efficiency, @Micha

by Micha ⌂, Monday, February 11, 2019, 12:50 (73 days ago) @ Auge

Hello,

I activated the statistical spam filter (B8) and wrote five entries. The entries was claimed not to be spam

mlf25_B8_rating
---------------------------
eid | spam | training_type
---------------------------
1 |    0 |             0
2 |    0 |             0
3 |    0 |             0
4 |    0 |             0
5 |    0 |             0

mlf25_B8_wordlist
--------------------------------------
token    | count_ham | count_spam
--------------------------------------
b8*dbversion |         3 |       NULL
B8*texts     |         0 |          0

I've absolutely no clue, how to interpret the values. Ok, my entries seems to be no spam. But, are these entries the training data? Where are the found words? Does the filter work or not?

You have to train the filter manually. Currently, no trainings data are available (empty table mlf25_B8_wordlist) Open your five posting and flag them explicit as HAM or even SPAM. If you flag a posting, the mlf25_B8_wordlist should contain the words of the postings and a counter: How often a word was in a SPAM/HAM message?

/Micha

--
applied-geodesy.org - OpenSource Least-Squares Adjustment Software for Geodetic Sciences

Avatar

New anti spam features and their efficiency, @Micha

by Auge ⌂ @, Monday, February 11, 2019, 13:03 (73 days ago) @ Micha

Hello

You have to train the filter manually. Currently, no trainings data are available (empty table mlf25_B8_wordlist) Open your five posting and flag them explicit as HAM or even SPAM. If you flag a posting, the mlf25_B8_wordlist should contain the words of the postings and a counter: How often a word was in a SPAM/HAM message?

Ahh, ok. When I click the "no spam" link I have to click one of two buttons: "Report and flag as ham" or "Flag as ham only". Which one is the correct button?

Tschö, Auge

--
Trenne niemals Müll, denn er hat nur eine Silbe!

Avatar

New anti spam features and their efficiency, @Micha

by Micha ⌂, Monday, February 11, 2019, 13:10 (73 days ago) @ Auge

Hello,

Which one is the correct button?

I think, you have to report your decision.

/Micha

--
applied-geodesy.org - OpenSource Least-Squares Adjustment Software for Geodetic Sciences

Avatar

New anti spam features and their efficiency, @Micha

by Auge ⌂ @, Monday, February 11, 2019, 13:20 (73 days ago) @ Micha

Hello

Which one is the correct button?


I think, you have to report your decision.

I took this button (without taking a look into the code) because the label promised the wished functionallity and the word list is getting bigger and bigger. I enabled the save-spam-function to get real live examples of real live spam.

I'm curious what will happen. Thank you for your help.

Tschö, Auge

--
Trenne niemals Müll, denn er hat nur eine Silbe!

Avatar

Efficiency of new anti spam feature

by Auge ⌂ @, Monday, February 11, 2019, 14:31 (73 days ago) @ Micha

Hello

After playing a bit around and feeding the filter with only two similar(!) spam entries [1] the filter detected the second spam posting as that, what it is. Also a third spam entry in russian language was found as spam without a previous training.

I'm curious to see the amount of false positives and negatives and what time and count of words it needs to get stable. I think, especially the different languages are a challenge for the script and the forum operators. What is white and what is black when a forum stores valid entries of different languages and the spam also are carried out in different languages, often overlapping with the languages of the valid entries?

In a first sight it's a nice feature.

How can we provide a dataset of training data for the forum operators (in the light of different languages), so they have not to start at the point 0?

Tschö, Auge

[1]: I copied them from my forum to the development forum.

--
Trenne niemals Müll, denn er hat nur eine Silbe!

Avatar

Efficiency of new anti spam feature

by Micha ⌂, Monday, February 11, 2019, 15:05 (73 days ago) @ Auge

Hello

some information can be found at the developer website.

I'm curious to see the amount of false positives and negatives and what time and count of words it needs to get stable.

Yes, me too. It depends on the HAM AND SPAM frequency of a forum. If you never train SPAM, all entries will classified as HAM. Training for both, detecting spam and ham, is the most important task. For that reason, do NOT classified spam posted by a Sockenpuppe. Restrict the training to SPAM written by bots.

I think, especially the different languages are a challenge for the script and the forum operators.

If the forum is operated in e.g. German language and spam entries are only in English, it will be quite easy to detect the spam (my opinion). So, I don't think, that one can give a more general answer to this topic.

often overlapping with the languages of the valid entries?

THAT is the challenge which is (hopefully) solved by Bayes statistics ;-)

How can we provide a dataset of training data for the forum operators (in the light of different languages), so they have not to start at the point 0?

This point is discussed in the B8 documentation. In the end, it make not sense to provide such a database because of the different languages. A Russia forum does not benefit from a German or English database.

/Micha

--
applied-geodesy.org - OpenSource Least-Squares Adjustment Software for Geodetic Sciences

Avatar

Efficiency of new anti spam feature

by Auge ⌂ @, Monday, February 11, 2019, 15:24 (73 days ago) @ Micha

Hello

some information can be found at the developer website.

I saw this page before but didn't read the whole page.

I'm curious to see the amount of false positives and negatives and what time and count of words it needs to get stable.


Yes, me too. It depends on the HAM AND SPAM frequency of a forum. If you never train SPAM, all entries will classified as HAM. Training for both, detecting spam and ham, is the most important task. For that reason, do NOT classified spam posted by a Sockenpuppe. Restrict the training to SPAM written by bots.

The entries I made was one-to-one copies of the originals from my forum.

I think, especially the different languages are a challenge for the script and the forum operators.


If the forum is operated in e.g. German language and spam entries are only in English, it will be quite easy to detect the spam (my opinion).

But in exapmle here we write mainly in English but also in German. The spam is often in English, Russian or Ukrainian language. At least we get ham and spam in English language. Here it is no problem because we use Akismet and Bad Behavior and we do not store the spam messages. This may be different in the future, when we activate the feature (at least in the first days or weeks).

So, I don't think, that one can give a more general answer to this topic.

Full ACK

A Russia forum does not benefit from a German or English database.

At least in the first view but it may benefit from examples of spam in English language.

Tschö, Auge

--
Trenne niemals Müll, denn er hat nur eine Silbe!

Avatar

Efficiency of new anti spam feature

by Micha ⌂, Monday, February 11, 2019, 15:46 (73 days ago) @ Auge

Hello,

But in exapmle here we write mainly in English but also in German. The spam is often in English, Russian or Ukrainian language.

If we never get/got SPAM in German, no German message will flagged as SPAM because no words are classified as SPAM.

Here it is no problem because we use Akismet and we do not store the spam messages.

??? Akismet stored the messages. However, I will remove Akismet due to protection of data privacy (in my forum). For that reason, I need to store the trainings data NOT the spam messages. Maybe it is a misinterpretation: You have to flag the entry. If it is SPAM, you can delete the entry after flagging.

At least in the first view but it may benefit from examples of spam in English language.

No, please read the theory about Bayes. A forum about e.g. flowers does not benefit from a trainings database, where HAM is derived from postings about e.g. animals. Its all about content. ;-) Of course, SPAM maybe the same but HAM isn't and you need both of them. For that reason, it makes not sense to provide a database with general entries - what ever general may be.

https://nasauber.de/opensource/b8/readme.php#tips-on-operation

/Micha

--
applied-geodesy.org - OpenSource Least-Squares Adjustment Software for Geodetic Sciences

Avatar

Efficiency of new anti spam feature

by Auge ⌂ @, Monday, February 11, 2019, 17:35 (73 days ago) @ Micha

Hello

But in exapmle here we write mainly in English but also in German. The spam is often in English, Russian or Ukrainian language.


If we never get/got SPAM in German, no German message will flagged as SPAM because no words are classified as SPAM.

No, but at least messages in English can be ham or spam in this, the project forum.

Here it is no problem because we use Akismet and we do not store the spam messages.

??? Akismet stored the messages.

No, the form can be configured to store the spam temporary (setting save_spam = 1). But in this forum messages, that are detected as spam will not be stored and does actually not be handled with B8 because we do use Bad Behavior, Akismet and Stop Forum Spam.

However, I will remove Akismet due to protection of data privacy (in my forum). For that reason, I need to store the trainings data NOT the spam messages.

This is a reasonable step.

Maybe it is a misinterpretation: You have to flag the entry. If it is SPAM, you can delete the entry after flagging.

Yes, I know.

At least in the first view but it may benefit from examples of spam in English language.


No, please read the theory about Bayes. A forum about e.g. flowers does not benefit from a trainings database, where HAM is derived from postings about e.g. animals. Its all about content. ;-) Of course, SPAM maybe the same but HAM isn't and you need both of them. For that reason, it makes not sense to provide a database with general entries - what ever general may be.

Ham will be different, spam will be the same. A dataset with spam as addendum for the existing training data is at least for me imaginable. But that's only a idée fixe, nothing we must think about.

Tschö, Auge

--
Trenne niemals Müll, denn er hat nur eine Silbe!

Avatar

Efficiency of new anti spam feature

by Micha ⌂, Monday, February 11, 2019, 18:07 (73 days ago) @ Auge

Hello,

Ich mach das mal auf deutsch, weil es für mich dann einfacher ist zu erklären:

No, but at least messages in English can be ham or spam in this, the project forum.

Ja, und? Die Sprache ist doch nicht entscheidend für den Filter sondern die Wörter.

Ham will be different, spam will be the same. A dataset with spam as addendum for the existing training data is at least for me imaginable.

Die Idee hinter dem Filter ist, dass die Einträge bewertet werden. ALLE Einträge, die zuvor als Trainingsdaten verwendet wurde, sind die Grundgesamtheit für die individuelle Bewertung (Stichprobe). Nun wird jedes Wort im Text bewertet, bezogen auf diese Grundgesamtheit. Kam dieses Wort also mehr in SPAM oder mehr in HAM Nachrichten vor. Hier ergibt sich also eine Quote z.B. 70 % HAM / 30 % SPAM. Nun wird jedes Wort nach diesem Schema bewertet und die Wahrscheinlichkeiten akkumuliert und daraus die Wahrscheinlichkeit für das gesamte Posting berechnet, dass es HAM oder SPAM = (100 % - HAM) % ist.

Ein Forum ist üblicherweise themenbezogen. Bestimmte "Fachbegriffe", "Modeworte" usw. tauschen also tendenziell nur dort auf. Wenn diese Worte nicht in der Trainingsdatenbank enthalten sind, kann der Filter diese nicht bewerten, sodass er sich auf die verbliebenen Füllwörter wie "Hallo", "Tschüss" usw. beschränken muss bei der Bewertung. Wenn diese Standardworte aber auch in SPAM-Nachrichten vorkommen, wird der Filter HAM Einträge potenziell eher falsch einstufen. Die falschen bzw. unpassende Trainingsdaten führen also i.A. zu einer Verschlechterung als zu einer Verbesserung.

Du betrachtest es nur aus Sicht von SPAM Nachrichten. Aber ohne sinnvolle (lies: Foren- oder Content-spezifische) HAM-Nachrichten, funktioniert es nicht und verschlechtert die Trefferquote. Du musst von der Gesamtheit aus SPAM und HAM-Nachrichten ausgehen. Wenn Du nur Addieren gelernt hast, kannst Du nicht Multiplizieren - Du hast falsch trainiert.
Wenn Du aus einer fremden Datenbank HAM-Trainingsdaten übernimmst, die in Deinem Forum praktisch nie vorkommen, hast Du keinen Vorteil aber in jedem Fall einen Nachteil. Die einzige Ausnahme sind themenverwandte Foren aber das würde ich mal als Sonderfall abtun.

Du kannst es gern versuchen aber ich sehe keinen Mehrwert.

/Micha

--
applied-geodesy.org - OpenSource Least-Squares Adjustment Software for Geodetic Sciences

Avatar

Efficiency of new anti spam feature

by Auge ⌂ @, Monday, February 11, 2019, 18:26 (73 days ago) @ Micha

Hello

Ham will be different, spam will be the same. A dataset with spam as addendum for the existing training data is at least for me imaginable.


Die Idee hinter dem Filter ist, dass die Einträge bewertet werden. ALLE Einträge, die zuvor als Trainingsdaten verwendet wurde, sind die Grundgesamtheit für die individuelle Bewertung (Stichprobe). Nun wird jedes Wort im Text bewertet, bezogen auf diese Grundgesamtheit. Kam dieses Wort also mehr in SPAM oder mehr in HAM Nachrichten vor. Hier ergibt sich also eine Quote z.B. 70 % HAM / 30 % SPAM. Nun wird jedes Wort nach diesem Schema bewertet und die Wahrscheinlichkeiten akkumuliert und daraus die Wahrscheinlichkeit für das gesamte Posting berechnet, dass es HAM oder SPAM = (100 % - HAM) % ist.

Soweit, so klar.

Du betrachtest es nur aus Sicht von SPAM Nachrichten. Aber ohne sinnvolle (lies: Foren- oder Content-spezifische) HAM-Nachrichten, funktioniert es nicht und verschlechtert die Trefferquote. Du musst von der Gesamtheit aus SPAM und HAM-Nachrichten ausgehen.

Genau deshalb schrieb ich doch, wie oben stehen gelassen, von Spam-Daten als "addendum for the existing training data", also Ergänzung zu den vorhandenen Trainingsdaten. Mir ist schon klar, dass sich der Ham zwischen den Foren unterscheidet. Der Spam tut das üblicherweise nicht.

Wenn Du nur Addieren gelernt hast, kannst Du nicht Multiplizieren ….

Wenn man es genau nimmt, kann ich genau das. Aber das sind mathematische Spitzfindigkeiten. :-)

Wenn Du aus einer fremden Datenbank HAM-Trainingsdaten übernimmst, die in Deinem Forum praktisch nie vorkommen, hast Du keinen Vorteil aber in jedem Fall einen Nachteil. Die einzige Ausnahme sind themenverwandte Foren aber das würde ich mal als Sonderfall abtun.

Das will ich doch aber garnicht.

Tschö, Auge

--
Trenne niemals Müll, denn er hat nur eine Silbe!

Avatar

Efficiency of new anti spam feature

by Micha ⌂, Monday, February 11, 2019, 18:36 (73 days ago) @ Auge

Hallo,

Genau deshalb schrieb ich doch, wie oben stehen gelassen, von Spam-Daten als "addendum for the existing training data", also Ergänzung zu den vorhandenen Trainingsdaten. Mir ist schon klar, dass sich der Ham zwischen den Foren unterscheidet. Der Spam tut das üblicherweise nicht.

Aber auch das ist ein Trugschluß ;-) Hierdurch erhöhst Du ja gerade z.B. für Füllwörter wie "Hallo" die Wahrscheinlichkeit, dass es danach als SPAM klassifiziert wird. Ich weiß nicht, wie ich es erklären soll aber Ergänzung heißt in diesem Kontext i.A. Verschlechterung.

Bleiben wir mal nur bei dem Wort "Hallo". Angenommen, bisher kam dieses Wort in Deinem Forum nur 10 mal vor. Und jedes Mal war es HAM und wurde korrekt klassifiziert. Nun spielst Du aus einer Datenbank nur die SPAM Trainingsdaten ein. Das Wort "Hallo" kam dort natürlich auch in SPAM Nachrichten vor. Da Du nur die SPAM Daten verwendest, besitzt "Hallo" neben den vorherigen 10 zu 0 nun ggf. 10 zu 10000 (HAM zu SPAM). Ein Wort, was vorher potenziell zu HAM zählte, wird schlagartig (und fälschlicherweise) zu SPAM, weil die Trainingsdaten ungünstig sind.

/Micha

--
applied-geodesy.org - OpenSource Least-Squares Adjustment Software for Geodetic Sciences

RSS Feed of thread
powered by my little forum