Avatar

Facebook crawler (Technics)

by Alfie ⌂, Vienna, Austria, (193 days ago)

Dear all,

two days ago my forum was flooded with requests from the facebook-crawler. Likely someone linked to the forum on Facebook (can’t check cause I don’t have a FB account and no intentions to get one). First I saw only a bizarre high number of online ‘users’ and at the end my server gave up (likely due to too many database-connections) and responded with an HTTP 500 (Internal Server Error).

Search for the user-agent string facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php) in your server’s access.log to check. My daily logs grew from ≈10MB to more than 150MB. A Google-search showed that I was not alone…
The crawler is aggressive and doesn’t give a shit about the robots.txt. Therefore,

User-agent: FacebookBot
Disallow: /

does not help.

Finally I used the workaround suggested at stackoverflow.

--
Cheers,
Alfie (Helmut Schütz)
BEBA-Forum (v1.8β)

Avatar

Facebook crawler

by Auge ⌂, (192 days ago) @ Alfie

Hello Alfie

Thank you for your report.

two days ago my forum was flooded with requests from the facebook-crawler. ... First I saw only a bizarre high number of online ‘users’ and at the end my server gave up (likely due to too many database-connections) and responded with an HTTP 500 (Internal Server Error).

Oha.

Search for the user-agent string facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php) in your server’s access.log to check. My daily logs grew from ≈10MB to more than 150MB. A Google-search showed that I was not alone…
The crawler is aggressive and doesn’t give a shit about the robots.txt. Therefore ...

User-agent: FacebookBot
Disallow: /

does not help.

Oha again.

Finally I used the workaround suggested at stackoverflow.

Hmm, shoudn't the forum internal spam protection method work similar to described the workaround?

Workaround: (as shown in the link above)

<?php 
$ua = $_SERVER['HTTP_USER_AGENT'];
 
if (preg_match('/facebookexternalhit/si',$ua)) { 
header('Location: no_fb_page.php'); 
die() ; 
} 
?>

MLF1 (version 1.7.9, inc.php line #58 ff.)

if (trim($data['list']) != '') {
 $banned_ips_array = explode(',',trim($data['list']));
 if (in_array($_SERVER["REMOTE_ADDR"], $banned_ips_array)) {
  die($lang['ip_no_access']);
 }
}

[edit]Forget MLF1, the banlist feature of the software supports usernames, IPs and words/strings but no user agent strings.[/edit]

MLF2 (version 20220803.1, includes/main.php line #45 ff.)

 if (isset($user_agents) && !empty($user_agents) && trim($user_agents) != '') {
  $banned_user_agents = explode("\n", $user_agents);
  if (is_user_agent_banned($_SERVER['HTTP_USER_AGENT'], $banned_user_agents)) raise_error('403');
 }

... and function is_user_agent_banned (includes/functions.inc.php line #2167 ff.) ...

function is_user_agent_banned($user_agent, $banned_user_agents) {
  foreach($banned_user_agents as $banned_user_agent) {
    if(strpos($user_agent,$banned_user_agent)!==false) // case sensitive, faster {
      return true;
    }
  }
  return false;
}

The function searches only for exact string. Not ideal, a search for a partially matching string would be better. That way it would also match the following search strings.

+http://www.facebook.com/externalhit_uatext.php
facebookexternalhit
facebookexternalhit/1.1

Especially recognising facebookexternalhit would be nice because this would make the check version string independent. Currently the match would break if Facebook would run a version with another UA-string than 1.1 (provided that the UA-string would otherwise remain unchanged).

For now I added the user agent string you mentioned to the banned UA-strings list.

Tschö, Auge

--
Trenne niemals Müll, denn er hat nur eine Silbe!

Avatar

Facebook crawler

by Alfie ⌂, Vienna, Austria, (192 days ago) @ Auge
edited by Alfie,

Hi Auge,

Thank you for your report.

Welcome.

Oha.
Oha again.

Right?

Hmm, shoudn't the forum internal spam protection method work similar to described the workaround?

MLF1 (version 1.7.9, inc.php line #58 ff.)

if (trim($data['list']) != '') {
$banned_ips_array = explode(',',trim($data['list']));
if (in_array($_SERVER["REMOTE_ADDR"], $banned_ips_array)) {
die($lang['ip_no_access']);
}
}

Not sure. AFAIK, the FB-crawler has 500+ IP4 and 2,000+ IP6 addresses. According to my access.log my forum was crawled from 67 different IPs within three days.

MLF2

Not my cup of tea… ;-)

The function searches only for exact string. Not ideal, a search for a partially matching string would be better. That way it would also match the following search strings.

+http://www.facebook.com/externalhit_uatext.php
facebookexternalhit
facebookexternalhit/1.1

Especially recognising facebookexternalhit would be nice because this would make the check version string independent. Currently the match would break if Facebook would run a version with another UA-string than 1.1 (provided that the UA-string would otherwise remain unchanged).

Right, makes sense. However, regexes are not my friends.

--
Cheers,
Alfie (Helmut Schütz)
BEBA-Forum (v1.8β)

Avatar

Facebook crawler

by Auge ⌂, (192 days ago) @ Alfie
edited by Alfie,

Hello Alfie

Hmm, shoudn't the forum internal spam protection method work similar to described the workaround?

MLF1 (version 1.7.9, inc.php line #58 ff.)

if (trim($data['list']) != '') {
$banned_ips_array = explode(',',trim($data['list']));
if (in_array($_SERVER["REMOTE_ADDR"], $banned_ips_array)) {
die($lang['ip_no_access']);
}
}


Not sure. AFAIK, the FB-crawler has 500+ IP4 and 2,000+ IP6 addresses. According to my access.log my forum was crawled from 67 different IPs within three days.

Did you see my marked edit in my last posting before you wrote yours? After sending my posting to the forum I became aware, that MLF1 doesn't have a banlist for user agents.

Handling this over the IPs or IP-ranges is a pointless task. Not only but also because MLF1 does not support IP-ranges.

The function searches only for exact string. Not ideal, a search for a partially matching string would be better. That way it would also match the following search strings.

+http://www.facebook.com/externalhit_uatext.php
facebookexternalhit
facebookexternalhit/1.1

Especially recognising facebookexternalhit would be nice because this would make the check version string independent. Currently the match would break if Facebook would run a version with another UA-string than 1.1 (provided that the UA-string would otherwise remain unchanged).


Right, makes sense. However, regexes are not my friends.

Me too, as you know. :-)

In principle it would be possible to search the strings with string functions but in general such a RegEx has to be implemented only once. For tests of regular expressions I always (in fact: every when and then) use regex101.com.

Tschö, Auge

--
Trenne niemals Müll, denn er hat nur eine Silbe!

Avatar

Facebook crawler

by Alfie ⌂, Vienna, Austria, (191 days ago) @ Auge

Hi Auge,

Did you see my marked edit in my last posting before you wrote yours?

No.

[…] regexes are not my friends.


Me too, as you know. :-)

In principle it would be possible to search the strings with string functions but in general such a RegEx has to be implemented only once. For tests of regular expressions I always (in fact: every when and then) use regex101.com.

I forgot this goodie, THX! Therefore,

$ua = $_SERVER['HTTP_USER_AGENT'];
if (preg_match('/facebook/gmi', $ua)) {
  header('Location: no_fb_page.php');
  die();
}

--
Cheers,
Alfie (Helmut Schütz)
BEBA-Forum (v1.8β)

RSS Feed of thread