Hello Alfie
Thank you for your report.
two days ago my forum was flooded with requests from the facebook-crawler. ... First I saw only a bizarre high number of online ‘users’ and at the end my server gave up (likely due to too many database-connections) and responded with an HTTP 500 (Internal Server Error).
Oha.
Search for the user-agent string facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
in your server’s access.log
to check. My daily logs grew from ≈10MB to more than 150MB. A Google-search showed that I was not alone…
The crawler is aggressive and doesn’t give a shit about the robots.txt
. Therefore ...
User-agent: FacebookBot
Disallow: /
does not help.
Oha again.
Finally I used the workaround suggested at stackoverflow.
Hmm, shoudn't the forum internal spam protection method work similar to described the workaround?
Workaround: (as shown in the link above)
<?php
$ua = $_SERVER['HTTP_USER_AGENT'];
if (preg_match('/facebookexternalhit/si',$ua)) {
header('Location: no_fb_page.php');
die() ;
}
?>
MLF1 (version 1.7.9, inc.php line #58 ff.)
if (trim($data['list']) != '') {
$banned_ips_array = explode(',',trim($data['list']));
if (in_array($_SERVER["REMOTE_ADDR"], $banned_ips_array)) {
die($lang['ip_no_access']);
}
}
[edit]Forget MLF1, the banlist feature of the software supports usernames, IPs and words/strings but no user agent strings.[/edit]
MLF2 (version 20220803.1, includes/main.php line #45 ff.)
if (isset($user_agents) && !empty($user_agents) && trim($user_agents) != '') {
$banned_user_agents = explode("\n", $user_agents);
if (is_user_agent_banned($_SERVER['HTTP_USER_AGENT'], $banned_user_agents)) raise_error('403');
}
... and function is_user_agent_banned
(includes/functions.inc.php line #2167 ff.) ...
function is_user_agent_banned($user_agent, $banned_user_agents) {
foreach($banned_user_agents as $banned_user_agent) {
if(strpos($user_agent,$banned_user_agent)!==false) // case sensitive, faster {
return true;
}
}
return false;
}
The function searches only for exact string. Not ideal, a search for a partially matching string would be better. That way it would also match the following search strings.
+http://www.facebook.com/externalhit_uatext.php
facebookexternalhit
facebookexternalhit/1.1
Especially recognising facebookexternalhit
would be nice because this would make the check version string independent. Currently the match would break if Facebook would run a version with another UA-string than 1.1
(provided that the UA-string would otherwise remain unchanged).
For now I added the user agent string you mentioned to the banned UA-strings list.
Tschö, Auge