Facebook crawler (Technics)
Hello Alfie
Thank you for your report.
two days ago my forum was flooded with requests from the facebook-crawler. ... First I saw only a bizarre high number of online ‘users’ and at the end my server gave up (likely due to too many database-connections) and responded with an HTTP 500 (Internal Server Error).
Oha.
Search for the user-agent string
facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
in your server’saccess.log
to check. My daily logs grew from ≈10MB to more than 150MB. A Google-search showed that I was not alone…
The crawler is aggressive and doesn’t give a shit about therobots.txt
. Therefore ...User-agent: FacebookBot Disallow: /
does not help.
Oha again.
Finally I used the workaround suggested at stackoverflow.
Hmm, shoudn't the forum internal spam protection method work similar to described the workaround?
Workaround: (as shown in the link above)
<?php $ua = $_SERVER['HTTP_USER_AGENT']; if (preg_match('/facebookexternalhit/si',$ua)) { header('Location: no_fb_page.php'); die() ; } ?>
MLF1 (version 1.7.9, inc.php line #58 ff.)
if (trim($data['list']) != '') { $banned_ips_array = explode(',',trim($data['list'])); if (in_array($_SERVER["REMOTE_ADDR"], $banned_ips_array)) { die($lang['ip_no_access']); } }
[edit]Forget MLF1, the banlist feature of the software supports usernames, IPs and words/strings but no user agent strings.[/edit]
MLF2 (version 20220803.1, includes/main.php line #45 ff.)
if (isset($user_agents) && !empty($user_agents) && trim($user_agents) != '') { $banned_user_agents = explode("\n", $user_agents); if (is_user_agent_banned($_SERVER['HTTP_USER_AGENT'], $banned_user_agents)) raise_error('403'); }
... and function is_user_agent_banned
(includes/functions.inc.php line #2167 ff.) ...
function is_user_agent_banned($user_agent, $banned_user_agents) { foreach($banned_user_agents as $banned_user_agent) { if(strpos($user_agent,$banned_user_agent)!==false) // case sensitive, faster { return true; } } return false; }
The function searches only for exact string. Not ideal, a search for a partially matching string would be better. That way it would also match the following search strings.
+http://www.facebook.com/externalhit_uatext.php
facebookexternalhit
facebookexternalhit/1.1
Especially recognising facebookexternalhit
would be nice because this would make the check version string independent. Currently the match would break if Facebook would run a version with another UA-string than 1.1
(provided that the UA-string would otherwise remain unchanged).
For now I added the user agent string you mentioned to the banned UA-strings list.
Tschö, Auge
--
Trenne niemals Müll, denn er hat nur eine Silbe!