Avatar

Facebook crawler (Technics)

by Auge ⌂, (192 days ago) @ Alfie

Hello Alfie

Thank you for your report.

two days ago my forum was flooded with requests from the facebook-crawler. ... First I saw only a bizarre high number of online ‘users’ and at the end my server gave up (likely due to too many database-connections) and responded with an HTTP 500 (Internal Server Error).

Oha.

Search for the user-agent string facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php) in your server’s access.log to check. My daily logs grew from ≈10MB to more than 150MB. A Google-search showed that I was not alone…
The crawler is aggressive and doesn’t give a shit about the robots.txt. Therefore ...

User-agent: FacebookBot
Disallow: /

does not help.

Oha again.

Finally I used the workaround suggested at stackoverflow.

Hmm, shoudn't the forum internal spam protection method work similar to described the workaround?

Workaround: (as shown in the link above)

<?php 
$ua = $_SERVER['HTTP_USER_AGENT'];
 
if (preg_match('/facebookexternalhit/si',$ua)) { 
header('Location: no_fb_page.php'); 
die() ; 
} 
?>

MLF1 (version 1.7.9, inc.php line #58 ff.)

if (trim($data['list']) != '') {
 $banned_ips_array = explode(',',trim($data['list']));
 if (in_array($_SERVER["REMOTE_ADDR"], $banned_ips_array)) {
  die($lang['ip_no_access']);
 }
}

[edit]Forget MLF1, the banlist feature of the software supports usernames, IPs and words/strings but no user agent strings.[/edit]

MLF2 (version 20220803.1, includes/main.php line #45 ff.)

 if (isset($user_agents) && !empty($user_agents) && trim($user_agents) != '') {
  $banned_user_agents = explode("\n", $user_agents);
  if (is_user_agent_banned($_SERVER['HTTP_USER_AGENT'], $banned_user_agents)) raise_error('403');
 }

... and function is_user_agent_banned (includes/functions.inc.php line #2167 ff.) ...

function is_user_agent_banned($user_agent, $banned_user_agents) {
  foreach($banned_user_agents as $banned_user_agent) {
    if(strpos($user_agent,$banned_user_agent)!==false) // case sensitive, faster {
      return true;
    }
  }
  return false;
}

The function searches only for exact string. Not ideal, a search for a partially matching string would be better. That way it would also match the following search strings.

+http://www.facebook.com/externalhit_uatext.php
facebookexternalhit
facebookexternalhit/1.1

Especially recognising facebookexternalhit would be nice because this would make the check version string independent. Currently the match would break if Facebook would run a version with another UA-string than 1.1 (provided that the UA-string would otherwise remain unchanged).

For now I added the user agent string you mentioned to the banned UA-strings list.

Tschö, Auge

--
Trenne niemals Müll, denn er hat nur eine Silbe!


Complete thread:

 RSS Feed of thread