Hi Auge and Göran!
Everything Auge said about indexing is correct.
That is actually causing us a problem. We have a database with 155.000 entries and when the Googlebot starts to search through the database, the forum canät be used by other users for about two hours. This happens every day.
I don't know how to avoid the daily access for search robots. Maybe the Google Webmaster Tools (in the case of googlebot) give any possibilities to control it?
Get a Google account first and go to to Google’s webmaster tools. After verifying that you are the owner of the site (you get a 2byte file [containing just LF/CR] of the type googleXXXXXXXXXXXXXXXX.html, where XXXXXXXXXXXXXXXX is a unique ID; upload the file to your root directory) you may change the settings:
Dashboard > Site configuration > Settings > Crawl rate > [o] Set custom crawl rate (the slowest rate is 0.002 requests/second = 1 per 500 seconds).
If this measure doesn’t help, you have to go with a xml-sitemap. For an example see the one of my main site (I don’t need one for my forum with just 3000+ posts). In a sitemap you can set the crawling frequency for any resource to one of the following values: always, daily, weekly, monthly, yearly, never.
Another hint: avoid double content (see this thread in the 1.x-forum).
Other search engines should give comparable programs to control it.
A nasty bot is Yahoo!Slurp. The only way I found out to decrease the access rate are two lines in robots.txt
:
User-agent: Slurp
Crawl-delay: 10
According to Yahoo! a value of 10 is the slowest rate; in my experience a higher number is ignored.
For MSN-Bot (formerly MS Live Search, now Bing Beta):
User-agent: msnbot
Crawl-delay: XXX
where XXX are seconds between requests.
@Alex: I would sugest to modify the scripts in such a way that links to the contact form - whether to the admin or a user - are given the attribute rel="nofollow"
, e.g. instead of
<a href="index.php?mode=contact" title="foo">bar</a>
to
<a href="index.php?mode=contact" title="foo" rel="nofollow">bar</a>