Do we have to have a robot.txt (Technics)

by crossitover ⌂ @, Thursday, February 05, 2009, 16:41 (5556 days ago)

Hi guys,

Here is my situation. google spider(known as googlebot)visited my site a couple of times but according to my wwwlog it appears that the file robot.txt is the only file googlebot interested in because everytime it just tried to read robot.txt but not any other files(like the content of my database which I want it to be indexed).

However the thing is actually I don't even have a robot.txt file in any directories of my forum. Could this be a problem causing google not indexing my forum? My understanding is if I do not use robot.txt then I'm welcoming google to visit/index everything I want in my site. But it looks like google did not find robot.txt then it left!!!

Look, I do have defined 404 error page.

Please any input would be highly appreciated...

Do we have to have a robot.txt

by Auge, Thursday, February 05, 2009, 20:07 (5556 days ago) @ crossitover

Hello

(like the content of my database which I want it to be indexed).

The database will be not indexed. Any robot will only index the content in your site. This could be the forum (and the sites of the threads) too.

However the thing is actually I don't even have a robot.txt file in any directories of my forum. Could this be a problem causing google not indexing my forum? My understanding is if I do not use robot.txt then I'm welcoming google to visit/index everything I want in my site. But it looks like google did not find robot.txt then it left!!!

You can have robots.txt in every directory but it should be enough to have it in the root directory of the website (where you'll find your homepage (maybe index.*, default.* ...)).

I don't know the actually behaviour of googlebot. But I know that googlebot uses the robots.txt and follow its instructions. Note: In the robots.txt you can only forbid the crawling of directories and/or files.

Tschö, Auge

Do we have to have a robot.txt - Google

by Göran B ⌂ @, Monday, June 22, 2009, 20:28 (5419 days ago) @ Auge

The database will be not indexed. Any robot will only index the content in your site. This could be the forum (and the sites of the threads) too.

I don't think it's true. All postings in our forum have been indexed by Google and can be find doing a Google search.

That is actually causing us a problem. We have a database with 155.000 entries and when the Googlebot starts to search through the database, the forum canät be used by other users for about two hours. This happens every day.

We can stop them from indexing our forum by robots.txt - but the problem is that we would like to eat AND keep the cake - we want our forum to be indexed!

Does anyone have any good advice on how to keep the forum indexed without getting performance problems? Are there any mySQL or server parameters to adjust when you are running mlf with so many entries and over 100 concurrent users?

Do we have to have a robot.txt - Google

by Auge, Monday, June 22, 2009, 23:10 (5419 days ago) @ Göran B

Hello

The database will be not indexed. Any robot will only index the content in your site. This could be the forum (and the sites of the threads) too.


I don't think it's true. All postings in our forum have been indexed by Google and can be find doing a Google search.

Only to make it clear:

The Googlebot (like other bots too) is a HTTP-client like any other browser. 'He' will find a website over a link and will follow all accessible links (with all potential parameters) on the pages of the site, except a robots.txt forbid it[1] for some or all directories or a meta-tag with the name-attribute with the value "robots" and the content-attribute with one of the following values [index|follow|noindex|nofollow|all] (italic values forbid the access for the page itself (noindex) or the linked pages (nofollow)).

So the robot follow the links in a page and accesses other pages that way. The robot will not read the database itself but only the pages wich contains the values of the database. In the case of mlf a robot will (if it's not forbidden) find and index the webpages with the postings but not the postings in the database.

That is actually causing us a problem. We have a database with 155.000 entries and when the Googlebot starts to search through the database, the forum canät be used by other users for about two hours. This happens every day.

I don't know how to avoid the daily access for search robots. Maybe the Google Webmaster Tools (in the case of googlebot) give any possibilities to control it? Other search engines should give comparable programs to control it.

[1] Attention a robot may follow the instructions of the robots.txt or the meta-element 'robots' but he/they must not!

Tschö, Auge

Avatar

Do we have to have a robot.txt - Google

by Alfie ⌂, Vienna, Austria, Tuesday, June 23, 2009, 01:03 (5419 days ago) @ Auge
edited by Alfie, Tuesday, June 23, 2009, 13:46

Hi Auge and Göran!

Everything Auge said about indexing is correct.

That is actually causing us a problem. We have a database with 155.000 entries and when the Googlebot starts to search through the database, the forum canät be used by other users for about two hours. This happens every day.

I don't know how to avoid the daily access for search robots. Maybe the Google Webmaster Tools (in the case of googlebot) give any possibilities to control it?

Get a Google account first and go to to Google’s webmaster tools. After verifying that you are the owner of the site (you get a 2byte file [containing just LF/CR] of the type googleXXXXXXXXXXXXXXXX.html, where XXXXXXXXXXXXXXXX is a unique ID; upload the file to your root directory) you may change the settings:
Dashboard > Site configuration > Settings > Crawl rate > [o] Set custom crawl rate (the slowest rate is 0.002 requests/second = 1 per 500 seconds).

If this measure doesn’t help, you have to go with a xml-sitemap. For an example see the one of my main site (I don’t need one for my forum with just 3000+ posts). In a sitemap you can set the crawling frequency for any resource to one of the following values: always, daily, weekly, monthly, yearly, never.

Another hint: avoid double content (see this thread in the 1.x-forum).

Other search engines should give comparable programs to control it.

A nasty bot is Yahoo!Slurp. The only way I found out to decrease the access rate are two lines in robots.txt:
User-agent: Slurp
Crawl-delay: 10

According to Yahoo! a value of 10 is the slowest rate; in my experience a higher number is ignored.

For MSN-Bot (formerly MS Live Search, now Bing Beta):
User-agent: msnbot
Crawl-delay: XXX

where XXX are seconds between requests.

@Alex: I would sugest to modify the scripts in such a way that links to the contact form - whether to the admin or a user - are given the attribute rel="nofollow", e.g. instead of

<a href="index.php?mode=contact" title="foo">bar</a>

to

<a href="index.php?mode=contact" title="foo" rel="nofollow">bar</a>

--
Cheers,
Alfie (Helmut Schütz)
BEBA-Forum (v1.8β)

RSS Feed of thread