This is Interesting: Free Magazines for Graphics designers and webmasters  


Home > Archive > Webmaster forum > August 2006 > Googlebot disregarding robots.txt





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author Googlebot disregarding robots.txt
Alfred Molon

2006-08-18, 6:36 pm

It's already a few days now that Googlebot is crawling through
directories of my site disallowed in robots.txt. Must be a bug in their
software I'd guess. Anybody else has this problem ?
--

Alfred Molon
http://www.molon.de - Photos of Asia, Africa and Europe
David Cary Hart

2006-08-18, 6:36 pm

On Fri, 18 Aug 2006 22:40:55 +0200, Alfred Molon
<alfredDELETE_molon@yahoo.com> opined:
> It's already a few days now that Googlebot is crawling through
> directories of my site disallowed in robots.txt. Must be a bug in
> their software I'd guess. Anybody else has this problem ?


If you are subscribed to Google sitemaps, that will override
robots.txt

--
"Black Hole": The economic effect of administering a DNSBL
Our DNSBL - Eliminate Spam at the Source: http://www.TQMcube.com
Don't Subsidize Criminals: http://boulderpledge.org
Alfred Molon

2006-08-18, 6:36 pm

In article <fnmhr3-cdh.ln1@news.TQMcube.com>, David Cary Hart says...

>
> If you are subscribed to Google sitemaps, that will override
> robots.txt


I'm not subscribed to Google sitemaps.
--

Alfred Molon
http://www.molon.de - Photos of Asia, Africa and Europe
William Tasso

2006-08-18, 6:36 pm

Fleeing from the madness of the Posted via Supernews,
http://www.supernews.com jungle
Alfred Molon <alfredDELETE_molon@yahoo.com> stumbled into
news:alt.www.webmaster
and said:

> It's already a few days now that Googlebot is crawling through
> directories of my site disallowed in robots.txt. Must be a bug in their
> software I'd guess. Anybody else has this problem ?


Is it a 'new' robots.txt - heard tell that s/e bots are fond of caching
this file.

--
William Tasso

http://williamtasso.com/words/what-is-usenet.asp
Alfred Molon

2006-08-18, 10:47 pm

In article <op.teh2j2i0m9g4qz-wnt@tbdata.com>, William Tasso says...

>
> Is it a 'new' robots.txt - heard tell that s/e bots are fond of caching
> this file.


I do indeed make changes to robots.txt every two or three weeks. The
last change was on August 11th.

But this has never been a problem with Googlebot. In fact on August 11th
I just added two new disallowed directories and Googlebot has been
accessing other disallowed directories which have been disallowed for a
lot of time. So it can't be a caching issue. Must be a bug in the
Googlebot software.
--

Alfred Molon
http://www.molon.de - Photos of Asia, Africa and Europe
Jerry Stuckle

2006-08-18, 10:47 pm

Alfred Molon wrote:
> In article <op.teh2j2i0m9g4qz-wnt@tbdata.com>, William Tasso says...
>
>
>
>
> I do indeed make changes to robots.txt every two or three weeks. The
> last change was on August 11th.
>
> But this has never been a problem with Googlebot. In fact on August 11th
> I just added two new disallowed directories and Googlebot has been
> accessing other disallowed directories which have been disallowed for a
> lot of time. So it can't be a caching issue. Must be a bug in the
> Googlebot software.


Maybe a typo in the file?

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================
Alfred Molon

2006-08-19, 3:35 am

In article <0pidnVTQI-3C8HvZnZ2dnUVZ_qidnZ2d@comcast.com>, Jerry Stuckle
says...
>
> Maybe a typo in the file?


Well no, but strange enough this validator produces a 403 error:
http://www.sxw.org.uk/computing/robots/check.html

http://www.dcs.ed.ac.uk/cgi/sxw/par...ite=http%3A%2F%
2Fwww.molon.de%2Frobots.txt
--

Alfred Molon
http://www.molon.de - Photos of Asia, Africa and Europe
Alfred Molon

2006-08-19, 3:35 am

In article <0pidnVTQI-3C8HvZnZ2dnUVZ_qidnZ2d@comcast.com>, Jerry Stuckle
says...

> Maybe a typo in the file?


By the way, is there a limit in the size of the robots.txt file?
--

Alfred Molon
http://www.molon.de - Photos of Asia, Africa and Europe
Charles Sweeney

2006-08-19, 6:44 pm

Alfred Molon wrote

> In article <0pidnVTQI-3C8HvZnZ2dnUVZ_qidnZ2d@comcast.com>, Jerry Stuckle
> says...
>
>
> By the way, is there a limit in the size of the robots.txt file?


http://www.robotstxt.org/wc/robots.html


--
Charles Sweeney
http://CharlesSweeney.com
Jerry Stuckle

2006-08-19, 6:44 pm

Alfred Molon wrote:
> In article <0pidnVTQI-3C8HvZnZ2dnUVZ_qidnZ2d@comcast.com>, Jerry Stuckle
> says...
>
>
>
> Well no, but strange enough this validator produces a 403 error:
> http://www.sxw.org.uk/computing/robots/check.html
>
> http://www.dcs.ed.ac.uk/cgi/sxw/par...ite=http%3A%2F%
> 2Fwww.molon.de%2Frobots.txt


That would explain a lot. If Google can't download the file, it can't
obey the directives.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================
Alfred Molon

2006-08-19, 6:44 pm

In article <SdqdnR02fJGJknrZnZ2dnUVZ_t-dnZ2d@comcast.com>, Jerry Stuckle
says...

> That would explain a lot. If Google can't download the file, it can't
> obey the directives.


I think I've now understood what the problem is. It's actually a chain
reaction of events. It seems that googlebot scans several times a day my
site, with different IP adresses. But googlebot does not retrieve every
day or at the beginning of a scan the robots.txt file, so it can happen
that googlebot works on a (temporarily) outdated copy of robots.txt.
--

Alfred Molon
http://www.molon.de - Photos of Asia, Africa and Europe
Mark Goodge

2006-08-19, 6:44 pm

On Sat, 19 Aug 2006 17:14:43 +0200, Alfred Molon put finger to
keyboard and typed:

>In article <SdqdnR02fJGJknrZnZ2dnUVZ_t-dnZ2d@comcast.com>, Jerry Stuckle
>says...
>
>
>I think I've now understood what the problem is. It's actually a chain
>reaction of events. It seems that googlebot scans several times a day my
>site, with different IP adresses. But googlebot does not retrieve every
>day or at the beginning of a scan the robots.txt file, so it can happen
>that googlebot works on a (temporarily) outdated copy of robots.txt.


Are you sure it really is Google, and not something else pretending to
be Google? Faking the User-Agent string is pretty trivial, and quite a
lot of malware bots will pretend to be a nice one in order to escape
detection. And, on a more benign note, there's a plugin for FireFox
which allows you to switch the UA, and one of the main uses of that is
to pretend to be Google in order to catch out sites which display
different content to bots than they do to humans.

Mark
--
Please give me one! http://www.pleasegivemeone.com
Alfred Molon

2006-08-19, 6:44 pm

In article <mlnee2peblml69slc1ge9jqlko2la5tebr@news.markshouse.net>,
Mark Goodge says...

>
> Are you sure it really is Google, and not something else pretending to
> be Google?


Yes, I checked the IP address with WHOIS.
--

Alfred Molon
http://www.molon.de - Photos of Asia, Africa and Europe
cristina

2006-08-20, 3:34 am

Alfred Molon wrote:
> In article <mlnee2peblml69slc1ge9jqlko2la5tebr@news.markshouse.net>,
> Mark Goodge says...
>
>
> Yes, I checked the IP address with WHOIS.
> --
>
> Alfred Molon
> http://www.molon.de - Photos of Asia, Africa and Europe


If you submit your site to Google sitemaps you can use
the robots.txt analysis tool to see
the content of the robots.txt file
currently cached by Google and to check if access to a URL is
blocked to Googlebot by the robots.txt file
http://www.google.com/support/webma....txt%20analysis

coreybryant@gmail.com

2006-08-20, 6:37 pm

http://www.robotstxt.org/wc/meta-user.html

You might add:
<meta name="robots" content="noindex,nofollow">
to the <head> if you have not done so already

Corey
http://www.loudcommerce.com

Sponsored Links


Copyright 2003 - 2008 forum4designers.com  Software forum  Computer Hardware reviews