This is Interesting: Free Magazines for Graphics designers and webmasters  


Home > Archive > Webmaster forum > November 2006 > robots.txt question





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author robots.txt question
Cynode

2006-11-05, 11:46 pm

I use a large, sorta crappy host, for a project website that i've been
playing with, I use them mostly because they are cheap.

A few days ago i went to play and noticed nothing but an empty /
directory, i thought my site had gotten deleted. So i checked my email
and saw that it has been deactivated by thier CSR because Google and
Yahoo bots were hitting the site so hard it was causing the server to
come down and making problems for others websites.

About the website real quick, it's a gallery, running Menalto Gallery2
script, has about 5,000 images, and nothing else, no forums, no chat,
nothing, just the script. However, you figure each image has a
thumbnail, an resized image, and a full size image,and a mini
thumbnail menu on each resized image page so I guess I could see the
bots chasing links to no end.

So, they said to upgrade to latest version of the script and use a
robots.txt file to prevent the bots from scanning so much.

Now to my question! lol. I found a robots.txt on google (irony?) that
excludes about 20 bots including google and yahoo slurp. now if i
uploaded this file, will these search engines entirely ignore my site,
or is there a way to just limit how deep they go?

I currently use ShortUrls so an album might look like this for the
thumbnail page:

www.domain.com/albums/sub1/

and like this for imageitem page:

www.domain.com/albums/sub1/imageitem.jpg.html

I would prefer to limit them to only going as far as the thumbnail
pages, and not to the resized or full image pages, but will limit them
to just the main page if I have to.

--
If it keeps up, man will atrophy all his limbs but the push-button finger.
~Frank Lloyd Wright - http://www.cynode.com
Mark Goodge

2006-11-05, 11:46 pm

On Sun, 22 Oct 2006 12:34:20 -0400, Cynode put finger to keyboard and
typed:
>
>Now to my question! lol. I found a robots.txt on google (irony?) that
>excludes about 20 bots including google and yahoo slurp. now if i
>uploaded this file, will these search engines entirely ignore my site,
>or is there a way to just limit how deep they go?


http://www.robotstxt.org has all you need. But, in summary, you can
exclude at directory level or individual files. So, for example:

Disallow: /
disallows access to the entire site

Disallow: /mydirectory/
disallows access to the all files in mydirectory

Disallow /mydirectory/thisfile.html
disalows access to thisfile.html inside mydirectory

>I currently use ShortUrls so an album might look like this for the
>thumbnail page:
>
>www.domain.com/albums/sub1/
>
>and like this for imageitem page:
>
>www.domain.com/albums/sub1/imageitem.jpg.html
>
>I would prefer to limit them to only going as far as the thumbnail
>pages, and not to the resized or full image pages, but will limit them
>to just the main page if I have to.


You can't easily do what you want without changing the structure, as

Disallow /albums/sub1/

will exclude the index file in the directory (which is the thumbnail
page) as well as the individual file pages. But excluding at this
level will mean updating the robots.txt every time you add a new
album, as you can't use wildcards in directory and filenames - you'd
need to specify each album on its own line, like this:

Disallow /albums/sub1/
Disallow /albums/sub2/
Disallow /albums/sub3/
Disallow /albums/sub4/
Disallow /albums/sub5/
etc...

so you might be better off just excluding the entire gallery, at this
level:

Disallow /albums/

However, to make things a bit easier, you can use an * in the robot
line, so you don't need to specify 20 different bots in the file -
just start robots.txt with this line:

User-agent: *

and then put all the directory/ exclusions underneath it. That will
then make it apply to any spider which conforms to the robots.txt
specifications.

Mark
--
Visit: http://www.MotorwayServices.info - read and share comments and opinons
"I need someone to hide under, should the sky fall on my car"
Cynode

2006-11-05, 11:46 pm

On Sun, 22 Oct 2006 19:00:28 +0100, Mark Goodge
<usenet@listmail.good-stuff.co.uk> wrote:

>On Sun, 22 Oct 2006 12:34:20 -0400, Cynode put finger to keyboard and
>typed:
[color=darkred]
>You can't easily do what you want without changing the structure, as
>
>Disallow /albums/sub1/
>
>will exclude the index file in the directory (which is the thumbnail
>page) as well as the individual file pages. But excluding at this
>level will mean updating the robots.txt every time you add a new
>album, as you can't use wildcards in directory and filenames - you'd
>need to specify each album on its own line, like this:
>


Excellent information, thanks a lot!

But, basically to limit the robots to only the thumbnail pages I would
have to do this?:

thumb page:

/albums/sub1/thumbs/

image file page:

/albums/sub1/thumbs/resized/

--
If it keeps up, man will atrophy all his limbs but the push-button finger.
~Frank Lloyd Wright - http://www.cynode.com
Mark Goodge

2006-11-05, 11:46 pm

On Sun, 22 Oct 2006 14:37:04 -0400, Cynode put finger to keyboard and
typed:

>On Sun, 22 Oct 2006 19:00:28 +0100, Mark Goodge
><usenet@listmail.good-stuff.co.uk> wrote:
>
>
>
>Excellent information, thanks a lot!
>
>But, basically to limit the robots to only the thumbnail pages I would
>have to do this?:
>
>thumb page:
>
>/albums/sub1/thumbs/
>
>image file page:
>
>/albums/sub1/thumbs/resized/


Yes, that would do what you want.

Mark
--
Visit: http://www.FridayFun.net - jokes, lyrics and ringtones
"Would you save my soul, tonight?"
Sponsored Links


Copyright 2003 - 2008 forum4designers.com  Software forum  Computer Hardware reviews