Web Design Web Design Forum
Registration is free! Here you can view your subscribed threads, work with private messages and edit your profile and preferences Calendar Find other members Frequently Asked Questions Search
Home Web Design

Convenient web based access to our favorite web design Usenet groups

web design reviews

This is Interesting: Free Magazines for Graphics designers and webmasters  





  Last Thread  Next Thread
Author
Thread Post New Thread   

robots.txt question
 

Cynode




quote this post edit post

IP Loged report this post

Old Post  11-06-06 - 04:46 AM  
I use a large, sorta crappy host, for a project website that i've been
playing with, I use them mostly because they are cheap.

A few days ago i went to play and noticed nothing but an empty /
directory, i thought my site had gotten deleted. So i checked my email
and saw that it has been deactivated by thier CSR because Google and
Yahoo bots were hitting the site so hard it was causing the server to
come down and making problems for others websites.

About the website real quick, it's a gallery, running Menalto Gallery2
script, has about 5,000 images, and nothing else, no forums, no chat,
nothing, just the script. However, you figure each image has a
thumbnail, an resized image, and a full size image,and a mini
thumbnail menu on each resized image page so I guess I could see the
bots chasing links to no end.

So, they said to upgrade to latest version of the script and use a
robots.txt file to prevent the bots from scanning so much.

Now to my question! lol. I found a robots.txt on google (irony?) that
excludes about 20 bots including google and yahoo slurp. now if i
uploaded this file, will these search engines entirely ignore my site,
or is there a way to just limit how deep they go?

I currently use ShortUrls so an album might look like this for the
thumbnail page:

www.domain.com/albums/sub1/

and like this for imageitem page:

www.domain.com/albums/sub1/imageitem.jpg.html

I would prefer to limit them to only going as far as the thumbnail
pages, and not to the resized or full image pages, but will limit them
to just the main page if I have to.

--
If it keeps up, man will atrophy all his limbs but the push-button finger.
~Frank Lloyd Wright - http://www.cynode.com


Post Follow-Up to this message ]
Re: robots.txt question
 

Mark Goodge




quote this post edit post

IP Loged report this post

Old Post  11-06-06 - 04:46 AM  
On Sun, 22 Oct 2006 12:34:20 -0400, Cynode put finger to keyboard and
typed:
>
>Now to my question! lol. I found a robots.txt on google (irony?) that
>excludes about 20 bots including google and yahoo slurp. now if i
>uploaded this file, will these search engines entirely ignore my site,
>or is there a way to just limit how deep they go?

http://www.robotstxt.org has all you need. But, in summary, you can
exclude at directory level or individual files. So, for example:

Disallow: /
disallows access to the entire site

Disallow: /mydirectory/
disallows access to the all files in mydirectory

Disallow /mydirectory/thisfile.html
disalows access to thisfile.html inside mydirectory

>I currently use ShortUrls so an album might look like this for the
>thumbnail page:
>
>www.domain.com/albums/sub1/
>
>and like this for imageitem page:
>
>www.domain.com/albums/sub1/imageitem.jpg.html
>
>I would prefer to limit them to only going as far as the thumbnail
>pages, and not to the resized or full image pages, but will limit them
>to just the main page if I have to.

You can't easily do what you want without changing the structure, as

Disallow /albums/sub1/

will exclude the index file in the directory (which is the thumbnail
page) as well as the individual file pages. But excluding at this
level will mean updating the robots.txt every time you add a new
album, as you can't use wildcards in directory and filenames - you'd
need to specify each album on its own line, like this:

Disallow /albums/sub1/
Disallow /albums/sub2/
Disallow /albums/sub3/
Disallow /albums/sub4/
Disallow /albums/sub5/
etc...

so you might be better off just excluding the entire gallery, at this
level:

Disallow /albums/

However, to make things a bit easier, you can use an * in the robot
line, so you don't need to specify 20 different bots in the file -
just start robots.txt with this line:

User-agent: *

and then put all the directory/ exclusions underneath it. That will
then make it apply to any spider which conforms to the robots.txt
specifications.

Mark
--
Visit: http://www.MotorwayServices.info - read and share comments and opinon
s
"I need someone to hide under, should the sky fall on my car"


Post Follow-Up to this message ]
Re: robots.txt question
 

Cynode




quote this post edit post

IP Loged report this post

Old Post  11-06-06 - 04:46 AM  
On Sun, 22 Oct 2006 19:00:28 +0100, Mark Goodge
<usenet@listmail.good-stuff.co.uk> wrote:

>On Sun, 22 Oct 2006 12:34:20 -0400, Cynode put finger to keyboard and
>typed: 

>You can't easily do what you want without changing the structure, as
>
>Disallow /albums/sub1/
>
>will exclude the index file in the directory (which is the thumbnail
>page) as well as the individual file pages. But excluding at this
>level will mean updating the robots.txt every time you add a new
>album, as you can't use wildcards in directory and filenames - you'd
>need to specify each album on its own line, like this:
>

Excellent information, thanks a lot!

But, basically to limit the robots to only the thumbnail pages I would
have to do this?:

thumb page:

/albums/sub1/thumbs/

image file page:

/albums/sub1/thumbs/resized/

--
If it keeps up, man will atrophy all his limbs but the push-button finger.
~Frank Lloyd Wright - http://www.cynode.com


Post Follow-Up to this message ]
Re: robots.txt question
 

Mark Goodge




quote this post edit post

IP Loged report this post

Old Post  11-06-06 - 04:46 AM  
On Sun, 22 Oct 2006 14:37:04 -0400, Cynode put finger to keyboard and
typed:

>On Sun, 22 Oct 2006 19:00:28 +0100, Mark Goodge
><usenet@listmail.good-stuff.co.uk> wrote:
> 
> 
>
>Excellent information, thanks a lot!
>
>But, basically to limit the robots to only the thumbnail pages I would
>have to do this?:
>
>thumb page:
>
>/albums/sub1/thumbs/
>
>image file page:
>
>/albums/sub1/thumbs/resized/

Yes, that would do what you want.

Mark
--
Visit: http://www.FridayFun.net - jokes, lyrics and ringtones
"Would you save my soul, tonight?"


Post Follow-Up to this message ]
Sponsored Links
 





All times are GMT. The time now is 09:08 PM. Post New Thread   
  Previous Last Thread   Next Thread next
Webmaster forum archive | Show Printable Version | Email this Page | Subscribe to this Thread

Popular forums

Adobe Photoshop forum Macromedia Flash Web Site Design
Dreamweaver FrontPage forum
JavaScript Forum XML forum
Style Sheets VRML
Forum Jump:
Rate This Thread:

 

XML RSS Feed web design latest articles Syndicate our forum via XML or simple JavaScript

Web Design archive  Database administration help  


Top Home  -  Register  -  Control Panel   -  Memberlist  -  Calendar  -  Faq  -  Search Top