Hi All,

We are taking a bit of a hit on Google due to the amount of duplicate content on our site. Therefore as part of a phased approach we want to start removing non-essential pages from the eyes of Googlebot.

Basically, our first action is to disallow Googlebot from crawling/indexing any product page that is a 4th generation copy (or more). We have done some initial research and think that a use of the * wildcard function as per below should be ok...

This is the proposed code we are thinking of using:

User-agent: *
Disallow: /copy_of_copy_of_copy_of_*.html

Therefore:

http://www.kjbeckett.com/acatalog/bl...red-perry.html WOULD be crawled (from http://www.kjbeckett.com/acatalog/fred-perry_p2.html).
http://www.kjbeckett.com/acatalog/co...red-perry.html WOULD be crawled (from http://www.kjbeckett.com/acatalog/mens-bags_p4.html).
http://www.kjbeckett.com/acatalog/co...red-perry.html WOULD be crawled (from http://www.kjbeckett.com/acatalog/mens-bags.html).
http://www.kjbeckett.com/acatalog/co...red-perry.html WOULD NOT be crawled (from http://www.kjbeckett.com/acatalog/messenger-bags.html).
http://www.kjbeckett.com/acatalog/co...red-perry.html WOULD NOT be crawled (from http://www.kjbeckett.com/acatalog/fred-perry.html).

Do you think our usage of the * wildcard is correct? Therefore, using the examples above, would we still be crawled where we want to, and not crawled where we don’t want to?

Any help would be greatly appreciated.

Cheers,
Liam



http://forums.masr.me