If you’ve looked at your website’s access logs recently and have wondered why you have weird looking ‘spiders’ visiting your website, chances are they are probably spam bots rather than anyone legitimate.
Here are some real bots from the three major search engines which you should let on your site:
- Googlebot – from Google
- bingbot – from Microsoft (formerly known as MSNbot)
- Yahoo! Slurp – from Yahoo!
So what can you do about the others? Well one technique that seems to work well for me is to block them using your website’s .htaccess file. This file is essentially a web server configuration file that lives in the root folder on your website (for example, it may be in your /var/www or /public_html folder), and most people will be able to edit themselves using their favourite text editor and an FTP client to upload any changes.
Here is a copy of a .htaccess file which I use, it works well for me on this and other sites that use the Apache web server:
# Block visitors (bots) with no user-agent string SetEnvIfNoCase User-Agent ^$ bad_bot
# Bad bots found accessing this domain previously SetEnvIfNoCase User-Agent "^\*bot" bad_bot SetEnvIfNoCase User-Agent "^bot\*" bad_bot SetEnvIfNoCase User-Agent "^checker" bad_bot SetEnvIfNoCase User-Agent "^crawl" bad_bot SetEnvIfNoCase User-Agent "^discovery" bad_bot SetEnvIfNoCase User-Agent "^hunter" bad_bot SetEnvIfNoCase User-Agent "^LinkWalker" bad_bot SetEnvIfNoCase User-Agent "^Netcraft" bad_bot SetEnvIfNoCase User-Agent "^robot" bad_bot SetEnvIfNoCase User-Agent "^spider" bad_bot SetEnvIfNoCase User-Agent "^WebCollage" bad_bot
<Limit GET POST HEAD> Order Allow,Deny Allow from all Deny from env=bad_bot </Limit>
What all of that basically means is that if a bot identifies itself using any of these user-agent strings (i.e. “bot*” or “crawl”) then the web server will block access to it and display a standard “error 403 – forbidden” page to them.
Each website will have different bad bots accessing it, the above list is not exhaustive by any means and one or two of them are actually legitimate bots but I block them because they either use too much bandwidth or I don’t see any value in them crawling my website. You should be able to identify which bad bots visit your website from the amount of bandwidth they eat up, and perhaps a quick Google search of their name too.
Note: if you have any bots that use an asterisk as part of their user-agent string, then you will need to “escape” it otherwise your web server will throw up an “error 500 – internal server error” page. Escaping basically means putting a backslash before it – so “*” becomes “\*” like you can see in my example .htaccess file content above.