Chris Blankenship has been on a crusade lately about abusive spiders. I was interested in some of the fixes he was applying to it, but a few weeks ago I got an email from him about a solution he was developing, ‘GateKeeper’. I reviewed the code and it all looked good, but he wasn’t ready yet to fully release it into the wild.
It finally got to that point and I installed it on my two DotNetBlogEngine.net blogs. So far I have been really impressed with it. I’m really interested to see how it affects my overall traffic. Right now I have four blocked user agents:
baiduspider, larbin, sogou, sosospider. All of those came from Chris’s recommendation. Then I immediately got a Slurp violation, though I am going to give them one more failure before I block them. Chris also has MSN blocked. A lot of my traffic comes from Live Search, so I’m a little scared to do that.
I did fall on one issue with the solution though. When I installed it, I had it set to automatically block violators. Unknown to Chris and I is that Google caches the robots.txt file! So since they didn’t get my new robots.txt file, they were blocked! So it is recommended to not turn on the automatic blocking for at least a few days.
Related post from Chris’s site:
The Continued Struggle With Spiders
To catch a spider…
Abusive Web Crawlers
Blocking Bad UserAgents and IP Addresses
The elusive Robots.txt file