Wednesday, February 26, 2014

Nosy Web Spiders

The other day I was Googling for some technical detail that I'd written about (you think I remember where I put stuff?), and saw some surprising results: in addition to the public web page that I wanted, I also saw the PHP “fragment” that I use to build the public page. This was surprising because there aren't any links to that page on my site; the only reference to it is in a table that associates the page name to a filename.

Looking at the structure of my site, I realized what had happened. In my original site design, I had put all of the page-related content in the same directory as the fragment file. A few years ago I re-organized the files, but left one presentation in place; it was linked by a couple of sites, and I had no way at that time to do a 301 redirect. While that decision seemed harmless, it left me open to the following: - - [30/Aug/2013:15:21:34 -0700] "GET /programming/ HTTP/1.1" 200 1542 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +" - - [30/Aug/2013:15:23:03 -0700] "GET /programming/intro.frag HTTP/1.1" 200 2355 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +" - - [30/Aug/2013:15:23:57 -0700] "GET /programming/scm.git.frag HTTP/1.1" 200 34073 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +"

To explain: they took the URL for the presentation, stripped off the filename component, and retrieved a directory listing. Then they crawled each of the files returned from that listing, and added them to their index. This is, quite frankly, a script-kiddie hack. In fact, the first time I saw this behavior it was a script-kiddie exploring my site.

I don't really care about script-kiddies: the directories that I want to keep private have directory listings disabled, and also have a dummy index.html that redirects back to my home page. If someone wants to learn my madd skillz with PHP, fine, have at it (but don't feel compelled to comment if you don't find my skillz that madd).

But why does Google (and Bing, and Baidu, and many others) feel the need to do this?

I try to follow their rules: I have a sitemap that lists the “interesting” pages on my site; I use noindex and nofollow meta tags (and canonical, which is a whole ‘nother rant); I have a robots.txt that lists directories that aren't covered by the meta tags. In short, I have presented them with my website as I want it to appear in their index.

But they choose to ignore that, poke around my website, and pick up whatever droppings they can find. Which ultimately makes their index, and the Internet as a whole, a worse place.

I suppose it's time to update robots.txt.

No comments: