Wednesday, February 26, 2014

Nosy Web Spiders

The other day I was Googling for some technical detail that I'd written about (you think I remember where I put stuff?), and saw some surprising results: in addition to the public web page that I wanted, I also saw the PHP “fragment” that I use to build the public page. This was surprising because there aren't any links to that page on my site; the only reference to it is in a table that associates the page name to a filename.

Looking at the structure of my site, I realized what had happened. In my original site design, I had put all of the page-related content in the same directory as the fragment file. A few years ago I re-organized the files, but left one presentation in place; it was linked by a couple of sites, and I had no way at that time to do a 301 redirect. While that decision seemed harmless, it left me open to the following:

66.249.73.164 - - [30/Aug/2013:15:21:34 -0700] "GET /programming/ HTTP/1.1" 200 1542 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.73.164 - - [30/Aug/2013:15:23:03 -0700] "GET /programming/intro.frag HTTP/1.1" 200 2355 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.73.164 - - [30/Aug/2013:15:23:57 -0700] "GET /programming/scm.git.frag HTTP/1.1" 200 34073 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

To explain: they took the URL for the presentation, stripped off the filename component, and retrieved a directory listing. Then they crawled each of the files returned from that listing, and added them to their index. This is, quite frankly, a script-kiddie hack. In fact, the first time I saw this behavior it was a script-kiddie exploring my site.

I don't really care about script-kiddies: the directories that I want to keep private have directory listings disabled, and also have a dummy index.html that redirects back to my home page. If someone wants to learn my madd skillz with PHP, fine, have at it (but don't feel compelled to comment if you don't find my skillz that madd).

But why does Google (and Bing, and Baidu, and many others) feel the need to do this?

I try to follow their rules: I have a sitemap that lists the “interesting” pages on my site; I use noindex and nofollow meta tags (and canonical, which is a whole ‘nother rant); I have a robots.txt that lists directories that aren't covered by the meta tags. In short, I have presented them with my website as I want it to appear in their index.

But they choose to ignore that, poke around my website, and pick up whatever droppings they can find. Which ultimately makes their index, and the Internet as a whole, a worse place.

I suppose it's time to update robots.txt.

Monday, February 3, 2014

Coder vs Engineer

Stack Overflow is a fabulous resource for programmers. When I have programming questions, the first page of Google results is filled with links to its pages, and they usually have the answers I need. So why do I often feel depressed after browsing its questions?

The answer came to me this weekend: it's a hangout for coders, not engineers.

The question that prompted this revelation was yet another request for help with premature optimization. The program in question was tracking lap times for race cars, and the OP (original poster, for those not familiar with the acronym) was worried that he (she?) was extracting the list of cars and sorting it after every update. He saw this as a performance and garbage-collection hit, that would happen “thousands of times a second.”

That last line raised a red flag for me: I'm not a huge race fan, but but I can't imagine why you would expect to update lap times so frequently. The Daytona 500, for example, has approximately 40 cars, each of which take approximately a minute per lap. Even if they draft, you have a maximum of 40 updates per second, for a rather small set of objects.

To me, this is one of the key differences between a coder and an engineer: not attempting to bound the problem. Those interview questions about counting gas stations in Manhattan are all about this. You don't have to be exact, but if you can't set a bound to a problem, you can't find an effective solution. Sure, updating and sorting thousands of cars, thousands of times a second, that might have performance issues. But that's not how real-world races work.

Another difference is that, having failed to bound the problem (indeed, even to identify whether there is a problem), the coder immediately jumps to writing code. And that was the case for the people who answered this particular question. Creating a variety of solutions that all solved some interpretation of the OP's problem.

And I think that's what really bothers me: that coders will interpret a question in whatever way makes their coding easiest. This was driven home by a recent DZone puzzle. The question was how to remove duplicates from a linked list, “without using a buffer.” It's a rather poorly-worded question: what constitutes a buffer?

There were a few people who raised that question, but by far the majority started writing code. And some of the implementations were quite inventive in their interpretation of the question. The very first response limited the input list to integers, and used a bitset to track duplicates (at the worst, that would consume nearly 300Mb of RAM — a “buffer” seems modest by comparison). Another respondent seemed to believe that a Java ArrayList satisfied the “linked list” criteria.

Enough with the rant. Bottom line is that this industry needs to replace coders by engineers: people who take the time to understand the problems that they're tasked to solve. Before writing code.