Saturday, May 7, 2022

A Self-Made Bug

Chasing a bug can be one of the most frustrating parts of software development. All you have to start with is an observed behavior. Hopefully you can replicate that behavior. You then have to gather the information needed to understand what caused the bug, which is mostly about ruling out what didn't cause it. With luck, you have the tools (logging, system utilities) that let you do this.

And then, when you figure out what it is and fix it, most of the time your thought is “wow, that was dumb.” It's definitely not the same feeling you get when you finish a challenging feature.

This is the story of me tracking down a bug. I'm writing it as a form of catharsis, but also because I think it illustrates the lengths a developer has to go to to prove that something is a bug. In this case, it wasn't, really, just me doing something dumb and not realizing it.

The story starts with me implementing what I thought would be a simple web service to let a client upload large files to S3. Simple, because I started with an example that I'd written for a blog post. After a couple of hours to convert something intended as a teaching tool into something user-friendly and easy to deploy, and I was ready to test. Some small files went through with no problem, so I selected a 12.5 GB file and went for a bicycle ride.

When I got back, there was no browser running.

The first rule of debugging is that you need to reproduce the bug, so I started the upload again and switched my focus to another task. An hour or so later, I went back to the VM running my test, and again, no browser running.

I started the upload again, and paid careful attention to the requests flowing through the Network tab in the Firefox developer tools. Everything looked good, so I turned to top: when programs die unexpectedly, it's often due to memory errors. Sure enough, I saw that the resident set size (RSS) kept growing.

OK, probably a bug in my code. I'm not much of a JavaScript developer, and don't really understand how it manages memory, so I'm probably holding onto something unintentionally. Fortunately, the code is simple, and I was able to set a breakpoint in a "progress monitor" callback. After digging through various object trees accessible from my code, I was convinced that nothing I did caused the problem.

Time to turn to Google (and Duck Duck Go, which I'm trying as an alternative), to see if maybe there's something wrong with the AWS JavaScript library. I didn't expect to find anything: the second rule of debugging is that when you think there's a problem with a commonly-used library, it's you. But I was using an older version of the library, and perhaps uploading 12 GB from the browser was unusual enough to hit a corner case. Unfortunately, most of the search hits were from Stack Overflow, and were basic “how do I use this library?” questions.

One of the great things about having browsers built by different companies is that you can do comparison tests. I fired up Chrome, started the upload, and watched top; no change in its RSS, and the overall available memory remained constant. I started a Windows VM and did the same thing with Edge, and while the Windows Task Manager doesn't give the level of information that top does, it also appeared to handle the upload without a problem.

More Googling, now on Firefox memory issues. And from that, I learned about about:memory, which lets you see the memory usage of every Firefox process. With that page open in one window, I started my upload from another. After a gigabyte or so of the upload, I checked the memory dump … and it didn't show any excess memory consumption. Neither did top. WTF?!?

At this point it hit me: every time the browser crashed, I had Developer Tools open. So the browser was holding onto every chunk of the file that it uploaded, in case I wanted to go to the Network tab to inspect those requests.

Is there a lesson here? Not really; it's just an example of following the process I first wrote about a dozen years ago. But as I said, it's catharsis, and a reminder that most problems are PEBKAC: Problem Exists Between Keyboard And Chair.