Saturday, May 7, 2022

A Self-Made Bug

Chasing a bug can be one of the most frustrating parts of software development. All you have to start with is an observed behavior. Hopefully you can replicate that behavior. You then have to gather the information needed to understand what caused the bug, which is mostly about ruling out what didn't cause it. With luck, you have the tools (logging, system utilities) that let you do this.

And then, when you figure out what it is and fix it, most of the time your thought is “wow, that was dumb.” It's definitely not the same feeling you get when you finish a challenging feature.

This is the story of me tracking down a bug. I'm writing it as a form of catharsis, but also because I think it illustrates the lengths a developer has to go to to prove that something is a bug. In this case, it wasn't, really, just me doing something dumb and not realizing it.

The story starts with me implementing what I thought would be a simple web service to let a client upload large files to S3. Simple, because I started with an example that I'd written for a blog post. After a couple of hours to convert something intended as a teaching tool into something user-friendly and easy to deploy, and I was ready to test. Some small files went through with no problem, so I selected a 12.5 GB file and went for a bicycle ride.

When I got back, there was no browser running.

The first rule of debugging is that you need to reproduce the bug, so I started the upload again and switched my focus to another task. An hour or so later, I went back to the VM running my test, and again, no browser running.

I started the upload again, and paid careful attention to the requests flowing through the Network tab in the Firefox developer tools. Everything looked good, so I turned to top: when programs die unexpectedly, it's often due to memory errors. Sure enough, I saw that the resident set size (RSS) kept growing.

OK, probably a bug in my code. I'm not much of a JavaScript developer, and don't really understand how it manages memory, so I'm probably holding onto something unintentionally. Fortunately, the code is simple, and I was able to set a breakpoint in a "progress monitor" callback. After digging through various object trees accessible from my code, I was convinced that nothing I did caused the problem.

Time to turn to Google (and Duck Duck Go, which I'm trying as an alternative), to see if maybe there's something wrong with the AWS JavaScript library. I didn't expect to find anything: the second rule of debugging is that when you think there's a problem with a commonly-used library, it's you. But I was using an older version of the library, and perhaps uploading 12 GB from the browser was unusual enough to hit a corner case. Unfortunately, most of the search hits were from Stack Overflow, and were basic “how do I use this library?” questions.

One of the great things about having browsers built by different companies is that you can do comparison tests. I fired up Chrome, started the upload, and watched top; no change in its RSS, and the overall available memory remained constant. I started a Windows VM and did the same thing with Edge, and while the Windows Task Manager doesn't give the level of information that top does, it also appeared to handle the upload without a problem.

More Googling, now on Firefox memory issues. And from that, I learned about about:memory, which lets you see the memory usage of every Firefox process. With that page open in one window, I started my upload from another. After a gigabyte or so of the upload, I checked the memory dump … and it didn't show any excess memory consumption. Neither did top. WTF?!?

At this point it hit me: every time the browser crashed, I had Developer Tools open. So the browser was holding onto every chunk of the file that it uploaded, in case I wanted to go to the Network tab to inspect those requests.

Is there a lesson here? Not really; it's just an example of following the process I first wrote about a dozen years ago. But as I said, it's catharsis, and a reminder that most problems are PEBKAC: Problem Exists Between Keyboard And Chair.

Thursday, February 24, 2022

Leveraging try-with-resources to manage the mapped diagnostic context

The mapped diagnostic context is an under-appreciated feature of logging libraries. It's essentially a thread-local Map, to which you add key-value pairs that provide context about what your program is doing. Here's a trivial example:

logger.info("program starting");
File file = new File(argv[0]);

MDC.put("outputFile", file.getAbsolutePath());

try (FileOutputStream fos = new FileOutputStream(file);
        OutputStreamWriter osr = new OutputStreamWriter(fos, StandardCharsets.UTF_8);
        BufferedWriter out = new BufferedWriter(osr))
{
    logger.info("starting to write");
    out.write("hello, world");
    out.newLine();
    logger.info("finished writing");
}

MDC.remove("outputFile");
logger.info("program finished");

When you run this program, using a logging config that outputs the MDC, you'll see something like this:

2022-02-23 07:28:01,673 INFO  [main] c.kdgregory.example.Main -  - program starting
2022-02-23 07:28:01,677 INFO  [main] c.kdgregory.example.Main - outputFile=/tmp/test.txt - starting to write
2022-02-23 07:28:01,677 INFO  [main] c.kdgregory.example.Main - outputFile=/tmp/test.txt - finished writing
2022-02-23 07:28:01,677 INFO  [main] c.kdgregory.example.Main -  - program finished

Web-apps can benefit enormously from the MDC: before processing the request, you extract things like the invoking user, request path, session token, or a unique request ID. This makes it easier to track down problems, because you can filter the log by any of those parameters.

But I've recently been doing a lot of data engineering work: reading files or database tables, performing some transformation on them, and sending them to some destination. These tasks tend to be complex, with multiple steps, and they tend to be performed for multiple files or tables at a time. As with a web-app, it's really nice to add contextual information such as table name and selection criteria to the MDC, and let the logging framework write it with every message.

The problem with the MDC is that every put() has to be paired with a remove(). If you forget to remove a key it will “pollute” every subsequent log message from that thread, giving you potentially misleading information.

However, one thing that I noticed is that I tend to set the MDC just before a try-with-resources block. Or inside a loop that calls one or more functions that could be wrapped in such a block. This led me to the AutoMDC class:

logger.info("program starting");
File file = new File(argv[0]);

try (AutoMDC mdc = new AutoMDC("outputFile", file.getAbsolutePath());
        FileOutputStream fos = new FileOutputStream(file);
        OutputStreamWriter osr = new OutputStreamWriter(fos, StandardCharsets.UTF_8);
        BufferedWriter out = new BufferedWriter(osr))
{
    logger.info("starting to write");
    out.write("hello, world");
    out.newLine();
    logger.info("finished writing");
}

logger.info("program finished");

Try-with-resources works with classes that implement java.lang.AutoCloseable: an interface that declares the close() method. My AutoMDC class implements this interface, and uses a Set<String> to keep track of the keys. It provides two methods (and the “convenience” constructor shown above):

public AutoMDC put(String key, Object value)
{
    keys.add(key);
    MDC.put(key, String.valueOf(value));
    return this;
}

@Override
public void close() throws Exception
{
    for (String key : keys)
    {
        MDC.remove(key);
    }
}

That's it: you can add items to the MDC in the try-with-resources initializer, or anywhere inside the try block itself, and they'll be removed at the end. One downside is that the close() method is called before any catch or finally blocks; if you need to log contextual information in those places, you'll need to do it manually.

If you'd like to use this in your own programs, I've provided an implementation for Logback as a GitHub gist; the same basic code can be adapted to whichever logging framework you use.

I think the idea of leveraging try-with-resources is a useful one for other situations, especially those where you might otherwise turn to aspect-oriented programming. Plus, it encourages the good habit of wrapping your code in try/catch blocks.