Saturday, May 7, 2022

A Self-Made Bug

Chasing a bug can be one of the most frustrating parts of software development. All you have to start with is an observed behavior. Hopefully you can replicate that behavior. You then have to gather the information needed to understand what caused the bug, which is mostly about ruling out what didn't cause it. With luck, you have the tools (logging, system utilities) that let you do this.

And then, when you figure out what it is and fix it, most of the time your thought is “wow, that was dumb.” It's definitely not the same feeling you get when you finish a challenging feature.

This is the story of me tracking down a bug. I'm writing it as a form of catharsis, but also because I think it illustrates the lengths a developer has to go to to prove that something is a bug. In this case, it wasn't, really, just me doing something dumb and not realizing it.

The story starts with me implementing what I thought would be a simple web service to let a client upload large files to S3. Simple, because I started with an example that I'd written for a blog post. After a couple of hours to convert something intended as a teaching tool into something user-friendly and easy to deploy, and I was ready to test. Some small files went through with no problem, so I selected a 12.5 GB file and went for a bicycle ride.

When I got back, there was no browser running.

The first rule of debugging is that you need to reproduce the bug, so I started the upload again and switched my focus to another task. An hour or so later, I went back to the VM running my test, and again, no browser running.

I started the upload again, and paid careful attention to the requests flowing through the Network tab in the Firefox developer tools. Everything looked good, so I turned to top: when programs die unexpectedly, it's often due to memory errors. Sure enough, I saw that the resident set size (RSS) kept growing.

OK, probably a bug in my code. I'm not much of a JavaScript developer, and don't really understand how it manages memory, so I'm probably holding onto something unintentionally. Fortunately, the code is simple, and I was able to set a breakpoint in a "progress monitor" callback. After digging through various object trees accessible from my code, I was convinced that nothing I did caused the problem.

Time to turn to Google (and Duck Duck Go, which I'm trying as an alternative), to see if maybe there's something wrong with the AWS JavaScript library. I didn't expect to find anything: the second rule of debugging is that when you think there's a problem with a commonly-used library, it's you. But I was using an older version of the library, and perhaps uploading 12 GB from the browser was unusual enough to hit a corner case. Unfortunately, most of the search hits were from Stack Overflow, and were basic “how do I use this library?” questions.

One of the great things about having browsers built by different companies is that you can do comparison tests. I fired up Chrome, started the upload, and watched top; no change in its RSS, and the overall available memory remained constant. I started a Windows VM and did the same thing with Edge, and while the Windows Task Manager doesn't give the level of information that top does, it also appeared to handle the upload without a problem.

More Googling, now on Firefox memory issues. And from that, I learned about about:memory, which lets you see the memory usage of every Firefox process. With that page open in one window, I started my upload from another. After a gigabyte or so of the upload, I checked the memory dump … and it didn't show any excess memory consumption. Neither did top. WTF?!?

At this point it hit me: every time the browser crashed, I had Developer Tools open. So the browser was holding onto every chunk of the file that it uploaded, in case I wanted to go to the Network tab to inspect those requests.

Is there a lesson here? Not really; it's just an example of following the process I first wrote about a dozen years ago. But as I said, it's catharsis, and a reminder that most problems are PEBKAC: Problem Exists Between Keyboard And Chair.

Thursday, February 24, 2022

Leveraging try-with-resources to manage the mapped diagnostic context

The mapped diagnostic context is an under-appreciated feature of logging libraries. It's essentially a thread-local Map, to which you add key-value pairs that provide context about what your program is doing. Here's a trivial example:"program starting");
File file = new File(argv[0]);

MDC.put("outputFile", file.getAbsolutePath());

try (FileOutputStream fos = new FileOutputStream(file);
        OutputStreamWriter osr = new OutputStreamWriter(fos, StandardCharsets.UTF_8);
        BufferedWriter out = new BufferedWriter(osr))
{"starting to write");
    out.write("hello, world");
    out.newLine();"finished writing");

MDC.remove("outputFile");"program finished");

When you run this program, using a logging config that outputs the MDC, you'll see something like this:

2022-02-23 07:28:01,673 INFO  [main] c.kdgregory.example.Main -  - program starting
2022-02-23 07:28:01,677 INFO  [main] c.kdgregory.example.Main - outputFile=/tmp/test.txt - starting to write
2022-02-23 07:28:01,677 INFO  [main] c.kdgregory.example.Main - outputFile=/tmp/test.txt - finished writing
2022-02-23 07:28:01,677 INFO  [main] c.kdgregory.example.Main -  - program finished

Web-apps can benefit enormously from the MDC: before processing the request, you extract things like the invoking user, request path, session token, or a unique request ID. This makes it easier to track down problems, because you can filter the log by any of those parameters.

But I've recently been doing a lot of data engineering work: reading files or database tables, performing some transformation on them, and sending them to some destination. These tasks tend to be complex, with multiple steps, and they tend to be performed for multiple files or tables at a time. As with a web-app, it's really nice to add contextual information such as table name and selection criteria to the MDC, and let the logging framework write it with every message.

The problem with the MDC is that every put() has to be paired with a remove(). If you forget to remove a key it will “pollute” every subsequent log message from that thread, giving you potentially misleading information.

However, one thing that I noticed is that I tend to set the MDC just before a try-with-resources block. Or inside a loop that calls one or more functions that could be wrapped in such a block. This led me to the AutoMDC class:"program starting");
File file = new File(argv[0]);

try (AutoMDC mdc = new AutoMDC("outputFile", file.getAbsolutePath());
        FileOutputStream fos = new FileOutputStream(file);
        OutputStreamWriter osr = new OutputStreamWriter(fos, StandardCharsets.UTF_8);
        BufferedWriter out = new BufferedWriter(osr))
{"starting to write");
    out.write("hello, world");
    out.newLine();"finished writing");
}"program finished");

Try-with-resources works with classes that implement java.lang.AutoCloseable: an interface that declares the close() method. My AutoMDC class implements this interface, and uses a Set<String> to keep track of the keys. It provides two methods (and the “convenience” constructor shown above):

public AutoMDC put(String key, Object value)
    MDC.put(key, String.valueOf(value));
    return this;

public void close() throws Exception
    for (String key : keys)

That's it: you can add items to the MDC in the try-with-resources initializer, or anywhere inside the try block itself, and they'll be removed at the end. One downside is that the close() method is called before any catch or finally blocks; if you need to log contextual information in those places, you'll need to do it manually.

If you'd like to use this in your own programs, I've provided an implementation for Logback as a GitHub gist; the same basic code can be adapted to whichever logging framework you use.

I think the idea of leveraging try-with-resources is a useful one for other situations, especially those where you might otherwise turn to aspect-oriented programming. Plus, it encourages the good habit of wrapping your code in try/catch blocks.

Saturday, December 18, 2021

My take on the Log4J 2.x vulnerability

A week ago, a lot of Java application developers learned that their applications harbored a severe vulnerability, courtesy of the Log4J 2.x logging framework. The vulnerability, CVE-2021-44228 allowed execution of arbitrary code on any system that logged unsanitized user input using the Log4J 2.x library. This prompted a lot of people to scramble to protect their systems, and a lot of people to take to Twitter with opinions of What It All Means.

As the maintainer of an open-source library that integrates with Log4J, I spent my Saturday morning understanding the bug and updating my examples to use versions of the library that mitigated the problem. Fortunately, my library is not directly affected, as long as its consumers don't use unsanitized data to configure the logging framework.

Having done that, and reading as much as I could on the issue (the LunaSec writeups are excellent), I've decided to voice my own thoughts on the matter. Some of which I haven't seen other people say, so they may be similarly interesting to you.

First, I think that it's important to recognize that this vulnerability — like most things that end up in the news — is not the result of a single issue. Instead, it's a combination of features, some of which make perfect sense:

  1. Log4J 2.x provides string interpolation for configuration (aka “lookups”)

    This is a great idea; I implemented it for my own logging library. The idea is that you don't want your configuration files to contain hardcoded values, such as the address of your central syslog daemon (or, in my case, the name of a CloudWatch log group). Log4J 2.x provides a long list of lookups, ranging from environment variables to information from the Docker container running your application.

  2. One of the lookups retrieves data from the Java Naming and Directory Interface (JNDI) API

    In large deployments, it's nice to be able to centralize your configuration. There are lots of tools to do this, but Java Enterprise Edition settled on JNDI, also known as the javax.naming package. JNDI is an umbrella API for retrieving data from different sources, such as LDAP.

  3. javax.naming.spi.ObjectFactory supports loading code from a remote server

    This is a somewhat dubious idea, but the JNDI SPI (service provider interface) spec justifies it: it lets you defined resources (such as printer drivers) that can be retrieved directly from JNDI. They use the example of a printer driver, and it should be a familiar example to anyone who has installed a printer on their home computer (do you know that the driver you installed is trustworthy?).

    Note: this condition is necessary for remote-code execution, but not for data exfiltration.

  4. The Log4J 2.x PatternLayout also supports string interpolation, with the same set of lookups

    Here we're getting into questionable features in the library. I had no idea that this behavior existed, even though I dug deeply into the documentation and source code when implementing my appenders. The documentation for this layout class is quite long, and prior to the vulnerability, you had to infer the behavior based on the presence of the nolookups pattern option.

  5. Users log unsanitized data

    If users passed all of their logging messages through a sanitizer that looked for and removed the sequence ${, then the vulnerability wouldn't exist. Except nobody does that, because why would you have to? I call it out because I think it's important to consider what you're logging, and whether it might contain information that you don't want to log. As I say in my “effective logging” talk, passwords can hide in the darndest places.

These are the things that need to happen to make this vulnerability affect you. If you can prevent one of them from happening, you're not vulnerable … to this specific issue. And there are ways to do this, although they aren't things that you can easily implement once the vulnerability is discovered. I'm writing a separate post on how Cloud deployments can mitigate similar vulnerabilities.

For now, however, I want to focus on two of the responses that I saw online: pay for open-source maintainers, and inspecting code before using it. In my opinion, neither of these are valid, and they distract from preparing for the next serious vulnerability.

Response #1: inspect code before you use it.

For any non-trivial application, this is simply impossible.

The current “trunk” revision of log4j-core has 89,778 lines of code, excluding test classes (measured using find and wc). That doesn't count any add-on libraries that you might use to write to your preferred destination, such as my log4j-aws-appenders (which has over 10,000 lines of mainline code). And logging is a tiny part of a modern application, which is typically built using a framework such as Spring, and runs on an application server such as Tomcat or Jetty.

And even if your company is willing to pay for the staff to read all of that source code, what is the chance that they will discover a vulnerability? What is the chance that they will even learn all of its behaviors? I've spent quite a bit of time with the Log4J 2.x code and documentation, both to write my library and on a post describing how to write Log4J plugins, and I never realized that it applied string interpolation to the raw log messages. After I knew about the vulnerability, of course, it was easy to find the code responsible.

Response #2: we — or at least large companies — should be paying the maintainers of open-source projects.

I don't know the situation of the core Log4J developers, and whether or not they get paid for their work on the project. But I don't believe that this vulnerability was the result of an overworked, unpaid, solo developer. True, there was no discussion on the commit that introduced JNDI lookups (ignoring the tacky post-hoc commenters). But the code that applies string substitution to logging events has been in the library since the 2.0 release, and has been rewritten multiple times since then.

In fact, I think direct corporate sponsorship would lead to more unexpected behavior, because the sponsoring corporations will all have their own desires, and an expectation that those desires will be met. And if my experience in the corporate world is any guide, a developer that feels their paycheck is in jeopardy is much more likely to do something without giving it full thought.

So where does that leave us?

Unfortunately, I haven't seen very many practical responses (although, as I said, I'm writing a separate post to this effect). And I think that the reason for that is that the real answer puts the onus for vulnerabilities on us, the consumers of open-source software.

Linus's Law, coined by Eric S Raymond, is that “given enough eyeballs, all bugs are shallow.” And this played out in full view with this vulnerability: the Log4J team quickly found and patched the problem, and has been releasing additional patches since.² There have also been multiple third-parties that have written detailed evaluations of the vulnerability.³

But still, we, the consumers of this package, needed to update our builds to use the latest version. And then deploy those updated builds. If you didn't already have a process that allows you to quickly build and deploy a hotfix, then your weekend was shot. Even if you did get a hotfix out, you needed to spend time evaluating your systems to ensure that they weren't compromised (and beware that the most likely compromise was exfiltration of secrets!).

It's natural, in such circumstances, to feel resentment, and to look for external change to make such problems go away. Or maybe even to think about reducing your reliance on open-source projects.

But in my opinion, this is the wrong outcome. Instead, look at this event as a wake-up call to make your systems more robust. Be prepared to do a hotfix at a moment's notice. Utilize tools such as web application firewalls, which can be quickly updated to block malicious traffic. And improve your forensic logging, so that you can identify the effects of vulnerabilities after they appear (just don't log unsanitized input!).

Because this is not the last vulnerability that you will see.

1: You might be interested to learn that the Apache Software Foundation receives approximately $3 million a year (annual reports here). However, to the best of my knowledge, they do not pay stipends to core maintainers. I do not know how the core Log4J 2.x maintainers earn their living.

2: Log4J 2.x change report.

3: I particularly like the CloudFlare posts, especially the one that describes how attackers changed their payloads to avoid simple firewall rules. This post informed my belief that the goal of most attacks was to exfiltrate secrets rather than take over systems.