Thursday, July 13, 2023

Buggy Behavior From The CloudWatch Logs API

Tip 26: “select” isn't broken.
The OS is probably not broken. And the database is probably just fine.

The Pragmatic Programmer

Words to live by — although I did once find a problem with Oracle's index management, which manifested in select. In my opinion, it's the maxim that defines a senior developer. Junior developers always look outside themselves whenever they run into a problem. Senior developers do everything they can to find where they've made a mistake. After all, there are a lot of people doing exactly the same things that you are; surely they would have found the bug first.

But sometimes, it really is a bug in the system. This post is a story of how I determined the bug wasn't in my code. It's part catharsis, part “here's what you should do if you think select is broken,” and part calling out the issue for posterity in case someone else trips over it.

In January of this year (2023), AWS made a change in the CloudWatch Logs API: you no longer needed to provide a sequence token when calling PutLogEvents. Sequence tokens were always, I think, a “leaky abstraction”: a part of the internal implementation of CloudWatch Logs that found its way into the API. To summarize: you had to retrieve a sequence token before you could write events. When you wrote a batch of events, you received a sequence token back, which you could use to write the next batch — unless another process was writing to the same stream, in which case your sequence token would become invalid. Which meant that you had to retrieve a new token, using an API that had a relatively low limit on number of requests per second. With enough concurrent writers, hilarity would ensue as each would retrieve a token, try to use it, and fail because some other writer got there first.

This change directly impacted my logging library, but initially I wasn't planning to adopt it: the library worked, the old behavior was supported, and to be honest, releasing the library involves a lot of toil. But then someone found a bug with my sequence-token retrieval code if the logging library happened to re-initialize itself (Log4J2 FTW!). I fixed the immediate bug, and released a patch, but then decided it was a sign from the universe that I should remove the sequence token entirely. Plus, it always feels good to delete code!

However, once I changed the code, one of my integration tests — the one that pits concurrent writers against each other — started failing. When it tried to verify the events that were written, it would often come up short.

Failing integration tests are nothing new to this library: CloudWatch is an “eventually consistent” system, and while “eventually” often means seconds, it sometimes takes longer for the service to become consistent. As a result, failing tests don't clean up after themselves, so that I can look at what happened. I could then use my log reader utility, and verify that all of the events were eventually available.

This time, I didn't get all of the events back. Of course, I assumed that my reader was incorrect, and dove back into the documentation for the API calls it was using. I experimented with different arguments to the GetLogEvents call, and also tried FilterLogEvents. But nothing I did had an effect: I was still missing log entries.

Next step was to dig into the logging library. I already had it fairly well instrumented: if you enable debug mode in the logging framework it will report everything that it does, including (optionally) every batch that it sends. Everything looked good, but there was a corner case that worried me: CloudWatch can reject events, and unlike Kinesis, it doesn't tell you which events it rejects. The appender pre-processed batches to remove invalid events, but didn't log failures (because there shouldn't be any). I changed that code, but no failures appeared.

At this point I didn't think the bug was in my appender, but there was always the possibility that I was doing something incorrectly that I didn't log. My next step for debugging the appender would have been to turn up the logging level within the AWS SDK, to get a detailed report of what it was doing. However, this can become a case of drowning in detail, so first I looked at alternative methods of verifying the events.

In addition to the GetLogEvents and FilterLogEvents API calls, CloudWatch provides a query API. This API requires you to submit a query and then loop until it's done, so I didn't have a utility program that used it. But it's what the Console uses in the Insights page.

So I wrote a simple query to retrieve events from the stream; I was actually more interested in the reported count, rather than looking at all 20,000 expected events. But Insights told me that there weren't any events in the stream.

At this point, I opened a ticket with AWS Support (fortunately, I have access to an account with a Business support contract, since “Basic” support no longer allows technical questions). I received an answer that “as designed,” Insights does not return any messages with timestamps prior to the log group's creation time. WTF?!?

The back-story is that my Log4J 1.x appender lazily creates the log group and stream when it first writes a batch of messages, because Log4J 1.x does not have an explicit initialization phase. This is definitely something to be aware of if you use my library with Log4J 1.x (and, overwhelmingly, that's what my download counts indicate people are doing). The solution is to pre-create your log groups, so I modified the test to do so. This is not an issue with Log4J 2.x and Logback, which do have an initialization phase.

But the test continued to fail. However, I could view the logs in Insights, and saw that it was able to retrieve all of the events (yet another approach, which also succeeded, is export to S3). And more interesting, it was the initial few log messages that weren't being retrieved by GetLogEvents. I also called the DescribeLogStreams API, and it returned a later “first event timestamp” than expected, by a few milliseconds.

This seemed worth another Support ticket: if I could retrieve all log events using one technique, and not using another, then clearly it was a problem with CloudWatch and not me.

And now comes the most important lesson from this entire adventure: do not assume that the AWS Support rep can see what you're describing, and more important, do not expect them to tell you that they can't.

I had created a log group/stream that demonstrated the problem, referenced it in the ticket. I also attached the S3 export, the list of events from my log reader, and the initial events from an Insights query. The support rep seemed to agree that there was a problem. Then for a week they wrote nightly updates about how they were trying to re-upload the events and did not see a problem. They seemed to be trying random things, so I recommended building my library and running the integration test — at this point, my assumption was that AWS accepted that there was a problem based on my example group/stream, and were simply trying to re-create at will.

It turned out this was an incorrect assumption, which I learned when the original rep went dark and a new rep picked up the case. I had independently created a program that triggered the issue, so that they wouldn't have to run the integration tests. The new rep claimed to have run the program, did not see the issue, and pointed out that AWS did not have permission to view the original log group and stream that I had identified in the ticket!

Seeing that the “web” interactions were not having any effect, I decided to try a chat session with a screenshare. I showed the rep the original group and stream, and how different retrieval methods returned different results. At the time it didn't seem to have much effect: he made sounds of acknowledgement, but was unable to assume the role I created to let him read events, and would go silent for minutes at a time while he was “try[ing] something.” All-in-all, not a great experience from my side.

But it did seem to have an effect: the next morning I got a message from someone who claimed to be on the CloudWatch team, saying that they were able to re-create the problem. And that's where it stands today. For my part, the updates to my library are on hold until I hear back: while this may be superstition on my part, it seems to work better if I provide a sequence token (I suspect it's actually the delay from retrieving the token).

And if you run into a problem with GetLogEvents not returning all of your events, feel free to reference case #12882575541 if you file a Support ticket.

Update, 2023-08-22

AWS has resolved the problem: "We investigated and found that there was an issue with the service which in rare situations can incorrectly set the "firstEventTimestamp" on a log-stream [...] Our team has now fixed the issue with GetLogEvents not returning some events due to incorrectly set "firstEventTimestamp"." Have verified that my integration tests run (repeatedly) without a failure. It took just under three months (05/23 to 08/18) to resolve the case, two months (06/15) from the point that someone took it seriously. For a company the size of AWS, and a service with the reach of CloudWatch Logs, that seems pretty fast; I've had issue drag on far longer with in-house IT teams.