Tuesday, May 28, 2024

My Tests Are Good At Finding AWS Bugs

I recently made a minor change to my logging library. And, minor though it was, I run a full suite of integration tests before any release. I wasn't expecting any problems, so was surprised when the first test against Kinesis failed. More puzzling: the primary assertions in the test all passed; the records were properly written to the stream. What failed was an assertion that there weren't any internally-logged errors.

Digging in, I discovered that the failure was a ResourceNotFoundException: Kinesis was claiming that the stream didn't exist! This particular test created the stream, waited for it to become active, and then immediately tried to write a batch of messages to it. I looked through the code to see if there was any way that it could be writing the batch before the stream was active, and didn't see one. I then put debugging messages into the code to be sure. The stream was definitely active when the batch was written.

I then extracted the code into a stand-alone program,** and discovered that PutRecords claimed that a newly-created stream didn't exist, even after a dozen or more calls to DescribeStreamSummary said that yes, it did. And a second call to PutRecords, invoked immediately afterward, succeeded. It seems that their internal name->ARN lookup service has a race condition.

Since my company has business-tier support with AWS, I submitted a ticket that included the demo program. And 24 hours later received a message asking for additional information, including “a detailed system architecture.” I'm tempted to rant about AWS support, but I'm sure they're overwhelmed. At least they don't use Q to handle first-tier response — yet. I also pinged a friend who works for AWS, but haven't heard back via that channel either.

I also did some digging and discovered that I could work around the problem by (1) passing the stream ARN to PutRecords instead of the name, or (2) inserting a one-second sleep between receiving the “active” status and trying to send the batch.

As far as the log-writer itself is concerned, the error isn't a problem: it retries the send. That's why the primary assertions succeed: all of the messages are in fact written. So my decision is whether to wait for AWS to fix the problem or change my tests to work-around it. Or change the library to use ARN, on the assumption that that's the “preferred” argument for the PutRecords call.


* In case you're wondering about the title, you can read about my last experience here.

** A Python program that displays the behavior is here.