I recently made a minor change to my logging library. And, minor though it was, I run a full suite of integration tests before any release. I wasn't expecting any problems, so was surprised when the first test against Kinesis failed. More puzzling: the primary assertions in the test all passed; the records were properly written to the stream. What failed was an assertion that there weren't any internally-logged errors.
Digging in, I discovered that the failure was a ResourceNotFoundException
: Kinesis was claiming that
the stream didn't exist! This particular test created the stream, waited for it to become active, and then
immediately tried to write a batch of messages to it. I looked through the code to see if there was any way
that it could be writing the batch before the stream was active, and didn't see one. I then put debugging messages
into the code to be sure. The stream was definitely active when the batch was written.
I then extracted the code into a stand-alone program,** and discovered that
PutRecords
claimed that a newly-created stream didn't exist, even after a dozen or more calls to
DescribeStreamSummary
said that yes, it did. And a second call to PutRecords
, invoked
immediately afterward, succeeded. It seems that their internal name->ARN lookup service has a race condition.
Since my company has business-tier support with AWS, I submitted a ticket that included the demo program. And 24 hours later received a message asking for additional information, including “a detailed system architecture.” I'm tempted to rant about AWS support, but I'm sure they're overwhelmed. At least they don't use Q to handle first-tier response — yet. I also pinged a friend who works for AWS, but haven't heard back via that channel either.
I also did some digging and discovered that I could work around the problem by (1) passing the stream ARN to
PutRecords
instead of the name, or (2) inserting a one-second sleep between receiving the
“active” status and trying to send the batch.
As far as the log-writer itself is concerned, the error isn't a problem: it retries the send. That's why the
primary assertions succeed: all of the messages are in fact written. So my decision is whether to wait for
AWS to fix the problem or change my tests to work-around it. Or change the library to use ARN, on the assumption
that that's the “preferred” argument for the PutRecords
call.
* In case you're wondering about the title, you can read about my last experience here.