blog.kdgregory.com

Monday, March 4, 2024

Ordinary Users Don't Stand a Chance

I have a “tax PC”: an old HP running Windows 10, that spends ten months of the year in a closet, and two months hooked up to a segment of my home network where it's unlikely to be attacked by the various network-attached devices living in my house. Each February I spend a couple of days applying the year's accumulated Windows updates, then I install Turbo-Tax so that I can do my taxes.

This year the Turbo-Tax installation failed:

Turbo-Tax installation error

I Google for the missing DLL, and end up on a Microsoft-hosted discussion board where the first answer says that I need to update my Visual C++ runtime, and provides a link to some third-party software site.

I look in my computer's list of applications and see that I have the “2010” runtime. Apparently that isn't something that Windows Update considers important. So I go to a Microsoft-hosted page, download the x64 version of the runtime (because the Windows “About” page tells me that I have an x64 computer with 64-bit OS installed), and … get the same error.

Perhaps Turbo-Tax isn't using the latest Visual C++ (never mind that the runtime advertises itself as being for 2015 onward). So I downloaded and installed the x64 versions the other runtimes, rebooted, and tried again. And the installer failed again.

Intuit provides a page titled “Troubleshoot installation and running TurboTax for Windows.” This page is the first result if you go to their Support page and search for the DLL name. It tells you to do things like “Check your TurboTax CD and computer's CD drive” and “Clean up and defragment your hard drive.”

I kept digging through search results, and eventually I found a discussion board posting in which someone says that they had to install the x86 version of the Visual C++ runtime. OK, that makes sense: lowest-common denominator, X86 libraries work on an x64 processor. So I did that, installed, rebooted, and the installer ran successfuly. According to a comment on that thread, the runtime should have been provided on the Turbo-Tax CD but wasn't.

I've been a professional software developer for 40 years. I don't use Windows, but I know what a DLL is, and I'm comfortable with searching the web for answers. More important, I can identify a sketchy answer, and don't download from random links on the Internet. But what about people that don't have that knowledge?

Maybe there aren't any other people who run Turbo-Tax locally, installing from physical media. Or maybe everybody else installs on their brand-new PC running Windows 11, with all the latest runtimes installed. But I have to believe that I'm not unique. There must be other people that have run into this problem.

What do they do?

Thursday, July 13, 2023

Buggy Behavior From The CloudWatch Logs API

Tip 26: “select” isn't broken.
The OS is probably not broken. And the database is probably just fine.

The Pragmatic Programmer

Words to live by — although I did once find a problem with Oracle's index management, which manifested in select. In my opinion, it's the maxim that defines a senior developer. Junior developers always look outside themselves whenever they run into a problem. Senior developers do everything they can to find where they've made a mistake. After all, there are a lot of people doing exactly the same things that you are; surely they would have found the bug first.

But sometimes, it really is a bug in the system. This post is a story of how I determined the bug wasn't in my code. It's part catharsis, part “here's what you should do if you think select is broken,” and part calling out the issue for posterity in case someone else trips over it.

In January of this year (2023), AWS made a change in the CloudWatch Logs API: you no longer needed to provide a sequence token when calling PutLogEvents. Sequence tokens were always, I think, a “leaky abstraction”: a part of the internal implementation of CloudWatch Logs that found its way into the API. To summarize: you had to retrieve a sequence token before you could write events. When you wrote a batch of events, you received a sequence token back, which you could use to write the next batch — unless another process was writing to the same stream, in which case your sequence token would become invalid. Which meant that you had to retrieve a new token, using an API that had a relatively low limit on number of requests per second. With enough concurrent writers, hilarity would ensue as each would retrieve a token, try to use it, and fail because some other writer got there first.

This change directly impacted my logging library, but initially I wasn't planning to adopt it: the library worked, the old behavior was supported, and to be honest, releasing the library involves a lot of toil. But then someone found a bug with my sequence-token retrieval code if the logging library happened to re-initialize itself (Log4J2 FTW!). I fixed the immediate bug, and released a patch, but then decided it was a sign from the universe that I should remove the sequence token entirely. Plus, it always feels good to delete code!

However, once I changed the code, one of my integration tests — the one that pits concurrent writers against each other — started failing. When it tried to verify the events that were written, it would often come up short.

Failing integration tests are nothing new to this library: CloudWatch is an “eventually consistent” system, and while “eventually” often means seconds, it sometimes takes longer for the service to become consistent. As a result, failing tests don't clean up after themselves, so that I can look at what happened. I could then use my log reader utility, and verify that all of the events were eventually available.

This time, I didn't get all of the events back. Of course, I assumed that my reader was incorrect, and dove back into the documentation for the API calls it was using. I experimented with different arguments to the GetLogEvents call, and also tried FilterLogEvents. But nothing I did had an effect: I was still missing log entries.

Next step was to dig into the logging library. I already had it fairly well instrumented: if you enable debug mode in the logging framework it will report everything that it does, including (optionally) every batch that it sends. Everything looked good, but there was a corner case that worried me: CloudWatch can reject events, and unlike Kinesis, it doesn't tell you which events it rejects. The appender pre-processed batches to remove invalid events, but didn't log failures (because there shouldn't be any). I changed that code, but no failures appeared.

At this point I didn't think the bug was in my appender, but there was always the possibility that I was doing something incorrectly that I didn't log. My next step for debugging the appender would have been to turn up the logging level within the AWS SDK, to get a detailed report of what it was doing. However, this can become a case of drowning in detail, so first I looked at alternative methods of verifying the events.

In addition to the GetLogEvents and FilterLogEvents API calls, CloudWatch provides a query API. This API requires you to submit a query and then loop until it's done, so I didn't have a utility program that used it. But it's what the Console uses in the Insights page.

So I wrote a simple query to retrieve events from the stream; I was actually more interested in the reported count, rather than looking at all 20,000 expected events. But Insights told me that there weren't any events in the stream.

At this point, I opened a ticket with AWS Support (fortunately, I have access to an account with a Business support contract, since “Basic” support no longer allows technical questions). I received an answer that “as designed,” Insights does not return any messages with timestamps prior to the log group's creation time. WTF?!?

The back-story is that my Log4J 1.x appender lazily creates the log group and stream when it first writes a batch of messages, because Log4J 1.x does not have an explicit initialization phase. This is definitely something to be aware of if you use my library with Log4J 1.x (and, overwhelmingly, that's what my download counts indicate people are doing). The solution is to pre-create your log groups, so I modified the test to do so. This is not an issue with Log4J 2.x and Logback, which do have an initialization phase.

But the test continued to fail. However, I could view the logs in Insights, and saw that it was able to retrieve all of the events (yet another approach, which also succeeded, is export to S3). And more interesting, it was the initial few log messages that weren't being retrieved by GetLogEvents. I also called the DescribeLogStreams API, and it returned a later “first event timestamp” than expected, by a few milliseconds.

This seemed worth another Support ticket: if I could retrieve all log events using one technique, and not using another, then clearly it was a problem with CloudWatch and not me.

And now comes the most important lesson from this entire adventure: do not assume that the AWS Support rep can see what you're describing, and more important, do not expect them to tell you that they can't.

I had created a log group/stream that demonstrated the problem, referenced it in the ticket. I also attached the S3 export, the list of events from my log reader, and the initial events from an Insights query. The support rep seemed to agree that there was a problem. Then for a week they wrote nightly updates about how they were trying to re-upload the events and did not see a problem. They seemed to be trying random things, so I recommended building my library and running the integration test — at this point, my assumption was that AWS accepted that there was a problem based on my example group/stream, and were simply trying to re-create at will.

It turned out this was an incorrect assumption, which I learned when the original rep went dark and a new rep picked up the case. I had independently created a program that triggered the issue, so that they wouldn't have to run the integration tests. The new rep claimed to have run the program, did not see the issue, and pointed out that AWS did not have permission to view the original log group and stream that I had identified in the ticket!

Seeing that the “web” interactions were not having any effect, I decided to try a chat session with a screenshare. I showed the rep the original group and stream, and how different retrieval methods returned different results. At the time it didn't seem to have much effect: he made sounds of acknowledgement, but was unable to assume the role I created to let him read events, and would go silent for minutes at a time while he was “try[ing] something.” All-in-all, not a great experience from my side.

But it did seem to have an effect: the next morning I got a message from someone who claimed to be on the CloudWatch team, saying that they were able to re-create the problem. And that's where it stands today. For my part, the updates to my library are on hold until I hear back: while this may be superstition on my part, it seems to work better if I provide a sequence token (I suspect it's actually the delay from retrieving the token).

And if you run into a problem with GetLogEvents not returning all of your events, feel free to reference case #12882575541 if you file a Support ticket.

Update, 2023-08-22

AWS has resolved the problem: "We investigated and found that there was an issue with the service which in rare situations can incorrectly set the "firstEventTimestamp" on a log-stream [...] Our team has now fixed the issue with GetLogEvents not returning some events due to incorrectly set "firstEventTimestamp"." Have verified that my integration tests run (repeatedly) without a failure. It took just under three months (05/23 to 08/18) to resolve the case, two months (06/15) from the point that someone took it seriously. For a company the size of AWS, and a service with the reach of CloudWatch Logs, that seems pretty fast; I've had issue drag on far longer with in-house IT teams.

Tuesday, January 3, 2023

Using EventBridge Pipes to route CloudWatch Logs events to Kinesis

Several years ago I wrote about connecting CloudWatch Logs into into a logging pipeline built on Kinesis Streams and Kinesis Firehose. To make that work, I wrote a Lambda.

AWS announced EventBridge Pipes at re:Invent 2022, as an easy way to connect messaging services. Connecting CloudWatch Logs to Kinesis seemed like a good way to kick the tires on this service. It still requires a custom Lambda, but that Lambda is now much simpler, because it doesn't need to concern itself with the quirks of PutRecords.

The Problem

When you subscribe a Kinesis stream to a log group, CloudWatch writes messages that look like this:

H4sIAAAAAAAAA62QTU7DMBCF95zC8jqO7LHd2NkFNVRCdNWyggqF1gqG/MlxW6qqd2cK7QWA5bw382a+OdLWjWNVu+VhcDSn02JZvMzLxaKYlTSh/b5zAWUBUulJZiwXgHLT17PQbwd0imFw3caF8rNqh8b9mIsYXNWi+9DX6l6wi8mAAwgAxXx88413HZNaWDHBqXH7Oq6DH6LvuzvfRBdGmj/Ra+zqO7fcuS6e9SP1G4yXGRgltdLcGK41SKu0AQ2KC55hrYTNsJCWc5CCW5CgteEG90WP4BHDaS4mmTDGAOKBTa4PwfjzuUwAA0W4yUHl0iTYQqbl7eOMPLkLFV+Rdd+mH5s6uLoPhxQvVe9ptR/TS0s6r3xHGNlVzdYRPxINzx09JX/DsP+LIX6LsTrdfAHRwAP2RwIAAA==

That's a GZipped and Base64-encoded JSON object. When decoded, unzipped, and pretty-printed, it looks like this:

{
  "messageType": "DATA_MESSAGE",
  "owner": "123456789012",
  "logGroup": "AppenderExample",
  "logStream": "Log4J1-Example-20221224-ithilien-351916",
  "subscriptionFilters": [
    "Example"
  ],
  "logEvents": [
    {
      "id": "37284354508805523945825240107552419724039002310923255808",
      "timestamp": 1671888278929,
      "message": "2022-12-24 08:24:38,929 DEBUG [example-0] com.kdgregory.log4j.aws.example.Main - value is 52\n"
    },
    {
      "id": "37284354508805523945825240107552419724039002310923255809",
      "timestamp": 1671888278929,
      "message": "2022-12-24 08:24:38,929 DEBUG [example-1] com.kdgregory.log4j.aws.example.Main - value is 52\n"
    }
  ]
}

As you can see, CloudWatch combines multiple log events into a single message. However, most destinations — such as Elasticsearch — want individual events, not a nested array. So you need something that pulls apart the original event. For example, the AWS-provided OpenSearch integration, which subscribes a Lambda to the log group, decomposes the messages, and writes them to the cluster. My previous implementation did a similar transformation, but wrote the log messages to a Kinesis stream.

EventBridge Pipes

EventBridge Pipes is modeled around a four-step process (image sourced from docs):

EventBridge Pipes data flow

For simple pipelines, EventBridge gives you a “no code” solution that eliminates the middle two steps: you define an output format using path references to fields in the input messages. Unfortunately, this isn't usable for CloudWatch Logs messages: it can't (currently) handle Gzipped data, and can't extract the elements in an array. So we need to add an “enrichment” step that performs this operation.

Implementation

You can implement the enrichment step in multiple ways, including an HTTP(S) API, but the easiest is a Lambda. Pipes invokes this Lambda with an event that contains one or more source records, and it returns returns zero or more result records.

The source event looks like this:

[
  {
    "eventSource": "aws:kinesis",
    "eventVersion": "1.0",
    "eventID": "shardId-000000000000:49636526735590005452916707570422271019264002608211165186",
    "eventName": "aws:kinesis:record",
    "invokeIdentityArn": "arn:aws:iam::123456789012:role/CloudWatchTransformPipeline-PipeRole-us-east-1",
    "awsRegion": "us-east-1",
    "eventSourceARN": "arn:aws:kinesis:us-east-1:123456789012:stream/log-subscription",
    "kinesisSchemaVersion": "1.0",
    "partitionKey": "87362ce96c65eb1b3125bf0129b99bee",
    "sequenceNumber": "49636526735590005452916707570422271019264002608211165186",
    "data": "...",
    "approximateArrivalTimestamp": 1672490859.613
  }
]

The first thing to call out is that this event is an array. EventBridge Pipes can batch messages together, so you must loop over the entire event and combine the results from each message (ie, a flat-map):

def lambda_handler(event, context):
    result = []
    for message in event:
        result += transform_messages(message)
    return result

The data element is the GZipped and Base64-encoded message that I showed above. You process it with code that looks like this:

data = base64.b64decode(message['data'])
data = gzip.decompress(data)
payload = json.loads(data)

The last thing to be aware of is that there are two types of messages that you'll get from a CloudWatch Logs subscription, identified by the messageType field. When the subscription first starts, CloudWatch sends a control message to the stream to verify that it can write messages. After that, it sends data messages. Your Lambda should ignore everything but the latter.

if payload['messageType'] == 'DATA_MESSAGE':
    # extract log events
else:
    return []

Conclusion

This post has shown you the basics of an EventBridge Pipes transform. You can find my complete example, which performs several transforms on the log events, here.

While I don't think that EventBridge Pipes quite lives up to the hype of Werner Vogels' keynote (and that it has nothing in common with EventBridge other than a name), it does allow you to focus on your logic rather than the quirks of your source and destination. As such, it's a useful addition to your data transformation toolbox.