Friday, June 14, 2024

CloudTrail + GenAI = Disappointment

AWS just announced a natural language query generator for CloudTrail Lake. Using it, you can write prompts like “What are my top errors in the past month?” and it will spit out the SQL to get the answer. On the surface, that seems great: CloudTrail events can be very complex: they hold the request and response data for thousands of unrelated API operations. In my experience, extracting the information you need can take multiple minutes with a search engine, and longer if you use a tool such as Athena.

Unfortunately, as the title of this post tells you, it doesn't quite live up to expectations. In this post I highlight four queries that I think are simple and likely to be frequently used, but which show what's lost in translation from natural language.

Note 1: I ran the queries in this post on June 12, 2024, the same week that the query generator was announced in preview. I expect it to go through continuous training, so you might not get the same output that I did. I don't plan to make this an evergreen post, and will not update it if the queries change. It calls out the state of the world at one partiular point in time.

Note 2: I made minor changes to identifiers in the generated queries, in case there's information about my account embedded within. I did not change the queries in any other way, including formatting.

Query #1: who started ec2 instance i-0fe20093cf243bca1

Sometimes you find an instance running, and have no idea why it exists. To answer that question, you need to talk to the person who started the instance.

SELECT userIdentity.arn,
FROM c40adde3-0909-92d8-b4cf-6b821599a449
WHERE eventName = 'RunInstances'
    AND element_at(responseElements, 'instancesSet') LIKE '%i-0fe20093cf243bca1%'
    AND eventTime <= '2024-06-12 17:00:00'

This query does what it's supposed to, and is not necessarily something that's easy to write off the top of your head: you need to know how to find the instance ID in the event's responseElements sub-object. Fortunately, the query generator does. But there are a few quirks, that I think should be called out. The first is that this query exhibits “do what I mean, not what I say” behavior.

There are two EC2 API calls that look very similar: StartInstances and RunInstances. The former restarts an existing, stopped instance; the latter starts a new instance. In maybe 99% of the cases, you (or at least I) actually want the latter API call when we use the word “start”. So it's great that the query generator picks it … until it doesn't, perhaps because the model has been retrained. A person experienced with CloudTrail will see that it's looking for the wrong API call; a novice won't, and will assume that the instance was started before the available data.

The second quirk is the condition on eventTime. It adds nothing to the query, and in fact can make it miss results: I generated this query at 17:44 UTC, so if the instance was started in the previous 45 minutes it wouldn't appear in the results.

I also found the ORDER BY and LIMIT clauses strange: they only make sense if instance IDs can be reused (which as far as I know, they can not).

Query #2: who started ec2 instances today

This query is less common, but might be useful to identify whether an person who shouldn't be authorized to start instances is, in fact, able to start them.

SELECT userIdentity.arn,
FROM c40adde3-0909-92d8-b4cf-6b821599a449
WHERE eventName = 'RunInstances'
    AND eventTime >= timestamp '2024-06-12 00:00:00'
    AND eventTime <= timestamp '2024-06-12 23:59:59'

This isn't quite what I asked: it's listing all of the events, not just the list of identities. To give me what I want, it should use distinct userIdentity.arn, and get rid of all of the other fields in the select-list. I suspect, however, that the training data was focused on retrieving individual records, so that's what the query generator “thought” people would want.

A more subtle issue is the date range. If you're familiar with the details of CLoudTrail events, you know that event timestamps are UTC. But if you're not familiar — in other words, the type of person who would find a query generator useful — you might think incorrectly that the timestamps are local time. I tried appending “local time” to the query, but it had no effect; all date ranges in generated queries are UTF

Query #3: how many EC2 instances were started today

I could imagine this query being used by someone investigating the cost of their EC2 fleet. Unfortunately, the generated query makes a rather large mistake.

SELECT COUNT(*) AS instances_started
FROM c40adde3-0909-92d8-b4cf-6b821599a449
WHERE eventName = 'RunInstances'
    AND eventtime >= '2024-06-12 00:00:00'
    AND eventtime <= '2024-06-12 23:59:59'
    AND eventsource = ''

This query actually counts the number of requests to start instances. In many cases, one request means one instance, but that's certainly not true in all cases. There are plenty of times when you (or your automated scripts) might spin up a fleet of instances in a single request.

To get the actual count, you must know that the responseElements object in the raw event contains an instancesSet sub-object, which in turn has an items element that's an array of information about the actual instances created. And you can use that information to write a SELECT clause like this:

SELECT sum(json_array_length(json_extract(responseElements['instancesSet'], '$.items'))) AS instances_started

This is a case where a query generator should be great, because its model could incorporate information about the contents of each event type. Unfortunately, large language models only “knows” what they have seen in their training data. The first query I showed above indicates that this model has some information about the deep event structures, but not enough to go past the basics.

Query #4: who started ec2 instance with tag "foobar"

This query seems as simple as my first. But the output has entered the realm of hallucination.

SELECT useridentity.arn,
FROM c40adde3-0909-92d8-b4cf-6b821599a449
WHERE eventname = 'RunInstances'
    AND resources [0].type = 'EC2'
    AND resources [0].tags ['foobar'] IS NOT NULL
    AND eventtime >= '2024-06-12 00:00:00'
    AND eventtime <= '2024-06-12 23:59:59'

First off, I have no idea where that date range came from; it's not in my prompt. I suspect that it was sent to the query generator as historical context, based on my previous queries. But it shouldn't be.

The conditions on resources are also hallucinations. Multiple hallucinations.

The first, and biggest, is that resources isn't a valid field for this query. There are some events that have a resources element: AssumeRole, for example. But RunInstances does not have this element; the resources that you see listed in the CloudTrail Console are all extracted from the event data programmatically. When I first saw this query I thought that the CloudTrail Lake transformation might be doing this extracton. But when I queried the data I found that the value of that column was always null for RunInstances, so clearly it isn't.

The second hallucination is that the zero element of the resources array contains information about the instance (and assumes that there's only one instance). Like I said, the resource list in the CloudTrail Console is built programmatically, so can't really be used as a guide, but it showed the instance as the fourth element of the array in the event I was viewing.

Third, the type field, for those events that have resources, is a qualified name such as AWS::IAM::Role. The CloudTrail Console uses AWS::EC2::Instance for the instance entry, not EC2, so that's what I would expect to see (if the RunInstances event did, in fact, have resources).

And lastly, I don't believe the tags element exists. I don't see it in any of the events that do have resources, and it doesn't appear in the (programatically generated) resource list in the CloudTrail Console. But this condition is at least plausible; when I tried this same prompt on June 11, it came back with a condition involving the resource ARN.

All of these hallucinations mean that this query will never return results, whether or not there's an actual instance tagged “foobar”

The one good thing about this query is that it eliminates the extra fields in the select-list. And it adds useridentity.username, which is useful even if it's only populated when an actual IAM user is responsible for the operation (which, in a world moving away from IAM users and to Identity Center, will be infrequent).

Wrapping Up

I spun up my first EC2 instance in 2008. And since then, I've usually been delighted by new AWS services, even if they had rough edges. I haven't feel that way since AWS decided to focus on gen-AI.

I don't have anything against generative AI per se (well, other than copyright infringement and the possibility to leak confidential information). I think it's great when helping an inexperienced writer organize and polish their thoughts (but hopefully they learn from the experience). And I have a friend who uses it as a “better search” when he's working with something new (but he always backs that up by reading the documentation, never blindly taking the output of the model).

But when it comes to programming, gen-AI always seems to miss the mark. I've written elsewhere about my experiences with Amazon CodeGuru reviewer. A friend recently told me that his company abandoned using Amazon CodeWhisperer to help them write CDK scripts; you'd think that would be a place where it would excel, having a large corpus of training data. And I quickly got bored with trying to use ChatGPT to generate a non-trivial progamming construct.

I think that the reason is that large language models are very good at representing syntax, but that's only a small part of programming. Programming is much more about data: its structure and how it changes over time. And also about the causal and temporal relationships between pieces of code. Syntax is essential, but it's not sufficient.

So my job is probably safe until AI actually learns how to reason, if it ever does. But what about the CloudTrail query generator?

To be honest, I don't see the market. People who use CloudTrail all the time probably use a homegrown search engine or a commercial SIEM. So that leaves novices, people who might need an occasional query as part of their workload. But given the quality of the output, I think they're likely to get frustrated very quickly and walk away (or worse, decide that the query output is correct). Perhaps the model will get better over time, but as the saying goes, you only get one chance to make a first impression.

Tuesday, May 28, 2024

My Tests Are Good At Finding AWS Bugs

I recently made a minor change to my logging library. And, minor though it was, I run a full suite of integration tests before any release. I wasn't expecting any problems, so was surprised when the first test against Kinesis failed. More puzzling: the primary assertions in the test all passed; the records were properly written to the stream. What failed was an assertion that there weren't any internally-logged errors.

Digging in, I discovered that the failure was a ResourceNotFoundException: Kinesis was claiming that the stream didn't exist! This particular test created the stream, waited for it to become active, and then immediately tried to write a batch of messages to it. I looked through the code to see if there was any way that it could be writing the batch before the stream was active, and didn't see one. I then put debugging messages into the code to be sure. The stream was definitely active when the batch was written.

I then extracted the code into a stand-alone program,** and discovered that PutRecords claimed that a newly-created stream didn't exist, even after a dozen or more calls to DescribeStreamSummary said that yes, it did. And a second call to PutRecords, invoked immediately afterward, succeeded. It seems that their internal name->ARN lookup service has a race condition.

Since my company has business-tier support with AWS, I submitted a ticket that included the demo program. And 24 hours later received a message asking for additional information, including “a detailed system architecture.” I'm tempted to rant about AWS support, but I'm sure they're overwhelmed. At least they don't use Q to handle first-tier response — yet. I also pinged a friend who works for AWS, but haven't heard back via that channel either.

I also did some digging and discovered that I could work around the problem by (1) passing the stream ARN to PutRecords instead of the name, or (2) inserting a one-second sleep between receiving the “active” status and trying to send the batch.

As far as the log-writer itself is concerned, the error isn't a problem: it retries the send. That's why the primary assertions succeed: all of the messages are in fact written. So my decision is whether to wait for AWS to fix the problem or change my tests to work-around it. Or change the library to use ARN, on the assumption that that's the “preferred” argument for the PutRecords call.

* In case you're wondering about the title, you can read about my last experience here.

** A Python program that displays the behavior is here.

Monday, March 4, 2024

Ordinary Users Don't Stand a Chance

I have a “tax PC”: an old HP running Windows 10, that spends ten months of the year in a closet, and two months hooked up to a segment of my home network where it's unlikely to be attacked by the various network-attached devices living in my house. Each February I spend a couple of days applying the year's accumulated Windows updates, then I install Turbo-Tax so that I can do my taxes.

This year the Turbo-Tax installation failed:

Turbo-Tax installation error

I Google for the missing DLL, and end up on a Microsoft-hosted discussion board where the first answer says that I need to update my Visual C++ runtime, and provides a link to some third-party software site.

I look in my computer's list of applications and see that I have the “2010” runtime. Apparently that isn't something that Windows Update considers important. So I go to a Microsoft-hosted page, download the x64 version of the runtime (because the Windows “About” page tells me that I have an x64 computer with 64-bit OS installed), and … get the same error.

Perhaps Turbo-Tax isn't using the latest Visual C++ (never mind that the runtime advertises itself as being for 2015 onward). So I downloaded and installed the x64 versions the other runtimes, rebooted, and tried again. And the installer failed again.

Intuit provides a page titled “Troubleshoot installation and running TurboTax for Windows.” This page is the first result if you go to their Support page and search for the DLL name. It tells you to do things like “Check your TurboTax CD and computer's CD drive” and “Clean up and defragment your hard drive.”

I kept digging through search results, and eventually I found a discussion board posting in which someone says that they had to install the x86 version of the Visual C++ runtime. OK, that makes sense: lowest-common denominator, X86 libraries work on an x64 processor. So I did that, installed, rebooted, and the installer ran successfuly. According to a comment on that thread, the runtime should have been provided on the Turbo-Tax CD but wasn't.

I've been a professional software developer for 40 years. I don't use Windows, but I know what a DLL is, and I'm comfortable with searching the web for answers. More important, I can identify a sketchy answer, and don't download from random links on the Internet. But what about people that don't have that knowledge?

Maybe there aren't any other people who run Turbo-Tax locally, installing from physical media. Or maybe everybody else installs on their brand-new PC running Windows 11, with all the latest runtimes installed. But I have to believe that I'm not unique. There must be other people that have run into this problem.

What do they do?