blog.kdgregory.com

Friday, September 6, 2024

My Thoughts on CodeCommit Deprecation

AWS's recent fait accompli deprecation of CodeCommit and six other services was a shock to me.*

Not because CodeCommit was a particularly good product — it wasn't — but because it could have been one with some investment. While there are many Git repository services in the world, CodeCommit was part of the AWS ecosystem, giving it some unique capabilities. Capabilites that AWS never exploited.

CodeCommit was released at re:Invent 2014, and became generally available the following summer. At some point after that, I moved a couple of repositories that I'd been self-hosting. At the time, GitHub didn't support private repositories, and I was happy to pay AWS a few cents a month to ensure that they were available anywhere and safely backed-up.

The first thing I noticed about CodeCommit was that it was painfully slow. To put some numbers on that: an HTTPS clone of my appenders library (~10,000 objects, 2.6 MiB) from GitHub takes 0.917 seconds. From the identical repository on CodeCommit, the time is 7.407 seconds. My website, which has fewer objects but more bytes, takes 22 seconds. While you might not be frequently cloning repositories, pulls and pushes are also much slower than with GitHub.

I was also annoyed that I had to set up an association between my private key and a CodeCommit user: I have several machines, each with their own private key. Making this more annoying, the key associations don't provide any indication of which key they represent; it would have been nice to see the associated public key, or even a key fingerprint.

But as I learned later, I didn't need to make those associations. CodeCommit has a git credential helper that generates temporary credentials based on IAM identity. It's not terribly well documented, and an experienced Git user probably wouldn't look at the documentation anyway (at least, I didn't), especially since the “helpful hints” shown after creating a repository make no mention of it.

Highlighting this one feature would have made CodeCommit a far more useful tool than it was. For example, a CI/CD pipeline running on AWS infrastructure wouldn't need to store credentials to access the source repository. But there's no suggestion that users could do that, even in AWS's own CodeBuild user guide (which gives examples for GitHub, BitBucket, and GitLab).

Or, as an example closer to my own needs: five years ago I submitted an issue to the “CloudFormation roadmap”, asking for the ability to retrieve templates from arbitrary HTTPS URLs (and, based on the stars and comments, this seems to be a common wish). While I can understand reluctance to allow generic public repository URLs as a security risk (somebody, somewhere, would check in something they shouldn't), CodeCommit and role-based credentials would allow CloudFormation to easily support private templates. But with CodeCommit's deprecation, that's a lost cause.

Historically, AWS has made a big deal of “dog-fooding” their services; I don't think this happened with CodeCommit. If you want to download SDK source, or any other AWS-provided open source, you go to GitHub. Although, given how slow CodeCommit is, maybe we should be thankful; something like the Java SDK already takes an enormously long time to download. But without AWS using CodeCommit internally, there was no pressure to improve. And no serendipitous “hey, we can do X!” moments.

So what? Anybody who signed up for CodeCommit can still use it, right? And perhaps it was just hubris on AWS's part to compete with companies that already produced developer tooling and the “Code” suite should never have been released in the first place.

But for me, it's yet another sign that AWS has moved away from “now go build.” And that was what attracted me to AWS in the first place.


* “Fait accompli” because they did not announce their plans in advance, instead using blog posts like this one, which was released after they'd already blocked new users.

Friday, June 14, 2024

CloudTrail + GenAI = Disappointment

AWS just announced a natural language query generator for CloudTrail Lake. Using it, you can write prompts like “What are my top errors in the past month?” and it will spit out the SQL to get the answer. On the surface, that seems great: CloudTrail events can be very complex: they hold the request and response data for thousands of unrelated API operations. In my experience, extracting the information you need can take multiple minutes with a search engine, and longer if you use a tool such as Athena.

Unfortunately, as the title of this post tells you, it doesn't quite live up to expectations. In this post I highlight four queries that I think are simple and likely to be frequently used, but which show what's lost in translation from natural language.

Note 1: I ran the queries in this post on June 12, 2024, the same week that the query generator was announced in preview. I expect it to go through continuous training, so you might not get the same output that I did. I don't plan to make this an evergreen post, and will not update it if the queries change. It calls out the state of the world at one partiular point in time.

Note 2: I made minor changes to identifiers in the generated queries, in case there's information about my account embedded within. I did not change the queries in any other way, including formatting.

Query #1: who started ec2 instance i-0fe20093cf243bca1

Sometimes you find an instance running, and have no idea why it exists. To answer that question, you need to talk to the person who started the instance.

SELECT userIdentity.arn,
    eventName,
    eventTime,
    awsRegion,
    responseElements
FROM c40adde3-0909-92d8-b4cf-6b821599a449
WHERE eventName = 'RunInstances'
    AND element_at(responseElements, 'instancesSet') LIKE '%i-0fe20093cf243bca1%'
    AND eventTime <= '2024-06-12 17:00:00'
ORDER BY eventTime DESC
LIMIT 1;

This query does what it's supposed to, and is not necessarily something that's easy to write off the top of your head: you need to know how to find the instance ID in the event's responseElements sub-object. Fortunately, the query generator does. But there are a few quirks, that I think should be called out. The first is that this query exhibits “do what I mean, not what I say” behavior.

There are two EC2 API calls that look very similar: StartInstances and RunInstances. The former restarts an existing, stopped instance; the latter starts a new instance. In maybe 99% of the cases, you (or at least I) actually want the latter API call when we use the word “start”. So it's great that the query generator picks it … until it doesn't, perhaps because the model has been retrained. A person experienced with CloudTrail will see that it's looking for the wrong API call; a novice won't, and will assume that the instance was started before the available data.

The second quirk is the condition on eventTime. It adds nothing to the query, and in fact can make it miss results: I generated this query at 17:44 UTC, so if the instance was started in the previous 45 minutes it wouldn't appear in the results.

I also found the ORDER BY and LIMIT clauses strange: they only make sense if instance IDs can be reused (which as far as I know, they can not).

Query #2: who started ec2 instances today

This query is less common, but might be useful to identify whether an person who shouldn't be authorized to start instances is, in fact, able to start them.

SELECT userIdentity.arn,
    eventName,
    eventTime,
    recipientAccountId,
    awsRegion,
    responseElements
FROM c40adde3-0909-92d8-b4cf-6b821599a449
WHERE eventName = 'RunInstances'
    AND eventTime >= timestamp '2024-06-12 00:00:00'
    AND eventTime <= timestamp '2024-06-12 23:59:59'

This isn't quite what I asked: it's listing all of the events, not just the list of identities. To give me what I want, it should use distinct userIdentity.arn, and get rid of all of the other fields in the select-list. I suspect, however, that the training data was focused on retrieving individual records, so that's what the query generator “thought” people would want.

A more subtle issue is the date range. If you're familiar with the details of CLoudTrail events, you know that event timestamps are UTC. But if you're not familiar — in other words, the type of person who would find a query generator useful — you might think incorrectly that the timestamps are local time. I tried appending “local time” to the query, but it had no effect; all date ranges in generated queries are UTF

Query #3: how many EC2 instances were started today

I could imagine this query being used by someone investigating the cost of their EC2 fleet. Unfortunately, the generated query makes a rather large mistake.

SELECT COUNT(*) AS instances_started
FROM c40adde3-0909-92d8-b4cf-6b821599a449
WHERE eventName = 'RunInstances'
    AND eventtime >= '2024-06-12 00:00:00'
    AND eventtime <= '2024-06-12 23:59:59'
    AND eventsource = 'ec2.amazonaws.com'

This query actually counts the number of requests to start instances. In many cases, one request means one instance, but that's certainly not true in all cases. There are plenty of times when you (or your automated scripts) might spin up a fleet of instances in a single request.

To get the actual count, you must know that the responseElements object in the raw event contains an instancesSet sub-object, which in turn has an items element that's an array of information about the actual instances created. And you can use that information to write a SELECT clause like this:

SELECT sum(json_array_length(json_extract(responseElements['instancesSet'], '$.items'))) AS instances_started

This is a case where a query generator should be great, because its model could incorporate information about the contents of each event type. Unfortunately, large language models only “knows” what they have seen in their training data. The first query I showed above indicates that this model has some information about the deep event structures, but not enough to go past the basics.

Query #4: who started ec2 instance with tag "foobar"

This query seems as simple as my first. But the output has entered the realm of hallucination.

SELECT useridentity.arn,
    useridentity.username
FROM c40adde3-0909-92d8-b4cf-6b821599a449
WHERE eventname = 'RunInstances'
    AND resources [0].type = 'EC2'
    AND resources [0].tags ['foobar'] IS NOT NULL
    AND eventtime >= '2024-06-12 00:00:00'
    AND eventtime <= '2024-06-12 23:59:59'
GROUP BY 1,
    2

First off, I have no idea where that date range came from; it's not in my prompt. I suspect that it was sent to the query generator as historical context, based on my previous queries. But it shouldn't be.

The conditions on resources are also hallucinations. Multiple hallucinations.

The first, and biggest, is that resources isn't a valid field for this query. There are some events that have a resources element: AssumeRole, for example. But RunInstances does not have this element; the resources that you see listed in the CloudTrail Console are all extracted from the event data programmatically. When I first saw this query I thought that the CloudTrail Lake transformation might be doing this extracton. But when I queried the data I found that the value of that column was always null for RunInstances, so clearly it isn't.

The second hallucination is that the zero element of the resources array contains information about the instance (and assumes that there's only one instance). Like I said, the resource list in the CloudTrail Console is built programmatically, so can't really be used as a guide, but it showed the instance as the fourth element of the array in the event I was viewing.

Third, the type field, for those events that have resources, is a qualified name such as AWS::IAM::Role. The CloudTrail Console uses AWS::EC2::Instance for the instance entry, not EC2, so that's what I would expect to see (if the RunInstances event did, in fact, have resources).

And lastly, I don't believe the tags element exists. I don't see it in any of the events that do have resources, and it doesn't appear in the (programatically generated) resource list in the CloudTrail Console. But this condition is at least plausible; when I tried this same prompt on June 11, it came back with a condition involving the resource ARN.

All of these hallucinations mean that this query will never return results, whether or not there's an actual instance tagged “foobar”

The one good thing about this query is that it eliminates the extra fields in the select-list. And it adds useridentity.username, which is useful even if it's only populated when an actual IAM user is responsible for the operation (which, in a world moving away from IAM users and to Identity Center, will be infrequent).

Wrapping Up

I spun up my first EC2 instance in 2008. And since then, I've usually been delighted by new AWS services, even if they had rough edges. I haven't feel that way since AWS decided to focus on gen-AI.

I don't have anything against generative AI per se (well, other than copyright infringement and the possibility to leak confidential information). I think it's great when helping an inexperienced writer organize and polish their thoughts (but hopefully they learn from the experience). And I have a friend who uses it as a “better search” when he's working with something new (but he always backs that up by reading the documentation, never blindly taking the output of the model).

But when it comes to programming, gen-AI always seems to miss the mark. I've written elsewhere about my experiences with Amazon CodeGuru reviewer. A friend recently told me that his company abandoned using Amazon CodeWhisperer to help them write CDK scripts; you'd think that would be a place where it would excel, having a large corpus of training data. And I quickly got bored with trying to use ChatGPT to generate a non-trivial progamming construct.

I think that the reason is that large language models are very good at representing syntax, but that's only a small part of programming. Programming is much more about data: its structure and how it changes over time. And also about the causal and temporal relationships between pieces of code. Syntax is essential, but it's not sufficient.

So my job is probably safe until AI actually learns how to reason, if it ever does. But what about the CloudTrail query generator?

To be honest, I don't see the market. People who use CloudTrail all the time probably use a homegrown search engine or a commercial SIEM. So that leaves novices, people who might need an occasional query as part of their workload. But given the quality of the output, I think they're likely to get frustrated very quickly and walk away (or worse, decide that the query output is correct). Perhaps the model will get better over time, but as the saying goes, you only get one chance to make a first impression.

Tuesday, May 28, 2024

My Tests Are Good At Finding AWS Bugs

I recently made a minor change to my logging library. And, minor though it was, I run a full suite of integration tests before any release. I wasn't expecting any problems, so was surprised when the first test against Kinesis failed. More puzzling: the primary assertions in the test all passed; the records were properly written to the stream. What failed was an assertion that there weren't any internally-logged errors.

Digging in, I discovered that the failure was a ResourceNotFoundException: Kinesis was claiming that the stream didn't exist! This particular test created the stream, waited for it to become active, and then immediately tried to write a batch of messages to it. I looked through the code to see if there was any way that it could be writing the batch before the stream was active, and didn't see one. I then put debugging messages into the code to be sure. The stream was definitely active when the batch was written.

I then extracted the code into a stand-alone program,** and discovered that PutRecords claimed that a newly-created stream didn't exist, even after a dozen or more calls to DescribeStreamSummary said that yes, it did. And a second call to PutRecords, invoked immediately afterward, succeeded. It seems that their internal name->ARN lookup service has a race condition.

Since my company has business-tier support with AWS, I submitted a ticket that included the demo program. And 24 hours later received a message asking for additional information, including “a detailed system architecture.” I'm tempted to rant about AWS support, but I'm sure they're overwhelmed. At least they don't use Q to handle first-tier response — yet. I also pinged a friend who works for AWS, but haven't heard back via that channel either.

I also did some digging and discovered that I could work around the problem by (1) passing the stream ARN to PutRecords instead of the name, or (2) inserting a one-second sleep between receiving the “active” status and trying to send the batch.

As far as the log-writer itself is concerned, the error isn't a problem: it retries the send. That's why the primary assertions succeed: all of the messages are in fact written. So my decision is whether to wait for AWS to fix the problem or change my tests to work-around it. Or change the library to use ARN, on the assumption that that's the “preferred” argument for the PutRecords call.


* In case you're wondering about the title, you can read about my last experience here.

** A Python program that displays the behavior is here.