blog.kdgregory.com

Saturday, December 14, 2024

Once more into the breach: Amazon EFS performance for software builds

This is the third time that I'm writing about this topic. The first time was in 2018, the second in 2021. In the interim, AWS has announced a steady stream of improvements, most recently (October) increasing read throughput to 60 MB/sec.

I wasn't planning to revisit this topic. However, I read Tim Bray's post on the Bonnie disk benchmark, and it had the comment “it’d be fun to run Bonnie on a sample of EC2 instance types with files on various EBS and EFS and so on configurations.” And after a few exchanges with him, I learned that the Bonnie++ benchmark measured file creation and deletion in addition to IO speed. So here I am.

EFS for Builds

Here's the test environment (my previous posts provide more information):

All tests run on an m5d.xlarge instance (4 vCPU, 16 GB RAM), running Amazon Linux 2023 (AMI ami-0453ec754f44f9a4a).
I created three users: one using the attached instance store, one using EBS (separate from the root filesystem), and one using EFS. Each user's home directory was on the filesystem in question, so all build-specific IO should be confined to that filesystem type, but they shared the root filesystem for executables and /tmp.
The local and EBS filesystems were formatted as ext4.
The EBS filesystem used a GP3 volume (so a baseline 3000 IOPS).
The EFS filesystem used Console defaults: general purpose, elastic throughput. I mounted it using the AWS recommended settings.
As a small project, my AWS appenders library, current (3.2.1) release.
As a large project, the AWS Java SDK (v1), tag 1.11.394 (the same that I used for previous posts).
The build command: mvn clean compile.
For each project/user, I did a pre-build to ensure that the local Maven repository was populated with all necessary dependencies.
Between builds I flushed and cleared the filesystem cache; see previous posts for details.
I used the time command to get timings; all are formatted minutes:seconds, rounded to the nearest second. “Real” time is the elapsed time of the build; if you're waiting for a build to complete, it's the most important number for you. “User” time is CPU time aggregated across threads; it should be independent of disk technology. And “System” time is that spent in the kernel; I consider it a proxy for how complex the IO implementation is (given that the absolute number of requests should be consistent between filesystems).

And here are the results:

	Appenders			AWS SDK
	Real	User	System	Real	User	System
Instance Store	00:06	00:16	00:01	01:19	02:12	00:09
EBS	00:07	00:16	00:01	01:45	02:19	00:09
EFS	00:18	00:20	00:01	15:59	02:24	00:17

These numbers are almost identical to the numbers from three years ago. EFS has not improved its performance when it comes to software build tasks.

What does Bonnie say?

As I mentioned above, one of the things that prompted me to revisit the topic was learning about Bonnie, specifically, Bonnie++, which performs file-level tests. I want to be clear that I'm not a disk benchmarking expert. If you are, and I've made a mistake in interpreting these results, please let me know.

I spun up a new EC2 instance to run these tests. Bonnie++ is distributed as a source tarball; you have to compile it yourself. Unfortunately, I was getting compiler errors (or maybe warnings) when building on Amazon Linux. Since I no longer have enough C++ knowledge to debug such things, I switched to Ubuntu 24.04 (ami-0e2c8caa4b6378d8c), which has Bonnie++ as a supported package. I kept the same instance type (m5d.xlarge).

I ran with the following parameters:

-c 1, which uses a single thread. I also ran with -c 4 and -c 16 but the numbers were not significantly different.
-s 32768, to use 32 GB for the IO tests. This is twice the size of the VM's RAM, the test should measure actual filesystem performance and rather than the benefit of the buffer cache.
-n 16, to create/read/delete 16,384 small files in the second phase.

Here are the results, with the command-lines that invoked them:

Local Instance Store: time bonnie++ -d /mnt/local/ -c 1 -s 32768 -n 16

Version 2.00a       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
ip-172-30-1-84  32G  867k  99  128m  13  126m  11 1367k  99  238m  13  4303 121
Latency              9330us   16707us   38347us    6074us    1302us     935us
Version 2.00a       ------Sequential Create------ --------Random Create--------
ip-172-30-1-84      -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 +++++ +++ +++++ +++ +++++ +++     0  99 +++++ +++ +++++ +++
Latency               146us     298us     998us    1857us      18us     811us
1.98,2.00a,ip-172-30-1-84,1,1733699509,32G,,8192,5,867,99,130642,13,128610,11,1367,99,244132,13,4303,121,16,,,,,+++++,+++,+++++,+++,+++++,+++,4416,99,+++++,+++,+++++,+++,9330us,16707us,38347us,6074us,1302us,935us,146us,298us,998us,1857us,18us,811us

real	11m10.129s
user	0m11.579s
sys	1m24.294s

EBS: time bonnie++ -d /mnt/ebs/ -c 1 -s 32768 -n 16

Version 2.00a       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
ip-172-30-1-84  32G 1131k  99  125m   8 65.4m   5 1387k  99  138m   7  3111  91
Latency              7118us   62128us   80278us   12380us   16517us    6303us
Version 2.00a       ------Sequential Create------ --------Random Create--------
ip-172-30-1-84      -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency               218us     303us     743us      69us      15us    1047us
1.98,2.00a,ip-172-30-1-84,1,1733695252,32G,,8192,5,1131,99,128096,8,66973,5,1387,99,140828,7,3111,91,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,7118us,62128us,80278us,12380us,16517us,6303us,218us,303us,743us,69us,15us,1047us

real	16m52.893s
user	0m12.507s
sys	1m4.045s

EFS: time bonnie++ -d /mnt/efs/ -c 1 -s 32768 -n 16

Version 2.00a       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
ip-172-30-1-84  32G  928k  98  397m  27 60.6m   6  730k  99 63.9m   4  1578  16
Latency              8633us   14621us   50626us    1893ms   59327us   34059us
Version 2.00a       ------Sequential Create------ --------Random Create--------
ip-172-30-1-84      -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16     0   0 +++++ +++     0   0     0   0     0   1     0   0
Latency             22516us      18us     367ms   24473us    6247us    1992ms
1.98,2.00a,ip-172-30-1-84,1,1733688528,32G,,8192,5,928,98,406639,27,62097,6,730,99,65441,4,1578,16,16,,,,,218,0,+++++,+++,285,0,217,0,944,1,280,0,8633us,14621us,50626us,1893ms,59327us,34059us,22516us,18us,367ms,24473us,6247us,1992ms

real	23m56.715s
user	0m11.690s
sys	1m18.469s

For the first part, reading large block files, I'm going to focus on the “Rewrite” statistic: the program reads a block from the already created file, makes a change, and writes it back out. For this test, local instance store managed 126 MB/sec, EBS was 65.4 MB/sec, and EFS was 60.6 MB/sec. Nothing surprising there: EFS achieved its recently-announced throughput, and a locally-attached SSD was faster than EBS (although much slower than the 443 MB/sec from my five-year-old laptop, a reminder that EC2 provides fractional access to physical hardware).

The second section was what I was interested in, and unfortunately, the results don't give much insight. In some doc I read that "+++++" in the output signifies that the results aren't statistically relevant (can't find that link now). Perhaps that's due to Bonnie++ dating to the days of single mechanical disks, and modern storage systems are all too fast?

But one number that jumped out at me was “Latency” for file creates: 146us for instance store, 218us for EBS, but a whopping 22516us for EFS. I couldn't find documentation for this value anywhere; reading the code, it appears to measure the longest time for a single operation. Which means that EFS could have 99% of requests completing in under 100ms but a few outliers, or it could mean generally high numbers, of which the one stated here is merely the worst. I suspect it's the latter.

I think, however, that the output from the Linux time command tells the story: each of the runs uses 11-12 seconds of “user” time, and a minute plus of “system” time. But they vary from 11 minutes of “real” time for instance store, up to nearly 24 minutes for EFS. That says to me that EFS has much poorer performance, and since the block IO numbers are consistent, it must be accounted for by the file operations (timestamps on the operation logs would make this a certainty).

Conclusion

So should you avoid EFS for your build systems? Mu.

When I first looked into EFS performance, in 2018, I was driven by my experience setting up a build server. But I haven't done that since then, and can't imagine that too many other people have either. Instead, the development teams that I work with typically use “Build as a Service” tools such as GitHub Actions (or, in some cases, Amazon CodeBuild). Running a self-hosted build server is, in my opinion, a waste of time and money for all but the most esoteric needs.

Wo where does that leave EFS?

I think that EFS is valuable for sharing files — especially large files — when you want or need filesystem semantics rather than the web-service semantics of S3. To put this into concrete terms: you can read a section of an object from S3, but it's much easier codewise to lseek or mmap a file (to be fair, I haven't looked at how well Mountpoint for Amazon S3 handles those operations). And if you need the ability to modify portions of a file, then EFS is the only real choice: to do that with S3 you'd have to rewrite the entire file.

For myself, I haven't found that many use cases where EFS is the clear winner over alternatives. And given that, and the fact that I don't plan to set up another self-hosted build server, this is the last posting that I plan to make on the topic.

Friday, September 6, 2024

My Thoughts on CodeCommit Deprecation

AWS's recent fait accompli deprecation of CodeCommit and six other services was a shock to me.*

Not because CodeCommit was a particularly good product — it wasn't — but because it could have been one with some investment. While there are many Git repository services in the world, CodeCommit was part of the AWS ecosystem, giving it some unique capabilities. Capabilites that AWS never exploited.

CodeCommit was released at re:Invent 2014, and became generally available the following summer. At some point after that, I moved a couple of repositories that I'd been self-hosting. At the time, GitHub didn't support private repositories, and I was happy to pay AWS a few cents a month to ensure that they were available anywhere and safely backed-up.

The first thing I noticed about CodeCommit was that it was painfully slow. To put some numbers on that: an HTTPS clone of my appenders library (~10,000 objects, 2.6 MiB) from GitHub takes 0.917 seconds. From the identical repository on CodeCommit, the time is 7.407 seconds. My website, which has fewer objects but more bytes, takes 22 seconds. While you might not be frequently cloning repositories, pulls and pushes are also much slower than with GitHub.

I was also annoyed that I had to set up an association between my private key and a CodeCommit user: I have several machines, each with their own private key. Making this more annoying, the key associations don't provide any indication of which key they represent; it would have been nice to see the associated public key, or even a key fingerprint.

But as I learned later, I didn't need to make those associations. CodeCommit has a git credential helper that generates temporary credentials based on IAM identity. It's not terribly well documented, and an experienced Git user probably wouldn't look at the documentation anyway (at least, I didn't), especially since the “helpful hints” shown after creating a repository make no mention of it.

Highlighting this one feature would have made CodeCommit a far more useful tool than it was. For example, a CI/CD pipeline running on AWS infrastructure wouldn't need to store credentials to access the source repository. But there's no suggestion that users could do that, even in AWS's own CodeBuild user guide (which gives examples for GitHub, BitBucket, and GitLab).

Or, as an example closer to my own needs: five years ago I submitted an issue to the “CloudFormation roadmap”, asking for the ability to retrieve templates from arbitrary HTTPS URLs (and, based on the stars and comments, this seems to be a common wish). While I can understand reluctance to allow generic public repository URLs as a security risk (somebody, somewhere, would check in something they shouldn't), CodeCommit and role-based credentials would allow CloudFormation to easily support private templates. But with CodeCommit's deprecation, that's a lost cause.

Historically, AWS has made a big deal of “dog-fooding” their services; I don't think this happened with CodeCommit. If you want to download SDK source, or any other AWS-provided open source, you go to GitHub. Although, given how slow CodeCommit is, maybe we should be thankful; something like the Java SDK already takes an enormously long time to download. But without AWS using CodeCommit internally, there was no pressure to improve. And no serendipitous “hey, we can do X!” moments.

So what? Anybody who signed up for CodeCommit can still use it, right? And perhaps it was just hubris on AWS's part to compete with companies that already produced developer tooling and the “Code” suite should never have been released in the first place.

But for me, it's yet another sign that AWS has moved away from “now go build.” And that was what attracted me to AWS in the first place.

* “Fait accompli” because they did not announce their plans in advance, instead using blog posts like this one, which was released after they'd already blocked new users.

Friday, June 14, 2024

CloudTrail + GenAI = Disappointment

AWS just announced a natural language query generator for CloudTrail Lake. Using it, you can write prompts like “What are my top errors in the past month?” and it will spit out the SQL to get the answer. On the surface, that seems great: CloudTrail events can be very complex: they hold the request and response data for thousands of unrelated API operations. In my experience, extracting the information you need can take multiple minutes with a search engine, and longer if you use a tool such as Athena.

Unfortunately, as the title of this post tells you, it doesn't quite live up to expectations. In this post I highlight four queries that I think are simple and likely to be frequently used, but which show what's lost in translation from natural language.

Note 1: I ran the queries in this post on June 12, 2024, the same week that the query generator was announced in preview. I expect it to go through continuous training, so you might not get the same output that I did. I don't plan to make this an evergreen post, and will not update it if the queries change. It calls out the state of the world at one partiular point in time.

Note 2: I made minor changes to identifiers in the generated queries, in case there's information about my account embedded within. I did not change the queries in any other way, including formatting.

Query #1: who started ec2 instance i-0fe20093cf243bca1

Sometimes you find an instance running, and have no idea why it exists. To answer that question, you need to talk to the person who started the instance.

SELECT userIdentity.arn,
    eventName,
    eventTime,
    awsRegion,
    responseElements
FROM c40adde3-0909-92d8-b4cf-6b821599a449
WHERE eventName = 'RunInstances'
    AND element_at(responseElements, 'instancesSet') LIKE '%i-0fe20093cf243bca1%'
    AND eventTime <= '2024-06-12 17:00:00'
ORDER BY eventTime DESC
LIMIT 1;

This query does what it's supposed to, and is not necessarily something that's easy to write off the top of your head: you need to know how to find the instance ID in the event's responseElements sub-object. Fortunately, the query generator does. But there are a few quirks, that I think should be called out. The first is that this query exhibits “do what I mean, not what I say” behavior.

There are two EC2 API calls that look very similar: StartInstances and RunInstances. The former restarts an existing, stopped instance; the latter starts a new instance. In maybe 99% of the cases, you (or at least I) actually want the latter API call when we use the word “start”. So it's great that the query generator picks it … until it doesn't, perhaps because the model has been retrained. A person experienced with CloudTrail will see that it's looking for the wrong API call; a novice won't, and will assume that the instance was started before the available data.

The second quirk is the condition on eventTime. It adds nothing to the query, and in fact can make it miss results: I generated this query at 17:44 UTC, so if the instance was started in the previous 45 minutes it wouldn't appear in the results.

I also found the ORDER BY and LIMIT clauses strange: they only make sense if instance IDs can be reused (which as far as I know, they can not).

Query #2: who started ec2 instances today

This query is less common, but might be useful to identify whether an person who shouldn't be authorized to start instances is, in fact, able to start them.

SELECT userIdentity.arn,
    eventName,
    eventTime,
    recipientAccountId,
    awsRegion,
    responseElements
FROM c40adde3-0909-92d8-b4cf-6b821599a449
WHERE eventName = 'RunInstances'
    AND eventTime >= timestamp '2024-06-12 00:00:00'
    AND eventTime <= timestamp '2024-06-12 23:59:59'

This isn't quite what I asked: it's listing all of the events, not just the list of identities. To give me what I want, it should use distinct userIdentity.arn, and get rid of all of the other fields in the select-list. I suspect, however, that the training data was focused on retrieving individual records, so that's what the query generator “thought” people would want.

A more subtle issue is the date range. If you're familiar with the details of CLoudTrail events, you know that event timestamps are UTC. But if you're not familiar — in other words, the type of person who would find a query generator useful — you might think incorrectly that the timestamps are local time. I tried appending “local time” to the query, but it had no effect; all date ranges in generated queries are UTF

Query #3: how many EC2 instances were started today

I could imagine this query being used by someone investigating the cost of their EC2 fleet. Unfortunately, the generated query makes a rather large mistake.

SELECT COUNT(*) AS instances_started
FROM c40adde3-0909-92d8-b4cf-6b821599a449
WHERE eventName = 'RunInstances'
    AND eventtime >= '2024-06-12 00:00:00'
    AND eventtime <= '2024-06-12 23:59:59'
    AND eventsource = 'ec2.amazonaws.com'

This query actually counts the number of requests to start instances. In many cases, one request means one instance, but that's certainly not true in all cases. There are plenty of times when you (or your automated scripts) might spin up a fleet of instances in a single request.

To get the actual count, you must know that the responseElements object in the raw event contains an instancesSet sub-object, which in turn has an items element that's an array of information about the actual instances created. And you can use that information to write a SELECT clause like this:

SELECT sum(json_array_length(json_extract(responseElements['instancesSet'], '$.items'))) AS instances_started

This is a case where a query generator should be great, because its model could incorporate information about the contents of each event type. Unfortunately, large language models only “knows” what they have seen in their training data. The first query I showed above indicates that this model has some information about the deep event structures, but not enough to go past the basics.

Query #4: who started ec2 instance with tag "foobar"

This query seems as simple as my first. But the output has entered the realm of hallucination.

SELECT useridentity.arn,
    useridentity.username
FROM c40adde3-0909-92d8-b4cf-6b821599a449
WHERE eventname = 'RunInstances'
    AND resources [0].type = 'EC2'
    AND resources [0].tags ['foobar'] IS NOT NULL
    AND eventtime >= '2024-06-12 00:00:00'
    AND eventtime <= '2024-06-12 23:59:59'
GROUP BY 1,
    2

First off, I have no idea where that date range came from; it's not in my prompt. I suspect that it was sent to the query generator as historical context, based on my previous queries. But it shouldn't be.

The conditions on resources are also hallucinations. Multiple hallucinations.

The first, and biggest, is that resources isn't a valid field for this query. There are some events that have a resources element: AssumeRole, for example. But RunInstances does not have this element; the resources that you see listed in the CloudTrail Console are all extracted from the event data programmatically. When I first saw this query I thought that the CloudTrail Lake transformation might be doing this extracton. But when I queried the data I found that the value of that column was always null for RunInstances, so clearly it isn't.

The second hallucination is that the zero element of the resources array contains information about the instance (and assumes that there's only one instance). Like I said, the resource list in the CloudTrail Console is built programmatically, so can't really be used as a guide, but it showed the instance as the fourth element of the array in the event I was viewing.

Third, the type field, for those events that have resources, is a qualified name such as AWS::IAM::Role. The CloudTrail Console uses AWS::EC2::Instance for the instance entry, not EC2, so that's what I would expect to see (if the RunInstances event did, in fact, have resources).

And lastly, I don't believe the tags element exists. I don't see it in any of the events that do have resources, and it doesn't appear in the (programatically generated) resource list in the CloudTrail Console. But this condition is at least plausible; when I tried this same prompt on June 11, it came back with a condition involving the resource ARN.

All of these hallucinations mean that this query will never return results, whether or not there's an actual instance tagged “foobar”

The one good thing about this query is that it eliminates the extra fields in the select-list. And it adds useridentity.username, which is useful even if it's only populated when an actual IAM user is responsible for the operation (which, in a world moving away from IAM users and to Identity Center, will be infrequent).

Wrapping Up

I spun up my first EC2 instance in 2008. And since then, I've usually been delighted by new AWS services, even if they had rough edges. I haven't feel that way since AWS decided to focus on gen-AI.

I don't have anything against generative AI per se (well, other than copyright infringement and the possibility to leak confidential information). I think it's great when helping an inexperienced writer organize and polish their thoughts (but hopefully they learn from the experience). And I have a friend who uses it as a “better search” when he's working with something new (but he always backs that up by reading the documentation, never blindly taking the output of the model).

But when it comes to programming, gen-AI always seems to miss the mark. I've written elsewhere about my experiences with Amazon CodeGuru reviewer. A friend recently told me that his company abandoned using Amazon CodeWhisperer to help them write CDK scripts; you'd think that would be a place where it would excel, having a large corpus of training data. And I quickly got bored with trying to use ChatGPT to generate a non-trivial progamming construct.

I think that the reason is that large language models are very good at representing syntax, but that's only a small part of programming. Programming is much more about data: its structure and how it changes over time. And also about the causal and temporal relationships between pieces of code. Syntax is essential, but it's not sufficient.

So my job is probably safe until AI actually learns how to reason, if it ever does. But what about the CloudTrail query generator?

To be honest, I don't see the market. People who use CloudTrail all the time probably use a homegrown search engine or a commercial SIEM. So that leaves novices, people who might need an occasional query as part of their workload. But given the quality of the output, I think they're likely to get frustrated very quickly and walk away (or worse, decide that the query output is correct). Perhaps the model will get better over time, but as the saying goes, you only get one chance to make a first impression.