Showing posts with label amazon web services. Show all posts
Showing posts with label amazon web services. Show all posts

Saturday, December 14, 2024

Once more into the breach: Amazon EFS performance for software builds

This is the third time that I'm writing about this topic. The first time was in 2018, the second in 2021. In the interim, AWS has announced a steady stream of improvements, most recently (October) increasing read throughput to 60 MB/sec.

I wasn't planning to revisit this topic. However, I read Tim Bray's post on the Bonnie disk benchmark, and it had the comment “it’d be fun to run Bonnie on a sample of EC2 instance types with files on various EBS and EFS and so on configurations.” And after a few exchanges with him, I learned that the Bonnie++ benchmark measured file creation and deletion in addition to IO speed. So here I am.

EFS for Builds

Here's the test environment (my previous posts provide more information):

  • All tests run on an m5d.xlarge instance (4 vCPU, 16 GB RAM), running Amazon Linux 2023 (AMI ami-0453ec754f44f9a4a).
  • I created three users: one using the attached instance store, one using EBS (separate from the root filesystem), and one using EFS. Each user's home directory was on the filesystem in question, so all build-specific IO should be confined to that filesystem type, but they shared the root filesystem for executables and /tmp.
  • The local and EBS filesystems were formatted as ext4.
  • The EBS filesystem used a GP3 volume (so a baseline 3000 IOPS).
  • The EFS filesystem used Console defaults: general purpose, elastic throughput. I mounted it using the AWS recommended settings.
  • As a small project, my AWS appenders library, current (3.2.1) release.
  • As a large project, the AWS Java SDK (v1), tag 1.11.394 (the same that I used for previous posts).
  • The build command: mvn clean compile.
  • For each project/user, I did a pre-build to ensure that the local Maven repository was populated with all necessary dependencies.
  • Between builds I flushed and cleared the filesystem cache; see previous posts for details.
  • I used the time command to get timings; all are formatted minutes:seconds, rounded to the nearest second. “Real” time is the elapsed time of the build; if you're waiting for a build to complete, it's the most important number for you. “User” time is CPU time aggregated across threads; it should be independent of disk technology. And “System” time is that spent in the kernel; I consider it a proxy for how complex the IO implementation is (given that the absolute number of requests should be consistent between filesystems).

And here are the results:

  Appenders AWS SDK
  Real User System Real User System
Instance Store 00:06 00:16 00:01 01:19 02:12 00:09
EBS 00:07 00:16 00:01 01:45 02:19 00:09
EFS 00:18 00:20 00:01 15:59 02:24 00:17

These numbers are almost identical to the numbers from three years ago. EFS has not improved its performance when it comes to software build tasks.

What does Bonnie say?

As I mentioned above, one of the things that prompted me to revisit the topic was learning about Bonnie, specifically, Bonnie++, which performs file-level tests. I want to be clear that I'm not a disk benchmarking expert. If you are, and I've made a mistake in interpreting these results, please let me know.

I spun up a new EC2 instance to run these tests. Bonnie++ is distributed as a source tarball; you have to compile it yourself. Unfortunately, I was getting compiler errors (or maybe warnings) when building on Amazon Linux. Since I no longer have enough C++ knowledge to debug such things, I switched to Ubuntu 24.04 (ami-0e2c8caa4b6378d8c), which has Bonnie++ as a supported package. I kept the same instance type (m5d.xlarge).

I ran with the following parameters:

  • -c 1, which uses a single thread. I also ran with -c 4 and -c 16 but the numbers were not significantly different.
  • -s 32768, to use 32 GB for the IO tests. This is twice the size of the VM's RAM, the test should measure actual filesystem performance and rather than the benefit of the buffer cache.
  • -n 16, to create/read/delete 16,384 small files in the second phase.

Here are the results, with the command-lines that invoked them:

  • Local Instance Store: time bonnie++ -d /mnt/local/ -c 1 -s 32768 -n 16
    Version 2.00a       ------Sequential Output------ --Sequential Input- --Random-
                        -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
    Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
    ip-172-30-1-84  32G  867k  99  128m  13  126m  11 1367k  99  238m  13  4303 121
    Latency              9330us   16707us   38347us    6074us    1302us     935us
    Version 2.00a       ------Sequential Create------ --------Random Create--------
    ip-172-30-1-84      -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
                  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                     16 +++++ +++ +++++ +++ +++++ +++     0  99 +++++ +++ +++++ +++
    Latency               146us     298us     998us    1857us      18us     811us
    1.98,2.00a,ip-172-30-1-84,1,1733699509,32G,,8192,5,867,99,130642,13,128610,11,1367,99,244132,13,4303,121,16,,,,,+++++,+++,+++++,+++,+++++,+++,4416,99,+++++,+++,+++++,+++,9330us,16707us,38347us,6074us,1302us,935us,146us,298us,998us,1857us,18us,811us
    
    real	11m10.129s
    user	0m11.579s
    sys	1m24.294s
         
  • EBS: time bonnie++ -d /mnt/ebs/ -c 1 -s 32768 -n 16
    Version 2.00a       ------Sequential Output------ --Sequential Input- --Random-
                        -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
    Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
    ip-172-30-1-84  32G 1131k  99  125m   8 65.4m   5 1387k  99  138m   7  3111  91
    Latency              7118us   62128us   80278us   12380us   16517us    6303us
    Version 2.00a       ------Sequential Create------ --------Random Create--------
    ip-172-30-1-84      -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
                  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                     16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
    Latency               218us     303us     743us      69us      15us    1047us
    1.98,2.00a,ip-172-30-1-84,1,1733695252,32G,,8192,5,1131,99,128096,8,66973,5,1387,99,140828,7,3111,91,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,7118us,62128us,80278us,12380us,16517us,6303us,218us,303us,743us,69us,15us,1047us
    
    real	16m52.893s
    user	0m12.507s
    sys	1m4.045s
         
  • EFS: time bonnie++ -d /mnt/efs/ -c 1 -s 32768 -n 16
    Version 2.00a       ------Sequential Output------ --Sequential Input- --Random-
                        -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
    Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
    ip-172-30-1-84  32G  928k  98  397m  27 60.6m   6  730k  99 63.9m   4  1578  16
    Latency              8633us   14621us   50626us    1893ms   59327us   34059us
    Version 2.00a       ------Sequential Create------ --------Random Create--------
    ip-172-30-1-84      -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
                  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                     16     0   0 +++++ +++     0   0     0   0     0   1     0   0
    Latency             22516us      18us     367ms   24473us    6247us    1992ms
    1.98,2.00a,ip-172-30-1-84,1,1733688528,32G,,8192,5,928,98,406639,27,62097,6,730,99,65441,4,1578,16,16,,,,,218,0,+++++,+++,285,0,217,0,944,1,280,0,8633us,14621us,50626us,1893ms,59327us,34059us,22516us,18us,367ms,24473us,6247us,1992ms
    
    real	23m56.715s
    user	0m11.690s
    sys	1m18.469s
         

For the first part, reading large block files, I'm going to focus on the “Rewrite” statistic: the program reads a block from the already created file, makes a change, and writes it back out. For this test, local instance store managed 126 MB/sec, EBS was 65.4 MB/sec, and EFS was 60.6 MB/sec. Nothing surprising there: EFS achieved its recently-announced throughput, and a locally-attached SSD was faster than EBS (although much slower than the 443 MB/sec from my five-year-old laptop, a reminder that EC2 provides fractional access to physical hardware).

The second section was what I was interested in, and unfortunately, the results don't give much insight. In some doc I read that "+++++" in the output signifies that the results aren't statistically relevant (can't find that link now). Perhaps that's due to Bonnie++ dating to the days of single mechanical disks, and modern storage systems are all too fast?

But one number that jumped out at me was “Latency” for file creates: 146us for instance store, 218us for EBS, but a whopping 22516us for EFS. I couldn't find documentation for this value anywhere; reading the code, it appears to measure the longest time for a single operation. Which means that EFS could have 99% of requests completing in under 100ms but a few outliers, or it could mean generally high numbers, of which the one stated here is merely the worst. I suspect it's the latter.

I think, however, that the output from the Linux time command tells the story: each of the runs uses 11-12 seconds of “user” time, and a minute plus of “system” time. But they vary from 11 minutes of “real” time for instance store, up to nearly 24 minutes for EFS. That says to me that EFS has much poorer performance, and since the block IO numbers are consistent, it must be accounted for by the file operations (timestamps on the operation logs would make this a certainty).

Conclusion

So should you avoid EFS for your build systems? Mu.

When I first looked into EFS performance, in 2018, I was driven by my experience setting up a build server. But I haven't done that since then, and can't imagine that too many other people have either. Instead, the development teams that I work with typically use “Build as a Service” tools such as GitHub Actions (or, in some cases, Amazon CodeBuild). Running a self-hosted build server is, in my opinion, a waste of time and money for all but the most esoteric needs.

Wo where does that leave EFS?

I think that EFS is valuable for sharing files — especially large files — when you want or need filesystem semantics rather than the web-service semantics of S3. To put this into concrete terms: you can read a section of an object from S3, but it's much easier codewise to lseek or mmap a file (to be fair, I haven't looked at how well Mountpoint for Amazon S3 handles those operations). And if you need the ability to modify portions of a file, then EFS is the only real choice: to do that with S3 you'd have to rewrite the entire file.

For myself, I haven't found that many use cases where EFS is the clear winner over alternatives. And given that, and the fact that I don't plan to set up another self-hosted build server, this is the last posting that I plan to make on the topic.

Friday, June 14, 2024

CloudTrail + GenAI = Disappointment

AWS just announced a natural language query generator for CloudTrail Lake. Using it, you can write prompts like “What are my top errors in the past month?” and it will spit out the SQL to get the answer. On the surface, that seems great: CloudTrail events can be very complex: they hold the request and response data for thousands of unrelated API operations. In my experience, extracting the information you need can take multiple minutes with a search engine, and longer if you use a tool such as Athena.

Unfortunately, as the title of this post tells you, it doesn't quite live up to expectations. In this post I highlight four queries that I think are simple and likely to be frequently used, but which show what's lost in translation from natural language.

Note 1: I ran the queries in this post on June 12, 2024, the same week that the query generator was announced in preview. I expect it to go through continuous training, so you might not get the same output that I did. I don't plan to make this an evergreen post, and will not update it if the queries change. It calls out the state of the world at one partiular point in time.

Note 2: I made minor changes to identifiers in the generated queries, in case there's information about my account embedded within. I did not change the queries in any other way, including formatting.

Query #1: who started ec2 instance i-0fe20093cf243bca1

Sometimes you find an instance running, and have no idea why it exists. To answer that question, you need to talk to the person who started the instance.

SELECT userIdentity.arn,
    eventName,
    eventTime,
    awsRegion,
    responseElements
FROM c40adde3-0909-92d8-b4cf-6b821599a449
WHERE eventName = 'RunInstances'
    AND element_at(responseElements, 'instancesSet') LIKE '%i-0fe20093cf243bca1%'
    AND eventTime <= '2024-06-12 17:00:00'
ORDER BY eventTime DESC
LIMIT 1;

This query does what it's supposed to, and is not necessarily something that's easy to write off the top of your head: you need to know how to find the instance ID in the event's responseElements sub-object. Fortunately, the query generator does. But there are a few quirks, that I think should be called out. The first is that this query exhibits “do what I mean, not what I say” behavior.

There are two EC2 API calls that look very similar: StartInstances and RunInstances. The former restarts an existing, stopped instance; the latter starts a new instance. In maybe 99% of the cases, you (or at least I) actually want the latter API call when we use the word “start”. So it's great that the query generator picks it … until it doesn't, perhaps because the model has been retrained. A person experienced with CloudTrail will see that it's looking for the wrong API call; a novice won't, and will assume that the instance was started before the available data.

The second quirk is the condition on eventTime. It adds nothing to the query, and in fact can make it miss results: I generated this query at 17:44 UTC, so if the instance was started in the previous 45 minutes it wouldn't appear in the results.

I also found the ORDER BY and LIMIT clauses strange: they only make sense if instance IDs can be reused (which as far as I know, they can not).

Query #2: who started ec2 instances today

This query is less common, but might be useful to identify whether an person who shouldn't be authorized to start instances is, in fact, able to start them.

SELECT userIdentity.arn,
    eventName,
    eventTime,
    recipientAccountId,
    awsRegion,
    responseElements
FROM c40adde3-0909-92d8-b4cf-6b821599a449
WHERE eventName = 'RunInstances'
    AND eventTime >= timestamp '2024-06-12 00:00:00'
    AND eventTime <= timestamp '2024-06-12 23:59:59'

This isn't quite what I asked: it's listing all of the events, not just the list of identities. To give me what I want, it should use distinct userIdentity.arn, and get rid of all of the other fields in the select-list. I suspect, however, that the training data was focused on retrieving individual records, so that's what the query generator “thought” people would want.

A more subtle issue is the date range. If you're familiar with the details of CLoudTrail events, you know that event timestamps are UTC. But if you're not familiar — in other words, the type of person who would find a query generator useful — you might think incorrectly that the timestamps are local time. I tried appending “local time” to the query, but it had no effect; all date ranges in generated queries are UTF

Query #3: how many EC2 instances were started today

I could imagine this query being used by someone investigating the cost of their EC2 fleet. Unfortunately, the generated query makes a rather large mistake.

SELECT COUNT(*) AS instances_started
FROM c40adde3-0909-92d8-b4cf-6b821599a449
WHERE eventName = 'RunInstances'
    AND eventtime >= '2024-06-12 00:00:00'
    AND eventtime <= '2024-06-12 23:59:59'
    AND eventsource = 'ec2.amazonaws.com'

This query actually counts the number of requests to start instances. In many cases, one request means one instance, but that's certainly not true in all cases. There are plenty of times when you (or your automated scripts) might spin up a fleet of instances in a single request.

To get the actual count, you must know that the responseElements object in the raw event contains an instancesSet sub-object, which in turn has an items element that's an array of information about the actual instances created. And you can use that information to write a SELECT clause like this:

SELECT sum(json_array_length(json_extract(responseElements['instancesSet'], '$.items'))) AS instances_started

This is a case where a query generator should be great, because its model could incorporate information about the contents of each event type. Unfortunately, large language models only “knows” what they have seen in their training data. The first query I showed above indicates that this model has some information about the deep event structures, but not enough to go past the basics.

Query #4: who started ec2 instance with tag "foobar"

This query seems as simple as my first. But the output has entered the realm of hallucination.

SELECT useridentity.arn,
    useridentity.username
FROM c40adde3-0909-92d8-b4cf-6b821599a449
WHERE eventname = 'RunInstances'
    AND resources [0].type = 'EC2'
    AND resources [0].tags ['foobar'] IS NOT NULL
    AND eventtime >= '2024-06-12 00:00:00'
    AND eventtime <= '2024-06-12 23:59:59'
GROUP BY 1,
    2

First off, I have no idea where that date range came from; it's not in my prompt. I suspect that it was sent to the query generator as historical context, based on my previous queries. But it shouldn't be.

The conditions on resources are also hallucinations. Multiple hallucinations.

The first, and biggest, is that resources isn't a valid field for this query. There are some events that have a resources element: AssumeRole, for example. But RunInstances does not have this element; the resources that you see listed in the CloudTrail Console are all extracted from the event data programmatically. When I first saw this query I thought that the CloudTrail Lake transformation might be doing this extracton. But when I queried the data I found that the value of that column was always null for RunInstances, so clearly it isn't.

The second hallucination is that the zero element of the resources array contains information about the instance (and assumes that there's only one instance). Like I said, the resource list in the CloudTrail Console is built programmatically, so can't really be used as a guide, but it showed the instance as the fourth element of the array in the event I was viewing.

Third, the type field, for those events that have resources, is a qualified name such as AWS::IAM::Role. The CloudTrail Console uses AWS::EC2::Instance for the instance entry, not EC2, so that's what I would expect to see (if the RunInstances event did, in fact, have resources).

And lastly, I don't believe the tags element exists. I don't see it in any of the events that do have resources, and it doesn't appear in the (programatically generated) resource list in the CloudTrail Console. But this condition is at least plausible; when I tried this same prompt on June 11, it came back with a condition involving the resource ARN.

All of these hallucinations mean that this query will never return results, whether or not there's an actual instance tagged “foobar”

The one good thing about this query is that it eliminates the extra fields in the select-list. And it adds useridentity.username, which is useful even if it's only populated when an actual IAM user is responsible for the operation (which, in a world moving away from IAM users and to Identity Center, will be infrequent).

Wrapping Up

I spun up my first EC2 instance in 2008. And since then, I've usually been delighted by new AWS services, even if they had rough edges. I haven't feel that way since AWS decided to focus on gen-AI.

I don't have anything against generative AI per se (well, other than copyright infringement and the possibility to leak confidential information). I think it's great when helping an inexperienced writer organize and polish their thoughts (but hopefully they learn from the experience). And I have a friend who uses it as a “better search” when he's working with something new (but he always backs that up by reading the documentation, never blindly taking the output of the model).

But when it comes to programming, gen-AI always seems to miss the mark. I've written elsewhere about my experiences with Amazon CodeGuru reviewer. A friend recently told me that his company abandoned using Amazon CodeWhisperer to help them write CDK scripts; you'd think that would be a place where it would excel, having a large corpus of training data. And I quickly got bored with trying to use ChatGPT to generate a non-trivial progamming construct.

I think that the reason is that large language models are very good at representing syntax, but that's only a small part of programming. Programming is much more about data: its structure and how it changes over time. And also about the causal and temporal relationships between pieces of code. Syntax is essential, but it's not sufficient.

So my job is probably safe until AI actually learns how to reason, if it ever does. But what about the CloudTrail query generator?

To be honest, I don't see the market. People who use CloudTrail all the time probably use a homegrown search engine or a commercial SIEM. So that leaves novices, people who might need an occasional query as part of their workload. But given the quality of the output, I think they're likely to get frustrated very quickly and walk away (or worse, decide that the query output is correct). Perhaps the model will get better over time, but as the saying goes, you only get one chance to make a first impression.

Tuesday, May 28, 2024

My Tests Are Good At Finding AWS Bugs

I recently made a minor change to my logging library. And, minor though it was, I run a full suite of integration tests before any release. I wasn't expecting any problems, so was surprised when the first test against Kinesis failed. More puzzling: the primary assertions in the test all passed; the records were properly written to the stream. What failed was an assertion that there weren't any internally-logged errors.

Digging in, I discovered that the failure was a ResourceNotFoundException: Kinesis was claiming that the stream didn't exist! This particular test created the stream, waited for it to become active, and then immediately tried to write a batch of messages to it. I looked through the code to see if there was any way that it could be writing the batch before the stream was active, and didn't see one. I then put debugging messages into the code to be sure. The stream was definitely active when the batch was written.

I then extracted the code into a stand-alone program,** and discovered that PutRecords claimed that a newly-created stream didn't exist, even after a dozen or more calls to DescribeStreamSummary said that yes, it did. And a second call to PutRecords, invoked immediately afterward, succeeded. It seems that their internal name->ARN lookup service has a race condition.

Since my company has business-tier support with AWS, I submitted a ticket that included the demo program. And 24 hours later received a message asking for additional information, including “a detailed system architecture.” I'm tempted to rant about AWS support, but I'm sure they're overwhelmed. At least they don't use Q to handle first-tier response — yet. I also pinged a friend who works for AWS, but haven't heard back via that channel either.

I also did some digging and discovered that I could work around the problem by (1) passing the stream ARN to PutRecords instead of the name, or (2) inserting a one-second sleep between receiving the “active” status and trying to send the batch.

As far as the log-writer itself is concerned, the error isn't a problem: it retries the send. That's why the primary assertions succeed: all of the messages are in fact written. So my decision is whether to wait for AWS to fix the problem or change my tests to work-around it. Or change the library to use ARN, on the assumption that that's the “preferred” argument for the PutRecords call.


* In case you're wondering about the title, you can read about my last experience here.

** A Python program that displays the behavior is here.

Thursday, July 13, 2023

Buggy Behavior From The CloudWatch Logs API

Tip 26: “select” isn't broken.
The OS is probably not broken. And the database is probably just fine.

The Pragmatic Programmer

Words to live by — although I did once find a problem with Oracle's index management, which manifested in select. In my opinion, it's the maxim that defines a senior developer. Junior developers always look outside themselves whenever they run into a problem. Senior developers do everything they can to find where they've made a mistake. After all, there are a lot of people doing exactly the same things that you are; surely they would have found the bug first.

But sometimes, it really is a bug in the system. This post is a story of how I determined the bug wasn't in my code. It's part catharsis, part “here's what you should do if you think select is broken,” and part calling out the issue for posterity in case someone else trips over it.

In January of this year (2023), AWS made a change in the CloudWatch Logs API: you no longer needed to provide a sequence token when calling PutLogEvents. Sequence tokens were always, I think, a “leaky abstraction”: a part of the internal implementation of CloudWatch Logs that found its way into the API. To summarize: you had to retrieve a sequence token before you could write events. When you wrote a batch of events, you received a sequence token back, which you could use to write the next batch — unless another process was writing to the same stream, in which case your sequence token would become invalid. Which meant that you had to retrieve a new token, using an API that had a relatively low limit on number of requests per second. With enough concurrent writers, hilarity would ensue as each would retrieve a token, try to use it, and fail because some other writer got there first.

This change directly impacted my logging library, but initially I wasn't planning to adopt it: the library worked, the old behavior was supported, and to be honest, releasing the library involves a lot of toil. But then someone found a bug with my sequence-token retrieval code if the logging library happened to re-initialize itself (Log4J2 FTW!). I fixed the immediate bug, and released a patch, but then decided it was a sign from the universe that I should remove the sequence token entirely. Plus, it always feels good to delete code!

However, once I changed the code, one of my integration tests — the one that pits concurrent writers against each other — started failing. When it tried to verify the events that were written, it would often come up short.

Failing integration tests are nothing new to this library: CloudWatch is an “eventually consistent” system, and while “eventually” often means seconds, it sometimes takes longer for the service to become consistent. As a result, failing tests don't clean up after themselves, so that I can look at what happened. I could then use my log reader utility, and verify that all of the events were eventually available.

This time, I didn't get all of the events back. Of course, I assumed that my reader was incorrect, and dove back into the documentation for the API calls it was using. I experimented with different arguments to the GetLogEvents call, and also tried FilterLogEvents. But nothing I did had an effect: I was still missing log entries.

Next step was to dig into the logging library. I already had it fairly well instrumented: if you enable debug mode in the logging framework it will report everything that it does, including (optionally) every batch that it sends. Everything looked good, but there was a corner case that worried me: CloudWatch can reject events, and unlike Kinesis, it doesn't tell you which events it rejects. The appender pre-processed batches to remove invalid events, but didn't log failures (because there shouldn't be any). I changed that code, but no failures appeared.

At this point I didn't think the bug was in my appender, but there was always the possibility that I was doing something incorrectly that I didn't log. My next step for debugging the appender would have been to turn up the logging level within the AWS SDK, to get a detailed report of what it was doing. However, this can become a case of drowning in detail, so first I looked at alternative methods of verifying the events.

In addition to the GetLogEvents and FilterLogEvents API calls, CloudWatch provides a query API. This API requires you to submit a query and then loop until it's done, so I didn't have a utility program that used it. But it's what the Console uses in the Insights page.

So I wrote a simple query to retrieve events from the stream; I was actually more interested in the reported count, rather than looking at all 20,000 expected events. But Insights told me that there weren't any events in the stream.

At this point, I opened a ticket with AWS Support (fortunately, I have access to an account with a Business support contract, since “Basic” support no longer allows technical questions). I received an answer that “as designed,” Insights does not return any messages with timestamps prior to the log group's creation time. WTF?!?

The back-story is that my Log4J 1.x appender lazily creates the log group and stream when it first writes a batch of messages, because Log4J 1.x does not have an explicit initialization phase. This is definitely something to be aware of if you use my library with Log4J 1.x (and, overwhelmingly, that's what my download counts indicate people are doing). The solution is to pre-create your log groups, so I modified the test to do so. This is not an issue with Log4J 2.x and Logback, which do have an initialization phase.

But the test continued to fail. However, I could view the logs in Insights, and saw that it was able to retrieve all of the events (yet another approach, which also succeeded, is export to S3). And more interesting, it was the initial few log messages that weren't being retrieved by GetLogEvents. I also called the DescribeLogStreams API, and it returned a later “first event timestamp” than expected, by a few milliseconds.

This seemed worth another Support ticket: if I could retrieve all log events using one technique, and not using another, then clearly it was a problem with CloudWatch and not me.

And now comes the most important lesson from this entire adventure: do not assume that the AWS Support rep can see what you're describing, and more important, do not expect them to tell you that they can't.

I had created a log group/stream that demonstrated the problem, referenced it in the ticket. I also attached the S3 export, the list of events from my log reader, and the initial events from an Insights query. The support rep seemed to agree that there was a problem. Then for a week they wrote nightly updates about how they were trying to re-upload the events and did not see a problem. They seemed to be trying random things, so I recommended building my library and running the integration test — at this point, my assumption was that AWS accepted that there was a problem based on my example group/stream, and were simply trying to re-create at will.

It turned out this was an incorrect assumption, which I learned when the original rep went dark and a new rep picked up the case. I had independently created a program that triggered the issue, so that they wouldn't have to run the integration tests. The new rep claimed to have run the program, did not see the issue, and pointed out that AWS did not have permission to view the original log group and stream that I had identified in the ticket!

Seeing that the “web” interactions were not having any effect, I decided to try a chat session with a screenshare. I showed the rep the original group and stream, and how different retrieval methods returned different results. At the time it didn't seem to have much effect: he made sounds of acknowledgement, but was unable to assume the role I created to let him read events, and would go silent for minutes at a time while he was “try[ing] something.” All-in-all, not a great experience from my side.

But it did seem to have an effect: the next morning I got a message from someone who claimed to be on the CloudWatch team, saying that they were able to re-create the problem. And that's where it stands today. For my part, the updates to my library are on hold until I hear back: while this may be superstition on my part, it seems to work better if I provide a sequence token (I suspect it's actually the delay from retrieving the token).

And if you run into a problem with GetLogEvents not returning all of your events, feel free to reference case #12882575541 if you file a Support ticket.

Update, 2023-08-22

AWS has resolved the problem: "We investigated and found that there was an issue with the service which in rare situations can incorrectly set the "firstEventTimestamp" on a log-stream [...] Our team has now fixed the issue with GetLogEvents not returning some events due to incorrectly set "firstEventTimestamp"." Have verified that my integration tests run (repeatedly) without a failure. It took just under three months (05/23 to 08/18) to resolve the case, two months (06/15) from the point that someone took it seriously. For a company the size of AWS, and a service with the reach of CloudWatch Logs, that seems pretty fast; I've had issue drag on far longer with in-house IT teams.

Tuesday, January 3, 2023

Using EventBridge Pipes to route CloudWatch Logs events to Kinesis

Several years ago I wrote about connecting CloudWatch Logs into into a logging pipeline built on Kinesis Streams and Kinesis Firehose. To make that work, I wrote a Lambda.

AWS announced EventBridge Pipes at re:Invent 2022, as an easy way to connect messaging services. Connecting CloudWatch Logs to Kinesis seemed like a good way to kick the tires on this service. It still requires a custom Lambda, but that Lambda is now much simpler, because it doesn't need to concern itself with the quirks of PutRecords.

The Problem

When you subscribe a Kinesis stream to a log group, CloudWatch writes messages that look like this:

H4sIAAAAAAAAA62QTU7DMBCF95zC8jqO7LHd2NkFNVRCdNWyggqF1gqG/MlxW6qqd2cK7QWA5bw382a+OdLWjWNVu+VhcDSn02JZvMzLxaKYlTSh/b5zAWUBUulJZiwXgHLT17PQbwd0imFw3caF8rNqh8b9mIsYXNWi+9DX6l6wi8mAAwgAxXx88413HZNaWDHBqXH7Oq6DH6LvuzvfRBdGmj/Ra+zqO7fcuS6e9SP1G4yXGRgltdLcGK41SKu0AQ2KC55hrYTNsJCWc5CCW5CgteEG90WP4BHDaS4mmTDGAOKBTa4PwfjzuUwAA0W4yUHl0iTYQqbl7eOMPLkLFV+Rdd+mH5s6uLoPhxQvVe9ptR/TS0s6r3xHGNlVzdYRPxINzx09JX/DsP+LIX6LsTrdfAHRwAP2RwIAAA==

That's a GZipped and Base64-encoded JSON object. When decoded, unzipped, and pretty-printed, it looks like this:

{
  "messageType": "DATA_MESSAGE",
  "owner": "123456789012",
  "logGroup": "AppenderExample",
  "logStream": "Log4J1-Example-20221224-ithilien-351916",
  "subscriptionFilters": [
    "Example"
  ],
  "logEvents": [
    {
      "id": "37284354508805523945825240107552419724039002310923255808",
      "timestamp": 1671888278929,
      "message": "2022-12-24 08:24:38,929 DEBUG [example-0] com.kdgregory.log4j.aws.example.Main - value is 52\n"
    },
    {
      "id": "37284354508805523945825240107552419724039002310923255809",
      "timestamp": 1671888278929,
      "message": "2022-12-24 08:24:38,929 DEBUG [example-1] com.kdgregory.log4j.aws.example.Main - value is 52\n"
    }
  ]
}

As you can see, CloudWatch combines multiple log events into a single message. However, most destinations — such as Elasticsearch — want individual events, not a nested array. So you need something that pulls apart the original event. For example, the AWS-provided OpenSearch integration, which subscribes a Lambda to the log group, decomposes the messages, and writes them to the cluster. My previous implementation did a similar transformation, but wrote the log messages to a Kinesis stream.

EventBridge Pipes

EventBridge Pipes is modeled around a four-step process (image sourced from docs):

EventBridge Pipes data flow

For simple pipelines, EventBridge gives you a “no code” solution that eliminates the middle two steps: you define an output format using path references to fields in the input messages. Unfortunately, this isn't usable for CloudWatch Logs messages: it can't (currently) handle Gzipped data, and can't extract the elements in an array. So we need to add an “enrichment” step that performs this operation.

Implementation

You can implement the enrichment step in multiple ways, including an HTTP(S) API, but the easiest is a Lambda. Pipes invokes this Lambda with an event that contains one or more source records, and it returns returns zero or more result records.

The source event looks like this:

[
  {
    "eventSource": "aws:kinesis",
    "eventVersion": "1.0",
    "eventID": "shardId-000000000000:49636526735590005452916707570422271019264002608211165186",
    "eventName": "aws:kinesis:record",
    "invokeIdentityArn": "arn:aws:iam::123456789012:role/CloudWatchTransformPipeline-PipeRole-us-east-1",
    "awsRegion": "us-east-1",
    "eventSourceARN": "arn:aws:kinesis:us-east-1:123456789012:stream/log-subscription",
    "kinesisSchemaVersion": "1.0",
    "partitionKey": "87362ce96c65eb1b3125bf0129b99bee",
    "sequenceNumber": "49636526735590005452916707570422271019264002608211165186",
    "data": "...",
    "approximateArrivalTimestamp": 1672490859.613
  }
]

The first thing to call out is that this event is an array. EventBridge Pipes can batch messages together, so you must loop over the entire event and combine the results from each message (ie, a flat-map):

def lambda_handler(event, context):
    result = []
    for message in event:
        result += transform_messages(message)
    return result

The data element is the GZipped and Base64-encoded message that I showed above. You process it with code that looks like this:

data = base64.b64decode(message['data'])
data = gzip.decompress(data)
payload = json.loads(data)

The last thing to be aware of is that there are two types of messages that you'll get from a CloudWatch Logs subscription, identified by the messageType field. When the subscription first starts, CloudWatch sends a control message to the stream to verify that it can write messages. After that, it sends data messages. Your Lambda should ignore everything but the latter.

if payload['messageType'] == 'DATA_MESSAGE':
    # extract log events
else:
    return []

Conclusion

This post has shown you the basics of an EventBridge Pipes transform. You can find my complete example, which performs several transforms on the log events, here.

While I don't think that EventBridge Pipes quite lives up to the hype of Werner Vogels' keynote (and that it has nothing in common with EventBridge other than a name), it does allow you to focus on your logic rather than the quirks of your source and destination. As such, it's a useful addition to your data transformation toolbox.

Sunday, January 1, 2023

Does that i4i.metal instance really help your build times?

Here in Philadelphia, the last day of 2022 was rainy and foggy; a day to stay inside. That, combined with $100 in AWS credits that were due to expire with the new year, made me decide to add another entry to my series of posts evaluating build performance. This time, I'm comparing my desktop PC to the current AWS king of the hill, an i4i.metal instance.

If you don't make a habit of reading EC2 instance specs, the i4i.metal has 128 virtual CPUs (32 actual cores), 1 terabyte of RAM, 30 terabytes of locally-attached SSD storage spread over 8 drives, and 75,000 megabits of network bandwidth. It costs $10.982 per hour ($96,268.212 per year) on-demand in us-east-1. There are more expensive EC2 instance types; the p3dn.24xlarge costs $31.212 per hour, but has fewer vCPUs, less memory, and less disk.

To see if the local SSD offered a performance improvement over EBS, I configured a 1 terabyte gp3 drive as the machine's root volume, and formatted one of the attached SSDs as ext4. I created a separate user, whose home directory was on the SSD, so all of the reads and writes for the builds should have stayed there.

As a comparison, I have my ten-year-old desktop: an i7-3770k (not overclocked) with 32 GB of RAM and a 512 GB Samsung 860 Pro SSD. It cost around $750 to build (no graphics card). I also spun up an m5.large EC2 instance, for consistency with previous tests.

Both of the AWS instances were running Amazon Linux 2; my devbox runs Xubuntu 20.04. All machines ran OpenJDK 1.8, although the specific builds varied (it's whatever the package manager installed). And I manually installed Maven 3.5.4 on all of the machines (it's what I'm running at home).

For a test workload, I built the AWS V2 Java SDK (commit 0f7779ef, corresponding to version 2.19.9-SNAPSHOT). This is a much bigger build than most people will ever deal with: almost 400 sub-projects, and over 100,000 Java files (including tests and generated code). I skipped tests, because it was clear that some of the tests involved timing delays; I wanted to see pure CPU/disk performance.

I ran three builds with each configuration. The first was a “clean” build, forcing Maven to download all dependencies. The second was with empty buffer cache (echo 3 > /proc/sys/vm/drop_caches), and the third, run immediately afterward, should have benefited from any files that were in the cache. As it turned out, the numbers were so close that it didn't make sense to report them separately; all numbers in the table below are from the “empty cache” run (ie, no downloads involved). As with previous tests, I show the output from the Linux time command; the “real” column is the elapsed time of the build, and is what developers care about.

  Real User Sys
My DevBox 27m 24.406s 80m 3.772s 1m 16.792s
m5.large 54m 40.495s 94m 1.207s 1m 46.047s
i4i.metal EBS 17m 49.500s 97m 30.694s 2m 19.089s
i4i.metal SSD 17m 44.165s 97m 6.427s 2m 11.365s

I have to be honest, I was surprised: I expected a small improvement in performance, but not 35%! My guess at the reason for the difference, based on the higher ”user” time, is that it's able to make more effective use of those 32 cores. The 55 MB CPU cache (versus 8MB on my PC) may have helped as well. I think the only way to summarize this test is with the words of Ferris Bueller:

It is so choice. If you have the means, I highly recommend picking one up.

Wednesday, December 14, 2022

Deprecating log4j-aws-appenders

I just made what I hope is the last release of my log4j-aws-appenders library. It started almost six years ago as a "weekend project" to produce a Log4J 1.x appender that wrote output to CloudWatch Logs. It has expanded over the years: I added Kinesis Streams and SNS as destinations, added support for Log4J 2.x and Logback, added support for the AWS 2.x Java SDK, and maintained the code as something I could be proud of.

And other people found it useful: it's received a handful of GitHub stars, and sees around 30,000 downloads a month from Maven Central (overwhelmingly for Log4J 1.x!).

But I think it's outlived its usefulness, and there are better options.

I'll start with an explanation of why I wrote it. At the time, I was working on a project that had around a hundred EC2 virtual machines running pre-baked AMIs, and using the AWS-provided logging agent to write to CloudWatch. The pre-baked AMI meant that everything went to the same log stream: we didn't configure separate streams for separate instances. The agent wasn't (at that time) very good at recognizing the boundaries between log messages, and the multi-threaded nature of our servers meant that messages would be interleaved anyway. Throw in CloudWatch's limited ability to filter messages, and the result was a mess.

As a result, I had two explicit goals for the first release: ensure that a Java LogEvent was written as a single CloudWatch Logs event (including stack traces), and allow different instances to write to different log streams. And this was better, but not great, because it still meant interleaved messages from different threads. So I implemented the Kinesis appender, to send log messages to Elasticsearch. I don't think my boss liked the monthly cost of a six-node Elasticsearch cluster, but the other developers liked the ability to quickly isolate a single web request and all of its related operations.

So why deprecate it? The answer is that the AWS world has moved forward in the past six years. CloudWatch is the default logging destination for Lambda and ECS, and the CloudWatch agent for EC2 has gotten much better. And Insights has made searching CloudWatch much easier, especially if you write log messages in JSON.

So rather than swim against the flow, I recommend that you use CloudWatch Logs rather than my library. For Lambda (and ECS or EKS with the awslogs driver), just write to Standard Error; it will be picked up by the runtime and sent to CloudWatch. For EC2, use the logging agent, configured to parse individual events. If you want to use Elasticsearch, you can use a subscription to get the data out of CloudWatch Logs (something that's become easier with EventBridge Pipes, the topic of an upcoming post). Or you can use an HTTP appender (available in all the major logging frameworks), and wire up your own pipeline using API Gateway's Kinesis integration (something else that wasn't available six years ago).

With all of these options, my library is superfluous. And while I think it's mature, stable, and bug-free, using it exposes your project to the risk of having a single volunteer maintainer responsible for a key part of your infrastructure.

Wednesday, July 14, 2021

My take on "How to re:Invent"

It's back. AWS is taking over Las Vegas for a week filled with information, sales pitches, and corporate-friendly activities. And while COVID and the possibility of a “fourth wave” hang over the conference, I decided to sign up. Having been to re:Invent once, I now consider myself an expert on how to survive the week. Here are five of my suggestions.

#1: Wear comfortable shoes.
OK, everybody says this, but it bears repeating: you're going to do a lot of walking. It might take ten minutes to navigate from the front door of a hotel to the meeting rooms, following a labyrinthine path through the casino. To give you some numbers: my watch recorded 87,000 steps, or 44.6 miles, over the five days of the conference. That may be higher than average: I often walked between venues rather than find my way to the shuttle buses. But even if you “only” walk 30 miles, you'll still be thankful for doing it in a pair of running shoes.
#2: Stay in a “venue” hotel.
These are the hotels that host sessions and other sponsored content, as opposed to the “sleeping” hotels that just have rooms for attendees. There are several reasons to stay at a venue hotel, but in my opinion the most important is that it cuts down on the amount of walking that you have to do. Of my 87,000 steps, I estimate that 10,000 or more were taken up in walking from my room at the Park MGM to the Aria so that I could pick up a shuttle bus.
#3: Attend workshops, not sessions.
There are some great sessions at re:Invent, conducted by people who are intimately familiar with the service. If you have specific questions it's worth attending one of the “deep dives” and then walking up to the speaker afterward to ask those questions.

But, all of these sessions will be recorded, and you can watch them at your leisure. So if you don't have specific questions there's no reason to attend in-person. What you can't do after December 3rd is learn with an AWS instructor by your side (well, not for free anyway). Unfortunately, space for these workshops is limited, so sign up early for the ones you want (that said, scheduling at re:Invent is extremely fluid; at least half the sessions I attended were marked as full but then had spots open up an hour before they started).

#4: Fly out on Saturday.
If you're not from the United States, you may not realize that re:Invent takes place during the week after the Thanksgiving holiday. Thanksgiving is a time when people return home to visit family and friends, and then all of them get on a plane the following Sunday to return home. It's historically the busiest travel day of the year, and US airports are crowded and frantic with people who only fly on that weekend. Plus, airlines charge the highest rates of the year, because they know people will pay. Even if you have TSA/Pre, it's not fun.

If you're willing to fly out a day early, you avoid the crowds. Plus, you can save significantly on the airfare (right now, it would save me over $300, or nearly 50%). Against that, you'll be paying for an extra night in Vegas. For me, with the conference rate for the hotel, the numbers worked.

#5: Get out of Vegas.
For some people, Vegas is a destination: they love the lights, the noise, and the constant activity. For me, it's overwhelming. Fortunately, you can find thousands of square miles of absolute desolation just outside city limits.

Last time, I rented a motorcycle for a day and explored nearby attractions: Valley of Fire, Hoover Dam, and Red Rock Canyon. This year, I'm planning to take three days and explore southern Utah and northern Arizona. If you're not a motorcyclist, Vegas also has plenty of rental cars, including exotics. And at the far end of the scale, you can spend a day in a high-performance driving class at the Las Vegas Motor Speedway.

Well, that's it. Now it's time to cross my fingers and hope the US COVID situation remains under control.

Tuesday, June 8, 2021

An open letter to the AWS Training organization

You don't have a Feedback link on your site, but it seems that Amazon keeps close tabs on the Blogosphere, so hopefully this reaches you.

I don't know whether you're an actual sub-division of Amazon, but the website URL https://www.aws.training certainly didn't give me a warm fuzzy feeling when it came up in Google. In fact, my first thought was that it was some unaffiliated company that had better SEO.

So, since it was asking me for login credentials, I did what any reasonably cautious technologist would do, and ran whois. And this is what I got back:

Domain Name: aws.training
Registry Domain ID: 8d519b3def254d2f980a08f62416a5b9-DONUTS
Registrar WHOIS Server: whois.comlaude.com
Registrar URL: http://www.comlaude.com
Updated Date: 2019-05-19T19:54:24Z
Creation Date: 2014-03-19T00:32:11Z
Registry Expiry Date: 2024-03-19T00:32:11Z
Registrar: Nom-iq Ltd. dba COM LAUDE
Registrar IANA ID: 470
Registrar Abuse Contact Email: abuse@comlaude.com
Registrar Abuse Contact Phone: +44.2074218250
Registrant Name: REDACTED FOR PRIVACY
Registrant Organization: Amazon Technologies, Inc.
Registrant Street: REDACTED FOR PRIVACY
Registrant City: REDACTED FOR PRIVACY
Registrant State/Province: NV
Registrant Postal Code: REDACTED FOR PRIVACY
Registrant Country: US
Registrant Phone: REDACTED FOR PRIVACY
Registrant Phone Ext: REDACTED FOR PRIVACY
Registrant Fax: REDACTED FOR PRIVACY
Registrant Fax Ext: REDACTED FOR PRIVACY
Registrant Email: Please query the RDDS service of the Registrar of Record identified in this output for information on how to contact the Registrant, Admin, or Tech contact of the queried domain name.

That's the sort of whois entry that you get for an individual using a shared hosting service. In fact, it provides less information than you'll see with my domain, which runs on a shared hosting service, and I pay extra for privacy.

By comparison, the whois entry for Amazon itself looks like this (and note that it's a different registrar, another red flag):

Domain Name: amazon.com
Registry Domain ID: 281209_DOMAIN_COM-VRSN
Registrar WHOIS Server: whois.markmonitor.com
Registrar URL: http://www.markmonitor.com
Updated Date: 2019-08-26T12:19:56-0700
Creation Date: 1994-10-31T21:00:00-0800
Registrar Registration Expiration Date: 2024-10-30T00:00:00-0700
Registrar: MarkMonitor, Inc.
Registrar IANA ID: 292
Registrar Abuse Contact Email: abusecomplaints@markmonitor.com
Registrar Abuse Contact Phone: +1.2083895770
Registrant Name: Hostmaster, Amazon Legal Dept.
Registrant Organization: Amazon Technologies, Inc.
Registrant Street: P.O. Box 8102
Registrant City: Reno
Registrant State/Province: NV
Registrant Postal Code: 89507
Registrant Country: US
Registrant Phone: +1.2062664064
Registrant Phone Ext: 
Registrant Fax: +1.2062667010
Registrant Fax Ext: 
Registrant Email: hostmaster@amazon.com

While I'm a little surprised by the Reno address, rather than Seattle, this at least looks like the sort of registration information used by a business rather than somebody who pays $10/month for hosting.

I ended up getting to the training site via a link on the AWS Console, so was able to achieve my goal.

But I think there's a general lesson: don't forsake your brand without good reason.

And at the very least, ask your network administrators to update your whois data.

Tuesday, February 16, 2021

EFS Build Performance Revisited

A few years ago I wrote a post calling out the poor performance of Amazon's Elastic File System (EFS) when used as the working directory for a software build. Since then, EFS has seen many performance improvements. Is it now viable for purposes such as developer home directories or build machines?

TL;DR: no.

As before, I'm using an m5d.xlarge EC2 instance running AWS Linux 2 as my testbed (for the record, ami-03c5cc3d1425c6d34 — you'll see later why I want to remember this). It provides four virtual CPUs and 16GB of RAM, so hardware should not be an issue. My test builds are the AWS SDK and my logging appenders project (releasing the latter is why I spun up the instance in the first place). The appenders project is larger than it was last time, but is still a reasonable “small” project. For consistency, I'm using the same tag (1.11.394) for the AWS SDK; it's grown dramatically in the interim.

I've configured the build machine with three users, each of which has their home directory in one of the tested storage types (instance store, EBS, EFS). The EBS test uses a 100 GB gp2 volume that is dedicated to the build user. For the EFS test I created two volumes, to compare the different EFS performance modes.

For each build I took the following steps:

  1. Copy the project from a "reference" user. This user has project directories without the .git directory, along with a fully-populated Maven local repository.
  2. Perform a test build. This is intended to ensure that all dependencies have been downloaded, and that there is nothing that would cause the build to fail.
  3. Run mvn clean in the test directory.
  4. Flush the disk cache (sync).
  5. For instance store, run TRIM (fstrim -v /) to avoid the penalty of SSD write amplification.
  6. Clear the in-memory buffer cache (echo 3 > /proc/sys/vm/drop_caches)
  7. Run the timed build (time mvn compile).

And here's the results. As before, I show the output from the time command: the first number is the "real" time (the wall-clock time it took to build). The second is "user" (CPU) time, while the third is "system" (kernel operation) time. All times are minutes:seconds, and are rounded to the nearest second.

  Appenders AWS SDK
  Real User System Real User System
Instance Store 00:05 00:13 00:00 01:14 02:16 00:06
EBS 00:06 00:15 00:00 01:26 02:13 00:06
EFS General Purpose 00:23 00:20 00:01 15:29 02:22 00:15
EFS Max IO 00:55 00:18 00:01 36:24 02:28 00:15

Comparing these timings to my previous run, the first thing that jumped out at me was how wildly different reported “user” time is. In fact, they are so different that my first step was to fire up an EC2 instance using the same AMI as the previous test (thankfully, AWS doesn't delete anything), and confirm the numbers (and yes, they were consistent). Intuitively, it should take the same amount of CPU time to compile a project, regardless of the performance of the disk storage, so I'm not sure why I didn't do more digging when I saw the original numbers. Regardless, “real” time tells the story.

And that story is that EFS still takes significantly longer than other options.

There have been definite performance improvements: the “general purpose” EFS volume takes 15 minutes, versus the 30+ required by the earlier test (the close correspondence of the earlier test and the “MAX IO” volume type make me think that it might be the same implementation).

But if you're speccing a build machine — or anything else that needs to work with large numbers of relatively small files — EFS remains a poor choice.

Monday, February 1, 2021

Why I Won't Use The AWS V2 Java SDK

Update, 2024-01-24

Well, I was wrong: AWS is end-of-lifing the v1 SDK. They will stop adding new features in July of this year, and stop providing updates in December 2025. They just announced this plan, so I'm still waiting to see if they'll back-pedal under pressure from large customers. But until that happens, ignore my advice below and start converting your code.

That said, feel free to read my venting from three years ago ...

Amazon's "version 2" SDK for Java was first released three and a half years ago. As of this writing, the current version is 2.15.73. So it's time to upgrade your Java AWS projects, right?

To which my answer is a resounding “NO!” In fact, if you are starting a new java AWS project, I recommend that you stick with the version 1 SDK unless you have a compelling reason to change. Especially if you already have code that uses version 1.

I base these statements on my experience updating my logging library. While this library doesn't use a lot of AWS services, it goes deep on the ones that it does use, especially client configuration. For additional context, I've been using the version 1 Java SDK since it came out, I currently use the Python SDK very heavily, and I've also spent some time with the JavaScript and Ruby SDKs. Plus, I implemented my own S3 SDK before the Java SDK was available. In other words, I'm not just someone who spent an hour trying to work with the library, got fed up, and is now venting.

And before I do start venting, I want to call out the things that I do like about the new library:

Consistent Naming
This is big. It's not just a matter of getting rid of the superfluous “Amazon” or ”AWS” prefix, but of consistent categorization. For example, CloudWatch Logs: that's its name, you find its documentation next to other things named CloudWatch, and you look at your logs by going to the CloudWatch page in the console. Yet in the v1 SDK it's called AWSLogs, and the JAR is aws-java-sdk-logs. A small thing, but small things are what enhance or detract from developer productivity.
Paginators
Paginated requests are a Good Thing: not only do they reduce the load on Amazon's servers, they avoid blowing up client programs (eg, downloading a gigabyte of data due to a poorly configured FilterLogEvents call). But dealing with pagination is a developer tax: the same dozen or so lines of code every time you make a paginated request (in fact, that was one of the primary drivers of my been-on-hold-because-I-mostly-work-with-Python-these-days AWS utility library). It's a lot easier to request a paginator and then loop through results or process them with a Java8 stream. I especially like that Java paginators don't force you to deal with pages at all, unlike the Python paginators.

And … that's about it. Now onto the things that I don't like:

Wholesale name changes of getters and setters
I understand why they did it: everybody hates Java's “bean” naming conventions. The get, set, and with prefixes are just visual clutter. All of the other SDKs use “property” naming, so why not make the Java SDK consistent? But the result, if you're planning to upgrade an existing project, means hours — perhaps days — of tedious manual effort to get rid of those prefixes. And at the same time, you'll be converting all of your code to use builders rather than constructors.
Just when you get used to the new conventions, you find a client that doesn't follow them
In my case, it was discovering that KinesisClient didn't support paginators, so I found myself re-implementing that dozen lines of code to iterate the list of streams. I also discovered that IamClient doesn't provide create(), unlike the other clients, but that brings up a bigger issue.
The documentation is incomplete and at times misleading
IamClient doesn't implement the create() method. That seems like an oversight, until you use IamClient.builder().build() and get an UnknownHostException with a long and confusing chain of causes. To get a usable client, you must add .region(Region.AWS_GLOBAL) to the builder invocation.

There is no hint of this in the IamClient documentation. The only AWS documentation that has anything to say on the topic is the Migration Guide, in which you'll see “Some AWS services don't have Region specific endpoints.” But no list of the classes affected, or examples of client configuration code. Oh, and sorry to amybody in China who might want to use my library: you've got a different region name.

That covers “incomplete.” For “misleading,” take a look at StsAssumeRoleCredentialsProvider. The version 1 variant of this class requires an STS client, which it uses to periodically refresh the credentials. For version 2, there's no indication that this method exists — moreover, the JavaDoc claims that the builder class extends Object, which should mean that the documented methods are all that exist. You have to look at the source code to see that it actually extends a BaseBuilder class (and has since the developer preview was released).

I think that the core problem in both cases is that perhaps 99.9% of everything in the SDK is generated from one central API specification. This is really the only way that you can manage multiple SDKs for over 100 services, each of which exposes dozens of operations. However, that 0.1% (or less) is what gives each SDK its personality, and it seems that the documentation generators for the v2 Java SDK aren't keeping up with the custom pieces. Fortunately, there's Google and Stack Overflow.

There are still missing capabilities
According to the migration guide, “higher level” libraries like the S3 Transfer Manager aren't available; if you need them you'll need to use both the V1 and V2 SDKs. I haven't verified this myself, but in the past I've used the Transfer Manager very heavily and have recommended it to others.

So what would be a compelling reason to switch? The only one I can think of is the ability to use HTTP/2 and true asynchronous clients. Both can improve performance when you're making large numbers of concurrent API calls. Of course, there are often better alternatives to making large numbers of API calls, especially calls that run the risk of being throttled (and a bonus complaint about AWS APIs in general: they have many different ways to report throttling).

One thing that is not compelling is the fear that one day AWS will decide to discontinue the version 1 SDK. First, because AWS never seems to deprecate/remove anything (does anyone still use SimpleDB? how about the Simple Workflow Service?). But more important is the existing code that uses version 1. I don't know how many millions of lines there are, but I suspect that it's large. And I also suspect that most of it belongs to companies that pay Amazon lots of money and would seriously reconsider cloud providers if required to update their codebases.

Monday, December 28, 2020

Diving Into First-Run Times for Lambdas Written in Java

A few weeks ago I answered a Stack Overflow question about the poor performance of Java Lambdas. I gave some general suggestions, but didn't provide any data to back them up. This post is my data.

The question posed a situation where one Lambda invoked another, which did nothing. The asker seemed particularly concerned by the time taken by the first function: 6000+ milliseconds when cold, 300 when warm. Rather than set up an identical situation, I'm going to use a simpler example, retrieving a list of CloudWatch log groups. The execution time, as you'll see, is similar.

Before getting started, I want to clarify terminology. When I use the term “cold,” I refer to a Lambda that's just been deployed. The execution time in this case consists of the time to initialize the Lambda runtime, initialize the Java Virtual Machine (JVM), and execute the Lambda's handler function. By comparison, a “warm” invocation is able to use an existing execution environment, so only includes the time to execute the handler function.

OK, on to Teh Codez:

public void handler(Object ignored, Context context)
{
    long start = System.currentTimeMillis();
    
    AWSLogs client = AWSLogsClientBuilder.defaultClient();
    
    long clientCreated = System.currentTimeMillis();
    
    client.describeLogGroups();
    
    long apiInvoked = System.currentTimeMillis();
    
    System.err.format("time to create SDK client = %6d\n", (clientCreated - start));
    System.err.format("time to make API call     = %6d\n", (apiInvoked - clientCreated));
}

This is an extremely simple Lambda: it creates an AWS SDK client and then uses that client to invoke an API call. Before each of these actions I retrieve the current system time. How long can it take?

Setting a Baseline

People like to say things are “slow.” But that term is meaningless if you don't have context. And to develop that context, you have to know both what is happening, and the physical constraints that apply. For example, on my 100 Mbit/sec Internet connection, it should take a little over 80 seconds to download a gigabyte of data. That feels like a really long time if you're waiting for the file to download, but it's not slow. If that same file takes 5 minutes to download over the same connection, it's reasonable to say that it's slow, to try to figure out why, and to attempt to correct it.

If you look at the example code, you might think that it should execute instantaneously. After all, it's just two operations. If you have think about the fact that at least the second operation makes a network call, you might say that it should take 100 milliseconds or so.

But observations are always better than guesses. Running on my laptop, with an Intel i7-6500U CPU at 2.5 GHz, here's how long it takes (to get these numbers I created a main() that just invokes the handler function):

time to create SDK client =    744
time to make API call     =    469

That's … much longer than one might expect. Over a second. To make sure it isn't a fluke, you should run the code several times. You should also try some other variants of the code. For example, call the handler function twice in a row:

time to create SDK client =    702
time to make API call     =    522
time to create SDK client =     21
time to make API call     =    151

Hmmmm. The time to create the client dropped dramatically. The time to make the API call also dropped, and is closer to what I'd expect from a network service call. At this point I might also use the Linux time command:

> time java -jar target/sandbox-lambda-java-1.0-SNAPSHOT.jar 
time to create SDK client =    576
time to make API call     =    623

real    0m1.320s
user    0m2.130s
sys     0m0.113s

The “real” value is also known as “wall clock” time: it's what I'm measuring by saving timestamps, and it's pretty close to the timings I print from within the program. The “user” time is the actual CPU time consumed by the program; it's nearly twice the real time, which indicates that the CPU is doing a lot of work. If the program was spending all of its time making network calls, that user time should be less than the real time.

This information alone is enough to make a first pass at optimization.

Memory Allocation

When you create a Java Lambda using the Console, you get a default memory allocation of 512 MB. While Java has a not always deserved reputation for excessive memory consumption, this is more than enough to run the example program. In fact, if you look at the post-run statistics, you'll see that “Max memory used” is around 150 MB.

However, maximum memory consumption only tells part of the story. A much — much — more important fact is that Lambda provides CPU in proportion to allocated memory. To get one virtual CPU, you need to allocate 1,769 MB of memory to your function. The AWS docs don't provide a table of how much CPU you'd get for a given memory size, but assuming a linear relationship, 512 MB would be a third of a CPU — a virtual CPU of unspecified performance.

The table below compares timings for different memory allocations. These are all “cold” times: before each test I uploaded the deployment bundle, which forced Lambda to create a new execution environment. I invoked the Lambda from the Console, providing a dummy event. For consistency with the previous numbers, times are in milliseconds.

  512 MB 1024 MB 2048 MB 4096 MB
Create client 5298 2493 1272 1019
Invoke API call 3844 2023 1061 613
Billed duration 9213 4555 2349 1648

If you add up the numbers, you'll see that the billed duration is slightly larger than the sum of the two recorded times. I believe this corresponds to the time taken to start the JVM and invoke the handler function (much like running on my laptop took 1320 ms total, but only 1199 was accounted for by my timestamps).

These numbers also omit the Lambda initialization time, which was approximately 500 ms. This is the time taken to start the Lambda's container, download the function code onto the container, and start the runtime environment. You aren't billed for this time, but it does affect the response time of the function, and Java Lambdas seem to take much longer than, say Python (where initialization takes around 150 milliseconds).

Based on what we know from the baseline performance test, the numbers make sense: at 4096 MB we have the equivalent of slightly more than two virtual CPUs, and the execution times are in line with what I saw running on my laptop (for what it's worth, increasing the memory size to 8192 MB, which should be 4+ vCPUs, does not significantly affect these timings). This leads to my number one rule for Lambda performance tuning:

The most important thing that you can do to improve Lambda performance is increase memory allocation.

This advice is, of course, subject to caveats. This example program is short, has high CPU usage relative to its runtime, and can exploit multiple virtual CPUs. A long-running Lambda that spends most of its time waiting for network calls may not benefit as much from the CPU boost.

However, almost any Java Lambda will benefit from increased memory allotment as a result of garbage collection: the less memory you give it, the more often collection has to run. I added some code to track garbage collection time, and the 512 MB run consumed nearly 200 milliseconds, versus 30 for the 4096 MB run. Again, this depends on what the program is doing, but in general a larger heap means that more of the program's garbage will never make it out of the “young” generation, which can be collected more efficiently than the “tenured” generation.

Classloading

So what's consuming all of the CPU time for this Lambda? At least part of the answer is classloading: to run this simple program, the JVM loads 3,329 classes.

The JVM loads classes on-demand. It won't load the AWSLogsClientBuilder class until it it executes the line that calls defaultClient(). Loading and initializing a class require loading any classes that it depends on, and so on. Even though an individual class can be loaded very quickly, the total classloading time adds up.

Unfortunately, there aren't a lot of ways to avoid this cost, especially in a simple program like the example. The tricks available to stand-alone Java programs aren't available in Lambda.

However, for real-world applications you can make architectural choices that minimize the number of classes that need to be loaded. One of the simplest is to avoid large frameworks such as Spring.

A Different SDK

Another possibility might be to replace the standard AWS SDK with something that loads fewer classes. In November 2018 AWS released version 2 of its Java SDK, which is described as “major rewrite of the version 1.x code base.” You'll occasionally see recommendations to use it for improved performance (including in the SDK docs themselves, via a link that doesn't go anywhere).

But, as I said before, there's no substitution for observation. Here are the numbers using version 2.15.53:

  512 MB 1024 MB 2048 MB 4096 MB
Create client 4965 2278 1141 959
Invoke API call 4235 2062 1047 661
Billed duration 9237 4357 2204 1637

No meaningful change. To be fair, I just used the default configuration. The v2 SDK lets you change out the underlying HTTP implementation, so maybe a different one would give better numbers. But that seems like a lot of work for “maybe.”

For that matter, switching to the v2 SDK requires a signficant amount of tedious recoding to change package and method names. And as-of this writing, there are still some features that aren't supported by v2. So I don't recommend making that switch until and unless there's a compelling reason.

Packaging

Earlier this year I wrote an article about the different ways to package a Java Lambda. In that article I explored why the Lambda documentation recommended against using an “UberJar” produced by the Maven Shade plugin — even though that same documentation uses that plugin for examples. However, I didn't record the performance gained by switching to the Assembly plugin.

Here's that comparison. I've taken the billed duration from the previous table, and compared it to the billed duration when packaged via the assembly plugin. Rather than show all of the memory sizes, I just show the two extremes:

  512 MB 4096 MB
Shade Plugin 9213 1648
Assembly Plugin 8138 1358

So, a decent speedup with a small memory allocation (which I hope you woudn't use after reading this post!), and a minor speedup when you have plenty of memory (and CPU). Still, any speedup is a good speedup, and this requires little effort.

Provisioned Concurrency

On Stack Overflow, the typical answer to people who are concerned about first-run execution time is “use Provisioned Concurrency.” I suspect that the people who say this have never actually used provisioned concurrency, because it's not a silver bullet. In fact, if you enable provisioned concurrency for my example program, you'll see no change in first-run execution time.

The reason is that my example does everything inside its handler function, and so incurs all of the cost of classloading and initialization when that function executes. Provisioned concurrency won't help with that.

To make Provisioned Concurrency help with Java first-start times, you need to move all of the code that triggers classloading and initialization into a constructor (or a static initializer, but trust me, don't go there). This adds to the complexity of your Lambda, because you have to ensure that you fully load the SDK (and other) classes that you need to run, without actually changing anything (you don't want to write bogus data to your production DynamoDB table!).

Assuming that you've gotten everything working, you'll still have the “N + 1” problem: unless you dramatically over-provision, you'll still get the occasional cold start. Perhaps it happens first thing in the morning, when all of your users connect for the first time. Or perhaps it happens when your site gets mentioned on Hacker News. Sooner or later, it will happen.

Finally, there's the matter of cost: with Provisioned Concurrency, you are paying for an always-on machine. In my example, with 4 GB allocated to the Lambda, enabling Provisioned Concurrency would cost approximately $45/month in addition to the per-request cost. It's hard to find an exact equivalent in the EC2 world, but a t3.medium has the same memory and virtual CPU count, and costs a little over $30 per month (both prices are for us-east-2). So if you're planning to replace your “traditional” Java web-app with Lambdas to save costs, you'll get the opposite.

Wrapping Up

I first implemented a Lambda web-app with Java nearly four years ago, and my opinion hasn't changed since then: if you need fast response times, then Java should not be your language of choice. Use Python, or NodeJS, or Go. Or if you must use Java, but want the benefits of Serverless computing, deploy it on ECS Fargate as a traditional J2EE application with embedded Jetty server.

That doesn't mean that Java is always a bad choice for Lambdas. If you have a long-running task, such as processing large files, then startup time pales in comparison to overall runtime. And Java is an especially good choice for CPU-intensive tasks, because the Hotspot engine will optimize performance.

The bottom line is that writing your Lambda functions in Java, like everything else in software engineering, is a tradeoff. Pick your application architecture based on your actual needs, not philosophical stances.

Monday, November 9, 2020

Building Modular Inline Policies with Terraform

Modules are the primary mechanism for reusability in Terraform. They're especially useful in cases where you need to stamp out a bunch of related resources multiple times. For example, an SQS queue along with its dead-letter queue and “least privilege” access policies.

The problem with creating a separate managed policy for each module invocation is that role/policy attachments are a limited resource: by default, you're limited to ten. You can increase that by opening a support ticket, but the limit is still low (20 unless you can show special circumstances), and it has to be done in every account where you'll create the role.

Alternatively, you can construct a combined policy in the invoking module as long as the child module outputs the queue ARNs. However, this eliminates much of the benefit of reusability, as you have to explicitly build the combined policy everywhere that you use the queue module. It also means that the person writing the combined policy has to understand the IAM permissions that are appropriate to a queue consumer or producer. At that point, why not just explicitly create your resources?

In this post, I look at a different alternative: the queue module exports the JSON for the producer and consumer statements. The invoking module then combines the statements for each of its queues into a single applicaton policy.

Wait, doesn't Terraform allow you to combine policies?

Not in the way that I want.

The iam_policy_document data source allows you to “layer” policies using its source_json and override_json arguments, but these only accept a single document. Which means that you can build a complex policy document step by step, but not as a single data source. In a real-world configuration, I think that would quickly turn into an unmaintainable mess.

The policy documents guide doesn't give any other alternatives: it recommends the data source or a multi-line “heredc”. Surprisingly, it doesn't mention using jsonencode() at all.

Creating an SQS queue module

A modern web-application is a complex beast: it often consists of a (relatively) simple API server, with background processes to handle long-running tasks such as sending notifications, rendering images, or generating reports. And linking these together, at least on AWS, SQS queues.

An SQS queue, although it seems simple, actually has a lot of hidden complexity:

  • Each queue should have a companion dead-letter queue, along with a "redrive policy" that determines how many times a consumer can fail to process a message before it goes to the dead-letter queue.
  • How long a does a consumer have to process a message before it's retried (the visibility timeout).
  • How long can a message stay in the queue without being picked up (the retention period, used to prevent queues from growing without bound when there are no/insufficient consumers).
  • You should also have “least-privilege” policies for the queue's producers and consumers.

This is a perfect use for Terraform modules: rather than repeat all of the resource definitions (as you have to do with CloudFormation), you simply create a module and reference it from your main configuration:

module "notifications_queue" {
  source = "./modules/sqs"
  queue_name = "Notifications"
}

module "reports_queue" {
  source = "./modules/sqs"
  queue_name = "ReportGeneration"
  visibility_timeout = 600
  receive_count = 1
}

The core of this module are the following two resource definitions, for the primary queue and its companion dead-letter queue (every primary should have a DLQ; trust me, you'll want it sooner or later):

resource "aws_sqs_queue" "base_queue" {
  name                        = var.queue_name
  message_retention_seconds   = var.retention_period
  visibility_timeout_seconds  = var.visibility_timeout
  redrive_policy              = jsonencode({
                                    "deadLetterTargetArn" = aws_sqs_queue.deadletter_queue.arn,
                                    "maxReceiveCount" = var.retry_count
                                })
}

resource "aws_sqs_queue" "deadletter_queue" {
  name                        = "${var.queue_name}-DLQ"
  message_retention_seconds   = var.retention_period
  visibility_timeout_seconds  = var.visibility_timeout
}

As you can see, all configuration is via module variables. Rather than show the entire variables.tf file, I'll use just two to give you a sense of my philosophy. The first, of course, is queue_name: this is the name of the queue, and can't be defaulted. The second, retention_period, is something that most consumers don't care about, but some might. So I picked a valid default (in my experience most messages have no value after 24 hours), and the invoking module can override if needed.

variable "queue_name" {
  description = "The name of the queue. Used as a prefix for related resource names."
  type = string
}

variable "retention_period" {
  description = "Time (in seconds) that messages will remain in queue before being purged"
  type = number
  default = 86400
}

So far, nothing unusual. But what I'm not showing (yet) are the policy documents that let your applications read and write the queue.

If you were just building a proof-of-concept, you might consider using the pre-existing AmazonSQSFullAccess managed policy. However, that policy grants all operations on all queues. Not just read and write, but the ability to create and delete them as well. For a real-world application, you want a “least privilege” policy that only allows actions on the queue that you've just created. And really, you want separate policies for the programs that write to the queue and those that read from the queue. To support this, my module also creates two managed policies; here's the producer policy:

resource "aws_iam_policy" "producer_policy" {
  name        = "SQS-${var.queue_name}-${data.aws_region.current.name}-consumer_policy"
  description = "Attach this policy to producers for ${var.queue_name} SQS queue"
  policy      = data.aws_iam_policy_document.producer_policy.json
}

data "aws_iam_policy_document" "producer_policy" {
  statement {
    actions = [
      "sqs:GetQueueAttributes",
      "sqs:GetQueueUrl",
      "sqs:SendMessage",
      "sqs:SendMessageBatch"
    ]
    resources = [
      aws_sqs_queue.base_queue.arn
    ]
  }
}

Simple, but as I said above, there's a limit on the number of policies that you can attach to a single role. In a real-world micro-service application, you might have a half-dozen queues. Or more.

Extracting the Policy Statement

I originally created this example for a user group talk, so while it was based on real-world experience it wasn't actually used in the real world. I recently had to implement an SQS module for the real world, and decided to tackle the “too many policies” problem (ironically, I don't think it will be a problem for that specific use case).

After reviewing the various documents linked above, I realized that none of them mentioned the jsonencode function. This seemed like a rather strange omission, especially given the inclusion of approaches based on string literals (although not nearly as bad as XML, JSON still has its share of gotchas for those who wish to write it using strings).

As it turns out, creating policy documents with jsonencode() instead of using a data source works quite well:

resource "aws_iam_policy" "producer_policy" {
  name        = "SQS-${var.queue_name}-${data.aws_region.current.name}-producer"
  description = "Attach this policy to producers for ${var.queue_name}"
  policy      = jsonencode({
                  "Version" = "2012-10-17",
                  "Statement" = [{
                    "Effect" = "Allow",
                    "Action" = [
                      "sqs:GetQueueAttributes",
                      "sqs:GetQueueUrl",
                      "sqs:SendMessage",
                      "sqs:SendMessageBatch"
                    ],
                    "Resource" = [
                      aws_sqs_queue.base_queue.arn
                    ]
                  }]
                })
}

That was a start, but it wasn't really where I wanted to go. The next step was to move the JSON into a local:

locals {
  producer_policy_statement = {
    "Effect" = "Allow",
    "Action" = [
      "sqs:GetQueueAttributes",
      "sqs:GetQueueUrl",
      "sqs:SendMessage",
      "sqs:SendMessageBatch"
    ],
    "Resource" = [
      aws_sqs_queue.base_queue.arn
    ]
  }
}

Now I could make the policy reference the local, rather than either inline JSON or a datasource:

resource "aws_iam_policy" "producer_policy" {
  name        = "SQS-${var.queue_name}-${data.aws_region.current.name}-producer"
  description = "Attach this policy to producers for ${var.queue_name}"
  policy      = jsonencode({
                  "Version" = "2012-10-17",
                  "Statement" = [local.producer_policy_statement]
                })
}

Constructing an application role

That's all well and good, but how does it provide any benefit over the former self-contained module? The answer is that you can reference the JSON in outputs.tf:

output "producer_policy_statement" {
  value = local.producer_policy_statement
}

And that enables the following change in the invoking module: rather than a bunch of role/policy attachments, the application role combines these snippets of JSON into an inline policy.

resource "aws_iam_role" "application_role" {
  name = "ApplicationRole"

  assume_role_policy = # you saw this above, no need to repeat
}

resource "aws_iam_role_policy" "application_role_queue_policy" {
  name = "application_role_queue_policy"
  role = aws_iam_role.application_role.id

  policy = jsonencode({
    "Version"   = "2012-10-17",
    "Statement" = [
        module.notifications_queue.producer_policy_statement,
        module.rendering_queue.producer_policy_statement,
        module.reports_queue.producer_policy_statement
    ]
  })
}

Now, rather than counting against your quota of policy attachments, these statements count against your quote for inline policy size. Which, for role policies, is 10,240 characters. If that's too low, you can create managed policies composed of individual statements, for a total of nearly 70k per role (10 policies at 6k each, and one inline policy). If your application needs a role that's more complex than that, you might want to think about breaking it into micro-services.

But in my opinion, the best feature of this implmentation is that it keeps the module as the arbiter of its resource's permissions. Which means there's less of a risk that a poorly crafted policy will open your system to an attacker.

If you'd like to try this out, I've saved both the original and the modular versions as GitHub Gists. One thing to note: Gists don't have any conception of sub-directories, and don't allow slashes in filenames, so I've used dashes to indicate directory boundaries (this should be clear once you look at the files).