Showing posts with label java. Show all posts
Showing posts with label java. Show all posts

Wednesday, December 14, 2022

Deprecating log4j-aws-appenders

I just made what I hope is the last release of my log4j-aws-appenders library. It started almost six years ago as a "weekend project" to produce a Log4J 1.x appender that wrote output to CloudWatch Logs. It has expanded over the years: I added Kinesis Streams and SNS as destinations, added support for Log4J 2.x and Logback, added support for the AWS 2.x Java SDK, and maintained the code as something I could be proud of.

And other people found it useful: it's received a handful of GitHub stars, and sees around 30,000 downloads a month from Maven Central (overwhelmingly for Log4J 1.x!).

But I think it's outlived its usefulness, and there are better options.

I'll start with an explanation of why I wrote it. At the time, I was working on a project that had around a hundred EC2 virtual machines running pre-baked AMIs, and using the AWS-provided logging agent to write to CloudWatch. The pre-baked AMI meant that everything went to the same log stream: we didn't configure separate streams for separate instances. The agent wasn't (at that time) very good at recognizing the boundaries between log messages, and the multi-threaded nature of our servers meant that messages would be interleaved anyway. Throw in CloudWatch's limited ability to filter messages, and the result was a mess.

As a result, I had two explicit goals for the first release: ensure that a Java LogEvent was written as a single CloudWatch Logs event (including stack traces), and allow different instances to write to different log streams. And this was better, but not great, because it still meant interleaved messages from different threads. So I implemented the Kinesis appender, to send log messages to Elasticsearch. I don't think my boss liked the monthly cost of a six-node Elasticsearch cluster, but the other developers liked the ability to quickly isolate a single web request and all of its related operations.

So why deprecate it? The answer is that the AWS world has moved forward in the past six years. CloudWatch is the default logging destination for Lambda and ECS, and the CloudWatch agent for EC2 has gotten much better. And Insights has made searching CloudWatch much easier, especially if you write log messages in JSON.

So rather than swim against the flow, I recommend that you use CloudWatch Logs rather than my library. For Lambda (and ECS or EKS with the awslogs driver), just write to Standard Error; it will be picked up by the runtime and sent to CloudWatch. For EC2, use the logging agent, configured to parse individual events. If you want to use Elasticsearch, you can use a subscription to get the data out of CloudWatch Logs (something that's become easier with EventBridge Pipes, the topic of an upcoming post). Or you can use an HTTP appender (available in all the major logging frameworks), and wire up your own pipeline using API Gateway's Kinesis integration (something else that wasn't available six years ago).

With all of these options, my library is superfluous. And while I think it's mature, stable, and bug-free, using it exposes your project to the risk of having a single volunteer maintainer responsible for a key part of your infrastructure.

Thursday, February 24, 2022

Leveraging try-with-resources to manage the mapped diagnostic context

The mapped diagnostic context is an under-appreciated feature of logging libraries. It's essentially a thread-local Map, to which you add key-value pairs that provide context about what your program is doing. Here's a trivial example:

logger.info("program starting");
File file = new File(argv[0]);

MDC.put("outputFile", file.getAbsolutePath());

try (FileOutputStream fos = new FileOutputStream(file);
        OutputStreamWriter osr = new OutputStreamWriter(fos, StandardCharsets.UTF_8);
        BufferedWriter out = new BufferedWriter(osr))
{
    logger.info("starting to write");
    out.write("hello, world");
    out.newLine();
    logger.info("finished writing");
}

MDC.remove("outputFile");
logger.info("program finished");

When you run this program, using a logging config that outputs the MDC, you'll see something like this:

2022-02-23 07:28:01,673 INFO  [main] c.kdgregory.example.Main -  - program starting
2022-02-23 07:28:01,677 INFO  [main] c.kdgregory.example.Main - outputFile=/tmp/test.txt - starting to write
2022-02-23 07:28:01,677 INFO  [main] c.kdgregory.example.Main - outputFile=/tmp/test.txt - finished writing
2022-02-23 07:28:01,677 INFO  [main] c.kdgregory.example.Main -  - program finished

Web-apps can benefit enormously from the MDC: before processing the request, you extract things like the invoking user, request path, session token, or a unique request ID. This makes it easier to track down problems, because you can filter the log by any of those parameters.

But I've recently been doing a lot of data engineering work: reading files or database tables, performing some transformation on them, and sending them to some destination. These tasks tend to be complex, with multiple steps, and they tend to be performed for multiple files or tables at a time. As with a web-app, it's really nice to add contextual information such as table name and selection criteria to the MDC, and let the logging framework write it with every message.

The problem with the MDC is that every put() has to be paired with a remove(). If you forget to remove a key it will “pollute” every subsequent log message from that thread, giving you potentially misleading information.

However, one thing that I noticed is that I tend to set the MDC just before a try-with-resources block. Or inside a loop that calls one or more functions that could be wrapped in such a block. This led me to the AutoMDC class:

logger.info("program starting");
File file = new File(argv[0]);

try (AutoMDC mdc = new AutoMDC("outputFile", file.getAbsolutePath());
        FileOutputStream fos = new FileOutputStream(file);
        OutputStreamWriter osr = new OutputStreamWriter(fos, StandardCharsets.UTF_8);
        BufferedWriter out = new BufferedWriter(osr))
{
    logger.info("starting to write");
    out.write("hello, world");
    out.newLine();
    logger.info("finished writing");
}

logger.info("program finished");

Try-with-resources works with classes that implement java.lang.AutoCloseable: an interface that declares the close() method. My AutoMDC class implements this interface, and uses a Set<String> to keep track of the keys. It provides two methods (and the “convenience” constructor shown above):

public AutoMDC put(String key, Object value)
{
    keys.add(key);
    MDC.put(key, String.valueOf(value));
    return this;
}

@Override
public void close() throws Exception
{
    for (String key : keys)
    {
        MDC.remove(key);
    }
}

That's it: you can add items to the MDC in the try-with-resources initializer, or anywhere inside the try block itself, and they'll be removed at the end. One downside is that the close() method is called before any catch or finally blocks; if you need to log contextual information in those places, you'll need to do it manually.

If you'd like to use this in your own programs, I've provided an implementation for Logback as a GitHub gist; the same basic code can be adapted to whichever logging framework you use.

I think the idea of leveraging try-with-resources is a useful one for other situations, especially those where you might otherwise turn to aspect-oriented programming. Plus, it encourages the good habit of wrapping your code in try/catch blocks.

Friday, February 26, 2021

Java8 Lambda Startup Times

A few months ago I wrote a post about startup times of AWS Lambdas written in Java. This post has a similar title, but a different topic: it looks at the first-run time for lambdas (lowercase) in a Java program, and has nothing to do with AWS. Although I did discover this issue while writing code for AWS. Confused yet?

Lambda expressions were added to the Java language with the release of Java 8 in 2014. By now I'm assuming every Java programmer has used them, if only as arguments to higher-order functions in the java.util.stream package:

List<String> uppercasedNames = names.stream()
                               .map(s -> s.toUpperCase())
                               .collect(Collectors.toList());

You can implement your own higher-order functions, with parameter types from the java.util.function package (or, if you have more complex needs, defining your own functional interfaces). So, for example, you might have a function with this signature:

public static String retryLambda(long interval, long timeout, Supplier<String> lambda) throws Exception

This can be called with any lambda expression that doesn't take arguments and returns a string. For example:

retryLambda(50, 100, () -> Instant.now().toString());

As you might have guessed from the signature, this function retries some operation. But before I dig into the implementation, here's some background about why you'd implement such a function. Most of my recent posts have referenced my AWS logging library, and this one's no different. When working with AWS, you need to be prepared to retry operations: either because AWS throttled the request (returning an error that indicates you should retry after a short delay), or because operations are eventually-consistent (there's a delay between creating something and being able to use it). As a result, AWS code can include a lot of retry loops:*

long timeoutAt = System.currentTimeMillis() + timeout;
while (System.currentTimeMillis() < timeoutAt)
{
    String value = doSomething();
    if (value != null)
        return value;
    Thread.sleep(interval);
}
throw new RuntimeException("timeout expired");

That's seven lines of boilerplate wrapping one line that actually does something. Functional programming is all about getting rid of boilerplate, so I implemented a function that would accept a lambda:**

public static String retryLambda(long interval, long timeout, Supplier<String> lambda) throws Exception
{   
    long timeoutAt = System.currentTimeMillis() + timeout;
    while (System.currentTimeMillis() < timeoutAt)
    {
        String value = lambda.get();
        if (value != null)
            return value;
        Thread.sleep(interval);
    }
    
    throw new RuntimeException("timeout expired");
}

The hardcoded loops can now be replaced with a call to this function:

retryLambda(50, 250, () -> doSomething());

All well and good, and it reduced the size of the code, but then my tests started failing.

When you're actually talking to AWS, you might need a timeout of 30 seconds or more. But you definitely don't want such a long timeout in a unit test. To solve that problem, I replaced the interval and timeout arguments with much shorter values: 50 and 200 milliseconds. And then my tests would assert the number of times the function was called: based on those values, the operation should be attempted four times before timing out. However, I was seeing that they were only executed two or three times.

When I dug into the problem, what I discovered is that the first execution of a lambda takes 40 to 50 milliseconds on my Core i7-3770K running Oracle Java 1.8.0_271. I knew there was a lot happening behind the scenes to make lambdas work, but wow, that's nearly infinity!

I also ran on an EC2 m5a.xlarge instance running AWS Linux 2, and saw that it took over 70 milliseconds with OpenJDK 1.8.0_272, but only 18 milliseconds running Corretto 11.0.10.9.1. I have to assume that the performance improvement is similar across Java11 implementations, but haven't tested. If you'd like to try it out yourself, I've created a GitHub Gist with the test program.

One thing that I do not want you to take from this post is the idea that Java lambdas are bad, or are poorly implemented. I didn't delve too deeply into what happens during that first invocation, but suspect that the JVM is loading something from disk (much like the initial JVM startup time). And in my experiments, invoking additional, different lambdas did not add to the execution time. So, like anything Java, lambdas are best used in a long-running program.

However, if you are in a similar situation, testing timing-dependent code that utilizes lambdas, you need to be prepared. When I ran into the problem, I simply wanted to move on with my life and relaxed the assertions (the primary assertion was elapsed time, which didn't change; it was simply the number of invocations). Now, after thinking about the problem and writing the example program for this post, I think I'd use a @BeforeClass function to “warm up” the lambda mechanism.


* Not all AWS code needs to have retry loops. But, for example, if you create a Kinesis stream you will need to wait until it becomes active before writing to it. I've seen some mock implementations of AWS services that don't accurately reflect these delays, leading to code that fails in the real world.

* Actually, I implemented a class, which was far easier to replace during testing. For an example if its use, look here.

Monday, February 1, 2021

Why I Won't Use The AWS V2 Java SDK

Update, 2024-01-24

Well, I was wrong: AWS is end-of-lifing the v1 SDK. They will stop adding new features in July of this year, and stop providing updates in December 2025. They just announced this plan, so I'm still waiting to see if they'll back-pedal under pressure from large customers. But until that happens, ignore my advice below and start converting your code.

That said, feel free to read my venting from three years ago ...

Amazon's "version 2" SDK for Java was first released three and a half years ago. As of this writing, the current version is 2.15.73. So it's time to upgrade your Java AWS projects, right?

To which my answer is a resounding “NO!” In fact, if you are starting a new java AWS project, I recommend that you stick with the version 1 SDK unless you have a compelling reason to change. Especially if you already have code that uses version 1.

I base these statements on my experience updating my logging library. While this library doesn't use a lot of AWS services, it goes deep on the ones that it does use, especially client configuration. For additional context, I've been using the version 1 Java SDK since it came out, I currently use the Python SDK very heavily, and I've also spent some time with the JavaScript and Ruby SDKs. Plus, I implemented my own S3 SDK before the Java SDK was available. In other words, I'm not just someone who spent an hour trying to work with the library, got fed up, and is now venting.

And before I do start venting, I want to call out the things that I do like about the new library:

Consistent Naming
This is big. It's not just a matter of getting rid of the superfluous “Amazon” or ”AWS” prefix, but of consistent categorization. For example, CloudWatch Logs: that's its name, you find its documentation next to other things named CloudWatch, and you look at your logs by going to the CloudWatch page in the console. Yet in the v1 SDK it's called AWSLogs, and the JAR is aws-java-sdk-logs. A small thing, but small things are what enhance or detract from developer productivity.
Paginators
Paginated requests are a Good Thing: not only do they reduce the load on Amazon's servers, they avoid blowing up client programs (eg, downloading a gigabyte of data due to a poorly configured FilterLogEvents call). But dealing with pagination is a developer tax: the same dozen or so lines of code every time you make a paginated request (in fact, that was one of the primary drivers of my been-on-hold-because-I-mostly-work-with-Python-these-days AWS utility library). It's a lot easier to request a paginator and then loop through results or process them with a Java8 stream. I especially like that Java paginators don't force you to deal with pages at all, unlike the Python paginators.

And … that's about it. Now onto the things that I don't like:

Wholesale name changes of getters and setters
I understand why they did it: everybody hates Java's “bean” naming conventions. The get, set, and with prefixes are just visual clutter. All of the other SDKs use “property” naming, so why not make the Java SDK consistent? But the result, if you're planning to upgrade an existing project, means hours — perhaps days — of tedious manual effort to get rid of those prefixes. And at the same time, you'll be converting all of your code to use builders rather than constructors.
Just when you get used to the new conventions, you find a client that doesn't follow them
In my case, it was discovering that KinesisClient didn't support paginators, so I found myself re-implementing that dozen lines of code to iterate the list of streams. I also discovered that IamClient doesn't provide create(), unlike the other clients, but that brings up a bigger issue.
The documentation is incomplete and at times misleading
IamClient doesn't implement the create() method. That seems like an oversight, until you use IamClient.builder().build() and get an UnknownHostException with a long and confusing chain of causes. To get a usable client, you must add .region(Region.AWS_GLOBAL) to the builder invocation.

There is no hint of this in the IamClient documentation. The only AWS documentation that has anything to say on the topic is the Migration Guide, in which you'll see “Some AWS services don't have Region specific endpoints.” But no list of the classes affected, or examples of client configuration code. Oh, and sorry to amybody in China who might want to use my library: you've got a different region name.

That covers “incomplete.” For “misleading,” take a look at StsAssumeRoleCredentialsProvider. The version 1 variant of this class requires an STS client, which it uses to periodically refresh the credentials. For version 2, there's no indication that this method exists — moreover, the JavaDoc claims that the builder class extends Object, which should mean that the documented methods are all that exist. You have to look at the source code to see that it actually extends a BaseBuilder class (and has since the developer preview was released).

I think that the core problem in both cases is that perhaps 99.9% of everything in the SDK is generated from one central API specification. This is really the only way that you can manage multiple SDKs for over 100 services, each of which exposes dozens of operations. However, that 0.1% (or less) is what gives each SDK its personality, and it seems that the documentation generators for the v2 Java SDK aren't keeping up with the custom pieces. Fortunately, there's Google and Stack Overflow.

There are still missing capabilities
According to the migration guide, “higher level” libraries like the S3 Transfer Manager aren't available; if you need them you'll need to use both the V1 and V2 SDKs. I haven't verified this myself, but in the past I've used the Transfer Manager very heavily and have recommended it to others.

So what would be a compelling reason to switch? The only one I can think of is the ability to use HTTP/2 and true asynchronous clients. Both can improve performance when you're making large numbers of concurrent API calls. Of course, there are often better alternatives to making large numbers of API calls, especially calls that run the risk of being throttled (and a bonus complaint about AWS APIs in general: they have many different ways to report throttling).

One thing that is not compelling is the fear that one day AWS will decide to discontinue the version 1 SDK. First, because AWS never seems to deprecate/remove anything (does anyone still use SimpleDB? how about the Simple Workflow Service?). But more important is the existing code that uses version 1. I don't know how many millions of lines there are, but I suspect that it's large. And I also suspect that most of it belongs to companies that pay Amazon lots of money and would seriously reconsider cloud providers if required to update their codebases.

Monday, December 28, 2020

Diving Into First-Run Times for Lambdas Written in Java

A few weeks ago I answered a Stack Overflow question about the poor performance of Java Lambdas. I gave some general suggestions, but didn't provide any data to back them up. This post is my data.

The question posed a situation where one Lambda invoked another, which did nothing. The asker seemed particularly concerned by the time taken by the first function: 6000+ milliseconds when cold, 300 when warm. Rather than set up an identical situation, I'm going to use a simpler example, retrieving a list of CloudWatch log groups. The execution time, as you'll see, is similar.

Before getting started, I want to clarify terminology. When I use the term “cold,” I refer to a Lambda that's just been deployed. The execution time in this case consists of the time to initialize the Lambda runtime, initialize the Java Virtual Machine (JVM), and execute the Lambda's handler function. By comparison, a “warm” invocation is able to use an existing execution environment, so only includes the time to execute the handler function.

OK, on to Teh Codez:

public void handler(Object ignored, Context context)
{
    long start = System.currentTimeMillis();
    
    AWSLogs client = AWSLogsClientBuilder.defaultClient();
    
    long clientCreated = System.currentTimeMillis();
    
    client.describeLogGroups();
    
    long apiInvoked = System.currentTimeMillis();
    
    System.err.format("time to create SDK client = %6d\n", (clientCreated - start));
    System.err.format("time to make API call     = %6d\n", (apiInvoked - clientCreated));
}

This is an extremely simple Lambda: it creates an AWS SDK client and then uses that client to invoke an API call. Before each of these actions I retrieve the current system time. How long can it take?

Setting a Baseline

People like to say things are “slow.” But that term is meaningless if you don't have context. And to develop that context, you have to know both what is happening, and the physical constraints that apply. For example, on my 100 Mbit/sec Internet connection, it should take a little over 80 seconds to download a gigabyte of data. That feels like a really long time if you're waiting for the file to download, but it's not slow. If that same file takes 5 minutes to download over the same connection, it's reasonable to say that it's slow, to try to figure out why, and to attempt to correct it.

If you look at the example code, you might think that it should execute instantaneously. After all, it's just two operations. If you have think about the fact that at least the second operation makes a network call, you might say that it should take 100 milliseconds or so.

But observations are always better than guesses. Running on my laptop, with an Intel i7-6500U CPU at 2.5 GHz, here's how long it takes (to get these numbers I created a main() that just invokes the handler function):

time to create SDK client =    744
time to make API call     =    469

That's … much longer than one might expect. Over a second. To make sure it isn't a fluke, you should run the code several times. You should also try some other variants of the code. For example, call the handler function twice in a row:

time to create SDK client =    702
time to make API call     =    522
time to create SDK client =     21
time to make API call     =    151

Hmmmm. The time to create the client dropped dramatically. The time to make the API call also dropped, and is closer to what I'd expect from a network service call. At this point I might also use the Linux time command:

> time java -jar target/sandbox-lambda-java-1.0-SNAPSHOT.jar 
time to create SDK client =    576
time to make API call     =    623

real    0m1.320s
user    0m2.130s
sys     0m0.113s

The “real” value is also known as “wall clock” time: it's what I'm measuring by saving timestamps, and it's pretty close to the timings I print from within the program. The “user” time is the actual CPU time consumed by the program; it's nearly twice the real time, which indicates that the CPU is doing a lot of work. If the program was spending all of its time making network calls, that user time should be less than the real time.

This information alone is enough to make a first pass at optimization.

Memory Allocation

When you create a Java Lambda using the Console, you get a default memory allocation of 512 MB. While Java has a not always deserved reputation for excessive memory consumption, this is more than enough to run the example program. In fact, if you look at the post-run statistics, you'll see that “Max memory used” is around 150 MB.

However, maximum memory consumption only tells part of the story. A much — much — more important fact is that Lambda provides CPU in proportion to allocated memory. To get one virtual CPU, you need to allocate 1,769 MB of memory to your function. The AWS docs don't provide a table of how much CPU you'd get for a given memory size, but assuming a linear relationship, 512 MB would be a third of a CPU — a virtual CPU of unspecified performance.

The table below compares timings for different memory allocations. These are all “cold” times: before each test I uploaded the deployment bundle, which forced Lambda to create a new execution environment. I invoked the Lambda from the Console, providing a dummy event. For consistency with the previous numbers, times are in milliseconds.

  512 MB 1024 MB 2048 MB 4096 MB
Create client 5298 2493 1272 1019
Invoke API call 3844 2023 1061 613
Billed duration 9213 4555 2349 1648

If you add up the numbers, you'll see that the billed duration is slightly larger than the sum of the two recorded times. I believe this corresponds to the time taken to start the JVM and invoke the handler function (much like running on my laptop took 1320 ms total, but only 1199 was accounted for by my timestamps).

These numbers also omit the Lambda initialization time, which was approximately 500 ms. This is the time taken to start the Lambda's container, download the function code onto the container, and start the runtime environment. You aren't billed for this time, but it does affect the response time of the function, and Java Lambdas seem to take much longer than, say Python (where initialization takes around 150 milliseconds).

Based on what we know from the baseline performance test, the numbers make sense: at 4096 MB we have the equivalent of slightly more than two virtual CPUs, and the execution times are in line with what I saw running on my laptop (for what it's worth, increasing the memory size to 8192 MB, which should be 4+ vCPUs, does not significantly affect these timings). This leads to my number one rule for Lambda performance tuning:

The most important thing that you can do to improve Lambda performance is increase memory allocation.

This advice is, of course, subject to caveats. This example program is short, has high CPU usage relative to its runtime, and can exploit multiple virtual CPUs. A long-running Lambda that spends most of its time waiting for network calls may not benefit as much from the CPU boost.

However, almost any Java Lambda will benefit from increased memory allotment as a result of garbage collection: the less memory you give it, the more often collection has to run. I added some code to track garbage collection time, and the 512 MB run consumed nearly 200 milliseconds, versus 30 for the 4096 MB run. Again, this depends on what the program is doing, but in general a larger heap means that more of the program's garbage will never make it out of the “young” generation, which can be collected more efficiently than the “tenured” generation.

Classloading

So what's consuming all of the CPU time for this Lambda? At least part of the answer is classloading: to run this simple program, the JVM loads 3,329 classes.

The JVM loads classes on-demand. It won't load the AWSLogsClientBuilder class until it it executes the line that calls defaultClient(). Loading and initializing a class require loading any classes that it depends on, and so on. Even though an individual class can be loaded very quickly, the total classloading time adds up.

Unfortunately, there aren't a lot of ways to avoid this cost, especially in a simple program like the example. The tricks available to stand-alone Java programs aren't available in Lambda.

However, for real-world applications you can make architectural choices that minimize the number of classes that need to be loaded. One of the simplest is to avoid large frameworks such as Spring.

A Different SDK

Another possibility might be to replace the standard AWS SDK with something that loads fewer classes. In November 2018 AWS released version 2 of its Java SDK, which is described as “major rewrite of the version 1.x code base.” You'll occasionally see recommendations to use it for improved performance (including in the SDK docs themselves, via a link that doesn't go anywhere).

But, as I said before, there's no substitution for observation. Here are the numbers using version 2.15.53:

  512 MB 1024 MB 2048 MB 4096 MB
Create client 4965 2278 1141 959
Invoke API call 4235 2062 1047 661
Billed duration 9237 4357 2204 1637

No meaningful change. To be fair, I just used the default configuration. The v2 SDK lets you change out the underlying HTTP implementation, so maybe a different one would give better numbers. But that seems like a lot of work for “maybe.”

For that matter, switching to the v2 SDK requires a signficant amount of tedious recoding to change package and method names. And as-of this writing, there are still some features that aren't supported by v2. So I don't recommend making that switch until and unless there's a compelling reason.

Packaging

Earlier this year I wrote an article about the different ways to package a Java Lambda. In that article I explored why the Lambda documentation recommended against using an “UberJar” produced by the Maven Shade plugin — even though that same documentation uses that plugin for examples. However, I didn't record the performance gained by switching to the Assembly plugin.

Here's that comparison. I've taken the billed duration from the previous table, and compared it to the billed duration when packaged via the assembly plugin. Rather than show all of the memory sizes, I just show the two extremes:

  512 MB 4096 MB
Shade Plugin 9213 1648
Assembly Plugin 8138 1358

So, a decent speedup with a small memory allocation (which I hope you woudn't use after reading this post!), and a minor speedup when you have plenty of memory (and CPU). Still, any speedup is a good speedup, and this requires little effort.

Provisioned Concurrency

On Stack Overflow, the typical answer to people who are concerned about first-run execution time is “use Provisioned Concurrency.” I suspect that the people who say this have never actually used provisioned concurrency, because it's not a silver bullet. In fact, if you enable provisioned concurrency for my example program, you'll see no change in first-run execution time.

The reason is that my example does everything inside its handler function, and so incurs all of the cost of classloading and initialization when that function executes. Provisioned concurrency won't help with that.

To make Provisioned Concurrency help with Java first-start times, you need to move all of the code that triggers classloading and initialization into a constructor (or a static initializer, but trust me, don't go there). This adds to the complexity of your Lambda, because you have to ensure that you fully load the SDK (and other) classes that you need to run, without actually changing anything (you don't want to write bogus data to your production DynamoDB table!).

Assuming that you've gotten everything working, you'll still have the “N + 1” problem: unless you dramatically over-provision, you'll still get the occasional cold start. Perhaps it happens first thing in the morning, when all of your users connect for the first time. Or perhaps it happens when your site gets mentioned on Hacker News. Sooner or later, it will happen.

Finally, there's the matter of cost: with Provisioned Concurrency, you are paying for an always-on machine. In my example, with 4 GB allocated to the Lambda, enabling Provisioned Concurrency would cost approximately $45/month in addition to the per-request cost. It's hard to find an exact equivalent in the EC2 world, but a t3.medium has the same memory and virtual CPU count, and costs a little over $30 per month (both prices are for us-east-2). So if you're planning to replace your “traditional” Java web-app with Lambdas to save costs, you'll get the opposite.

Wrapping Up

I first implemented a Lambda web-app with Java nearly four years ago, and my opinion hasn't changed since then: if you need fast response times, then Java should not be your language of choice. Use Python, or NodeJS, or Go. Or if you must use Java, but want the benefits of Serverless computing, deploy it on ECS Fargate as a traditional J2EE application with embedded Jetty server.

That doesn't mean that Java is always a bad choice for Lambdas. If you have a long-running task, such as processing large files, then startup time pales in comparison to overall runtime. And Java is an especially good choice for CPU-intensive tasks, because the Hotspot engine will optimize performance.

The bottom line is that writing your Lambda functions in Java, like everything else in software engineering, is a tradeoff. Pick your application architecture based on your actual needs, not philosophical stances.

Saturday, July 4, 2020

Packaging Java for AWS Lambda

When I want to package a Java application in a single JAR with all of its dependencies, I normally turn to Maven’s Shade plugin. This plugin builds an “UberJAR” by unpacking all of the files from the project's dependencies and repackaging them into a single JAR.

The Shade plugin is also used by the Lambda Developer's Guide. I was surprised, then to see this later in the same guide:

Reduce the time it takes Lambda to unpack deployment packages authored in Java by putting your dependency .jar files in a separate /lib directory. This is faster than putting all your function’s code in a single jar with a large number of .class files.

I was surprised because a single JAR containing class files should be the most efficient way to deploy a project. The JVM memory-maps the JAR files on its classpath, which means that it can access arbitrary parts of those files without an explicit call to the OS kernel. And UberJAR means that the JVM only has to examine a single directory structure.

After some investigation, I discovered the reason: Lamdba unpacks the contents of the deployment package into its filesystem. Which means that an UberJAR turns into a lot of little files, each of which must be read using several kernel calls.

OK, so if you’re a Maven user, how do you package your Lambdas in the prefered way? The answer is that instead of using the Shade plugin, you use the Assembly plugin. The stated goal of this plugin is to allow developers to package their project artifact with related information such as a website. But it's a very flexible plugin, using a “deployment descriptor” to specify which files should be included in the final artifact, and where they should go.

This is the descriptor that I used for my Java Lambda example. It's duplicated for each of the Lambdas, and is found at src/assembly/deployment.xml:

<assembly xmlns="http://maven.apache.org/ASSEMBLY/2.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/ASSEMBLY/2.0.0 http://maven.apache.org/xsd/assembly-2.0.0.xsd">
    <id>deployment</id>
    <includeBaseDirectory>false</includeBaseDirectory>
    <formats>
        <format>zip</format>
    </formats>
    <dependencySets>
        <dependencySet>
            <outputDirectory>/lib</outputDirectory>
            <useProjectArtifact>true</useProjectArtifact>
            <unpack>false</unpack>
            <scope>runtime</scope>
        </dependencySet>
    </dependencySets>
</assembly>

The descriptor is simple: it says that every JAR used by the Lambda — including the one built by the project itself — should be stored in the /lib directory of the final artifact. To use it, I replace the Shade configuration in my POM with this (the plugin.assembly.version property defined as 3.2.0):

<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-assembly-plugin</artifactId>
    <version>${plugin.assembly.version}</version>
    <configuration>
        <descriptors>
            <descriptor>src/assembly/deployment.xml</descriptor>
        </descriptors>
    </configuration>
    <executions>
        <execution>
            <phase>package</phase>
            <goals>
                <goal>single</goal>
            </goals>
        </execution>
    </executions>
</plugin>

And that's it: when I run mvn package I get a file in my target directory with the suffix -deployment.zip.

Friday, July 19, 2019

Broken Backwards Compatibility: The Bane of the Library Developer

Backwards compatibility has long been a hallmark of Java: there are many classes and methods still in the standard library even though they've been deprecated since the 1990s. Imagine my surprise, then, when I saw this exception:

java.lang.NoSuchMethodError: java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;
    at net.sf.kdgcommons.util.Counters.keySet(Counters.java:179)
    ...

To understand why, you need some back-story. I maintain a couple of open-source libraries that I claim are compatible with Java 1.6. To ensure that I don't accidentally use methods that didn't exist in 1.6, I have a 1.6 JVM installed on my development machine, and tell Eclipse to use it to compile the project. Ditto for the library that was running this test.

However, I don't use the real 1.6 JDK for producing the released versions of these libraries. I use Maven for releases, and the newer versions of Maven won't run on anything older than JDK 1.8 (and there are enough backwards-incompatibilites between older versions of Maven and its plugins that trying to use one is an exercise in frustration). This means that, even though I'm setting my target property to 1.6, it's compiling against the 1.8 standard library.

And in 1.8, the Java maintainers changed the return type of ConcurrentHashMap.keySet(): previously, it returned a Set, as defined by the contract of java.util.Map. In 1.8, it returns a ConcurrentHashMap.KeySetView. This concrete class implements Set, so it doesn't break the contract at the source level.

However things are different at bytecode level: the actual return type is retained in the method's descriptor, and the invokevirtual operation attempts to find a defined method that matches that descriptor. It will accept a method definition that returns a subtype of the specified return type, but not one that returns a supertype. So, when presented with a version of that standard library in which keySet() returns a Set, the JVM throws NoSuchMethodError.

There is a work-around: cast the map reference to java.util.Map. This causes the compiler to use the method definition from the interface, which fortunately has not changed since it was released in 1.2.

ConcurrentHashMap map = // something

Set keys = ((Map)map).keySet();

There is a bug report for this behavior, which was closed with “Won't Fix”. Unfortunately, that's the correct resolution: reverting the method signature would not only break any code that relies on it, but also any deployed applications compiled with the old signature but running on a JVM with the reverted signature.

I can only hope that, in reviewing the bug report, the JVM maintainers realized that changing the public API of a class — any class — is a mistake. Unfortunately, some of the changes happening with newer releases of Java indicate this is to be a more frequent occurrence. Which is probably why 1.8 remains (in my experience) the primary deployment platform for Java apps, even though we're now at Java 12.

Saturday, August 4, 2018

The KFEK Stack: A Logging Pipeline for AWS

This post is a short introduction to building a logging pipeline based on managed services provided by AWS. I've covered this topic in a presentation for the Philadelphia Java User's Group, and in a much longer article on my website. I've also built such a pipeline, aggregating log messages from (when I left the company) a 200 machine production environment that generated 30 GB of log messages per day.

I'll start with a quick review of why log aggregation is a Good Thing, especially in a cloud deployment:

  • You may have many machines writing logs.

    To inspect those logs you either need to go to the machines themselves, pull the logs to your workstation, or push them to a centralized location. And your system operations group may have very strong feelings about some of these options.

  • You may have many machines running the same application.

    To track down a problem, you will have to examine multiple logfiles, and be able to identify which machine they came from.

  • You may need to correlate log messages from different applications.

    This is a particular problem in a micro-service architecture: your web requests may fail due to a remote service, meaning that you have to correlate logs from different machines.

  • Machines regularly shut down.

    The promise of cloud deployments is that you can scale-up and scale-down in response to actual load. However, scaling down means terminating the instance, and that means losing all files that haven't been retrieved or pushed to a central location.

There are many ways to approach centralized logging. One of the simplest — and for a small deployment, arguably best — is to ship your logs off to a third-party provider, such as Loggly or SumoLogic. They make it easy to start with centralized logging, providing useful tools at a very reasonable price. The downside is that, as your logging volumes increase, you may move into their “call us” pricing plans.

The standard self-managed solution is the “ELK” stack: Elasticsearch, Logstash, and Kibana. All are products of elastic, which provides these three products in open source versions and makes their money from paid enhancments and consulting.

In my eyes, there are two primary drawbacks to deploying the ELK stack. The main one is that you have to deploy it — and possibly wake up in the middle of the night when one of the nodes in your Elasticsearch cluster goes down. This drawback is answered by Amazon Elasticsearch Service, a managed implementation of Elasticsearch and Kibana. It allows you to bring up or reconfigure an Elasticsearch cluster with a few mouse clicks in the console, and the AWS ops team will fix any failed hardware for you. In exchange, you pay slightly more than a self-hosted solution, and give up some flexibility. For me that's a good deal.

The second drawback to the ELK stack is the “L”: Logstash, and its companion, Filebeat. To make the ELK stack work, you need to install an agent on each machine: that agent looks at local logfiles, parses them, and ships a JSON representation off to Elasticsearch. This isn't too onerous, as long as you format your logfiles to match one of the out-of-the-box formats, but if you don't you need to write regexes. And you need to manage the on-host logfiles, using a tool like logrotate.

My solution is to have applications write JSON-formatted messages to a Kinesis Data Stream, where they are picked up and sent to Elasticsearch via Kinesis Firehose.

The chief benefit of this architecture — aside from not installing log agents or being woken up in the middle of the night — is scalability. A Kinesis Stream is built from shards, and each shard can accept 1,000 messages or 1 MB per second. As your logging volume increases, you can add shards as needed. Firehose will accommodate the increased throughput by writing to Elasticsearch more frequently. And if you start pushing the limits of your Elasticsearch cluster, you can expand it from the AWS Console.

There is one question remaining: how do the applications write directly to the Kinesis stream? It's actually a rather challenging problem, as writers need to use bulk APIs for performance, but be prepared to resend individual messages. My solution, since I come from a background of Java and Log4J, was to write a Log4J appender library (which also supports other AWS destinations). But I'm not the only person who has had this idea; Googling turned up implementations for Python, .Net, and other languages (I haven't tried them, so am not linking to them).

To recap: if you have a few dozen machines, you'll probably find third-party log aggregators a cost-effective, easy solution. But if your logging volumes are high enough to warrant the cost, this pipeline is an effective alternative to a self-managed ELK stack.

Monday, November 23, 2015

Java Object Serialization and Untrusted Code Execution

This past week we had a mini-fire-drill at work, in response to a CERT vulnerability note titled “Apache Commons Collections Java library insecurely deserializes data.” We dutifully updated our commons-collections dependencies, but something didn't quite smell right to me, so I did some digging. And after a few hours of research and experimentation, I realized that commons-collection was only a convenient tool to exploit the real weakness.

Yes, if you use commons-collection, you should update it to the latest version. Today. Now.

But more important, you need to review your codebase for any code that might deserialize untrusted data. This includes references to ObjectInputStream, but it also includes any RMI servers that you expose to the outside world. And don't stop at your code, but also look at your dependencies to make sure that they don't deserialize untrusted data. Once you've done that, read on for an explanation.

Actually, there are two things you should read first. The first is the slide deck from “Marshalling Pickles,” which describes how this exploit works (for multiple languages, not just Java). The second is my article on Java serialization, especially the section on readObject().

Now that you've read all that, I'm going to summarize the exploit:

  • The Java serialization protocol is public and easy to read. Anybody can craft an arbitrary object graph and write it to a file.
  • To deserialize an object, that object's classfile must be on the classpath. This means that an attacker can't simply write a malicious class, compile it, and expect your system to execute it (unless they can also get the bytecode into your system, in which case you've already lost). This, however, is a false sense of security.
  • Apache Commons Collections provides a number of functor objects that perform invocation of Java methods using reflection. This means that, if Commons-Collections is on your classpath, the attacker can execute arbitrary code, by providing it as serialized data. And because it's so useful, Commons-Collections is on many classpaths, including many app-servers.
  • Commons-Collection also provides a LazyMap class that invokes user-defined functors via its get(). This means that, if you can get the target JVM to accept and access such a map, you can execute arbitrary code.
  • The JRE internal class sun.reflect.annotation.AnnotationInvocationHandler has a member variable as a Map (rather than, say, a HashMap). Moreover, it has a readObject() method that accesses this map as part of restoring the object's state in its readObject() method.

Putting all these pieces together, the exploit is a serialized AnnotationInvocationHandler that contains a LazyMap that invokes functors that execute arbitrary code. When this graph is deserialized, the functors are executed, and you're pwned.

There are a few points to take away from this description. The first, of course, is that uncontrolled deserialization is a Bad Idea. That doesn't mean that serialization is itself a bad idea; feel free to serialize and deserialize your internal data. Just be sure that you're not deserializing something that came from outside your application.

The second take-away is that code-as-data can be dangerous. As I said above, there's no way to put arbitrary code on your classpath, but the exploit doesn't rely on that: it relies on passing arbitrary data into your application. Clojure fans might not like the implications.

For that matter, pay attention to what goes onto your classpath! Apache Commons is an incredibly useful collection of utility classes; if you don't use them, it's likely that one of your dependencies does. Or you have a dependency that offers features that are just waiting to be exploited. My KDGCommons library, for example, has a DefaultMap; all it's missing is a reflection-based functor (and a much wider usage).

The final — and most important — take-away is to pay attention to the variables in your serialized classes. The attack vector here is not commons-collections, it's AnnotationInvocationHandler and its Map. If, instead, this member was defined as a concrete class such as HashMap, the exploit would not have worked: attempting to deserialize a LazyMap when a Hashmap is expected would cause an exception. I know, using a concrete class goes against everything that we're taught about choosing the least-specific type for a variable, but security takes precedence over purity (and you don't need to expose the concrete type via the object's API).

If you're not scared enough already, I'll leave you with this: in the src.zip for JDK 1.7.0_79, there are a little over 600 classes that implement Serializable. And there are a bunch of these that define a private Map. Any of those could be a vector for the same attack.

Friday, July 17, 2015

The Achilles Heel of Try-With-Resources

Here's some code that you might see since Java 7 appeared. Can you spot the bug?

public List<String> extractData(File file, String encoding)
throws IOException
{
    List<String> results = new ArrayList<String>();
    try (BufferedReader rdr = new BufferedReader(new InputStreamReader(new FileInputStream(file), encoding)))
    {
        // read each line and extract data
    }
    return results;
}

OK, here's a hint: what if encoding is invalid?

The answer is that the InputStreamReader constructor will throw UnsupportedEncodingException (which is a subclass of IOException).

But this exception happens after the FileInputStream has been created, but before the body of the try-with-resources statement. Which means that the implicit finally in that statement won't get executed, and the file will remain open until the stream's finalizer executes.

To solve this problem, you need to break apart the constructor chain, to ensure that the actual resource is assigned to its own variable in the resource clause, and will therefore be closed. There are two ways to approach this: either use multiple resource clauses, or push non-resource constructors into the body of the statement.

In this case, the InputStreamReader and BufferedReader are simple decorator objects: if you neglect to close them, it doesn't change the behavior of the program. So we can construct them inside the body of the statement:

try (InputStream in = new FileInputStream(file))
{
    BufferedReader rdr = new BufferedReader(new InputStreamReader(in, encoding));
    // ...

Bug solved: regardless of what happens when constructing rdr, the underlying stream will be closed; we don't have to worry about a “too many open files” exception.

However, if we try to do something similar with output streams, we introduce a new bug:

try (OutputStream out = new FileOutputStream(file))
{
    // DON'T DO THIS!
    BufferedWriter wtr = new BufferedWriter(new OutputStreamWriter(out, encoding));
    // ...

The problem with this code is that decorators are important when writing. Unless you close the BufferedOutputStream, the data in its buffer won't be written to the file. On the positive side, you'll see this bug the first time you try to write a file. To solve it and still use the try-with-resources construct, you need to use multiple resource clauses:

try (OutputStream out = new FileOutputStream(file) ;
        OutputStreamWriter osw = new OutputStreamWriter(out, encoding) ;
        BufferedWriter wtr = new BufferedWriter(osw))
{
    // ...

There you have it, the only safe way to use try-with-resources. If you currently have code with nested constructors, I strongly recommend that you change it. And please change any public documentation that has nested constructors; explaining the bug to someone that is “just following the docs” wastes time that could be used for better things.

Monday, June 16, 2014

Finalizers

I recently saw a comment on my reference objects article that said, in effect, “he mentioned finalizers but didn't say DON'T USE THEM EVER, so I stopped reading.” The capitalized words are taken exactly from the original comment; the rest is a paraphrase to tone down the rhetoric. The comment didn't bother me — if you stop reading so easily, you probably won't gain much from my articles anyway — but the attitude did. Because finalizers do indeed have a role to play.

They compensate for sloppy programming.

Good Java programmers know that non-memory resource allocation has to take place in a try / catch / finally construct. Just like good C programmers know that every malloc() must be matched by a free(). Sloppy programmers either don't know or don't care. And good programmers sometimes forget.

The question then becomes: what should the program do?

One alternative is to just explode. And this is not necessarily a bad thing: not closing your resources is a bug, and it's better to expose bugs early in the development process. In this view, forgetting to close your resource is no different than dereferencing a null pointer.

The problem with this view is that null pointer exceptions tend to show up early, the first time through the affected code. By comparison, leaked resources tend to hide until production, because you never generate sufficient load during development. Leaked file handles are a great example: a typical developer environment allows 4,096 open files. It will take a long time to run out of them, especially if you constantly restart the app. A typical server environment might allow twice or four times that much, but if you never shut down the app you'll eventually run out. Probably at your time of highest load.

And that's leads to an alternate solution: protect the programmers from themselves, by checking for open resources at the time the owning object is garbage collected. This isn't perfect: in a server with lots of memory, the garbage collection interval might exceed the time taken to exhaust whatever resource you're leaking.*

Once you accept the need for some alternate way to clean up non-memory resources, the only remaining question is how. Jave provides two mechanisms: finalizers or (since 1.2) phantom references.

So which should you use? Before answering that question, I want to point out something to the DON'T USE THEM EVER crowd: finalizers and phantom references are invoked under exactly the same conditions: when the garbage collector decides that an object is eligible for collection. The difference is what happens afterward. With phantom references the object gets collected right away, and the reference is put on a queue for later processing by the application. With finalizers, the object is passed to a JVM-managed thread that runs the finalize() method; actual collection happens once the finalizer runs.

Given the choice between a finalizer and a phantom reference, I will pick the finalizer 95% of the time. Phantom references require a lot of work to implement correctly. In the normal case, the only real benefit they provide is to avoid out-of-memory errors from large objects with long-running finalizers.**

A better answer is to avoid both. Far too many people think of finalizers as equivalent to C++ destructors, a place to put important cleanup code such as transaction commit/rollback. This is simply wrong. But it's just as wrong to invoke that code in a phantom reference.

Rather than dogmatically insist “no finalizers!”, I think the better approach is to adopt some development practices to prevent their invocation. One approach is to identify resources that are allocated but not released, using an analysis tool like FindBugs.

But if you want to recover from mistakes in the field, and can do so quickly and cleanly (so that you don't block the finalization thread), don't be afraid of finalizers.


* This is unlikely: if you're running through resources quickly, the associated objects will almost certainly be collected while still in the young generation. At this point, a pedantic pinhead will say “but the JLS doesn't guarantee that the garbage collector will ever run.” My best response to that: ”well, then you've got bigger problems.”

** A bigger benefit, and this is the 5% where I would use them, is to provide leaked-object tracking. For example, in a database connection pool. But I don't consider that the “normal” case.

Friday, July 19, 2013

What is Compelling?

For all that Java got right, I'm looking for something new. I can't claim that my motivation is a good one: I'm starting to feel that Java is déclassé, occupying a “mainstream niche” in which it provides database access to web applications. Yes, there are some interesting things happening with Big Data and Java, but most of the interesting projects that I've seen look to alternative languages even if they keep the JVM.

However, that presents the question “What is compelling about other languages?” Is there anything in the non-Java space that has a story similar the one I saw from Java in 1999? There has been a lot of language development in the last decade or so, with new languages appearing and older languages gaining in popularity. Most of these languages have features that Java doesn't. Do any of those features present a compelling case to leave Java behind?

Before continuing, I want to state that I do not consider one language to be more “powerful” than another. That term gets a lot of use and abuse, but it's meaningless — at least if both languages are Turing-complete. Instead, I look at a language based on how expressive it is for a particular purpose. For example, APL is an extremely expressive language for an extremely narrow purpose.

My purpose is best described as “applications that transform quantum data, possibly in a distributed fashion.” This is almost a simile for “general purpose” computing, but highlights a few things that I find important:

  • Applications have multiple parts, and will run for an extended period of time. This implies that I want easy modularization, and that performance is important.
  • Quantum data is a fancy way of saying that I want to work with individual pieces of information, on both the input and output side. So statistical analysis features are not terribly important to me, but data manipulation features are. As I side note, I don't accept the LISPer's view that data-is-code-is-data.
  • Transformation has similar implications.
  • My view of “distributed computing” has more to do with how work is partitioned, rather than where those pieces run. Features that support partitioning are important, particularly partitioning of independent concurrent tasks. Once you partition your processing, you can move those pieces anywhere.

So, given this mindset, what are some features that Java doesn't have and that I find interesting?

Interpreted versus Compiled
One of the big things that drew me to Java was its elimination of the traditional build cycle: I didn't have to take a nap while building the entire project. This point was driven home to me last year, talking with a neighbor who is doing iOS development. He was touting how fast the latest high-end Macs could build his projects; I pointed out how the IDE on my nine-year-old desktop PC made that irrelevant.

Build cycles haven't completely gone away with Java: deploying a large Spring web-app, for example, takes a noticeable amount of time. But deployment takes time, even with an “interpreted” framework like Rails. For that matter, client-side JavaScript requires a browser refresh. But saving a few seconds isn't really important.

The REPL (read-eval-print loop), on the other hand, is a big win for interpreted languages. I came of age as a programmer when interactive debugging was ascendant, and I like having my entire environment available to examine and modify.

Duck Typing
Another of the benefits of so-called “scripting” languages is that objects are malleable; they are not rigidly defined by class. If you invoke a method on an object, and that object supports the method, great. If not, what happens depends on the language; in many, an “unimplemented method handler” is called, allowing functionality to be added on the fly. I like that last feature; it was one of the things that drew me to Rails (although I didn't stay there). And I think that function dispatch in a duck-typed language is closer to the “objects respond to messages” ideal of OOP purists.

On a practical basis, I find myself wishing for duck-typing every time I have to repeat a long parameterized type specification in Java. Especially when I have to repeat that specification both to declare and assign a variable. I've often wished for something like typedef, to reduce typing and provide names for generic structures.

But … there's a certain level of comfort in knowing that my compiler will catch my misspellings; I make a lot of them. And at the extreme, the flexibility of this model is no different than the idea of a single method named doSomething that takes a map and returns a map.

Lambdas, Closures, Higher-order Functions
At some point in the last dozen years, function pointers became mainstream. I use that term intentionally: the idea of a function as a something that can be passed around by your program is fundamentally based on the ability to hold a pointer to that function. It's a technique that every C programmer knows, but few use (and if you've ever seen the syntax of a non-trivial function pointer, you know why). It's a feature that is present in Java, albeit with the boilerplate of a class definition and the need to rigidly (and repeatedly) define your function's signature. I think that, as a practical thing, duck typing makes lambdas easier: a function is just another malleable object.

Which is a shame, because the idea of a lambda is extremely useful. For example, reading rows from a JDBC query requires about a dozen lines of boilerplate code, which hides the code that actually processes the data. Ditto for file processing, or list processing in general.

While I like lambdas, closures quite frankly scare me. Especially if used with threads. But that's a topic for another post.

Lazy Evaluation
I don't believe that “functional” languages are solely about writing your code as a series of higher-order functions. To me, a critical part of the functional approach is the idea that functions will lazily evaluate their arguments. The example-du-jour calculates the squares of a limited set of integers. It appears to take the entire set of integers as an argument, but lazy evaluation means that those values are only created when needed.

As people have pointed out, lazy evaluation can be implemented using an iterator in Java. Something to think about is that it also resembles a process feeding a message queue.

Structured Data Abstraction
One of the things I like about Groovy is that everything looks like an object, even XML. Combined with the null-safe navigation operator, you can replace a dozen lines of code (or an XPath) with a very natural foo?.bar?.baz. If your primary goal is data manipulation, features like this are extremely valuable. Unfortunately, the list of languages that support them is very short.
A different concurrency paradigm
Concurrent programming isn't easy. Programming itself isn't easy: for any non-trivial application you have to maintain a mental model of how that application's internal state changes over time, and which parts get to change what state. A concurrent application raises the bar by introducing non-determinism: two (or more) free-running threads of execution that update the same state at what seem to be arbitrary times. Java provided a model for synchronizing these threads of execution, at least within a single process, but Java synchronization brings with it the specter of contention: threads sitting idle waiting for a mutex.

Concurrency has been a hidden issue for years: even a simple web-app may have concurrent requests for the same user, accessing the same session data. With the rise of multi-core processors, the issue will become more visible, especially as programmers try to exploit performance gains via parallel operations. While Java's synchronization tools work well for “accidentally concurrent” programs, I don't believe they're sufficient for full-on concurrent applications.

New languages have the opportunity to create new approaches to managing concurrency, and two of the leading approaches are transactional memory and actors. Of the two, I far prefer the latter.

Those are the features that interest me; other people will have different ones. But given those desired features, along with my overall goals for what I'm doing with computers, I've surveyed the landscape of languages and thought about which could replace Java for me. That's the topic of my next post.

Wednesday, July 17, 2013

Things that Java Got Right

Java turns 18 this year. I've been using it for 14 of those years, and as I've said before, I like it, both as a language and as a platform. Recently, I've been thinking more deeply about why I made the switch from C++ and haven't looked back.

One well-known if biased commentator said that Java's popularity was due to “the most intense marketing campaign ever mounted for a programming language,” but I don't think that's the whole story. I think there were some compelling reasons to use Java in 1995 (or, for me, 1999). And I'm wondering whether these reasons remain compelling in 2013.

Before going further, I realize that this post mixes Java the language with Java the platform. For the purposes of this post, I don't think they're different: Java-the-platform was created to support Java-the-language, and Java programs reflect this fact. Superficially, Java-the-language is quite similar to C or C++. Idiomatically, it is very different, and that is due to the platform.

Dynamic Dispatch
Both Java and C++ support polymorphism via dynamic dispatch: your code is written in terms of a base class, while the actual object instances are of different descendant classes. The language provides the mechanism by which the correct class‘ method is invoked, depending on the actual instance.

However, this is not the default mechanism for C++, static dispatch is. The compiler creates a mangled name for each method, combining class name, method name, and parameters, and writes a hard reference to that name into the generated code. If you want dynamic dispatch, you must explicitly mark a method as virtual. For those methods alone, the compiler invokes the method via a reference stored in the object's “V-table.”

This made sense given the performance goals of C++: a V-table reference adds (a tiny amount of) memory to each object instance, and the indirect method lookup adds (a tiny amount of) time to each invocation. As a result, while C++ supported polymorphic objects, it was a rare C++ application that actually relied on polymorphism (at least in my experience in the 1990s). It wasn't until I saw dozens of Java classes created by a parser generator that I truly understood the benefits of class-based polymorphism.

Today, the prevalence of “interpreted” languages means that dynamic dispatch is the norm — and in a far more flexible form than Java provides.

Late Binding
At the time Java appeared, most applications were statically linked: the compiler produced an object file filled with symbolic references, then the linker replaced these symbolic references with physical addresses when it produced the executable. Shared libraries existed, and were used heavily in the Windows world (qv “DLL hell”), but their primary purpose was to conserve precious RAM (because they could be shared between processes).

In addition to saving RAM, shared libraries had another benefit: since you linked them into the program at runtime, you could change your program's behavior simply by changing the libraries that you used. For example, if you had to support multiple databases you could write a configuration file that selected a specific data access library. At the time, I was working on an application that had to do just that, but we found it far easier to just build separate executables and statically link the libraries. Talking with other people, this was a common opinion. The Apache web server was an exception, although in its case the goal was again to save RAM by not linking modules that you didn't want; you also had an option to rebuild with statically-linked libraries.

Java, by comparison, always loads classes on an as-needed basis when the program runs. Changing the libraries that you use is simply a matter of changing your classpath. If you want, you can change a single classfile.

This has two effects: first, your build times are dramatically reduced. If you change one file, you can recompile just that file. When working on large C and C++ codebases, I've spent a lot of time optimizing builds, trying to minimize time spent staring at a scrolling build output.

The second effect is that the entire meaning of a “release” changes. With my last C++ product, bug fixes got rolled into the six-month release cycle; unless you were a large customer who paid a lot in support, you had to wait. With my first Java project, we would ship bugfixes to the affected customers as soon as the bug was fixed. There was no effort expended to “cut a release,” it was simply a matter of emailing classfiles.

Classloaders
Late binding was present in multiple other languages at the time Java appeared, but to the best of my knowledge, the concept of a classloader — a mechanism for isolating different applications within a single process — was unique to Java. It represented late binding on steroids: with a little classloading magic, and some discipline in how you wrote your applications, you could update your applications while they were running. Or, as in the case of applets, applications could be loaded, run, and then be discarded and their resources reclaimed.

Classloaders do more than simply isolate an application: the classloader hierarchy controls the interaction of separately-loaded applications (or, at least, groups of classes). This is a hard problem, and Java classloaders are an imperfect solution. OSGi tries to do a better job, but adds complexity to an application's deployment.

Threads
Threads are another thing that predated Java but didn't see a lot of use. For one thing, “threads” gives the impression of a uniform interface, which wasn't the case in the mid-1990s. I've worked with Posix threads, Solaris threads, and one other whose name I can't remember. They all had the same basic idea, but subtly different APIs and implementation. And even if you knew the APIs, you'd find that vendor-supplied code wasn't threadsafe. Faced with these obstacles, most Unix-centric programmers turned to the tried-and-true multi-process model.

But that decision led to design compromises. I think the InConcert server was typical in that it relied on the database to manage consistency between multiple processes. We could have gotten a performance boost by creating a cache in shared memory — but we'd have to implement our own allocator and coherence mechanism to make that work. One of the things that drove me to prototype a replacement in Java was its ability to use threads and a simple front-end cache.

I could contemplate this implementation because Java provided threads as a core part of the language, along with synchronization primitives for coordinating them. My cache was a simple synchronized Map: retrieval operations could probe the map without worrying that another thread was of updating it. Updates would clear the cache on their way out, meaning that the next read would reload it.

That said, today I've come to the belief that most code should be thread-agnostic, written as if it were the only thing running, and not sharing resources with other threads. Concurrent programming is hard, even when the language provides primitives for coordination, and in a massively-parallel world contention is the enemy. A shared-nothing mentality — at least for mutable state — and a queue-based communication model makes for far simpler programs and (I believe) higher overall performance.

Library Support
All of the above were neat, but I think the truly compelling reason to use Java in the 1990s was that it came with a huge standard library. You wanted basic data structures? They were in there. Database access? In there. A way to make HTTP requests? In there. Object-oriented client-server communication? Yep. A GUI? Ugly, but there.

This was a time when the STL wasn't yet ubiquitous; indeed, some C++ compiler vendors were just figuring out how to implement templates (causing no end of problems for cross-platform applications that tried to use them). The RogueWave library was popular, but you had to work for a company willing to buy it. I think every C++ programmer of the 1990s had his or her own string class, with varying levels of functionality and correctness. It was a rite of passage, the first thing you did once you figured out how classes worked.

Java's large — and ever-growing — library has been both a blessing and a curse. It's nice to have one standard way to do most tasks, from parsing XML to computing a cryptographic hash. On the other hand, in JDK 1.6 there are 17,484 classes in 754 packages. Many imported whole from third-party libraries. This is bloat for those who don't need the features. Worse, it creates friction and delay for updates: if you find a bug in the XML processor, should you file it with Oracle or Apache? And will it ever get fixed?

Those were the things that I found compelling about Java. Other people had different lists, but I think that everyone who adopted Java in the late 1990s and early 2000s did so for practical reasons. It wasn't simply Sun's marketing efforts, we actually believed that Java offered benefits that weren't available elsewhere.

The 2000s have been a time of change on the language scene: new languages have appeared, and some older languages have become more popular. What would make a compelling case for leaving Java behind?

Monday, July 15, 2013

Signs that Java is Fading

My website is losing traffic. It never had terribly high traffic, other than the days it was mentioned on Reddit or Hacker News, but in the last year that traffic has been cut in half. There are no doubt many reasons for this, but I'm going to take a dramatic interpretation: it's another indication that the time of Java is passing.

Bold words for someone who only gets a few hundred hits a day, but the character of those hits are changing. The two most-hit pages on my site have consistently been those on bytebuffers and reference objects: topics that are of interest to people doing “interesting” things, and not something that you'd use in a typical corporate web-app. Recently, these two have been eclipsed by articles on parsing XML and debugging out-of-memory errors.

This sense that Java isn't being used for “interesting” projects comes from other sources as well. Colleagues and friends are looking into other languages, some on the JVM and some not. One of the best recruiters I know, a long-time specialist in Java (and founder of the Philadelphia Java Users Group) is now emphasizing other languages on his home page.

Of course, Java isn't going to disappear any time soon. There are billions of dollars of sunk costs in Java software, and it needs to be maintained. And with hundreds of thousands of Java programmers in the job market, employers know that they can always staff their projects, old and new. Corporate inertia is all but impossible to overcome, as evidenced by the fact that you still see plenty of COBOL positions open.

Java is occasionally called “the COBOL of the 21st century,” and I've never found the comparison apt. But the position of Java today seems to be similar to that of COBOL in the 1980s: a huge installed base, huge demand for developers, but slotted into a specific application area. If you wanted to venture out of the accounting realm, you looked for something else.

Yesterday was my birthday. This upcoming January will mark 30 years that I've received a regular paycheck for making computers do what other people want. In that time, I've switched areas of expertise several times. I figure that I'm going to be in this business for another 15 to 20 years, and I want to be doing something interesting for those years. So it's time to re-evaluate Java, and think about what my next area of expertise should be.

Tuesday, June 18, 2013

Perils of Reflection: Public Methods on Private Classes

Here's a puzzle: what does the following code print:

Iterator itx = new ArrayList().iterator();
Method method = itx.getClass().getMethod("hasNext");
System.out.println(method.invoke(itx));

You might expect it to print false: you're invoking the hasNext() method on an iterator over an empty list. Instead, you get a rather confusing exception::

Exception in thread "main" java.lang.IllegalAccessException: Class com.example.ReflectionFailure can not access a member of class java.util.AbstractList$Itr with modifiers "public"
    at sun.reflect.Reflection.ensureMemberAccess(Reflection.java:65)
    at java.lang.reflect.Method.invoke(Method.java:588)
    at com.example.ReflectionFailure.main(ReflectionFailure.java:16)

Huh? IllegalAccessException indicates that the program doesn't have access to the method. But hasNext() is a public method, how can I not have access to it? The following code compiles and runs without a problem, aren't I doing the same thing with reflection?

Iterator itx = new ArrayList().iterator();
System.out.println(itx.hasNext());

Well, not exactly. If you look at the bytecode of the explicit example, you'll see that you're actually invoking Iterator.hasNext(), not the method defined by the implementation class:

15:  invokeinterface #32,  1; //InterfaceMethod java/util/Iterator.hasNext:()Z

And if you change the reflective code so that you get hasNext() from the Iterator class, it runs without a problem:

Iterator itx = new ArrayList().iterator();
Method method = Iterator.class.getMethod("hasNext");
System.out.println(method.invoke(itx));

But the first piece of code isn't doing that. Instead, it's trying to do something like the following:

AbstractList.Itr itx = (AbstractList.Itr)new ArrayList().iterator();

That code won't compile, because AbstractList.Itr is a private class; the compiler rightly complains that it isn't visible.

I'm not the first person to stumble over this. There are a number of bugs entered about this problem, although they seem to focus on the use of an inner class. The problem still manifests even if you have a top-level protected class in another package. There's also a posting on the Java Specialists blog that covers the problem and its workaround:

Iterator itx = new ArrayList().iterator();
Method method = Iterator.class.getMethod("hasNext");
method.setAccessible(true);
System.out.println(method.invoke(itx));

Method.setAccessible() tells the JVM to allow the calling program to bypass some of the normal security measures; its brother, Field.setAccessible() is a mainstay of injection frameworks and bean introspecters. There's only one problem: if you're running in a sandbox you aren't allowed to tell the JVM to ignore its safety features, you'll get a SecurityException instead.

The only always-usable solution is to figure out where the method is actually defined, and use the Method corresponding to that definition. This is not a trivial task: you have to walk the class hierarchy, looking at all of the concrete or abstract (but still accessible) classes in the hierarchy, as well as all of the interfaces that they implement.

So is this a bug in the JVM? If you couldn't use reflection to accomplish something that you could in explicit code, I'd say yes. In this case, however, you aren't mimicking actual code, you just think you are. So, while inconvenient, in my opinion not a bug.


Update: I've been researching how to determine whether or not a Method can be invoked without an access error. I assumed that the isAccessible() would work: if it returned true, I was good to go. I quickly learned that it always returns false, unless you've specifically called setAccessible(true). Not, in my opinion, terribly useful.

I've combed through the documentation several times, but can't find any method that will tell me if I can invoke a Method. I suppose this is reasonable: accessibility depends on the object doing the invoking. This means that any “find the accessible method” utility would have to take the invoking class as an argument, and there would be no guarantee that the returned method could be invoked by an instance of another class. Not worth pursuing.

Monday, October 8, 2012

Scriptlets Are Not (Inherently) Evil

JSP scriptlets have a bad reputation; everyone can point to their favorite example of scriptlet abuse. In my case, it was a 12,000 line (yes, three zeros) monstrosity that held a “mere” 1,000 lines or so of markup. The rest was a big if-else construct with plenty of embedded logic. But that example, and ones like it, are examples of bad programming, not indictments of scriptlets as a programming medium. They're no different than Java classes that have a single method with thousands of lines.

Monstrosities aside, I don't think there's a valid argument against scriptlets. Actually, when I Googled for “why are scriptlets bad,” most of the top-ranked pages defended them. As far as I can tell, the main arguments are that scriptlets encourage bad programmers to put business logic in the page, that they make the page untestable, that they limit reuse, and that web developers won't understand pages with embedded Java code. All of which seem to me like red, rotted herrings.

Don't get me wrong, I believe in the separation of concerns that underlies the MVC model. Actually, I believe in the separation of concerns found in what I call the “CSV” model: a lightweight Controller that interacts with business logic via a Service layer, and passes any returned data to a View for rendering. But after working with several alternative technologies and languages, I'm convinced that scriptlets are the best way to implement any programmatic constructs involved in view rendering.

And some amount of programmatic rendering resides in almost every view. One common example is populating a <select> element. On the surface this is an easy task: iterate over a list of values, and emit <option> elements for each. In the real world, it's more complex: the option value will come from a different field than the option text, you probably have a (possibly null) value that should be selected, and maybe you'll decorate different options with different classes or IDs. To handle this, you need a language that, if not Turing-complete, is very close.

I'm going to work through just such an example, comparing scriptlets and JSTL. Both examples use the same Spring controller, which stores two values in the request context: a list of employees in allEmployees, and the desired selection in selectedEmployee (which may be null). If you want to walk through the code, it's available here, and builds with Maven.

First up is the JSTL version. Nothing complex here: a forEach over the list of employees, and some EL to extract the bean values. Perhaps the worst thing about this code is that the HTML it generates looks ugly: line breaks carry through from the source JSP, so the <option> element gets split over three lines. You can always tell a JSP-generated page by the amount of incongruous whitespace it contains (if you're security-conscious, that might be an issue).

<select name="employeeSelector">
<c:forEach var="emp" items="${employees}">
    <option value="${emp.id}"
        <c:if test="${!(empty selectedEmployee) and (selectedEmployee eq emp)}"> selected </c:if>
    >${emp}</option>
</c:forEach>
</select>

Now for the scriptlet version. It's a few lines longer than the JSTL version, partly because I have to retrieve the two variables from the request context (something EL does for me automatically). I also chose to create a temporary variable to hold the “selected” flag; I could have used an inline “if” that matched the JSTL version, or I could have changed the JSTL version to use a temporary variable; I find that I have different programming styles depending on language.

<select name="employeeSelector">
    <%
    List<Employee> allEmployees = (List<Employee>)request.getAttribute(Constants.PARAM_ALL_EMPLOYEES);
    Employee selectedEmployee  = (Employee)request.getAttribute(Constants.PARAM_SELECTED_EMPLOYEE);
    for (Employee employee : allEmployees)
    {
        String selected = ObjectUtils.equals(employee, selectedEmployee) ? "selected" : "";
        %>
        <option value="<%=employee.getId()%>" <%=selected%>> <%=employee%> </option>
        <%
    }
    %>
</select>

Even though the scriptlet version has more lines of code, it seems cleaner to me: the code and markup are clearly delineated. Perhaps there's a web developer who will be confused by a Java “for” statement in the middle of the page, but is the JSTL “forEach” any better? Assuming that your company separates the roles of web developer and Java developer, it seems like a better approach to say “anything with angle brackets is yours, everything else is mine.”

Putting subjective measures aside, I think there are a few objective reasons that the scriptlet is better than JSTL. The first of these is how I retrieve data from the request context: I use constants to identify the objects, the same constants that I use in the controller to store them. While I have no inherent opposition to duck typing, there's a certain comfort level from knowing that my page won't compile if I misspell a parameter. Sometimes those misspellings are obvious, but if you happen to write your JSTL with “selectedEmploye” it will take rigorous testing (either manual or automated) to find the problem.

Another benefit to using scriptlets is that you can transparently call out to Java. Here, for example, I use Jakarta Commons to do a null-safe equality check.

A better example would be formatting: here I rely on Employee.toString() to produce something reasonable (and if you look at the example you may question my choice). But let's say the users want to see a different format, like “Last, First”; or they want different formats in different places. In JSTL, you'll be required to manually extract and format the fields, and that code will be copy/pasted everywhere that you want to display the employee name. Then you'll have to track down all of that copy/paste code when the users inevitably change their mind about what they want to see. Or you could add different formatting functions to the Employee object itself, breaking the separation of model and view. With a scriptlet, you can call a method like EmployeeFormattingUtils.selectionName().

OK, before I get comments, I should note that there are easier ways to implement the JSTL version, using 3rd-party tag libraries. The Spring Form Tags, for example, would reduce my code to a single line, assuming that my controller followed the Spring “command” pattern; But those libraries limit you to common cases, restricting your ability to customize your HTML. And although you can write your own taglibs, I haven't seen many projects that do that (in fact, the only ones I've seen are ones where I did it).

I want to finish with a punchline: everything I've just written is moot, because the web-app world is transitioning from server-side view rendering to client-side, using technologies such as Backbone and Spine. That change will come with its own perils, and history tends to repeat itself. I worry that five years from now we'll be dealing with multi-thousand-line JavaScript monstrosities, mixing business logic with view rendering.