Saturday, December 14, 2024

Once more into the breach: Amazon EFS performance for software builds

This is the third time that I'm writing about this topic. The first time was in 2018, the second in 2021. In the interim, AWS has announced a steady stream of improvements, most recently (October) increasing read throughput to 60 MB/sec.

I wasn't planning to revisit this topic. However, I read Tim Bray's post on the Bonnie disk benchmark, and it had the comment “it’d be fun to run Bonnie on a sample of EC2 instance types with files on various EBS and EFS and so on configurations.” And after a few exchanges with him, I learned that the Bonnie++ benchmark measured file creation and deletion in addition to IO speed. So here I am.

EFS for Builds

Here's the test environment (my previous posts provide more information):

  • All tests run on an m5d.xlarge instance (4 vCPU, 16 GB RAM), running Amazon Linux 2023 (AMI ami-0453ec754f44f9a4a).
  • I created three users: one using the attached instance store, one using EBS (separate from the root filesystem), and one using EFS. Each user's home directory was on the filesystem in question, so all build-specific IO should be confined to that filesystem type, but they shared the root filesystem for executables and /tmp.
  • The local and EBS filesystems were formatted as ext4.
  • The EBS filesystem used a GP3 volume (so a baseline 3000 IOPS).
  • The EFS filesystem used Console defaults: general purpose, elastic throughput. I mounted it using the AWS recommended settings.
  • As a small project, my AWS appenders library, current (3.2.1) release.
  • As a large project, the AWS Java SDK (v1), tag 1.11.394 (the same that I used for previous posts).
  • The build command: mvn clean compile.
  • For each project/user, I did a pre-build to ensure that the local Maven repository was populated with all necessary dependencies.
  • Between builds I flushed and cleared the filesystem cache; see previous posts for details.
  • I used the time command to get timings; all are formatted minutes:seconds, rounded to the nearest second. “Real” time is the elapsed time of the build; if you're waiting for a build to complete, it's the most important number for you. “User” time is CPU time aggregated across threads; it should be independent of disk technology. And “System” time is that spent in the kernel; I consider it a proxy for how complex the IO implementation is (given that the absolute number of requests should be consistent between filesystems).

And here are the results:

  Appenders AWS SDK
  Real User System Real User System
Instance Store 00:06 00:16 00:01 01:19 02:12 00:09
EBS 00:07 00:16 00:01 01:45 02:19 00:09
EFS 00:18 00:20 00:01 15:59 02:24 00:17

These numbers are almost identical to the numbers from three years ago. EFS has not improved its performance when it comes to software build tasks.

What does Bonnie say?

As I mentioned above, one of the things that prompted me to revisit the topic was learning about Bonnie, specifically, Bonnie++, which performs file-level tests. I want to be clear that I'm not a disk benchmarking expert. If you are, and I've made a mistake in interpreting these results, please let me know.

I spun up a new EC2 instance to run these tests. Bonnie++ is distributed as a source tarball; you have to compile it yourself. Unfortunately, I was getting compiler errors (or maybe warnings) when building on Amazon Linux. Since I no longer have enough C++ knowledge to debug such things, I switched to Ubuntu 24.04 (ami-0e2c8caa4b6378d8c), which has Bonnie++ as a supported package. I kept the same instance type (m5d.xlarge).

I ran with the following parameters:

  • -c 1, which uses a single thread. I also ran with -c 4 and -c 16 but the numbers were not significantly different.
  • -s 32768, to use 32 GB for the IO tests. This is twice the size of the VM's RAM, the test should measure actual filesystem performance and rather than the benefit of the buffer cache.
  • -n 16, to create/read/delete 16,384 small files in the second phase.

Here are the results, with the command-lines that invoked them:

  • Local Instance Store: time bonnie++ -d /mnt/local/ -c 1 -s 32768 -n 16
    Version 2.00a       ------Sequential Output------ --Sequential Input- --Random-
                        -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
    Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
    ip-172-30-1-84  32G  867k  99  128m  13  126m  11 1367k  99  238m  13  4303 121
    Latency              9330us   16707us   38347us    6074us    1302us     935us
    Version 2.00a       ------Sequential Create------ --------Random Create--------
    ip-172-30-1-84      -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
                  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                     16 +++++ +++ +++++ +++ +++++ +++     0  99 +++++ +++ +++++ +++
    Latency               146us     298us     998us    1857us      18us     811us
    1.98,2.00a,ip-172-30-1-84,1,1733699509,32G,,8192,5,867,99,130642,13,128610,11,1367,99,244132,13,4303,121,16,,,,,+++++,+++,+++++,+++,+++++,+++,4416,99,+++++,+++,+++++,+++,9330us,16707us,38347us,6074us,1302us,935us,146us,298us,998us,1857us,18us,811us
    
    real	11m10.129s
    user	0m11.579s
    sys	1m24.294s
         
  • EBS: time bonnie++ -d /mnt/ebs/ -c 1 -s 32768 -n 16
    Version 2.00a       ------Sequential Output------ --Sequential Input- --Random-
                        -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
    Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
    ip-172-30-1-84  32G 1131k  99  125m   8 65.4m   5 1387k  99  138m   7  3111  91
    Latency              7118us   62128us   80278us   12380us   16517us    6303us
    Version 2.00a       ------Sequential Create------ --------Random Create--------
    ip-172-30-1-84      -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
                  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                     16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
    Latency               218us     303us     743us      69us      15us    1047us
    1.98,2.00a,ip-172-30-1-84,1,1733695252,32G,,8192,5,1131,99,128096,8,66973,5,1387,99,140828,7,3111,91,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,7118us,62128us,80278us,12380us,16517us,6303us,218us,303us,743us,69us,15us,1047us
    
    real	16m52.893s
    user	0m12.507s
    sys	1m4.045s
         
  • EFS: time bonnie++ -d /mnt/efs/ -c 1 -s 32768 -n 16
    Version 2.00a       ------Sequential Output------ --Sequential Input- --Random-
                        -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
    Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
    ip-172-30-1-84  32G  928k  98  397m  27 60.6m   6  730k  99 63.9m   4  1578  16
    Latency              8633us   14621us   50626us    1893ms   59327us   34059us
    Version 2.00a       ------Sequential Create------ --------Random Create--------
    ip-172-30-1-84      -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
                  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                     16     0   0 +++++ +++     0   0     0   0     0   1     0   0
    Latency             22516us      18us     367ms   24473us    6247us    1992ms
    1.98,2.00a,ip-172-30-1-84,1,1733688528,32G,,8192,5,928,98,406639,27,62097,6,730,99,65441,4,1578,16,16,,,,,218,0,+++++,+++,285,0,217,0,944,1,280,0,8633us,14621us,50626us,1893ms,59327us,34059us,22516us,18us,367ms,24473us,6247us,1992ms
    
    real	23m56.715s
    user	0m11.690s
    sys	1m18.469s
         

For the first part, reading large block files, I'm going to focus on the “Rewrite” statistic: the program reads a block from the already created file, makes a change, and writes it back out. For this test, local instance store managed 126 MB/sec, EBS was 65.4 MB/sec, and EFS was 60.6 MB/sec. Nothing surprising there: EFS achieved its recently-announced throughput, and a locally-attached SSD was faster than EBS (although much slower than the 443 MB/sec from my five-year-old laptop, a reminder that EC2 provides fractional access to physical hardware).

The second section was what I was interested in, and unfortunately, the results don't give much insight. In some doc I read that "+++++" in the output signifies that the results aren't statistically relevant (can't find that link now). Perhaps that's due to Bonnie++ dating to the days of single mechanical disks, and modern storage systems are all too fast?

But one number that jumped out at me was “Latency” for file creates: 146us for instance store, 218us for EBS, but a whopping 22516us for EFS. I couldn't find documentation for this value anywhere; reading the code, it appears to measure the longest time for a single operation. Which means that EFS could have 99% of requests completing in under 100ms but a few outliers, or it could mean generally high numbers, of which the one stated here is merely the worst. I suspect it's the latter.

I think, however, that the output from the Linux time command tells the story: each of the runs uses 11-12 seconds of “user” time, and a minute plus of “system” time. But they vary from 11 minutes of “real” time for instance store, up to nearly 24 minutes for EFS. That says to me that EFS has much poorer performance, and since the block IO numbers are consistent, it must be accounted for by the file operations (timestamps on the operation logs would make this a certainty).

Conclusion

So should you avoid EFS for your build systems? Mu.

When I first looked into EFS performance, in 2018, I was driven by my experience setting up a build server. But I haven't done that since then, and can't imagine that too many other people have either. Instead, the development teams that I work with typically use “Build as a Service” tools such as GitHub Actions (or, in some cases, Amazon CodeBuild). Running a self-hosted build server is, in my opinion, a waste of time and money for all but the most esoteric needs.

Wo where does that leave EFS?

I think that EFS is valuable for sharing files — especially large files — when you want or need filesystem semantics rather than the web-service semantics of S3. To put this into concrete terms: you can read a section of an object from S3, but it's much easier codewise to lseek or mmap a file (to be fair, I haven't looked at how well Mountpoint for Amazon S3 handles those operations). And if you need the ability to modify portions of a file, then EFS is the only real choice: to do that with S3 you'd have to rewrite the entire file.

For myself, I haven't found that many use cases where EFS is the clear winner over alternatives. And given that, and the fact that I don't plan to set up another self-hosted build server, this is the last posting that I plan to make on the topic.

No comments: