Monday, March 30, 2020

S3 Troubleshooting: when 403 is really 404

It's generally a bad idea to click on links in strange websites, but this one is key to this post. If you were to click this link, you'd see a response like the following, and more importantly, the HTTP response code would be a 403 (Forbidden).

<Error>
  <Code>AccessDenied</Code>
  <Message>Access Denied</Message>
  <RequestId>CC60CD5C69BC271F</RequestId>
  <HostId>gBmQHQNr/7CFHLCbiYjqzm3iT2m5WQCobTfxFiXIj9K0448YrtvJKbOimEXHwAxwILw0oDcS9TI=</HostId>
</Error>

However, that bucket is open to the world: AWS reminds me of this by flagging it as “Public” in bucket listings, and putting a little orange icon on the “Permissions” tab when I look at its properties. And if you click this similar link you'll get a perfectly normal test file.

The difference between the two links, of course, is that the former file doesn't exist in my bucket. But why isn't S3 giving me a 404 (Not Found) error? When I look at the list of S3 error responses I see that there's a NoSuchKey response — indeed, it's the example at the head of that page.

As it turns out, the reason for the 403 is called out in the S3 GetObject API documentation:

  • If you have the s3:ListBucket permission on the bucket, Amazon S3 will return an HTTP status code 404 ("no such key") error.
  • If you don’t have the s3:ListBucket permission, Amazon S3 will return an HTTP status code 403 ("access denied") error.

While my bucket allows public read access to its objects via an access policy, that policy follows the principle of least privilege and only grants s3:GetObject. As do almost all of the IAM policies that I write for things like Lambdas.

Which brings me to the reason for writing this post: Friday afternoon I was puzzling over an error with one of my Lambdas: I was creating dummy files to load-test an S3-triggered Lambda, and it was consistently failing with a 403 error. The files were in the source bucket, and the Lambda had s3:GetObject permission.

I had to literally sleep on the error before realizing what happened. I was generating filenames using the Linux date command, which would produce output like 2020-03-27T15:43:31-04:00. However, S3 notifications url-encode the object key, so the key in my events looked like this: 2020-03-27T15%3A43%3A31-04%3A00-c. Which, when passed back to S3, didn't refer to any object in the bucket. But because my Lambda didn't have s3:ListObjects I was getting the 403 rather than a 404.

So, to summarize the lessons from my experience:

  1. Always url-decode the keys in an S3 notification event. How you do this depends on your language; for Python use the unquote_plus() function from the urllib.parse module.
  2. If you see a 403 error, check your permissions first, but also look to see if the file you're trying to retrieve actually exists.

You'll note that I didn't say “grant s3:ListObjects in your IAM policies.” The principle of least privilege still applies.

Monday, January 6, 2020

The Future of Open Source

The world of open source software seems to be going through a period of soul-searching. On the one hand, individual maintainers have retracted packages, causing disruption for the communities that depended on those packages. On the other, software-as-a-service providers are making more money from some applications than their creators.

This is all happening in a world where businesses depend on open-source to operate. It doesn't matter whether you're an individual launching a startup with PHP and MySQL, or a multi-national replacing your mainframe with a fleet of Linux boxes running Java. Your business depends on the work of people that have their own motivations, and those motivations may not align with yours. I think this is an untenable situation, one that will eventually resolve by changing the nature of open-source.

Before looking at how I think it will resolve, I want to give some historical perspective. This is one person's view; you may not agree with it.

I date the beginning of “professional” open source as March 1985: that was the month that Dr Dobbs published an article by Richard Stallman, an article that would turn into the GNU Manifesto. There was plenty of freely available software published prior to that time; my experience was with the Digital Equipment Corporation User Society (DECUS), which published an annual catalog of programs ranging in complexity from fast fourier transform routines to complete language implementations. These came with source code and no copyright attached (or, at least, no registered copyright, which was an important distinction in the 1970s and early 1980s).

What was different about the GNU Manifesto, and why I refer to it as the start of “professional” open source, was that Stallman set out a vision of how programmers could make money when they gave away their software. In his view, companies would get software for free but then hire programmers to maintain and enhance it.

In 1989, Stallman backed up the ideas of the GNU Manifesto with the Gnu Public License (GPL), which was applied to the software produced by the GNU project. This licence introduced the idea of “copyleft”: a requirement that any “derivative works” also be licensed using the GPL, meaning that software developers could not restrict access to their code. Even though that requirement was relaxed in 1991 with the “library” (now “lesser”) license, meaning that you could use the GNU C compiler to compile your programs without them becoming open source by that act, the GPL scared most corporations away from any use of the GNU tools (as late as 1999, I was met with a look of shock when I suggested that the GNU C compiler could make our multi-platform application easier to manage).

In my opinion, it was the Apache web server, introduced in 1995, that made open-source palatable (or at least acceptable) to the corporate world. In large part, this was due to the Apache license, which essentially said “do what you want, but don't blame us if anything goes wrong.” But also, I think it was because the corporate world was completely unprepared for the web. To give a sense of how quickly things moved: in 1989 I helped set up the DNS infrastructure for a major division of one of the world's largest corporations; I had only a few years of experience with TCP/IP networking, but it was more than the IT team. NCSA Mosaic appeared four years later, and within a year or two after that companies were scrambling to create a web presence. Much like the introduction of PCs ten years earlier, this happened outside of corporate IT; while there were commercial web-servers (including Microsoft and Netscape), “free as in beer” was a strong incentive.

Linux, of course, was a thing in the late 1990s, but in my experience wasn't used outside of a hobbyist community; corporations that wanted UNIX used a commercial distribution. In my view, Linux became popular due to two things: first, Eric Raymond published The Cathedral and the Bazaar in 1997, which made the case that open source was actually better than commercial systems: it has to be good to survive. But also, after the dot-com crash, “free as in beer” became a selling point, especially to the startups that would create “Web 2.0”

Jumping forward 20 years, open-source software is firmly embedded in the corporate world. While I'm an oddity for running Linux on the desktop, all of the companies I've worked with in the last ten or more years used it for their production deployments. And not just Linux; the most popular database systems are open source, as are the tools to provision and manage servers, and even productivity tools such as LibreOffice. And for most of the users of these tools, “free as in beer” is an important consideration.

But stability is (or should be) another important consideration, and I think that many open-source consumers have been lulled into a false sense of stability. The large projects, such as GNU and Apache, have their own repositories and aren't going anywhere. And the early “public” repositories, such as SourceForge and Maven Central, adopted a policy that “once you publish something here, it will never go away.” But newer repositories don't have such a policy, and as we saw with left-pad in 2016 and chef-sugar in 2019, authors are willing and able to pull their work down.

At the same time, companies such as Mongo and Elastic.NV found that releasing their core products as open-source might not have been such a great idea. Software-as-a-service companies such as AWS are able to take those products and host them as a paid service, often making more money from the hosting than the original companies do from the services they offer around the product. And in response, the product companies have changed the license on their software, attempting to cut off that usage (or at least capture a share of it).

Looking at both behaviors, I can't help but think that one of the core tenets of the GNU manifesto has been forgotten: that the developers of open-source software do not have the right to control its use. Indeed, the Manifesto is quite clear on this point: “[programmers] deserve to be punished if they restrict the use of these programs.”

You may or may not agree with that idea. I personally believe that creators have the right to decide how their work is used. But I also believe that releasing your work under an open-source license is a commitment, one that can't be retracted.

Regardless of any philosophical view on the matter, I think there are two practical outcomes.

The first is that companies — or development teams — that depend on open-source software need to ensure their continued access to that software. Nearly eight years ago I wrote about using a local repository server when working with Maven and Java. At the time I was focused on coordination between different development teams at the same company. If I were to rewrite the post today, it would focus on using the local server to ensure that you always have access to your dependencies.

A second, and less happy change, is that I think open-source creators will lose the ability to control their work. One way this will happen is for companies whose products are dependent on open-source to provide their own public repositories — indeed, I'm rather amazed that Chef doesn't offer such a repository (although perhaps they're gun-shy after the reaction to their hamfisted attempt to redistributed chef-sugar).

The other way this will happen is for service-provider companies to fork open-source projects and maintain their own versions. Amazon has already done this, for Elasticsearch and also OpenJDK; I don't expect them to be the only company to do so. While these actions may damage the companies' reputations within a small community of open-source enthusiasts, the much larger community of their clients will applaud those actions. I can't imagine there are many development teams that will say “we're going to self-host Elasticsearch as an act of solidarity”; convenience will always win out.

If you're like me, a person responsible for a few niche open-source projects, this probably won't matter: nobody's going to care about your library (although note that both left-pad and chef-sugar at least started out single-maintainer niche projects). But if you're a company that is planning to release your core product as open-source, you should think long and hard about why you want to do this, and whether your plan to make money is viable. And remember these words from the GNU Manifesto: “programming will not be as lucrative on the new basis as it is now.”