Saturday, December 8, 2018

Using CloudWatch Insights With JSON Log Messages

I've never been a fan of CloudWatch Logs. On the positive side, it's available by default, is relatively cheap, and there are a variety of ways to send data to it (including my own log4j-aws-appenders). But the features it supported were extremely limited: you could view log messages by date/time range or do a simple text search. As a result, I felt that “serious” log analysis required a tool like Kibana, fed by a Kinesis-based pipeline. Or, if your volumes were low enough, a service provider such as Loggly or DataDog.

That opinion changed with the introduction of CloudWatch Logs Insights. This feature adds a query tab to the CloudWatch Logs console, allowing you to filter and summarize messages. For example, this query will show you the number of messages that contain the word “ERROR” (from your selected time range, which is specified outside of the query):

filter @message like /ERROR/
| stats count()

To actually see those errors, you'd replace the stats directive with fields, along with a sort to put them in reverse chronological order:

filter @message like /ERROR/
| fields @@timestamp, @message
| sort @timestamp desc

That's nice, but it's still full-text search: if you're using Log4J it will pick up the messages that were logged at the ERROR level, but it will also pick up any other entries that happen to contain the word “ERROR” anywhere in the message.

But Insights isn't limited to simple text searches on the entire message: it will parse JSON messages and let you reference individual fields from those messages. And, as it happens, my appenders library provides a JSON layout class (originally intended for the Kinesis to Kibana pipeline linked above). To configure the CloudWatch appender with the JSON layout, you'd add these lines to your file (changing logGroup, logStream, and tags as appropriate):

As I said, you can use individual fields from the JSON message in your queries. So to see only the Log4J errors, you would use the following (note: this uses the parsed “message” field, which is different from the unparsed “@message” field):

filter level == "ERROR"
| fields @timestamp, logger, message
| sort @timestamp desc
| limit 25

You can also aggregate the data. For example, this query counts the number of errors for each 15 minute period within your time range:

filter level == "ERROR"
| stats count() by bin(15m)

There are, of course, some caveats to this feature. While the documentation doesn't go into detail, it appears to work by a brute-force scan all of your log messages (unlike Kibana, which uses a search engine to index the messages). This means that queries take longer to run, and that time increases with the amount of data that you're examining.

You'll also pay (as of the time of writing) $0.005 per gigabyte of data scanned. This is cheap if you are careful to restrict your time periods and/or don't have a lot of logging output. But I suspect it will add up quickly.

With those caveats, however, I think Insights is a great solution for organizations that don't have large logging volumes, don't want to ship their logs offsite, and don't want to pay for an Elasticsearch/Kibana cluster.

No comments: