AWS Cloudwatch Logs

AWS Cloudwatch Logs

In a world dominated by containers & microservices over servers & monoliths, it’s important to get your application logs sorted. Here’s a brief overview of iterations of application log stacks in production-level applications at Car Throttle.

So, since we’re going to be discussing application logs, it’ll help to visualise this. So here is an example HTTP request, response & application log when I ran a GET request to fetch embed data from a locally running Yoem instance for a YouTube video:

Application to Fluentd

For the longest time at Car Throttle we’ve used a combination of different technology to ship application logs. At first the main production stack used a combination of bunyan and monkey-patching the main functions to send log data to an always-running fluentd instance using fluentd-logger.

From there the fluentd instance takes logs in via in_tcp and outputs to out_s3 & fluentd-plugin-elasticsearch. S3 provides a perfect archive of log data, and Elasticsearch stores the last 30 days of data for (near) real-time analysis & trends.

This pattern worked well, to a degree. Logs found their way into the Kibana dashboard quickly, although being based on Elasticsearch there were a few teething issues, (e.g. entire records occasionally being logst due to slightly differing (“malformed”) data) which often knocked the entire ES cluster into a “yellow” state, which caused plenty of devops headaches.

Docker to Fluentd

When we started deploying services with Docker, we noticed Docker supported fluentd as a logging driver, so suddenly our services don’t need to use up additional processing power to send logs directly to fluentd, they just need to output logs in a specific format to stdout:

To ease this, I wrote chill-logger, which allows our services to send JSON-stringified logs stdout, which ECS can direct to fluentd, but it requires two additions to a standard fluentd instance and an update to your container configuration.

Firstly, a custom fluentd filter to take logs output by Docker and transform them into native fluentd events.

Enable the filter in your td-agent.conf file like above. Docker sends events with the tag docker.${INSTANCE_ID}, which realistically won’t be known at runtime, so the custom filter will capture all Docker events and re-emit them as native events that fluentd prefers.

Finally, update your container options (either via the ECS task definitions role or via the docker cli) to send the logs to the fluentd instance.

But again, this would then ship logs to Fluentd and back out into S3 & Elasticsearch, with the same dashboard that we know and love but also the same constraints as we know and love 🤦 This was a good middle-ground for application logs from these new microservices but more was required.

Docker to AWS Cloudwatch Logs

After a little digging into logging platforms I noticed a platform hiding right under my nose: Cloudwatch Logs. Having written autoscaling rules for EC2 instances & lambda scripts I knew Cloudwatch Logs existed, and indeed was tracking & reacting to server/service metrics I had set up in the past. What took me a while to notice was Docker support for awslogs, so after a little testing (with Yoem, mentioned above) it turns out Cloudwatch Logs is exactly what I was looking for!

Cloudwatch Logs takes those JSON logs (looking exactly like that log data at the top of this page) and inserts them into a specified log-group, then the data can be queried in a similar fashion to ES. By “similar fashion” I mean a similar query search, but if data isn’t found or a record doesn’t match the type the query is suggesting, the record returns false and naturally isn’t found by the query.

The only configuration that’s required for this is creating the log-group in Cloudwatch Logs, and then updating your container options to send logs via the awslogs driver. And that’s literally it. Suddenly you have all log data going into Cloudwatch Logs, complete with search options, the only thing you’re missing is those lovely Kibana dashboards & aggregations. Which I’m slowly getting used to. I also set the log-group retention policy to 3 months, so Cloudwatch Logs will automagically remove logs older than 3 months, which realistically is plenty of time to store application logs for 🙌

Here’s where Cloudwatch Logs starts to get a little more interesting. The service support subscriptions, as discussed in Real-time processing of log data with Subscriptions.

This effectively means you can ship important data from Cloudwatch Logs to other AWS services. And Amazon even streamline the process for one particular use case: Streaming Cloudwatch logs to an AWS ES instance. So if you love your Kibana dashboard & aggregations more than I do then you can still use Elasticsearch, but instead of fluentd & S3 you can use Cloudwatch for the permanent store (or semi-permanent if you have a retention policy enabled) but still use ES & Kibana. I don’t know how likely I am to use that, given that Cloudwatch Logs supports (excellent) filtering & pattern searching, but it’s nice to options open to me for when my logging use case changes!

So, this post has gone over the different forms of application logging used in production-level applications here at Car Throttle. The first focused on emitting logs in the application, and the second two focused on using the underlying architecture to ship logs, so the application doesn’t have to focus on logging. I much prefer this way of thinking - it means I can focus building services that focus on solving their problem. Obviously on a per-project basis I have to create log streams in Cloudwatch Logs or update my td-agent.conf to support new services - no solution is without setup - but the fact that Cloudwatch Logs handles records individually rather than part of a collection is much more appealing to me.

Next step is to add fluent-plugin-cloudwatch-logs to the existing fluentd instance and slowly start to move logs from ES to Cloudwatch Logs. Services backed by Docker & ECS can be moved over anytime - they require a simple configuration change & deployment - but older services still using fluentd have higher volumes of traffic than these microservices, so I’ll be migrating them slowly!

Interesting thought whilst writing this: Elasticsearch isn’t truly “schema-less”. It is, in the sense that new properties can be added on the fly, but what they don’t openly tell you is the type of every subsequent property has to be the same type, otherwise it just causes massive issues. Whereas Cloudwatch Logs are truly schema-less - it just depends on the query you’re searching with. Key equals string? Search where that key is a string. Key is less than a number? Search where that key is a number.