Published on

Efficient and Durable Logging for Apache Airflow in AWS

Apache Airflow

Viewing run logs for your development work is vital to ensuring your tasks are executing as you expect them to. However this can sometimes be more difficult in Apache Airflow, and the reasons may not be very clear to you, the developer. We will look at one scenario where this happens, and how you can work around it while improving your logging mechanics.

Apache Airflow makes an execution log available in the web interface. While this appears to have logging details, you can only select your dag_id and then click into the logs to view the details. This dag_id link is tied to your current graph and assumes it will remain static. If you modify your graph, e.g. by adding or removing nodes, the dag_id link in the execution log will no longer be able to retrieve any previously generated logs from the same graph. Not being able to compare logs from one run to the next is problematic but there is a workaround.

Configure remote logging to CloudWatch Logs then transition to S3

CloudWatch Logs is available as a target for Airflow logs. It offers a good user experience with search and has the newer insights feature for running queries. To enable this, you first need to set up a CloudWatch connection in Airflow. If you are using Airflow in EKS, you can create a new IAM policy for your execution role with write access to a new Airflow log group. You will need to update the custom environment variables in the YAML files as described in the documentation:

# Airflow can store logs remotely in AWS CloudWatch. Users must supply a log group
# ARN (starting with 'CloudWatch://...') and an Airflow connection
# id that provides write and read access to the log location.
remote_logging = True
remote_base_log_folder = CloudWatch://arn:aws:logs:<region name>:<account id>:log-group:<group name>
remote_log_conn_id = MyCloudWatchConn

Once the log stream is enabled, modify the retention settings on the AWS CloudWatch console menu and choose to expire the logs after a week. This keeps the amount of data in CloudWatch low, which will limit the cost of using the service. If you do need more durable logs then you can set up a recurring task using CloudWatch Events as the scheduled trigger and Lambda to run your script. The boto3 client has a create_export_task function that will start an asynchronous job for CloudWatch Logs to write to an S3 location. You can separately configure the S3 location to transition to a lower cost storage class if you are needing to store them long term.

Antipattern: Send remote logs to S3 then push to CloudWatch

If you are already using S3 as an Airflow logging target, you cannot add another target to CloudWatch as well, so the configuration variable remote_base_log_folder should be replaced. S3 Select does allow you to view log results of any runs – even if your graph changes and gets a new dag_id – but it's time consuming because of all the clicks involved. And it's not efficient to send logs to CloudWatch once they land in S3, either. This is because a Lambda would need to be triggered off each PutObject event as log files land, and it would have to parse the log file to get it in the correct JSON format before sending it using put_log_events in boto3. In addition to the extra API calls, the Lambda logs themselves would incur additional charges unless you limit their retention policy, too.

Hopefully this helps you set up an efficient and durable logging pattern for your Apache Airflow development work, and avoid unnecessary pain with the web interface or S3.