Handling timeouts in Ansible AWS modules

AWS

Timeouts in AWS?

Yup, it's completely normal, that sometimes between your laptop and AWS API something doesn't work as stable as you'd like it to. We hit random timeouts and we have to re - run our playbooks.

Timeouts in Ansible?

Looking for any "timeout" configuration directive in ansible.cfg I found this - it actually doesn't help. We might just set a longer timeout for SSH connections. But AWS modules connects to API using boto library.

So what's the problem once again?

Let's say that we have the following playbook:

- hosts: localhost
  connection: local
  tasks:
  - name: Create security group
    ec2_group:
      name: "test_sg"
      state: "present"
      vpc_id: "some_vpc"
      purge_rules: False
      purge_rules_egress: False
    register: ec2_group

  - name: Tag security group
    ec2_tag:
      resource: "{{ ec2_group.group_id }}"
      state: "present"
      tags: "{{ some_defined_tags }}"
    register: ec2_group_tags

  - name: Apply permissions on security group
    ec2_group:
      name: "test_sg"
      state: "present"
      vpc_id: "some_vpc"
      rules: "{{ some)defined_rules }}"

  - name: Provision a set of instances
    ec2:
      group_id: "{{ ec2_group.group_id }}"
      instance_type: "t2.large"
      image: "some_ami"
      vpc_subnet_id: "some_vpc"
      count_tag: "some_tag"
      exact_count: "3"
      wait: true
      instance_tags: "{{ some_defined_tags }}"
      zone: "some_zone"
    register: ec2

And now let's say that it fails randomly because of timeouts that happens over time - and those timeouts hit random tasks.

We could simply re - run the job and thanks to idempotency it would just make sure that it was all finished up.

Also we could simply retry starting from failing step:

PLAY RECAP ********************************************************************
           to retry, use: --limit @/home/user/playbook.retry

Anything more civilized?

Actually there's another method. Not sure if it is more civilized, but it works and makes the Ansible playbook finish successfully more frequently.

It's a simple do - until loop documented here

Using this approach we might add error - handling and retry policy to above playbook so we could get something like the following one;

- hosts: localhost
  connection: local
  tasks:
  - name: Create security group
    ec2_group:
      name: "test_sg"
      state: "present"
      vpc_id: "some_vpc"
      purge_rules: False
      purge_rules_egress: False
    register: ec2_group

    until: ec2_group.failed is not defined or ec2_group.failed == false
    retries: "3"
    delay: "30"

  - name: Tag security group
    ec2_tag:
      resource: "{{ ec2_group.group_id }}"
      state: "present"
      tags: "{{ some_defined_tags }}"
    register: ec2_group_tags

    until: ec2_group_tags.failed is not defined or ec2_group_tags.failed == false
    retries: "3"
    delay: "30"

  - name: Apply permissions on security group
    ec2_group:
      name: "test_sg"
      state: "present"
      vpc_id: "some_vpc"
      rules: "{{ some)defined_rules }}"
    register: ec2_group_perms

    until: ec2_group_perms.failed is not defined or ec2_group_perms.failed == false
    retries: "3"
    delay: "30"

  - name: Provision a set of instances
    ec2:
      group_id: "{{ ec2_group.group_id }}"
      instance_type: "t2.large"
      image: "some_ami"
      vpc_subnet_id: "some_vpc"
      count_tag: "some_tag"
      exact_count: "3"
      wait: true
      instance_tags: "{{ some_defined_tags }}"
      zone: "some_zone"
    register: ec2

    until: ec2.failed is not defined or ec2.failed == false
    retries: "3"
    delay: "30"

It's very simple and works just fine. Of course retry and delay values are good subject to be parametrized and put into variable.

Comments