Documentation homepage Last update: Sep 22, 2021

AWX Clustering/HA Overview

Prior to 3.1, the Ansible Tower HA solution was not a true high-availability system. This system has been entirely rewritten in 3.1 with a focus towards a proper highly-available clustered system. This has been extended further in 3.2 to allow grouping of clustered instances into different pools/queues.

It's important to point out a few existing things:

Ansible Tower 3.3 adds support for container-based clusters using Openshift or Kubernetes.

Important Changes

Concepts and Configuration

Installation and the Inventory File

The current standalone instance configuration doesn't change for a 3.1+ deployment. The inventory file does change in some important ways:

[tower]
hostA
hostB
hostC

[instance_group_east]
hostB
hostC

[instance_group_west]
hostC
hostD

The database group remains in order to specify an external Postgres. If the database host is provisioned separately, this group should be empty.

[tower]
hostA
hostB
hostC

[database]
hostDB

Recommendations and constraints:

Provisioning and Deprovisioning Instances and Groups

$ awx-manage deprovision_instance --hostname=<hostname>
$ awx-manage unregister_queue --queuename=<name>

Configuring Instances and Instance Groups from the API

Instance Groups can be created by posting to /api/v2/instance_groups as a System Administrator.

Once created, Instances can be associated with an Instance Group with:

HTTP POST /api/v2/instance_groups/x/instances/ {'id': y}`

An Instance that is added to an InstanceGroup will automatically reconfigure itself to listen on the group's work queue. See the following section Instance Group Policies for more details.

Instance Group Policies

AWX Instances can be configured to automatically join Instance Groups when they come online by defining a policy. These policies are evaluated for every new Instance that comes online.

Instance Group Policies are controlled by three optional fields on an Instance Group:

NOTES

Manually Pinning Instances to Specific Groups

If you have a special Instance which needs to be exclusively assigned to a specific Instance Group but don't want it to automatically join other groups via "percentage" or "minimum" policies:

  1. Add the Instance to one or more Instance Groups' policy_instance_list.
  2. Update the Instance's managed_by_policy property to be False.

This will prevent the Instance from being automatically added to other groups based on percentage and minimum policy; it will only belong to the groups you've manually assigned it to:

HTTP PATCH /api/v2/instance_groups/N/
{
    "policy_instance_list": ["special-instance"]
}

HTTP PATCH /api/v2/instances/X/
{
    "managed_by_policy": False
}

Status and Monitoring

AWX itself reports as much status as it can via the API at /api/v2/ping in order to provide validation of the health of the Cluster. This includes:

A more detailed view of Instances and Instance Groups, including running jobs and membership information can be seen at /api/v2/instances/ and /api/v2/instance_groups.

Instance Services and Failure Behavior

Each AWX instance is made up of several different services working collaboratively:

AWX is configured in such a way that if any of these services or their components fail, then all services are restarted. If these fail sufficiently (often in a short span of time), then the entire instance will be placed offline in an automated fashion in order to allow remediation without causing unexpected behavior.

Job Runtime Behavior

Ideally a regular user of AWX should not notice any semantic difference to the way jobs are run and reported. Behind the scenes it is worth pointing out the differences in how the system behaves.

When a job is submitted from the API interface, it gets pushed into the dispatcher queue via postgres notify/listen (https://www.postgresql.org/docs/10/sql-notify.html), and the task is handled by the dispatcher process running on that specific AWX node. If an instance fails while executing jobs, then the work is marked as permanently failed.

If a cluster is divided into separate Instance Groups, then the behavior is similar to the cluster as a whole. If two instances are assigned to a group then either one is just as likely to receive a job as any other in the same group.

As AWX instances are brought online, it effectively expands the work capacity of the AWX system. If those instances are also placed into Instance Groups, then they also expand that group's capacity. If an instance is performing work and it is a member of multiple groups, then capacity will be reduced from all groups for which it is a member. De-provisioning an instance will remove capacity from the cluster wherever that instance was assigned.

It's important to note that not all instances are required to be provisioned with an equal capacity.

If an Instance Group is configured but all instances in that group are offline or unavailable, any jobs that are launched targeting only that group will be stuck in a waiting state until instances become available. Fallback or backup resources should be provisioned to handle any work that might encounter this scenario.

Project Synchronization Behavior

It is important that project updates run on the instance which prepares the ansible-runner private data directory. This is accomplished by a project sync which is done by the dispatcher control / launch process. The sync will update the source tree to the correct version on the instance immediately prior to transmitting the job. If the needed revision is already locally checked out and Galaxy or Collections updates are not needed, then a sync may not be performed.

When the sync happens, it is recorded in the database as a project update with a launch_type of "sync" and a job_type of "run". Project syncs will not change the status or version of the project; instead, they will update the source tree only on the instance where they run. The only exception to this behavior is when the project is in the "never updated" state (meaning that no project updates of any type have been run), in which case a sync should fill in the project's initial revision and status, and subsequent syncs should not make such changes.

All project updates run with container isolation (like jobs) and volume mount to the persistent projects folder.

Controlling Where a Particular Job Runs

By default, a job will be submitted to the default queue (formerly the tower queue). To see the name of the queue, view the setting DEFAULT_EXECUTION_QUEUE_NAME. Administrative actions, like project updates, will run in the control plane queue. The name of the control plane queue is surfaced in the setting DEFAULT_CONTROL_PLANE_QUEUE_NAME.

How to Restrict the Instances a Job Will Run On

If the Job Template, Inventory, or Organization have instance groups associated with them, a job run from that Job Template will not be eligible for the default behavior. This means that if all of the instance associated with these three resources are out of capacity, the job will remain in the pending state until capacity frees up.

How to Set Up a Preferred Instance Group

The order of preference in determining which instance group the job gets submitted to is as follows:

  1. Job Template
  2. Inventory
  3. Organization (by way of Inventory)

To expand further: If instance groups are associated with the Job Template and all of them are at capacity, then the job will be submitted to instance groups specified on Inventory, and then Organization.

The global tower group can still be associated with a resource, just like any of the custom instance groups defined in the playbook. This can be used to specify a preferred instance group on the job template or inventory, but still allow the job to be submitted to any instance if those are out of capacity.

Instance Enable / Disable

In order to support temporarily taking an Instance offline, there is a boolean property enabled defined on each instance.

When this property is disabled, no jobs will be assigned to that Instance. Existing jobs will finish but no new work will be assigned.

Acceptance Criteria

When verifying acceptance, we should ensure that the following statements are true:

Testing Considerations

Performance Testing

Performance testing should be twofold:

These should also be benchmarked against the same playbooks using the 3.0.X Tower release and a stable Ansible version. For a large volume playbook (e.g., against 100+ hosts), something like the following is recommended:

https://gist.github.com/michelleperz/fe3a0eb4eda888221229730e34b28b89