MIQ(ResourceAction#deliver_to_automate_from_dialog) \ Queuing <ResourceAction:1000000000066> for <CustomButton:1000000000001>
Distributed Automation Processing
As we start using Automate to expand our workflows into the wider Enterprise, we may find that we need to add further CloudForms or ManageIQ appliances to spread the workload in our Region.
Automate has been designed to be scalable by supporting distributed worker appliances, each running the Automation Engine role and polling the VMDB database appliance for work. In this chapter we’ll take an in-depth look at how the automation operations are distributed between appliances via a central queue. This is another background information chapter, so feel free to skip on first read or bookmark for later reference.
Nondistributed Automation Operations
Not all automation operations need to have a distributed capability. Some automation operations interact with a user through the WebUI, and these require an instance/method to be run directly on the WebUI node to which the user is logged-in. Such operations include:
-
Running an Automate instance from simulation
-
Automate instances that are run to populate dynamic dialog elements
Some other automation operations need to be executed synchronously and in a specific order, and these are also run on a single appliance to guarantee execution order. An example of this is running a control policy synchronous action type of Invoke a Custom Automation.
The Automation Engine role does not need to be enabled for these nondistributed automation operations to run.
Distributed Automation Operations
Most automation operations benefit from being scalable and distributed, and capable of running on any appliance in our zone with the Automation Engine role set. These include:
-
Running an Automate instance from a custom button
-
A control policy asynchronous action type of Invoke a Custom Automation
-
Any automation operations that involve separated requests and tasks
Distributed automation tasks are passed to the Automation Engine using the standard message passing mechanism by which all workers communicate. This is via a queue, modelled in the database as the miq_queue
table. Generic workers running on appliances with the Automation Engine role set, monitor this queue for messages with a queue_name
field of generic and a role field of automate. If such a message is found, it is dequeued and processed.
Tracing Queueing/Dequeueing Operations
We can examine evm.log to see the interworker message queueing/dequeueing activity when a custom button is clicked that launches an automation task (here the lines have been wrapped for clarity). The first activity that we see is the ResourceAction message (button activities are run as Resource Actions):
This is immediately followed by the insertion of a new message (#1000000158789) into the queue, containing the task details. The Role: [automate] parameter signifies that the message is intended for the Automation Engine.
MIQ(MiqQueue.put) Message id: [1000000158789], \ id: [], \ Zone: [default], \ Role: [automate], \ Server: [], \ Ident: [generic], \ Target id: [], \ Instance id: [], \ Task id: [resource_action_1000000000066], \ Command: [MiqAeEngine.deliver], \ Timeout: [3600], \ Priority: [20], \ State: [ready], \ Deliver On: [], \ Data: [], \ Args: [{:namespace=>"SYSTEM", \ :class_name=>"PROCESS", \ :instance_name=>"Request", \ :automate_message=>nil, \ :attrs=>{"class"=>"methods", \ "instance"=>"objectwalker", \ "namespace"=>"stuff", \ "request"=>"call_instance", \ "dialog_walk_association_whitelist"=>""}, \ :object_type=>"VmOrTemplate", \ :object_id=>1000000000024, \ :user_id=>1000000000001, \ :miq_group_id=>1000000000002, \ :tenant_id=>1000000000001}] \
The next log line that mentions Message id: [1000000158789]
shows it
being dequeued by an MiqPriorityWorker thread:
MIQ(MiqPriorityWorker::Runner#get_message_via_drb) Message id: [1000000158789], \ MiqWorker id: [1000000000504], \ Zone: [default], \ Role: [automate], \ Server: [], \ Ident: [generic], \ Target id: [], \ Instance id: [], \ Task id: [resource_action_1000000000066], \ Command: [MiqAeEngine.deliver], \ Timeout: [3600], Priority: [20], \ State: [dequeue], \ Deliver On: [], \ Data: [], \ Args: [{:namespace=>"SYSTEM", \ :class_name=>"PROCESS", \ :instance_name=>"Request", \ :automate_message=>nil, \ :attrs=>{"class"=>"methods", \ "instance"=>"objectwalker", \ "namespace"=>"stuff", \ "request"=>"call_instance", \ "dialog_walk_association_whitelist"=>""}, \ :object_type=>"VmOrTemplate", \ :object_id=>1000000000024, \ :user_id=>1000000000001, \ :miq_group_id=>1000000000002, \ :tenant_id=>1000000000001}], \ Dequeued in: [3.494673879] seconds
From here we see the message payload being delivered to the Automation Engine. Notice that in the logfile the task action is now prefixed by Q-task_id, followed by the Task id in the message:
Q-task_id([resource_action_1000000000066]) MIQ(MiqQueue#deliver) \ Message id: [1000000158789], Delivering... Q-task_id([resource_action_1000000000066]) MIQ(MiqAeEngine.deliver) Delivering \ {"class"=>"methods", \ "instance"=>"objectwalker", \ "namespace"=>"stuff", \ "request"=>"call_instance", \ "dialog_walk_association_whitelist"=>""} \ for object [VmOrTemplate.1000000000024] with state [] to Automate
We see the string Q-task_id many times in evm.log. This is an indication that the log line was generated by a task that was created as a result of a dequeued message, and that the message contained a valid Task id.
Finally the target instance itself is run by the Automation Engine:
Q-task_id([resource_action_1000000000066]) \ <AutomationEngine> Instantiating [/SYSTEM/PROCESS/Request? \ MiqServer%3A%3Amiq_server=1000000000001& \ User%3A%3Auser=1000000000001& \ VmOrTemplate%3A%3Avm=1000000000024& \ class=methods& \ dialog_walk_association_whitelist=& \ instance=objectwalker& \ namespace=stuff& \ object_name=Request& \ request=call_instance& \ vmdb_object_type=vm]
Detailed Queue Analysis
At any time, the miq_queue
table in the PostgreSQL database contains several messages:
vmdb_production=# select id,priority,method_name,state,queue_name,class_name, vmdb_production=# zone,role,msg_timeout from miq_queue; id | priority | method_name | state | queue_name | ---------------+----------+-------------- +---------+-----------------------| ... 1000000160668 | 100 | perf_rollup | ready | ems_metrics_processor | ... 1000000160710 | 20 | deliver | ready | generic | ... 1000000160673 | 100 | perf_rollup | ready | ems_metrics_processor | ... 1000000126295 | 100 | refresh | ready | ems_1000000000004 | ... 1000000160711 | 20 | deliver | ready | generic | ... 1000000153572 | 100 | perf_rollup | ready | ems_metrics_processor | ... 1000000154220 | 100 | perf_rollup | ready | ems_metrics_processor | ... ...
Each worker type queries the miq_queue
table to see if there is any work to be done for its respective role. The workers search for messages with a specific queue_name
field; for automation-related messages this is generic
.
When work is claimed by a worker, the message status is changed from “ready” to “dequeue” and the worker starts processing the message.
Monitoring the Queue During an Automation Operation
We can monitor the miq_queue
table during an automation operation initiated from a RESTful call. The following SQL query enables us to see the relevant messages:
vmdb_production=# select id,priority,method_name,state,queue_name, vmdb_production-# class_name,zone,role,msg_timeout from miq_queue where vmdb_production-# class_name like '%Automation%' or class_name like '%MiqAe%';
Searching for specific class_name
fields in this way enables us to also see automate_event messages, which aren’t handled by the Automation Engine, but are still relevant to an automation operation.
We see several messages created and dispatched over a short time period:
id | pri | method_name | state | queue | class_name | ... ---------+-----+----------------------+-------+---------+-------------------+---- ...1068 | 100 | call_automate_event | ready | generic | AutomationRequest | ... ...1069 | 100 | call_automate_event | ready | generic | AutomationRequest | ... ...1070 | 100 | create_request_tasks | ready | generic | AutomationRequest | ... (3 rows)
id | pri | method_name | state | queue | class_name ... ---------+-----+----------------------+---------+---------+------------------+... ...1071 | 20 | deliver | ready | generic | MiqAeEngine ... ...1070 | 100 | create_request_tasks | ready | generic | AutomationRequest ... ...1069 | 100 | call_automate_event | dequeue | generic | AutomationRequest ... (3 rows)
id | pri | method_name | state | queue | class_name ... ---------+-----+----------------------+---------+---------+-------------------... ...1071 | 20 | deliver | ready | generic | MiqAeEngine ... ...1072 | 20 | deliver | ready | generic | MiqAeEngine ... ...1070 | 100 | create_request_tasks | dequeue | generic | AutomationRequest ... (3 rows)
id | pri | method_ | state | queue | class_name | zone | role ---------+-----+---------+-------+---------+----------------+---------+-------... ...1071 | 20 | deliver | ready | generic | MiqAeEngine | default | automa... ...1072 | 20 | deliver | ready | generic | MiqAeEngine | default | automa... ...1073 | 100 | execute | ready | generic | AutomationTask | default | automa... (3 rows)
id | pri | method_ | state | queue | class_name | zone | ro... ---------+-----+---------+---------+---------+----------------+---------+-----... ...1071 | 20 | deliver | dequeue | generic | MiqAeEngine | default | auto... ...1073 | 100 | execute | dequeue | generic | AutomationTask | default | auto... (2 rows)
id | pri | method_ | state | queue | class_name | zone | ro... ---------+-----+---------+---------+---------+----------------+---------+-----... ...1073 | 100 | execute | dequeue | generic | AutomationTask | default | auto... (1 row)
id | pri | method_name | state | queue_name | class_name | zone | role | msg_... ----+-----+-------------+-------+------------+------------+------+------+-----... (0 rows)
We can search for any of these message IDs in evm.log and expand them to examine the message content. For example searching for message id: 1000000161070 reveals:
MIQ(MiqQueue.put) Message id: [1000000161070], \ id: [], \ Zone: [default], \ Role: [automate], \ Server: [], \ Ident: [generic], \ Target id: [], \ Instance id: [1000000000016], \ Task id: [automation_request_1000000000016], \ Command: [AutomationRequest.create_request_tasks], \ Timeout: [3600], \ Priority: [100], \ State: [ready], \ Deliver On: [], \ Data: [], \ Args: []
MIQ(MiqGenericWorker::Runner#get_message_via_drb) Message id: [1000000161070], \ MiqWorker id: [1000000000503], \ Zone: [default], \ Role: [automate], \ Server: [], \ Ident: [generic], \ Target id: [], \ Instance id: [1000000000016], \ Task id: [automation_request_1000000000016], \ Command: [AutomationRequest.create_request_tasks], \ Timeout: [3600], \ Priority: [100], \ State: [dequeue], \ Deliver On: [], \ Data: [], \ Args: [], \ Dequeued in: [5.622553094] seconds
Q-task_id([automation_request_1000000000016]) MIQ(MiqQueue#deliver) \ Message id: [1000000161070], Delivering...
Q-task_id([automation_request_1000000000016]) MIQ(MiqQueue#delivered) \ Message id: [1000000161070], State: [ok], Delivered in [1.866825831] seconds
This corresponds to the message queueing activity generated by the execute
method in the backend Rails module vmdb/app/models/miq_request.rb.
def execute
task_check_on_execute
deliver_on = nil
if get_option(:schedule_type) == "schedule"
deliver_on = get_option(:schedule_time).utc rescue nil
end
# self.create_request_tasks
MiqQueue.put(
:class_name => self.class.name,
:instance_id => id,
:method_name => "create_request_tasks",
:zone => options.fetch(:miq_zone, my_zone),
:role => my_role,
:task_id => "#{self.class.name.underscore}_#{id}",
:msg_timeout => 3600,
:deliver_on => deliver_on
)
end
If we search the sources for MiqQueue.put we see the extent to which the distributed nature of ManageIQ is used.
Troubleshooting
As (by design) queued automation operations can be dequeued and run by any appliance in a zone with the Automation Engine role set, we cannot necessarily predict which appliance will run our code. This can make troubleshooting $evm.log
output more challenging, as we may need to search automation.log on several appliances to find our method’s log output. When tracing message passing, the enqueue MiqQueue.put
and corresponding dequeue Worker::Runner#get_message_via_drb
calls might even be on different appliances as well.
If Automate tasks are not being run in a distributed ManageIQ installation, it is often worth examining the contents of the miq_queue
table to see whether Automate messages are accumulating, and which zone the messages are targeted for (the Zone: [] field). If messages are not being dequeued as expected, then check that the Automation Engine role is set on at least one appliance in the zone.
We often see this when separating appliances into various role-specific zones, such as a WebUI zone and a Worker Appliance zone. Automation calls made using the RESTful API to an appliance in the WebUI zone will fail to run if the Automation Engine role is not enabled on any of the WebUI zone appliances, or the RESTful call does not specify an alternative zone to run in.