In the most general terms, automation refers to the use of technology to help operators work faster and with greater efficiency. With respect to service automation, much of the focus to date has been on scripts for initial deployment and configuration of services. These scripts are typically written in a scripting language such as Python or use frameworks such as Ansible, Puppet, or Chef. They combine a number of manual steps into a single executable with the goal of saving time and (more importantly) avoiding costly mistakes resulting from human operator error. More recently, these scripting systems have broadened in scope to include service orchestration. Whereas scripts tend to focus on individual components, orchestration systems coordinate the deployment of multiple components associated with a service. Orchestrators may also provide mechanisms (e.g. through policies) to attempt to align service deployments with high-level business goals.
While these technologies are obviously valuable, they only deal with a small portion of the service management cost incurred by service providers. For most services providers, the majority of Operational Expenses (OPEX) results from making sure that deployed services meet performance, reliability, and security objectives in the face of changing requirements or changes to the operational environment. For example, the introduction of new features may require components to be added, replaced, or upgraded; increased loads may force service providers to provision additional capacity; security infrastructure may need to be added or reconfigured in response to new security threats; failure of individual components may require temporary reconfigurations to work around malfunctioning components.
Most if not all of these tasks still require human intervention today. While excellent tools exist to alert operators of security, performance, or reliability events, it typically requires a human operator to handle the event by troubleshooting what’s going on, diagnosing the root cause of the event, and deciding on the appropriate corrective action.
This brings us to the real opportunity for service automation: to automate the process of responding to performance, reliability, and security events such that these no longer requires human intervention. In other words, rather than building automation tools that simplify the life of the human operator, can we eliminate the human operator altogether? Shouldn’t our ultimate goal be one of service autonomy? And if autonomy is the ultimate goal, how do we get there?
Fortunately, a lot of research has been done in this area that can help make service autonomy a reality. IBM originally coined the term Autonomic Computing System to describe systems that can adapt to unpredictable changes while hiding intrinsic complexity from operators and users. Such autonomic systems typically implement a number of self-management functions such as self-configuration, self-healing, self-optimization, and self-protection.
Architecturally, autonomic systems are constructed as a collection of autonomic elements, each of which manages its internal behavior and its relationships with other autonomic elements. Each autonomic element consists of the following components:
- The managed resource, which is the entity whose behavior is autonomically adapted.
- An autonomic manger that implements the “control loop” that is responsible for autonomic behavior.
- A number of sensors that are used by the autonomic manager to obtain data from the managed resource.
- A number of effectors that are used by the autonomic manager to perform operations on the managed resource.
When this architecture is applied to autonomic service management, the following common features seem to emerge:
Decoupled ControlWhile autonomic elements can be designed to incorporate autonomic functionality directly into the resources that are being managed, it is preferred to logically separate the autonomic manager from the managed resource (very much the same way SDN logically separates network forwarding decisions from packet forwarding functionality in the data plane). This approach results in a consistent architecture for both new autonomic elements as well as for legacy components that were not designed with autonomic functionality in mind. More importantly, this logical separation provides the flexibility required to change autonomic control functionality independently from the functionality provided by the managed resource.
Model-basedIn order to provide autonomic functionality, an autonomic element must have sufficient knowledge about itself and its own state. Such self-knowledge is typically implemented using a model-based approach where autonomic managers maintain a run-time model that reflects the configuration and state of the managed resource at all times (using synchronization techniques that leverage the sensors of the autonomic resource). This allows the autonomic manager to make control decisions based on state reflected in the model only without ever having to interact with the managed resource directly. Similarly, any changes resulting from control operations are made to the model first, and then propagated back into the managed resource.
The obvious benefit of this approach is that control logic can be designed based on standardized models without having to worry about technology and device-specific differences in the interfaces provided by the managed resources.
HierarchicalThe autonomic architecture as described earlier structures autonomic systems as collections of autonomic elements and their relationships. It is typical to use a hierarchical design pattern for structuring autonomic systems where each autonomic element can in turn be structured as an autonomic system in its own right that consists of more fine-grain autonomic elements. This supports the recursive decomposition approach that I’ve described in an earlier blog post.
The benefit of this type of recursive structure is that it allows for hierarchical control. In a hierarchical control system, each autonomic system controls the behavior of all the autonomic elements it contains directly. This creates an “outer control loop” for the autonomic elements. However, each of these autonomic elements can in turn implement a separate “inner control loop” for its own autonomic elements. Coordination between the inner and outer control loops is done using the sensors and effectors exposed by the autonomic elements.
Policy-BasedWhile the control loop provided by the autonomic manager can be structured based on control theory concepts, it is more common to adopt policy-based control systems that consist of sets of rules with which the system needs to comply. Since policies can exist at each level of the hierarchy, the system must automatically reconcile and/or translate policies specified at different levels using the Policy Continuum concept I described in an earlier post.
What I find interesting is that these characteristics are similar to those of the more advanced service orchestration systems (such as TOSCA-based orchestrators). Does this mean that it might be possible to evolve these orchestration systems into full-fledged autonomic service management systems? I hope to explore this question in a future post.