“Automate or die” could be the slogan for building a scalable cloud UC platform. Cloud UC platforms that combine voice, video, mobility, web, contact center, and analytics functionality with data and telco networks can get pretty complicated. Trying to manage a multi-tenant platform spread over multiple data centers and hundreds or thousands of virtual machines is simply not possible without the heavy use of automation. Running a cloud service at scale means you can’t have sys admins running around doing things by hand. Not only will they not be able to get the job done in any reasonable amount of time, but the end product will be rife with manual errors, discrepancies, and supportability issues.
The good news is that there are very good automation tools available these days, much better than existed even five years ago. Some of these tools are open source with strong communities around them. In this post and one that will follow, I will focus on server automation. Virtual servers are the most numerous infrastructure elements in our platform and they are the most complicated in terms of their configuration, so it follows that they are the most important elements to wrap with good automation. I generally stay away from the Chef vs. Puppet vs. CFEngine religious wars people get wrapped up in. To me, it matters much less which tool you choose (we chose Chef), but that you choose a tool which supports four key requirements.
1. A pull- vs. push-based architecture. You want the servers in your environment to pull their configs on a regular basis from a set of distribution servers, instead of trying to push configs from a central point out to your servers. Previously, I worked extensively with a product that had a push-based architecture and I found it much harder to scale that type of architecture as you grow the number of servers under management. You also run into all sorts of issues with simple things like a host being down when you go to push, or not being available when you go to push. Do you then remember to push to the failed host again later? A pull model is much more resilient to these types of failures and naturally scales as you can put more distribution servers in place and load balance inbound requests across them as you grow.
2. A fine-grained configuration vocabulary that can be parameterized. You need to be able to express any type of configuration in the tool, from an entire service configuration down to an individual change within a text-based configuration file. Image-based approaches may be tempting in their simplicity, but will really only work well for initial server provisioning, not for ongoing management of that server, which often requires fine grained change. Further, you need to be able to support variables or parameters that will vary server to server. Otherwise you won’t be able to automate many of the interesting bits of your configuration.
3. Declarative vs. procedural blueprints of your configurations. A lot of people focus on the initial provisioning of a server using, say, a bash script. But then that script and all the knowledge in it can’t be used to manage the server over time, and your servers start drifting from their ideal state as soon as they are provisioned. A declarative approach states what the end goal configuration state should look like on the server, and it’s up to the automation tool to make the changes necessary to achieve that state. It’s a subtle distinction, but it means that you can use a developed blueprint both for initial provisioning of the server, and also on an ongoing basis to keep the configuration from drifting, and to effect change and maintain that server over time.
4. Ability to group your servers in flexible ways and attach blueprints to those groups. Grouping servers according to environment, data center, function, type, version, discovered attributes, etc. is key, as those groups will serve as targets for automation actions. Attaching a set of blueprints to a group of servers should trigger the system to automatically make sure the configuration state described in the blue prints is realized on the target servers. This is how you move from dealing with one server at a time to entire groups or classes of servers. If the group contains five servers or 5000 servers, the human labor is similar.
The obvious benefit we have gotten out of our adoption of a tool like Chef is that our teams are able to manage many more servers than we could before, our environments are more consistent and stable, and our platform is much more responsive, so we can bring new capacity online without a lot of manual labor. Even building out entirely new data centers is vastly easier if you already have blueprints developed for your platform components.
The more subtle benefit is that our internal alignment between development and operations has improved. You will sometimes see this combination of dev and ops around an automation tool like Chef referred to as “devops.” In adopting a devops approach, we have found that the rate at which we are able to release features has dramatically improved. I’ve always maintained that running an effective cloud service means that your dev and ops teams have to be in close alignment. Our use of Chef has been immensely helpful in achieving this goal, and I would recommend it to anyone trying to run a cloud service at scale.