Chef Delivery at Ooyala

Ooyala is a Telstra subsidiary and a leading innovator in premium video publishing, analytics and monetization. For the past year and a half, Ooyala has been using Chef to configure and provision its deployment framework, Atlantis. Atlantis is an open-source platform as a service (PaaS) for HTTP applications. It is built on Docker and written in Go.

Ooyala is beginning to use Chef Delivery to deploy the Atlantis framework itself. The applications platform team, which developed Atlantis, is driving the project. In this Q&A, we talk to Cædman Oakley, DevOps Evangelist, about the initial phases of the transition from a largely manual release process to an automated one based on Delivery.

Chef: Could you give us an update on where you are in the project?

Cædman: We have one Atlantis component that is currently ready to be deployed by Delivery. We've not yet deployed all the way out to production due to some integration issues with our current Chef infrastructure.

The goal is to have at least some of the components of the Atlantis platform use Delivery by the end of this year. We also want to have templates written by the end of the year to deploy applications that sit on top of Atlantis. Those templates will help other teams get up to speed with Delivery more quickly.

Come Q1, we will start moving some teams onto Delivery. These are teams that use Atlantis apps or write Atlantis apps. We'll be getting the entirety of Atlantis onto Delivery during Q1, as well.

Chef: What is the difference between an app and a component?

Cædman: When I talk about components, I'm talking about the services that actually run Atlantis, that make up the platform. When I talk about the apps, I'm talking about the services that use Atlantis as their deployment and Docker management platform.

Here's an example. We have a thing called a supervisor. A supervisor looks after all the Docker containers that are running on a particular machine. We have another micro-service that runs in a Docker container on that machine. So, the micro-service would be the app. The supervisor would be the component.

Chef: Can you walk us through what happens as something moves through the Delivery pipeline?

Cædman: Every component and every app is in its own repo. Each component and each app has one Delivery pipeline. At the moment we don't have multiple pipelines for any app or component. We might introduce multiple pipelines later.

We begin with local development and use short-lived feature branches. We're using Go, so our test suite is basically make test. Unit tests and Verify tests are the same (that is, the tests in the Verify stage are the same as those used for local development).

When we submit the change with delivery review it's pushed out to Stash, which is where we do our code reviews. We have a very flat structure so anyone on the team can review the changes. We're doing the Stash integration right now with the plug-in we just got from Chef.

The Build stage kicks off a hook in Delivery to build using our Atlantis builder. The result is a Docker container with a unique ID. We publish into the Acceptance stage and into our package store. Build also verifies that the package was correctly created.

For Acceptance, we've got some skeletal tests for the supervisors. Once it passes Acceptance we hit the Deliver button and it goes through the final stages. At the moment we don't deliver to production so this allows us to really work on the robustness of our test suites, including code-coverage metrics.

Any dependencies in Acceptance are drawn from Union. If one app is dependent on another app, in order to make sure that we don't run into cross-testing, we hook our Acceptance environment to Union for dependencies. The same thing is true for components.

Chef: What release process will Delivery replace?

Cædman: We have a standard, lightweight release process that has very little auto-promotion. There are some teams that use auto-promotion, but the majority run a more manual process. The auto-promotion is done using Jenkins. The branch is pulled using the Github Pull Request Builder, and kicks off a set of test jobs. If all these pass, then Jenkins deploys to the next environment and kicks off the downstream set of jobs. We deliberately don't go to production with this, in order to allow manual testing and rigorous QA processes to happen.

In the manual case, we have lots of scripts, such as Capistrano scripts, and we have many Atlantis deploys. We have strong auditing, so we can allow people to deploy from their own laptops, but we much prefer to deploy from an actual deployment box. We have a team for our legacy code as well, in the sense that the availability and the site reliability engineering (SRE) teams help out with deploying that code.

We did put in place a change management process to help improve the process and, to be fair, we've gone from a rollback every 23 days, on average, to, I think, four in the last three years. But the process is not particularly scalable. Everybody has to jump through the same set of hoops. It's about as little friction as you could get from a manual process but it's still more friction than you would want.

Now, we're replacing that release process with Delivery and telling people, "OK, you have the review process, you know that once you go into Delivery and hit 'Deliver' it's going out to production." It's an auto-promotion system. It's dependent on the team leads and business owners themselves to make sure everything is reviewed correctly.

It's a real mind shift going from a manual process where everyone can do their own deployments to Delivery, where anyone can deploy anything but it has to go through this automated pipeline.

Chef: Have you thought about some approaches to continuous delivery (CD) that you either want to take or want to avoid?

Cædman: Typically, when you adopt CD, there are large portions of time that are lost to preparing the code. You end up having to write all these tests that you didn't have, from scratch. That's great from a technical point of view but it doesn't take into account any of the business drivers. You can't lose 3 months on a product to get all the testing (smoke, functional, unit, acceptance, compliance, security, lint and syntax) in place. We need to find some balance.

We do have a requirement to have some amount of code coverage that, at least eventually, will equal the coverage we had before Delivery. One of the things we've been talking about is how to expose the coverage stats to Delivery.

Also, during the transition, we'll run the manual process in parallel with Delivery so that we're still able to release as people learn the new system.

Chef: What have been some of your pain points?

Cædman: We're fairly special in that we need to be able to pass IDs of containers around between nodes, no matter which of the stages you're on. Most people only need to pass configuration data between the phases of a particular stage, not back and forth between the stages themselves. The Delivery team did some work to make that possible.

We needed this feature because, at the moment, we have Chef servers that are in a different subnet and availability zone than our Delivery system. Our Delivery system exists in EC2 which is perfectly fine, but our canonical repos exist in our own data centers. We've had to build out bridges between the data centers in order to make Delivery talk nicely and play nicely. There are workarounds that we could have done, such as SSH forwarding, to get around some of this, but it's not the cleanest approach.

Security is very important here because you're talking about information that's strictly internal, including full copies of your source code. You want to make sure that nobody who's just standard VPN-ing in has access to Delivery. Yes, there's authentication, but if you have Delivery being managed by Chef, and that Chef server has any security hole at all, such as allowing you to SSH to any machine once you're inside the VPN, then you're broken.

Chef: Are there any pain points in terms of the actual implementation?

Cædman: There are two. One, and I think this will be true for other companies as well, is that the workflow is going to change. We're not delivering to Stash, approving in Stash and having web hooks in Stash pushed to Delivery. We're actually doing a Delivery review and having Stash be a mirror of the repo in Delivery. While Stash does become the canonical repo, this is still a slightly different workflow than most people are expecting.

The second point is that every repo and every product is written differently so it's incumbent upon the teams who are pushing out to Chef Delivery to write libraries and templates that can be reused. Our intent here is to make these sub-modules that other teams can include. That way other teams don't have to copy/paste recipes and cookbooks. When you do a copy/paste you're going to get divergent recipes.

The templates need to work more like an import rather than having people say, "Oh, I cut it from here and I paste it here," because then everybody works in their own repo and the point of those libraries disappears.

Chef: Given your title of DevOps Evangelist, what sort of evanglizing have you done throughout Ooyala?

Cædman: On the Delivery front, we've done two engineering-wide demos. We did one demo before we started to adopt Delivery that showed what was going to happen. We did a demo about four weeks ago to show where we currently were. We also ran a Proof Of Concept project with Chef in the pre-sales process to upper management and the technical leads within engineering.

Once we finish this quarter, when we've got the Stash integration and we're actually pushing things through the pipeline, we'll be doing a lot more demos and evangelizing.

Right now, I'm working at the management level and team lead level. I'm sending out weekly emails, I'm making sure people are aware this is going on. We're doing a lot of socialization, prepping people, and we're actually pushing this into our quarterly planning as well to make sure people are aware that Delivery is coming and they need to put in place plans for automating unit tests and getting code coverage to whatever percentage we decide on.

Chef: Do people come to you and say, "How?" or "I have no idea how to do any of this?"

Cædman: Yes, they do. Most of the time, my response is that the application platform team is writing a bunch of templates that teams can simply include in their source code. We will also do skills shares and more demos starting in January.

Chef: It sounds like a lot of education is going to have to happen.

Cædman: Yes, but I think that's true of any new system you put in place.

Chef: It may be premature, but do you have any advice you'd like to share, given what you've learned so far?

Cædman: I think the biggest thing is that the team that builds out the Delivery cluster should be a development team that will actually use Delivery to develop and deliver their applications.

You learn far more lessons with this approach. We were initially going to give this project to a systems team that doesn't develop applications but we figured out that every organization has many common practices, even if they're just manual. You can automate those practices and probably cover 80% of your organization.

Building out a library for Docker deployment or building out a library for a standard Capistrano deployment is the key to this coverage. It's a huge gotcha if the libraries don't already exist but you should build them. If you can eat that pain up front in a very LEAN and DevOps way, longer term it's going to be a lot easier.

I see libraries and templates as a self-service tool. They don't have a nice little UI but you can use them the way you need to. That means you have to have a certain amount of what I call nous, or internal expertise. If you have that, you can actually build anything the company needs. If you don't have that, then you end up in a bottleneck where Delivery is hampered by people's time being consumed by repetitive tasks, as opposed to letting them get code to the customer. I want to get the code to the customer.

Another lesson is that your Chef Delivery system should be in the same places as your source control and your standard Chef service. Delivery is not, for example, just another repo server that you can throw somewhere. There needs to be some network design that goes into this, upfront.

Chef: Do you think that, within a team, you need a specialist to write the Delivery build cookbooks?

Cædman: I see it as a team effort and so does the team. In fact, that's why I'm saying we should create libraries and templates. Then, that repo with those libraries and templates in it can be imported and any build cookbook that needs to be done for any different product can be built by that particular team, using the templates and libraries up front.

It is non-scalable to have a single point of failure. That means it's non-scalable to have a single person or a single team member or a specialist who writes the cookbooks. Do we think that we need a Chef specialist? Yes, we do. That person should be a technical lead, not someone who we make solely responsible for writing cookbooks. The whole point of the DevOps model is to make sure that a team can do everything, from start to finish.

Chef: How is the application platform team feeling about the work, so far?

Cædman: They lurch back and forth from, "Hey, this is going to be awesome," to "I can't believe we have to do all of this." It is a little harder than we were hoping, but that's because we're doing things like building in a third-party repository server.

Also, we're trying to do all of this underlying work so that the other engineering teams don't have to. They will just be able to use Delivery.

When we're done, we'll have a pipeline that provides a uniform process across all the engineering teams. Even the standard Chef stuff, like configuration management, will be going through that pipeline. Applications, configurations and components will all be going through the same pipeline. Any changes to Chef itself will go through the same pipeline.

The overall feeling of the team is that Delivery will improve engineering efforts across the entire Ooyala engineering organization.

Next in this series: Chef at Criteo

Learn how Criteo uses Chef to manage its infrastructure.

Read the article