What Is DevOps: How To Build An Efficient DevOps Team

For many years, I’ve been helping teams adopt and implement DevOps principles. The word DevOps gets thrown around a lot these days. I’m a little guilty of overusing the word myself — take a look at my Docker and Kubernetes courses — but I’m a huge fan of the DevOps movement (coined by Patrick Debois), and I’ve seen it change so many organizations and lives for the better.

Rather than define DevOps from a theoretical point of view, I’m writing from what I know and see happens in the teams I work with. In this article, I’m focusing on areas that should help you immediately advance your DevOps knowledge. First, I’ll list and detail what DevOps is and isn’t. Then, I’ll go through my DevOps evaluation to help you assess how “DevOps-y” you and your team are.

What DevOps is

At its core, DevOps is a culture of cross-team collaboration and measured improvement in the software delivery process. It started as a shared responsibility model for developers and operators. But now, it often includes security (DevSecOps) and business teams (BizDevOps). In DevOps, a team works together to remove silos, improve collaboration, and create shared ownership over the continuous improvement of the software they build and support.

1. DevOps centers software creation

DevOps requires the process of creating software to become as important as the software itself. Rather than exclusively focusing on “new customer-facing features,” the staff should also prioritize improving the system. DevOps-specific metrics highlight problems in those systems, so leadership can see when they need to prioritize “non-feature work.”

2. DevOps is driven by speed (velocity)

This velocity allows you to deliver a better product to your customer and continually improve that product through feedback and automation. From product idea to product release, then repeat. This is where the ♾ infinity logo for DevOps comes from — it’s a never-ending cycle of improvement and increasing your release velocity.

3. DevOps implementation is a combination of activities

This includes new processes and organizing people in new ways. DevOps expects you to create a new culture and implement new technology to help automate and highlight the important aspects of the application lifecycle. It covers a lot of areas and affects way more than engineers.

4. DevOps is driven by data and metrics

A DevOps culture measures its success with these metrics to make educated decisions on what works and what to do next. Four common metrics are Deployment Frequency, Lead Time for Changes, Change Failure Rate, and Time to Restore Service. These are explained below.

What DevOps is not

It’s not a series of checkboxes for “tools to implement.” Implementing Docker won’t make you DevOps any more than Jenkins. Sure, technology can help achieve a higher level of DevOps maturity, but I’ve also seen these DevOps tools misused, making systems difficult to use and fragile to change. Tools must be implemented in a way so that they improve your DevOps metrics.

1. You’re never “done” with DevOps

It isn’t a destination — it’s a journey of improving your mindset, the organization’s processes, and technology automation, all while seeking feedback more often.

2. DevOps is not just a better project plan

If we’ve learned anything in 50 years of making software, it’s that the more you try to detail the plan before you start, the harder it is to stay agile in ever-changing requirements. DevOps should help you find a middle ground where you can be flexible to change while planning for the work required. A big part of DevOps is creating smaller updates (more often) to reduce risk and acquiring feedback early (and often) to limit re-work.

3. DevOps isn’t afraid of failure. It embraces it

We’re human and naturally inclined to avoid failure. That tendency leads to a lack of skills and processes to recover from failure. Rather than making the priority to avoid failure, DevOps gives priority to quickly recovering with a combination of improved detection and recovery automation. DevOps also prioritizes metrics for measuring the time to discover a problem (MTTD) and time to repair it (MTTR).

4. It’s not “vanity metrics”

Speaking of metrics, DevOps is a big fan. We should measure our success with metrics. DevOps is not, however, a fan of all metrics. Avoid vanity metrics that encourage bad habits or make us feel good without showing if our DevOps abilities are maturing. Vanity metrics include ones like “number of tickets closed” or “number of features built.” These metrics track that people are doing work, sure, but is it the right work? Actionable metrics like “deployment frequency” (DF) and “mean lead time for changes” (MLT) highlight the speed at which the org is improving its processes and automation.

5. DevOps is not a job title

This one is tricky because my generic consulting title is often DevOps <something> (as seen in my last article, Kubernetes vs. Docker). Oops! That title can be as confusing as the title of “Agile Engineer,” which makes no sense. DevOps, like Agile Software Development, isn’t a job role. Everyone in the software organization needs to be on board with the DevOps objectives for it to work. One (or a few) people labeled as “the DevOps Engineer” usually means that everyone else is ignoring the required collaboration and changes they need to make as well. Better titles might be “QA Engineer,” “Reliability Engineer (SRE),” or “Build Engineer.” These job roles (and many more) control a piece of the software lifecycle that affects your DevOps results.

Advantages of DevOps

I’ve helped dozens of teams make progress in their DevOps maturity. I teach all my courses like Docker Mastery and Kubernetes Mastery with that mindset. Here’s my list of big benefits that I see happen once teams put in months of effort to get started:

  • It naturally avoids process rigidity. Mature DevOps allows your team to take detours and u-turns at any point in the release lifecycle.
  • It prevents any one person from being the single point of failure. Self-service systems are encouraged, and people are more empowered to move work forward because automation tools and testing are integrated into many processes.
  • It raises the visibility of your work to everyone who wants to know. Automation logs the changes for everyone to see. Versioning systems track nearly everything. The Pull Request workflow becomes a culture for you to design a change and then seek review and approval, in everything from code, to infrastructure and more.
  • It promotes shared knowledge and cross-training. Information silos are discouraged, and sharing your knowledge is highly valued by teammates and management. People are valued by how much they share and teach, rather than just what they know.
  • It iterates small changes very often. An org with a high level of DevOps-ness will be able to ship products and changes more often with a higher degree of success. Continual improvement of automation and testing allows the pace of change to trend upward while also reducing risk and downtime.
  • It enables faster recovery. DevOps designs for quick failure recovery as part of the workflow. Rollbacks and repairs are short and painless. Automation can provide auto-rollback when failed changes are detected.
  • It incorporates continual feedback and improvement. Feedback becomes part of every process, both automated (testing) and human. Confidence grows, creating better team cohesion and improved agility to ever-changing conditions.
  • It reduces toil and increases team (and customer) happiness. “Toil” is manual tasking that can be draining, time-consuming, and is often repetitive. It’s likely related to testing, debugging, admin commands, and manual activities that we now have ways to automate and provide self-service tools for others. For example, having developers test rather than automating the process. Read more about toil and how to eliminate it over at the Google SRE book.

Bret’s DevOps maturity evaluation

I’m a fan of assessing your organization’s DevOps practices. Assessment helps you determine where you are and which areas need the most attention. It’s hard to know where you need to go unless you know where you’re starting from.

DevOps casts such a big net over various parts of your organization that it’s difficult to look from the outside in and know where the organization stands in its maturity. You’ll need some sort of evaluation to help.

There are a lot of results when you search “DevOps maturity model,” and there are two basic types of assessment available. It’s either so generic that it isn’t really valuable for taking action — or it’s so detailed that it will take hours or days just to finish the evaluation.

I’ve created a mini-evaluation for you to do right now. I’m calling it Bret’s DevOps Maturity Evaluation. B-DOME for short, hah! For each of the three topic areas, there are five questions with up to 75 points across the evaluation. For each question, I give you three basic choices:

  • We have no DevOps-ness for this yet: 0 points
  • We’re doing some things right: 3 points
  • We’ve DevOps’ed all the things: 5 points

DevOps evaluation part 1: Culture

1. Production outage handling

This is a key indicator of having a growth culture that is central to DevOps maturity.

  • Rarely outage postmortems, blame is common: 0 points
  • Regular outage postmortems, never assigned blame: 3 points
  • Postmortems that include other team feedback. Never assigned blame. Commonly lead to better automation and improved team performance: 5 points

2. Reducing toil

Automating and removing daily human interaction with healthy systems.

  • Humans do many repetitive tasks as part of normal development and IT operations. Humans aren’t held accountable for automating manual operational tasks: 0 points
  • Automation is encouraged, but toil is unmeasured. User features are often prioritized over operational improvements: 3 points
  • Toil is measured and below 50% of work for operations. Management respects and encourages efforts to automate work as much as user features: 5 points

3. Empowering others with knowledge sharing

How much is sharing knowledge and cross-training encouraged? Is “tribal knowledge” actively resisted?

  • Only basic documentation exists (wiki, readme’s, etc.). Not consistently updated. It’s hard for new team members to get up to speed on their own: 0 points
  • Training is multi-modal and routine (recorded tutorials, lunch-and-learns, code pairing). New team members learn a lot from these: 3 points
  • Training is multimodal, routine, expected, and supported from the top as a priority activity. Evaluations reflect your commitment to sharing and people go out of their way to reduce tribal knowledge. A formal process exists for new team members to learn from existing records and provide feedback on what’s lacking: 5 points

4. Self-service tool creation

How much are staff enabled by other staff to help themselves? Are teams encouraged to create tools for other teams that remove their need for involvement in normal workflows? Is it common for staff to “protect their turf”? This is usually linked to the answer to the previous question.

  • Information and control is siloed and we often need someone else to configure, deploy, or provision stuff during normal operations: 0 points
  • We have a few self-service tools, but they lack rigor and top-down support in becoming quality products: 3 points
  • Self-service tools are common and given similar priority as products for external customers. The outcomes of reducing toil commonly create new self-service tools and they are regularly improved based on feedback: 5 points

5. Who’s on-call?

  • On-call staff are not usually developers, and app maintainers are not usually on-call for the apps production systems: 0 points
  • App maintainers are part of the on-call process. They respond to outages of their own products: 3 points
  • SRE culture is alive. App maintainers are on-call. Operations engineers are developers and commonly commit code to improve app stability and performance: 5 points

DevOps evaluation part 2: process

1. Release cadence

The rate of change that you’re delivering to customers. Also called “deployment frequency.” I’m assuming that it’s in active development. A healthy DevOps practice is working to release smaller updates more often.

  • A month or longer: 0 points
  • Days to weeks: 3 points
  • Multiple times a day: 5 points

2. Change failure rate

How often do production changes require a fix? It’s also called the “defect escape rate.” DevOps works to lower the failure rate of changes while also increasing the rate of change. This is also related to MTTD and MTTR.

  • Fails over 15% of the time, or unknown: 0 points
  • Fails under 15% of the time: 3 points
  • Fails under 15% and improving MTTR is valued as much as improving the failure rate: 5 points

3. Delivery lead time

The time it takes from committing a change (code or config) to it being available in production for the customer. This will relate to other questions but is a solid indicator of how automated and established your deployment workflow is.

  • Weeks or more: 0 points
  • Days to week: 3 points
  • Hours to days: 5 points

4. Time to recovery

The time it takes to get a major fix into the customer’s hands once it’s reported. It’s measured as Mean Time To Recover (MTTR) and closely affected by Mean Time To Detection, two standard DevOps metrics.

  • Days to weeks or more with many hands involved and manual effort beyond the typical release development process: 0 points
  • Hours to days, with a similar workflow to a standard release that’s CI tested and mostly automated: 3 points
  • Minutes to hours, with rollback and self-healing declarative options in many cases. MTTD and MTTR are tracked and given time for improvement: 5 points

5. Secure supply chain

Increasing the confidence that the code and infrastructure you designed is the actual thing delivered. There are lots of knobs to turn here, so I’ll focus on the most common parts of engineering control.

  • Many people can change something directly in the production environment without change management, versioning, or auditing: 0 points
  • Commits are signed. Pull request style testing and approval. Declarative systems are common. Artifacts are the standard deployment object (Docker/OCI images): 3 points
  • Commits are signed. Pull request style testing and approval. Declarative systems are everywhere. Artifacts are the standard deployment object (Docker/OCI images) and require promotion. People auth with 2FA/MFA. Machines mutually auth with each other. Each step of change workflow is auditable and cryptographically proven: 5 points

DevOps evaluation part 3: Technology

1. Decouple local dev environment

For better developer onboarding and increasing self-service, it’s best that your local environment requirements be as automated, streamlined, and generic as possible. For standards that are set (linting, testing, code quality), multiple popular options exist for developers to meet requirements.

  • We don’t support all major OSs (Win/Mac/Linux) or platforms (x86_64/arm64), have a complex setup process, and require a specialist who supports the dev environment tooling and data center: 0 points
  • We support all major OSs and platforms. The entire dev team can quickly set up and maintain the dev environment and its tooling: 3 points
  • Works in all major OSs, platforms, and editors, including web IDE’s (e.g., GitHub Codespaces). Uses containers for local infrastructure. Daily workflows don’t require complex commands or custom scripts: 5 points

2. Rich set of “shift left” local dev tools

Shift left” for security scanning, testing, and linting is important for empowering developers (and operators) to provide self-service to anyone committing a change to version control. The more opportunity someone has to evaluate a change before it’s pushed to a code repository, the fewer delays and rework there will be.

  • Local testing is very limited. CI pipelines, linting, and security scanning are created and/or controlled by only a select few people and executed after commits are pushed: 0 points
  • Local unit and functional testing is possible and encouraged before commit. Devs have limited access to edit CI pipelines. Some limited local linting or security scanning: 3 points
  • Every dev can run full CI, linters, and security scanning locally before commit push. CI pipelines are stored with code and editable by all. Pipelines workflows are updated as part of a source code change when required: 5 points

3. Infrastructure change tracking

Often, the maturity of infrastructure change management lags behind developer source code workflows. The devs will be rapidly releasing updates through CI and ready for continuous deployment, but the sysadmins haven’t adapted and are unable to keep things stable with an ever-increasing speed of change. If DevOps metrics are used to track change rate, stability, and recovery speed, then infrastructure tooling and ops change management processes have to keep pace. Much of that means adopting the workflow patterns of developers inside the infrastructure teams.

  • Infrastructure changes are not tracked automatically or in real-time and aren’t tested by CI before the change: 0 points
  • Most infrastructure as code (YAML, TOML, JSON, etc.), git revisioned, and CI tested: 3 points
  • Nearly all infrastructure changes are code, versioned, CI tested, peer-reviewed (pull request), and often reversible with minimal effort: 5 points

4. Deploy artifacts, not source code

Container images are the de-facto Cloud Native deployment object. Once images are adopted as the common deployment artifact (including serverless), several major sources of Dev-to-Ops pain get significantly reduced. Environment drift becomes far less of an issue, deployment times shrink, and it becomes possible to invest in ephemeral systems for higher levels of stability and agility.

  • Little to no software is deployed as a container image: 0 points
  • Most software is deployed as a container image: 3 points
  • Container images are the standard for everything your teams do. Image tags are never reused for production workloads to ensure old images are not reused: 5 points

5. Infrastructure change control

Learn about imperative vs. declarative infrastructure. Also, get educated about GitOps.

  • Mostly done manually (imperatively) by a human on the actual systems: 0 points
  • Mostly done in configuration management tool by a human (some imperative, some declarative) against systems: 3 points
  • Mostly GitOps. Changed in code repositories by humans. Tested by CI, approved by humans. Infrastructure (declaratively) updates itself: 5 points

Add up your points, and I’ve assigned some titles for different levels of achievement as a way to track your progress. Share your results with me on Twitter @BretFisher:

10 – 27 Points: DevOps Newborns! You’re off to a good start. Expect the journey toward maturity to take months or years.

27 – 45 Points: DevOps Graduates! You’ve got a solid foundation with many DevOps improvements producing measurable results. Don’t stop now!

46+ Points: DevOps Masters! You’re a well-rounded DevOps organization. Continuing the push for growth will lead to continued benefits and DevOps maturity.

I hope you realize this is hardly comprehensive and is highly opinionated based on my experience with teams. My goal is that this gives you some insight into what you’re doing well and what you need to work on without the work of a formal and detailed evaluation. These 15 topics hit the big issues I see with teams while they are growing their DevOps proficiency.

I wish you good luck in your DevOps journey, and may all your deployments be successful!

I originally posted this article on Udemy’s Blog as What Is DevOps: How To Build An Efficient DevOps Team