Book Notes: The DevOps Handbook (Parts I & II)

Part 1 - The Three Ways

  • The foundation of DevOps can be traced back to Lean, The Theory of Constraints, and Toyota Kata. It has its roots in manufacturing process management
  • DevOps is an extension of the Agile movement paired with the Continuous Delivery movement
  • Measured in lead time and percent complete & accurate (%C/A)

The First Way - The Principles of Flow

  • We must make work visible
    • Unlike physical processes, technology work is largely invisible. It is hard to see where flow is impeded or where work is piling up
    • We should use tools to visualize how our work flows from left to right, like a kanban board
  • Limit work in progress (WIP)
    • Daily work is dominated by priority du jour, or urgent work
    • Disruptions are highly visible in physical processes but almost invisible for tech workers
      • An engineer will context switch and re-establish congnitive rules and goals the result in slower and more error-prone work
  • Reduce batch sizes
    • Reduce changeover cost between tasks, so that team does not feel forced to complete in specific operations
    • Allows work defects to be discovered early, so that problems can be fixed before other items flow through
    • Large batch releases cause sudden large amounts of WIP and less parallelization
  • Reduce the number of handoffs
    • Each handoffs incurs loss of knowledge
    • Each handoff is a potential queue where work will wait
    • Each handoff requires various sorts of communication, signaling, prioritization, testing, scheduling, etc.
  • Continually Identify and Elevate Our Constraints
    • Every system has a constraint. Any optimization that is not on the main constraint is an illusion
      • Work with either queue up it optimization happens before the constraint, or optimization will be starved if it comes after the main constraint
      • Constraint should be optimized to be the product owner or development, not operations, testing, deployment, etc.
  • Eliminate hardships and waste in the value stream
    • Waste is anything that causes delay for the customer that could be bypassed without affecting the result
    • Waste includes
      • Partially done work that is blocked
      • Extra processes
      • Extra features outside of requirements
      • Task switching
      • Motion of communication
      • Defects
      • Nonstandard or manual work
      • Heroics

The Second Way - The Principles of Feedback

  • Working Safely within Complex Systems
    • Complex systems are defined as systems which defies a single persons ability to see the system as a whole and understand how all the pieces fit together
    • Failure is inherent and inevitable
    • We must aim to work without fear because we are confident errors will be detected quickly before catastrophe occurs
    • To be safe within a complex system, we must meet the following conditions
      • Complex work is managed so that problems in design and operations are revealed
      • Problems are swarmed and solved, resulting in quick construction of new knowledge
      • New local knowledge is exploited globally throughout the organization
      • Leaders grow other leaders who continually grow these types of capabilities
  • See problems as they occur
    • We must tighten the feedback loops on the quality of our work within the system
    • When feedback is delayed and infrequent, it is too slow to prevent undesirable outcomes
    • Automated builds and testing allows us to identify when a change is introduced to the system that’s incompatible with expectations
    • Detects issues early on, but also identifies how these can be prevented in the future
    • Feedback allows us to steer
  • Swarm and solve problems to build new knowledge
    • Swarm problems to contain problems before they can spread and to diagnose and treat the problem so it doesn’t happen again
    • Andon cord, used in Toyota plants where every worker is trained to pull the cord when something goes wrong
      • This could mean a defective part, a required part is not available, or even that work is taking too long
    • When things are wrong or slow, the entire production line is stopped so that the problem can be fixed
      • This prevents the problem from continuing downstream
      • It prevents work centers from starting new work that will likely introduce more issues into the system
      • If problem is not addressed, work center could potentially deal with the same problem and cause more work loss
    • Swarming seems contrary to common management practices, but it
      • Prevents loss of critical information due to fading memories or changing circumstances
      • Provides fast feedback into the system
      • Isolates the problem
      • Prevents further complicating factors
  • Keep pushing quality closer to the source
    • More inspection steps and approval processes introduce potential for more errors, since the distance between who does the work and the decision makers is larger
    • Ineffective quality controls involve manual processes, approvals from busy people, and large documentation
    • Peer reviews should be implemented
    • Automatic tests and other checks should be implemented and required before changes are checked into production
    • Quality is everyone’s responsibility
      • Developers are usually the furtherest from the customer
      • Developers can’t learn when they’re punished for mistakes from months ago
  • Enable optimizing for downstream work centers
    • Lean defines two customers - internal and external
    • Our most important customer is the next step downstream
    • Operational non-functional requirements are prioritized as highly as user features
    • This creates quality at the source
    • Examples from manufacturing include asymmetrical materials so they could not be assembled backwards or screw fasteners that were impossible to over tighten

The Third Way - The Principles of Continual Learning and Experimentation

  • Enabling organization learning and a safety culture
    • Never name, blame and shame the person who caused a problem. We are human and mistakes happen.
    • Our work is almost always performed within a complex system
      • How management chooses to react to failures and accidents may lead to a culture of fear which then makes it unlikely that problems and failure signals are ever reported
    • Conduct blameless post-mortem after every incident to
      • gain the best understanding of how the incident occurred
      • agree on countermeasures to improve the system
  • Institutionalize the improvement of daily work
    • In the absence of improvements processes don’t stay the same - due to chaos and entropy, processes actually degrade over time
    • We improve daily work by explicitly reserving time to pay down technical debt, fix defects, and refactor and improve problematic areas of code
    • We schedule kaizen blitzes, which are periods when engineers self-organize into teams to work on fixing any problem they want
  • Transform local discoveries into global improvements
    • When new learnings are discovered locally, there must be a mechanism to enable the rest of the organization to benefit
    • i.e., post-mortem being searchable, source code repos being shared, etc.
  • Inject resilience patterns into daily work
    • Introduce tension into system to elevate performance
    • Seek to reduce deployment times
    • Reduce test execution times
    • Perform game day exercises, rehearsing large-scale failures or Chaos Money like Netflix
  • Leaders reinforce a learning culture
    • Leaders must elevate the value of learning and disciplined problem-solving
    • Coaching kata
      • the scientific method of stating True North goals
      • Organization goals to individual, team-based, measurable goals
    • Conduct experiments, with the leader coaching the person running the experiment to continue iterating and learning

Part 2 - Where to Start

  • How do we practically implement a culture of DevOps into our organization?
  • How do we decide where to start?
  • How do we enable our teams to succeed?

Selecting Which Value Stream to Start With

  • Single product team rather than functional teams
    • Reduces handoffs
    • aligns goals
    • Remove external team dependencies
  • Increasing team size not always best move
    • Improve the way work is done. Increase effectiveness
  • Greenfield vs. brownfield
    • Greenfield are new projects, where culture can be built in from the start
    • Brownfield projects may be more receptive because it’s clear current process is not working
      • DevOps has been used to successfully transform brownfield projects
  • Start with sympathetic and innovative groups
    • Much like Crossing the Chasm, look for early adopters
    • Don’t spend time trying to convert conservatives groups. They must see proven track record
  • Build critical mass and silent majority
    • Expand to more teams and value streams
    • Do not have to be most visible or influential groups, but expand the coalition
  • Identify the holdouts
    • Must have enough success to protect the initiative
  • Little fish learn to be big fish in little ponds

Understanding the Work in Our Value Stream, Making it Visible, and Expanding Across the Organization

  • Value stream mapping
    • Conduct a workshop with all the major stakeholders
    • First create high-level process blocks
    • Focus on places where
      • work must wait for weeks or months
      • waiting for processes
      • significant rework is generated or received
    • Measure each block in %C/A, lead time, and value add time
    • Identify metrics that need to be improved
    • Unexpected insights
    • See obvious areas of improvement
  • Identify teams supporting our value stream
    • No one person knows all the work that must be performed to create value for the customer

Creating a dedicated transformation team

  • Initiatives like DevOps transform are inevitably in conflict with ongoing business operations
    • We are trying to improve business operations, but ultimately require disruptions to change how we work
    • Business are built to be resilient to change
      • Good for maintaining status quo, but this puts us at odds with groups who are responsible for daily operations
  • Organizations must create a dedicated transformation team
    • Must be able to operate outside the rest of the organization that is responsible for daily operations
    • Allows “performance engine” to continue to operate the business
  • Dedicated team is accountable for achieving a clearly defined, measurable, system-level result
    • Separate from team as to not interrupt normal operations
    • Create a separate space to maximize communication flow within the team
    • Select team members who have long-standing and mutually respectful relationships with the rest of the organization
  • Agree on a shared goal
    • It should require considerable work but is not impossible
  • Limit the number of these types of initiatives as to not tax the organizational change management capacity
  • Keep improvement planning horizons short
    • Allows flexibility to reprioritize
    • Quicker realization of improvements that make meaningful differences
    • Less risk that project is killed before demonstrable outcomes
    • Early wins are important
  • Reserve 20% of cycle for non-functional requirements and reducing technical debt
    • Organizations that need process improvements the most are those that have the least amount of time to spend
    • Organizations that do not pay down technical debt will soon be burdened with daily workarounds where no new work can be completed
    • If “tax” is not paid, technical debt will become large burden
  • Identify technical debt early and prioritize it in backlog
    • Ongoing incidents should halt further work

How to Design Our Organization and Architecture with Conway’s Law in Mind

  • Conway’s Law is inevitable
    • organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations
    • How we organize our teams has a powerful effect on the software we produce
    • If inevitable, we must use it to our advantage
  • Eliminate dependencies on other teams
  • Organization archetypes
    • Functional-oriented
      • Specialties are grouped together
    • Matrix-oriented
      • Combine functional and market. Usually causes confusing and complicated organizational structures
    • Market-oriented
      • Optimized for responding quickly to customer needs
      • Cross-functional
      • Potential for redundancies across organizations
  • Overly functional team issues
    • Long lead times
    • Work requires opening up tickets with multiple groups
    • Implementer often does not have context about why change is being implemented
  • Market-oriented teams
    • Don’t do top-down reorganization, as it’s very scary and disruptive
    • Implant engineers on existing service teams
  • Very important is how people act and react, not necessarily just the team organization
  • Align incentives to spur change or resilience
    • Developers should be on call
    • Implementers work on the front-lines to gain understanding
  • Encourage learning
    • Team must overcome learning anxiety
    • Hiring must see potential in skill set
  • Design team boundaries with Conway’s law in mind
  • Development should result in loosely coupled services with bounded contexts
    • Service-oriented architecture
  • Align teams with their products in a way that reduces handoffs, external communication, and cross-team dependencies

How to Get Great Outcomes by Integrating Operations into the Daily Work of Development

  • If operation resources are limited, use the Ops Liaison model
    • Dedicated release engineer for each time who becomes intimately familiar with the needs and executes the work
    • Business relationship manager who helps their product teams navigate the Operations landscape, prioritizes work, and streamlines requests
  • Create shared services to increase developer productivity
    • “Without self-service Operations platforms, the cloud is just Expensive Hosting 2.0”
    • Customers are not external customers but internal Dev teams
    • Includes pre-blessed security libraries, deployment pipeline, and tools
  • Embed Ops Engineers into Service Teams
    • Priorities are driven entirely by the goals of the product teams they’re embedded in
    • Efficient way to cross-train operations knowledge and expertise
    • Transform operations knowledge into automated code
  • Integrate Ops into Dev Rituals, and invite Ops to Dev stand-ups
  • Make ops work visible on shared Kanban boards
    • Only work that is relevant to product delivery
    • People may not be aware of necessary Operations work until it becomes an urgent crisis