Book Notes: The DevOps Handbook (Parts I & II)

2018-06-02

Book: The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations

Part 1 - The Three Ways

The foundation of DevOps can be traced back to Lean, The Theory of Constraints, and Toyota Kata. It has its roots in manufacturing process management
DevOps is an extension of the Agile movement paired with the Continuous Delivery movement
Measured in lead time and percent complete & accurate (%C/A)

The First Way - The Principles of Flow

We must make work visible
- Unlike physical processes, technology work is largely invisible. It is hard to see where flow is impeded or where work is piling up
- We should use tools to visualize how our work flows from left to right, like a kanban board
Limit work in progress (WIP)
- Daily work is dominated by priority du jour, or urgent work
- Disruptions are highly visible in physical processes but almost invisible for tech workers
  - An engineer will context switch and re-establish congnitive rules and goals the result in slower and more error-prone work
Reduce batch sizes
- Reduce changeover cost between tasks, so that team does not feel forced to complete in specific operations
- Allows work defects to be discovered early, so that problems can be fixed before other items flow through
- Large batch releases cause sudden large amounts of WIP and less parallelization
Reduce the number of handoffs
- Each handoffs incurs loss of knowledge
- Each handoff is a potential queue where work will wait
- Each handoff requires various sorts of communication, signaling, prioritization, testing, scheduling, etc.
Continually Identify and Elevate Our Constraints
- Every system has a constraint. Any optimization that is not on the main constraint is an illusion
  - Work with either queue up it optimization happens before the constraint, or optimization will be starved if it comes after the main constraint
  - Constraint should be optimized to be the product owner or development, not operations, testing, deployment, etc.
Eliminate hardships and waste in the value stream
- Waste is anything that causes delay for the customer that could be bypassed without affecting the result
- Waste includes
  - Partially done work that is blocked
  - Extra processes
  - Extra features outside of requirements
  - Task switching
  - Motion of communication
  - Defects
  - Nonstandard or manual work
  - Heroics

The Second Way - The Principles of Feedback

Working Safely within Complex Systems
- Complex systems are defined as systems which defies a single persons ability to see the system as a whole and understand how all the pieces fit together
- Failure is inherent and inevitable
- We must aim to work without fear because we are confident errors will be detected quickly before catastrophe occurs
- To be safe within a complex system, we must meet the following conditions
  - Complex work is managed so that problems in design and operations are revealed
  - Problems are swarmed and solved, resulting in quick construction of new knowledge
  - New local knowledge is exploited globally throughout the organization
  - Leaders grow other leaders who continually grow these types of capabilities
See problems as they occur
- We must tighten the feedback loops on the quality of our work within the system
- When feedback is delayed and infrequent, it is too slow to prevent undesirable outcomes
- Automated builds and testing allows us to identify when a change is introduced to the system that’s incompatible with expectations
- Detects issues early on, but also identifies how these can be prevented in the future
- Feedback allows us to steer
Swarm and solve problems to build new knowledge
- Swarm problems to contain problems before they can spread and to diagnose and treat the problem so it doesn’t happen again
- Andon cord, used in Toyota plants where every worker is trained to pull the cord when something goes wrong
  - This could mean a defective part, a required part is not available, or even that work is taking too long
- When things are wrong or slow, the entire production line is stopped so that the problem can be fixed
  - This prevents the problem from continuing downstream
  - It prevents work centers from starting new work that will likely introduce more issues into the system
  - If problem is not addressed, work center could potentially deal with the same problem and cause more work loss
- Swarming seems contrary to common management practices, but it
  - Prevents loss of critical information due to fading memories or changing circumstances
  - Provides fast feedback into the system
  - Isolates the problem
  - Prevents further complicating factors
Keep pushing quality closer to the source
- More inspection steps and approval processes introduce potential for more errors, since the distance between who does the work and the decision makers is larger
- Ineffective quality controls involve manual processes, approvals from busy people, and large documentation
- Peer reviews should be implemented
- Automatic tests and other checks should be implemented and required before changes are checked into production
- Quality is everyone’s responsibility
  - Developers are usually the furtherest from the customer
  - Developers can’t learn when they’re punished for mistakes from months ago
Enable optimizing for downstream work centers
- Lean defines two customers - internal and external
- Our most important customer is the next step downstream
- Operational non-functional requirements are prioritized as highly as user features
- This creates quality at the source
- Examples from manufacturing include asymmetrical materials so they could not be assembled backwards or screw fasteners that were impossible to over tighten

The Third Way - The Principles of Continual Learning and Experimentation

Enabling organization learning and a safety culture
- Never name, blame and shame the person who caused a problem. We are human and mistakes happen.
- Our work is almost always performed within a complex system
  - How management chooses to react to failures and accidents may lead to a culture of fear which then makes it unlikely that problems and failure signals are ever reported
- Conduct blameless post-mortem after every incident to
  - gain the best understanding of how the incident occurred
  - agree on countermeasures to improve the system
Institutionalize the improvement of daily work
- In the absence of improvements processes don’t stay the same - due to chaos and entropy, processes actually degrade over time
- We improve daily work by explicitly reserving time to pay down technical debt, fix defects, and refactor and improve problematic areas of code
- We schedule kaizen blitzes, which are periods when engineers self-organize into teams to work on fixing any problem they want
Transform local discoveries into global improvements
- When new learnings are discovered locally, there must be a mechanism to enable the rest of the organization to benefit
- i.e., post-mortem being searchable, source code repos being shared, etc.
Inject resilience patterns into daily work
- Introduce tension into system to elevate performance
- Seek to reduce deployment times
- Reduce test execution times
- Perform game day exercises, rehearsing large-scale failures or Chaos Money like Netflix
Leaders reinforce a learning culture
- Leaders must elevate the value of learning and disciplined problem-solving
- Coaching kata
  - the scientific method of stating True North goals
  - Organization goals to individual, team-based, measurable goals
- Conduct experiments, with the leader coaching the person running the experiment to continue iterating and learning

Part 2 - Where to Start

How do we practically implement a culture of DevOps into our organization?
How do we decide where to start?
How do we enable our teams to succeed?

Selecting Which Value Stream to Start With

Single product team rather than functional teams
- Reduces handoffs
- aligns goals
- Remove external team dependencies
Increasing team size not always best move
- Improve the way work is done. Increase effectiveness
Greenfield vs. brownfield
- Greenfield are new projects, where culture can be built in from the start
- Brownfield projects may be more receptive because it’s clear current process is not working
  - DevOps has been used to successfully transform brownfield projects
Start with sympathetic and innovative groups
- Much like Crossing the Chasm, look for early adopters
- Don’t spend time trying to convert conservatives groups. They must see proven track record
Build critical mass and silent majority
- Expand to more teams and value streams
- Do not have to be most visible or influential groups, but expand the coalition
Identify the holdouts
- Must have enough success to protect the initiative
Little fish learn to be big fish in little ponds

Understanding the Work in Our Value Stream, Making it Visible, and Expanding Across the Organization

Value stream mapping
- Conduct a workshop with all the major stakeholders
- First create high-level process blocks
- Focus on places where
  - work must wait for weeks or months
  - waiting for processes
  - significant rework is generated or received
- Measure each block in %C/A, lead time, and value add time
- Identify metrics that need to be improved
- Unexpected insights
- See obvious areas of improvement
Identify teams supporting our value stream
- No one person knows all the work that must be performed to create value for the customer

Creating a dedicated transformation team

Initiatives like DevOps transform are inevitably in conflict with ongoing business operations
- We are trying to improve business operations, but ultimately require disruptions to change how we work
- Business are built to be resilient to change
  - Good for maintaining status quo, but this puts us at odds with groups who are responsible for daily operations
Organizations must create a dedicated transformation team
- Must be able to operate outside the rest of the organization that is responsible for daily operations
- Allows “performance engine” to continue to operate the business
Dedicated team is accountable for achieving a clearly defined, measurable, system-level result
- Separate from team as to not interrupt normal operations
- Create a separate space to maximize communication flow within the team
- Select team members who have long-standing and mutually respectful relationships with the rest of the organization
Agree on a shared goal
- It should require considerable work but is not impossible
Limit the number of these types of initiatives as to not tax the organizational change management capacity
Keep improvement planning horizons short
- Allows flexibility to reprioritize
- Quicker realization of improvements that make meaningful differences
- Less risk that project is killed before demonstrable outcomes
- Early wins are important
Reserve 20% of cycle for non-functional requirements and reducing technical debt
- Organizations that need process improvements the most are those that have the least amount of time to spend
- Organizations that do not pay down technical debt will soon be burdened with daily workarounds where no new work can be completed
- If “tax” is not paid, technical debt will become large burden
Identify technical debt early and prioritize it in backlog
- Ongoing incidents should halt further work

How to Design Our Organization and Architecture with Conway’s Law in Mind

Conway’s Law is inevitable
- organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations
- How we organize our teams has a powerful effect on the software we produce
- If inevitable, we must use it to our advantage
Eliminate dependencies on other teams
Organization archetypes
- Functional-oriented
  - Specialties are grouped together
- Matrix-oriented
  - Combine functional and market. Usually causes confusing and complicated organizational structures
- Market-oriented
  - Optimized for responding quickly to customer needs
  - Cross-functional
  - Potential for redundancies across organizations
Overly functional team issues
- Long lead times
- Work requires opening up tickets with multiple groups
- Implementer often does not have context about why change is being implemented
Market-oriented teams
- Don’t do top-down reorganization, as it’s very scary and disruptive
- Implant engineers on existing service teams
Very important is how people act and react, not necessarily just the team organization
Align incentives to spur change or resilience
- Developers should be on call
- Implementers work on the front-lines to gain understanding
Encourage learning
- Team must overcome learning anxiety
- Hiring must see potential in skill set
Design team boundaries with Conway’s law in mind
Development should result in loosely coupled services with bounded contexts
- Service-oriented architecture
Align teams with their products in a way that reduces handoffs, external communication, and cross-team dependencies

How to Get Great Outcomes by Integrating Operations into the Daily Work of Development

If operation resources are limited, use the Ops Liaison model
- Dedicated release engineer for each time who becomes intimately familiar with the needs and executes the work
- Business relationship manager who helps their product teams navigate the Operations landscape, prioritizes work, and streamlines requests
Create shared services to increase developer productivity
- “Without self-service Operations platforms, the cloud is just Expensive Hosting 2.0”
- Customers are not external customers but internal Dev teams
- Includes pre-blessed security libraries, deployment pipeline, and tools
Embed Ops Engineers into Service Teams
- Priorities are driven entirely by the goals of the product teams they’re embedded in
- Efficient way to cross-train operations knowledge and expertise
- Transform operations knowledge into automated code
Integrate Ops into Dev Rituals, and invite Ops to Dev stand-ups
Make ops work visible on shared Kanban boards
- Only work that is relevant to product delivery
- People may not be aware of necessary Operations work until it becomes an urgent crisis