2018-06-02
Part 1 - The Three Ways
- The foundation of DevOps can be traced back to Lean, The Theory of Constraints, and Toyota Kata. It has its roots in manufacturing process management
- DevOps is an extension of the Agile movement paired with the Continuous Delivery movement
- Measured in lead time and percent complete & accurate (%C/A)
The First Way - The Principles of Flow
- We must make work visible
- Unlike physical processes, technology work is largely invisible. It is hard to see where flow is impeded or where work is piling up
- We should use tools to visualize how our work flows from left to right, like a kanban board
- Limit work in progress (WIP)
- Daily work is dominated by priority du jour, or urgent work
- Disruptions are highly visible in physical processes but almost invisible for tech workers
- An engineer will context switch and re-establish congnitive rules and goals the result in slower and more error-prone work
- Reduce batch sizes
- Reduce changeover cost between tasks, so that team does not feel forced to complete in specific operations
- Allows work defects to be discovered early, so that problems can be fixed before other items flow through
- Large batch releases cause sudden large amounts of WIP and less parallelization
- Reduce the number of handoffs
- Each handoffs incurs loss of knowledge
- Each handoff is a potential queue where work will wait
- Each handoff requires various sorts of communication, signaling, prioritization, testing, scheduling, etc.
- Continually Identify and Elevate Our Constraints
- Every system has a constraint. Any optimization that is not on the main constraint is an illusion
- Work with either queue up it optimization happens before the constraint, or optimization will be starved if it comes after the main constraint
- Constraint should be optimized to be the product owner or development, not operations, testing, deployment, etc.
- Eliminate hardships and waste in the value stream
- Waste is anything that causes delay for the customer that could be bypassed without affecting the result
- Waste includes
- Partially done work that is blocked
- Extra processes
- Extra features outside of requirements
- Task switching
- Motion of communication
- Defects
- Nonstandard or manual work
- Heroics
The Second Way - The Principles of Feedback
- Working Safely within Complex Systems
- Complex systems are defined as systems which defies a single persons ability to see the system as a whole and understand how all the pieces fit together
- Failure is inherent and inevitable
- We must aim to work without fear because we are confident errors will be detected quickly before catastrophe occurs
- To be safe within a complex system, we must meet the following conditions
- Complex work is managed so that problems in design and operations are revealed
- Problems are swarmed and solved, resulting in quick construction of new knowledge
- New local knowledge is exploited globally throughout the organization
- Leaders grow other leaders who continually grow these types of capabilities
- See problems as they occur
- We must tighten the feedback loops on the quality of our work within the system
- When feedback is delayed and infrequent, it is too slow to prevent undesirable outcomes
- Automated builds and testing allows us to identify when a change is introduced to the system that’s incompatible with expectations
- Detects issues early on, but also identifies how these can be prevented in the future
- Feedback allows us to steer
- Swarm and solve problems to build new knowledge
- Swarm problems to contain problems before they can spread and to diagnose and treat the problem so it doesn’t happen again
- Andon cord, used in Toyota plants where every worker is trained to pull the cord when something goes wrong
- This could mean a defective part, a required part is not available, or even that work is taking too long
- When things are wrong or slow, the entire production line is stopped so that the problem can be fixed
- This prevents the problem from continuing downstream
- It prevents work centers from starting new work that will likely introduce more issues into the system
- If problem is not addressed, work center could potentially deal with the same problem and cause more work loss
- Swarming seems contrary to common management practices, but it
- Prevents loss of critical information due to fading memories or changing circumstances
- Provides fast feedback into the system
- Isolates the problem
- Prevents further complicating factors
- Keep pushing quality closer to the source
- More inspection steps and approval processes introduce potential for more errors, since the distance between who does the work and the decision makers is larger
- Ineffective quality controls involve manual processes, approvals from busy people, and large documentation
- Peer reviews should be implemented
- Automatic tests and other checks should be implemented and required before changes are checked into production
- Quality is everyone’s responsibility
- Developers are usually the furtherest from the customer
- Developers can’t learn when they’re punished for mistakes from months ago
- Enable optimizing for downstream work centers
- Lean defines two customers - internal and external
- Our most important customer is the next step downstream
- Operational non-functional requirements are prioritized as highly as user features
- This creates quality at the source
- Examples from manufacturing include asymmetrical materials so they could not be assembled backwards or screw fasteners that were impossible to over tighten
The Third Way - The Principles of Continual Learning and Experimentation
- Enabling organization learning and a safety culture
- Never name, blame and shame the person who caused a problem. We are human and mistakes happen.
- Our work is almost always performed within a complex system
- How management chooses to react to failures and accidents may lead to a culture of fear which then makes it unlikely that problems and failure signals are ever reported
- Conduct blameless post-mortem after every incident to
- gain the best understanding of how the incident occurred
- agree on countermeasures to improve the system
- Institutionalize the improvement of daily work
- In the absence of improvements processes don’t stay the same - due to chaos and entropy, processes actually degrade over time
- We improve daily work by explicitly reserving time to pay down technical debt, fix defects, and refactor and improve problematic areas of code
- We schedule kaizen blitzes, which are periods when engineers self-organize into teams to work on fixing any problem they want
- Transform local discoveries into global improvements
- When new learnings are discovered locally, there must be a mechanism to enable the rest of the organization to benefit
- i.e., post-mortem being searchable, source code repos being shared, etc.
- Inject resilience patterns into daily work
- Introduce tension into system to elevate performance
- Seek to reduce deployment times
- Reduce test execution times
- Perform game day exercises, rehearsing large-scale failures or Chaos Money like Netflix
- Leaders reinforce a learning culture
- Leaders must elevate the value of learning and disciplined problem-solving
- Coaching kata
- the scientific method of stating True North goals
- Organization goals to individual, team-based, measurable goals
- Conduct experiments, with the leader coaching the person running the experiment to continue iterating and learning
Part 2 - Where to Start
- How do we practically implement a culture of DevOps into our organization?
- How do we decide where to start?
- How do we enable our teams to succeed?
Selecting Which Value Stream to Start With
- Single product team rather than functional teams
- Reduces handoffs
- aligns goals
- Remove external team dependencies
- Increasing team size not always best move
- Improve the way work is done. Increase effectiveness
- Greenfield vs. brownfield
- Greenfield are new projects, where culture can be built in from the start
- Brownfield projects may be more receptive because it’s clear current process is not working
- DevOps has been used to successfully transform brownfield projects
- Start with sympathetic and innovative groups
- Much like Crossing the Chasm, look for early adopters
- Don’t spend time trying to convert conservatives groups. They must see proven track record
- Build critical mass and silent majority
- Expand to more teams and value streams
- Do not have to be most visible or influential groups, but expand the coalition
- Identify the holdouts
- Must have enough success to protect the initiative
- Little fish learn to be big fish in little ponds
Understanding the Work in Our Value Stream, Making it Visible, and Expanding Across the Organization
- Value stream mapping
- Conduct a workshop with all the major stakeholders
- First create high-level process blocks
- Focus on places where
- work must wait for weeks or months
- waiting for processes
- significant rework is generated or received
- Measure each block in %C/A, lead time, and value add time
- Identify metrics that need to be improved
- Unexpected insights
- See obvious areas of improvement
- Identify teams supporting our value stream
- No one person knows all the work that must be performed to create value for the customer
- Initiatives like DevOps transform are inevitably in conflict with ongoing business operations
- We are trying to improve business operations, but ultimately require disruptions to change how we work
- Business are built to be resilient to change
- Good for maintaining status quo, but this puts us at odds with groups who are responsible for daily operations
- Organizations must create a dedicated transformation team
- Must be able to operate outside the rest of the organization that is responsible for daily operations
- Allows “performance engine” to continue to operate the business
- Dedicated team is accountable for achieving a clearly defined, measurable, system-level result
- Separate from team as to not interrupt normal operations
- Create a separate space to maximize communication flow within the team
- Select team members who have long-standing and mutually respectful relationships with the rest of the organization
- Agree on a shared goal
- It should require considerable work but is not impossible
- Limit the number of these types of initiatives as to not tax the organizational change management capacity
- Keep improvement planning horizons short
- Allows flexibility to reprioritize
- Quicker realization of improvements that make meaningful differences
- Less risk that project is killed before demonstrable outcomes
- Early wins are important
- Reserve 20% of cycle for non-functional requirements and reducing technical debt
- Organizations that need process improvements the most are those that have the least amount of time to spend
- Organizations that do not pay down technical debt will soon be burdened with daily workarounds where no new work can be completed
- If “tax” is not paid, technical debt will become large burden
- Identify technical debt early and prioritize it in backlog
- Ongoing incidents should halt further work
How to Design Our Organization and Architecture with Conway’s Law in Mind
- Conway’s Law is inevitable
- organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations
- How we organize our teams has a powerful effect on the software we produce
- If inevitable, we must use it to our advantage
- Eliminate dependencies on other teams
- Organization archetypes
- Functional-oriented
- Specialties are grouped together
- Matrix-oriented
- Combine functional and market. Usually causes confusing and complicated organizational structures
- Market-oriented
- Optimized for responding quickly to customer needs
- Cross-functional
- Potential for redundancies across organizations
- Overly functional team issues
- Long lead times
- Work requires opening up tickets with multiple groups
- Implementer often does not have context about why change is being implemented
- Market-oriented teams
- Don’t do top-down reorganization, as it’s very scary and disruptive
- Implant engineers on existing service teams
- Very important is how people act and react, not necessarily just the team organization
- Align incentives to spur change or resilience
- Developers should be on call
- Implementers work on the front-lines to gain understanding
- Encourage learning
- Team must overcome learning anxiety
- Hiring must see potential in skill set
- Design team boundaries with Conway’s law in mind
- Development should result in loosely coupled services with bounded contexts
- Service-oriented architecture
- Align teams with their products in a way that reduces handoffs, external communication, and cross-team dependencies
How to Get Great Outcomes by Integrating Operations into the Daily Work of Development
- If operation resources are limited, use the Ops Liaison model
- Dedicated release engineer for each time who becomes intimately familiar with the needs and executes the work
- Business relationship manager who helps their product teams navigate the Operations landscape, prioritizes work, and streamlines requests
- Create shared services to increase developer productivity
- “Without self-service Operations platforms, the cloud is just Expensive Hosting 2.0”
- Customers are not external customers but internal Dev teams
- Includes pre-blessed security libraries, deployment pipeline, and tools
- Embed Ops Engineers into Service Teams
- Priorities are driven entirely by the goals of the product teams they’re embedded in
- Efficient way to cross-train operations knowledge and expertise
- Transform operations knowledge into automated code
- Integrate Ops into Dev Rituals, and invite Ops to Dev stand-ups
- Make ops work visible on shared Kanban boards
- Only work that is relevant to product delivery
- People may not be aware of necessary Operations work until it becomes an urgent crisis