The Benefits for Site Reliability Engineering (SRE)

There’s been a lot of hype around SRE lately and when discussing this topic I always try to draw the conversation back to the benefits of looking at new frameworks and tools…. what will the organisation get out of SRE?  What will individuals get out of SRE?

For the organisation

Enhanced stability and reliability of services – SRE is about reliability

Better understanding of how production services work – through observability and shared “wisdom of production”

A better balance between the investment in customer experience and technical reliability – SLO’s are business objectives

A greater appreciation of the operational impact of services in development teams – shift left, SRE’s in teams, designing for operations

Improvements in staff morale and retention – HOW? LET’S CONSIDER THE INDIVIDUAL FIRST….

For the individual

A better work balance with ring fenced time for improvement – 50% split, operational v improvement work

Less stressful on-call experiences and a reduction in overall call-out volumes – automation of incident response, “chaos engineering”, blameless post mortems

Broader skills-based opportunities that leverage the latest in automation – a future-proof “toolbox”

An improvement in workplace culture – Westrum model

Opportunities for “shifting left” and helping to ensure development teams deliver more reliable services – getting the wisdom of production into teams

WHICH SHOULD LEAD TO – Improvements in morale and staff retention….

For me, SRE is part of the natural evolution that began with Agile and progressed through DevOps.  It is for those organisations who want to balance the speed of delivery and value creation these approaches introduced with a relentless focus on reliability.

Posted in Agile, DevOps, SRE | Leave a comment

Site Reliability Engineering (SRE) – back to the future?

I’ve walked the line between DevOps and SRE in organisations and witnessed first-hand both good and bad practices.  While there are a lot of SRE success stories I’d like to focus on the occasional use of SRE as a justification to rebuild the walls around operations/production that DevOps strived for so long to take down.  While adopting  SLO’s and focussing effort on toil reduction are obviously beneficial, taking a defensive position over the production environment is not.

The trigger for this if often an issue impacting production – sometimes major, sometimes minor – or the perception that more audit rigour is needed in this new continuous delivery world.  Whatever the trigger sometimes SRE is proposed as the answer (R stands for Reliability, right?)

SRE has to shift-left and spread more of the wisdom of production to teams delivering value-adding products and services to that production environment.  Deploying value rapidly is crucial. Doing this in a reliable manner is key. Both are valid reasons for introducing site reliability engineering – the future.  However, keeping the wisdom of production in a separate (silo?) SRE team and withholding access to production is a backward step.  A touch of back to the future maybe…..

Posted in DevOps, SRE | Leave a comment

Heritage Reliability Engineering

I originally wrote this post while helping BPDTS and DWP Digital ->

Colleagues from BPDTS Ltd and DWP Digital are enabling the modernisation of DWP’s portfolio of services. Some people might think we are only developing new services, but in fact, we are applying the latest digital thinking to the large heritage services that still handle huge volumes of our customer’s pension and benefit payments. The department is investing heavily in site reliability engineering (SRE) to maintain these services.

In short, SRE is about applying software engineering principles and practices to the world of service delivery and operations. It’s about prioritising reliability over new features, making sure services are stable, secure and performant for when our users need them most.

For our new services, this is very much part of the launch process. Teams must prove services are reliable before they’ll be made available to the public. This blog-post – Gearing up for Site Reliability Engineering – explains more about our move to SRE.

What does SRE mean for our heritage services?

Firstly, we need to ensure that our existing services can stand the test of time, remembering that some of these services are over 20 years old, because we still need to operate them for another few years yet. So we’re using DevOps and continuous delivery practices to make our legacy services run on modern platforms and, where necessary, rebuilding applications using the latest tools. We’re also using the public cloud to host and run our services at optimal cost and efficiency.

We have proven that modern practices like agile and DevOps can apply to heritage services and, once we’ve remediated these services, we can then focus on reliability. We coined the term ‘heritage reliability engineering’ to add focus to the effort involved in making these existing services more reliable.

How we’re making heritage services more reliable

We established service level objectives that are more relevant to heritage services, for example, the time it takes to process batch transactions, payment success rate or data quality.

We’re looking to eliminate toil (non-value adding, repetitive work) by investing time in engineering – by increasing capability and training our colleagues in modern engineering approaches and by allowing colleagues to apply those modern engineering approaches to heritage services. A good example of this is the way we use infrastructure-as-code, which is covered in more detail in our How we’re using Ansible to improve our digital infrastructure blog-post. It talks about how we deploy and configure services and manage the underlying environments, including scaling those environments up or down when necessary.

We’ve also improved our telemetry (how we measure remotely) to make more service operation data available. By making these heritage services more observable, we can spot potential problems before they cause disruptions. This, coupled with the benefits of our cloud platform, is putting our heritage services on a path to be more reliable than ever.

Posted in SRE | Leave a comment

Yellow Brick Road Retrospective

My first post in 2020 will look back at a recent retrospective held at a client in the run up to Christmas 2019.  Those that know me will know that I like to get “creative” in agile delivery – sometimes it works, sometimes it falls flat, but either way I’ll keep trying.  Agile is about inspection and adaptation after all.

The “Yellow Brick Road” retrospective is based (yes you guessed it) on the wonderful L. Frank Baum story (and subsequent film) called the “Wizard of Oz“.  To run it you will need some A4 paper (ideally yellow!) as well as the usual post-it notes (three different colours – red/green/blue are perfect), and some space for participants to move around in (you might struggle with this retro in a meeting room!)

This retro format is best suited for an end of year/phase/stage where there is a lot to look back on and time to change things for the better in the future.  A team re-organisation or a new product focus also works well.

Start by getting participants to write on the A4 yellow paper about how they currently feel about the work they are, or have been doing.  This can include accomplishment/achievements but also challenges/problems that they have faced.  Allow 5-10 minutes for this.

Find some space for the team to use as we are going to create a “yellow brick road” from the A4 pieces of paper.  Get each participant to read out what they have written and create a “road” on the floor, one piece of A4 after the other, giving time for each participant to talk, and constructing the road, ideally in a “meandering” way (if you have seen the film, think of the road Dorothy follows out of Munchkin Land).

If this retro is being run to cover quite a long time period, where a lot of things may have happened, then this can be broken into stages.  Identify the stages and then break them down into “Scarecrow”, “Tin Man”, “Cowardly Lion”.  Each stage could represent one time period (sprint/phase/etc) or problem area (structure, scope, environment, etc).  Participants can write one piece of yellow A4 per stage if need be (when I ran this we created quite a large road….)

So far so good, from an individual perspective, everyone should have contributed and there should be some consensus around common areas.  Next we will introduce the organisational angle – in a similar vein – what has the organisation been like to help/facilitate improvement and what has it done to get in the way (feel free to use different terminology but the idea here is to weave together personal feedback and organisational impact).

We will introduce the organisational angle via post-it notes – one colour will be the “good witch of the North” (representing what the organisation has done to make the world a better place) and another colour will be the “wicked witch of the west” (representing the barriers and limitations the organisation has).

Now if most of this is not making sense, now might be a good time to watch the movie 🙂

Get everyone to fill in the post-it notes and then get participants to wander down the yellow brick road explaining their post-it notes and highlighting where, when, and to who, the organisation has impacted – good and bad.  Some things will impact many people (and effect several pieces of A4 paper) and some will be impacting individuals only.  This worked well by placing one set of post-it notes (good witch) one one side (North?) of the yellow brick road and the other colour (wicked witch) on the other side.

Now for the inspect and adapt part.  The way to play this out is to aim for the “Emerald City” – a place that (appears to be) the best place in the world!  Using the third coloured post it notes (emerald/green?) get the participants to outline what they want – in an ideal sense – to make their world a better place.  Everything is on the table here, organisationally and individually.  Encourage participants to aim high.  If they could wave a magic wand what would they change to improve delivery.  Participants should share these, ideally on a wall, and construct the “Emerald City”.

Allow a little bit of time for reflection…..

Now back to reality (“Kansas – there’s no place like home”).  Participants should then take one post-it and write an action plan of what they will do when they get “home”.  Some will write things that involve changes in individual behaviours, others may require organisational changes to be made, others may be team changes.  The key thing here as facilitator is to make sure they are achievable.

Hopefully with this format “the yellow brick road” will lead you to the “emerald castle” and ultimately back home to “Kansas” with an improved team…….




Posted in Agile | Leave a comment

The “group verbosity alignment” technique

At a recent Scrum Coaching retreat our group was struggling to turn a lot of words generated in a group environment into something concise and meaningful.  Get many agile coaches together and you get a lot of competing and wordy viewpoints!

Step forward what I am now calling the “group verbosity alignment” technique.

For a topic allow all present to articulate what it means to them. Write this down on post it notes.

Then get all present to pick their three favourite words from post it notes that are NOT their own.

Draw up all of the highlighted words on a flip chart.  Then create a sentence that includes as many words as possible, adding grammar and punctuation as required.

A great technique to get everyone’s viewpoint together and focussed!


Posted in Agile, Coaching | Leave a comment

The Team, the Whole Team and nothing but The Team.

I work with many clients trying to embrace Agility by adopting one of the popular Agile frameworks (e.g. Scrum). Remember there is a big difference between “doing” Agile and “being” Agile and a lot of businesses think that because they are “doing” Agile (e.g. Scrum) that they “are” Agile (when they are patently not). However that is for another blog post.

In Scrum there are only three roles – the Scrum Master, the Product Owner and The Team – and while most clients I work with are happy with the first two, they have difficulty with the last – the Team.

The Scrum objective of a true cross-functional/multi-disciplined Team with all capabilities required to deliver is a nice one but something that takes time and massive culture change (excepting, maybe, those start-up’s who embrace the concept from the start). Inevitably The Team is a collection of individuals brought together but who are on separate journeys through their differing professional lives. Some Team members embrace the concept of one Team and work, within their supporting organization, to achieve this. However the majority of organizations, and sometimes the individuals within, find this journey difficult to make. People within organizations are usually structured along clearly defined (and progression incentivized) role hierarchies (Junior Developer, Developer, Senior Developer, Principal Developer, etc. – for Developer substitute Tester, BA, Architect, or whatever role title you prefer). To propose that The Team members cross-skill across other roles is often meet with bewilderment, concern and often resistance. Individuals themselves are usually comfortable working within one delineated role, and feel uncomfortable spreading themselves further. “Not my area, see so and so” is a popular refrain.

While this is fine, and can be managed, as part of the transformation journey towards Agile, it should not detract from the One Team focus at the heart of Scrum.   Getting people together as a Team is the first step to creating a Team. Once you have a Team there is the chance of the other Scrum Team requisites following. All too often I hear people say they are not partaking in a particular ceremony or Team discussion as it is “not for me”. Developers avoiding design and research sessions, BA’s refusing to attend test sessions, Architects not attending Sprint Planning – these are all examples of where the gravity of the “Home” role over-rides the One Team ethic. The inevitable result is internal knowledge differences between Team members, internal Team strife and no velocity improvement across sprints. Insist that the whole Team is together for every session from the start.

It may take time to become a true multi-disciplined and cross-functional Team but the first step is to act like a Team. All ceremonies in Scrum are for All Team members. Success depends on the success of the Team, not the individual. Make sure the Team, the whole Team, and nothing but the Team is always, always There.

Posted in Agile, Coaching | Leave a comment

Velocity is an output, not an input

There are some Agile Teams I have witnessed where a “Sprint Velocity” metric is used as an “input” to the Agile process. This is particularly manifest when third-party development teams are involved, where the Velocity concept is used as some sort of “budget” against which a customer can “purchase” their Product Backlog items. This might give a clean “pound-per-point” contract position but does go against the Agile manifesto (“customer collaboration over contract negotiation…..”).

Other teams I’ve observed follow a similar concept and use Velocity as some sort of capacity limit – filling the Sprint Backlog with Stories (and Story Points) up to – but not over – the Velocity figure. Even more surprising I’ve also witnessed teams (few, it must be said) that estimate the stories as part of Sprint Planning and – surprise surprise – the total number of Story Points for the Sprint matches the team Velocity (there is a reason why tools like Jira do not allow estimates to be applied after a sprint has started).

How is velocity expected to increase in these kind of environments? At best it will remain constant, but normally it will embark on a downwards trend.

For me, Velocity is an output from the Agile process and not an input (to). Velocity is not something that you can control or indeed something to be used as a control. It simply gives a current reflection on the performance of the team. It will fluctuate and change over time but it is key that it follows a generally upwards trend.

Treat velocity with respect and use it wisely.

One of the key reasons I like Agile and Scrum is the fact-based approach to forward planning. The team who actually deliver the work, estimate the work. The same team then deliver that work to the best of their ability – delivering a (hopefully) increasing velocity figure. That velocity figure can then help with visualizing how long the Product Backlog, in whole or in part, can be delivered based on current velocity. This is fact-based Release Planning and is more accurate than any other guesswork.

Key takeaways: Velocity is an output, not an input. Velocity should exist in an environment which encourages it to increase. Velocity is a fact that can be applied to more accurate forward planning.

Posted in Agile, Coaching | Tagged , | Leave a comment

Watching the flower grow, DevOps transformation at Scale

DevOps does not mean employing “super admins” (despite what the recruitment agencies may say!) DevOps also does not mean building another silo (the “DevOps Team”) alongside other silos – although a DevOps team done right can be a catalyst for change.

DevOps isn’t about throwing Dev and Ops together (like throwing two atoms together the reaction is somewhat nuclear in nature!).

What DevOps does mean is cultural and organisational transformation, underpinned by appropriate automation. Yes – you may have heard this all before but how do you go about doing this if you are not a start-up? And how do you do it at scale?

Well the catalyst for this is a good DevOps Coach, preferably someone Agile, but also with a strong business, as well as technical, skill-set.  But then what do you do next?

Let me start by drawing a picture, some of you may recognise, consisting of many Development teams reliant upon (surrounding?) an (in-house or outsourced) Ops (or WebOps) team:


This is the typical silo formation that the DevOps movement was formed to alleviate, and there is chapter and verse elsewhere on the web as to the problems with this.

A typical “first step” in the DevOps transformation journey is to “make available” Ops to the development team (where were they before?). This involves tentative bridge building between the functions (breaking down the “wall of confusion”), some relationships forming and, if progressing well, suggestions from the inside on how the Dev and Ops relationship can be improved. This is a recognised DevOps pattern – Ops feeding into the Development backlog and Development feeding into the Ops backlog (also known as “design for operations”). This is also why Agile/Scrum is a good fit here, a backlog of Dev and Ops items woven together, as well as the Scrum ceremony feedback loops to measure and improve. This is when the kindergarten-style flower begins to appear:


“But this wont scale!” I often hear. The Ops team feel swamped by all of the requests from the Development “petals”. Ops may not have the capacity to support this model in the short term but the worse thing you can do is remove the petals (“they love me, Not!”). Instead the flower should be allowed to “grow” into something more cohesive with previously disparate Development and Operations teams pulling together a structure to improve overall delivery flow. Natural selection is the order of the day here, with individuals from Development and Operations (and ALL functions of operations too – Security, DBA, Audit, etc.) forming cohesive teams based on their current point in the delivery lifecycle and with business aligned goals.  This is all augmented through a common sense of purpose and compliance with central standards (the double underlined “stigma” in the flower below – quite an appropriate word!).

Our flower now begins to take shape:


But what is the glue that binds all this together? A cross team Scrum-of-Scrums with a focus on DevOps is a good example. A DevOps community or “guild” is another. Even some of the Agile-at-scale concepts can be applied. Improved communication is the key however – every touch-point in our flower is a new group of communicating individuals who have aligned to address the businesses objectives.

At real scale you will see multiple flowers starting to appear – all joined at the same “management” stem. A stem that channels new ideas, new business objectives and new constrains, but that also shares ideas, problems, opportunities and challenges. Our flower now starts to look like this, a strong (and growing) “plant” of DevOps capabilities joined at a co-ordinating management stem:


In my next blog I will write about some of the “bees” that flutter across these flowers within organisations than can cross-pollinate new ideas and solutions but that sometimes can also cause the flowers to wither and die….. watch this space….

Posted in Coaching, DevOps, Uncategorized | Leave a comment

Musings on the latest Thought Works Technology Radar

Its been some times since I put down some feedback on the Tech Radar, of which I am an avid and enthusiastic supporter. However the January 2015 edition has a few gems which I want to replay.


I’ve worked in the Agile + DevOps (= continuous delivery!) space almost exclusively since 2012 and so many organizations have tried to break down the silos between Development and Operations by establishing a separate DevOps team. Queue yet another silo! This is something that I’ve advised organizations against since the start so it is interesting to hear “Separate DevOps Team” as something the Tech Radar suggests should be a “hold”. At last!

I’ve also worked with multi-team i.e. large scale Agile/Scrum teams, both in the public and the private sector, and the SAFe framework did not quite resonate. There are a few elements that are definitely worth adopting – a cross-team Scrum of Scrums, the concept of a release/deploy “train” and program-level (multi-epic?) collaboration. However SAFe left me cold when it again suggested a separate DevOps function. No surprise to see that SAFe is also in the “hold” category.

On the flip side it is great to see MTTR (mean time to recovery) as a metric that businesses should adopt – and focus on. Once this has the management focus it deserves then businesses begin to propagate a culture of reliability, pulling investment focus in cloud, tooling and processes in areas like APM (application performance management) towards an MTTR of zero.

Organizational dynamics are also a rich vein of gold, and only recently I’ve tried to focus a UK major government agency away from the trap that is Conway’s Law. Like many organizations they talked the talk around Digital, Agile and new ways-of-working, but didn’t walk-the-walk. I didn’t realize there was a formal name for the work I’d been doing but the “inverse Conway maneuver” encapsulates it well (a “trial” on the Tech Radar).


Hand-baking of cloud infrastructure seems to have a magnetic allure for a lot of people. Maybe it’s the thought of being at the “cutting edge” but for me it is all about focus. Turnkey Dev, test and other environments before production should be the norm, and Digital Ocean (“trial”) is a good example. Instant developer productivity, with the tools they are familiar with, and which uses the same infrastructure-as-code capabilities that production ready cloud providers use, provides that seamless, and more cost effective, delivery/deployment train I described earlier. Heck it even uses SSD!


I’ll gloss over the pushing of GO, for obvious reasons, and focus only on Xamarin. Re-use of code across multi-platforms has been a dream for as long as I have worked in IT (remember 4GL’s?). Add to this the “mobile first” ambitions of modern “Digital”, along with the gradual migration from Microsoft to other platforms (!) and you have a lot of businesses with a lot of C#…. Cue Xamarin with its ability to use C# code across multiple, modern, platforms – yes, even the iPhone. My old friend James Tuck is doing a sterling job pushing this and this is reflected in the Tech Radar rating of “trial”.


I’ll focus on Django REST. I’ve loved REST since my days at Sage architecting (and promoting) the SData RESTful framework. Fast forward a few years and cue the modern API-first, Microservices-led revolution, but with better security (OAuth out-of-the box), taking place and you will soon be led to Django REST. You just wait and see.

Posted in Uncategorized | 2 Comments

A bad work-person blames their tools

In less politically correct times the adage was a “bad workman blames his tools”. As a one-time apprentice mechanical engineer (failed!) I heard this a lot. However in recent times I’ve been wanting to use this put-down a lot when hearing people talk about their technology tools. “Ah, man, Jenkins is broke again”, “Damn it, the Puppet run failed”, “Holy cow, Git hasn’t got my source code”. If you are reading this, and none of this resonates, then you are reading the wrong blog!

Seriously, we need to elevate the status of some of these tools so that they reflect the important job they do for us. I was thinking about this when reading an article about Kevin Jorgeson and Tommy Caldwell who scaled the El Capitane in Yosemite National Park. Whereas we, as software engineers and the like, use our preferred tools, and blame then when things sometimes go wrong, the life of these guys depended on their tools. Heck, these guys used to sleep by hanging themselves (in a tent) off a cliff face. I think we’d all take more interest in our tools if our lives depended on it, right? So let’s stop blaming the tools for the stuff we do. Take a leaf out of the book of Kevin and Tommy and make sure, when you want a tool to do something (run a test, deploy some servers, build some code) think as if you’re life depended on that tool. Check, double check and even triple check that you have done things right. And if you don’t get the result you expect (a test failed, a server didn’t start, the build “broke”) at least you’re life did not depend on it. A bad work-person blames their tools!!!

Posted in Agile, Coaching, DevOps | Leave a comment