Things I Realized Managing Space Reliability for A pair of of the World’s Busiest Gambling Sites

Things I Realized Managing Space Reliability for A pair of of the World’s Busiest Gambling Sites


For quite so much of years I managed the 3rd line receive 22 situation reliability operation for quite so much of the enviornment’s busiest playing sites, working for moderately-known company that built and ran the core backend online instrument for quite so much of agencies that every at peak may elevate tens of millions of pounds in earnings per hour. I left about a years ago, so it’s a true time to deem on what I realized within the formulation.

In some ways, what we did became equivalent to what’s now referred to as an SRE characteristic (I’m going to call us SREs, nevertheless the acronym didn’t exist on the time). We had been on call, needed to answer to incidents, made options for re-engineering, supplied worthy feedback to builders and buyer teams, managed escalations and emergency eventualities, ran monitoring techniques, and so forth.

The team I joined became spherical 5 engineers (all aged builders and technical leaders), which grew to spherical 50 of more blended skills across a couple of locations by the time I left.

I’m going to focus here on job and documentation, since I don’t judge they’re talked about usefully enough the place I raise out read about them.

For those that preserve shut to must read one thing some distance longer Google’s SRE guide is a immense resource.

Direction of

Direction of is a must must working and scaling an SRE operation. It’s the core of the total lot we completed. Once I joined the team, habits had been tainted – there became a ticketing design, nevertheless one-journal resolutions weren’t unusual (‘Space down. Fixed, closing.’).

An SRE operation is basically a factory processing data and can act accordingly. You wouldn’t have a factory working without processes to raise care of the movement of items, and by the identical token you shouldn’t have a data-intensive SRE operation working without processes to raise care of the movement of data.

One frequent objection to job I heard is that it ‘stifles creativity’. Genuinely, efficient job (tainted job applied poorly can mess anything else up!) clears your mind to permit inventive understanding.

A immense guide on this field is ‘The Pointers Manifesto’, which impressed quite so much of the modifications we made, and became widely read for the length of the team. It cites the examples of the aviation trade’s approach to job, which enables excellent creativity below tense conditions by mental automation of routine operations. There’s even a movie about one incident discussed and the pilot himself cited checklists and routine as an enabler of his snappily-thinking creativity and preserve an eye on in that tense field. Genuinely, we aged a identical job ourselves: in emergency eventualities, an skilled engineer would dive into discovering a resolution, whereas a more junior one would apply the checklist.

One other critique of job is that job can inhibit efficient working and collaboration. It totally can if job is handled as an entity justified by its possess existence moderately than one more residing asset. One of the best thing that can guard towards here’s custom. More on that later.

Direction of – Tooling

The significant thing to gain lawful is the ticketing design. Love monitoring alternatives, of us obsess over which ticketing design is greatest. And additionally they’re imperfect to. The ticketing design you use it is most likely you’ll per chance in total discontinuance up preferring simply as a result of familiarity. The ticketing design is easiest tainted if it drives or encourages tainted processes. What a tainted job is relies on the constraints of your industry.

It’s some distance more crucial to have a ticketing design that choices reliably and helps your processes than the diversified design spherical.

Here’s an example. We moved from RT to JIRA throughout my tenure. JIRA supplied many advantages over RT, and I would in total suggest JIRA as a collaborative instrument. One of the best pain we had switching, on the opposite hand, became the shortcoming of some functionality we’d built into RT, which became extreme to us. RT allowed us to gain right-time updates on tickets, which intended that collaboration on incidents became someplace between chat and ticketing. This document became purposeful in put up-incident overview. RT also allowed us to cowl entries from prospects, which all as soon as more became in fact worthy to lose. We bought over it, nevertheless this stuff had been surprisingly crucial attributable to they’d turn out to be embedded in our job and custom.

When picking or changing your ticketing design, take into legend what’s in fact crucial to operations, not particular facets that seem fine when on a checklist. What’s crucial to it is most likely you’ll per chance per chance range from how fine it appears to be like to be like (severely – your prospects may elevate you more severely, and your stamp will doubtless be about true produce), as to whether the reporting tools are extremely efficient.


After job, documentation is the greatest thing, and the two are intimately connected.

There’s a guide in documentation, attributable to, all as soon as more, of us specialize within the imperfect things. The extreme thing to heed is that documentation is an asset fancy every diversified. Love any industry sources, documentation:

  • If successfully looked after, will return funding repeatedly over
  • Requires funding to preserve (fancy the fabric of a factory)
  • If outdated-unusual, costs money just attributable to it’s there (fancy out-of-date inventory)
  • If of unhappy quality, or not usable is a criminal responsibility, not an asset

However here’s not controversial – few of us disagree with the premise that true documentation is purposeful. The point is: what raise out you raise out about it?

Documentation – The place We Were

We had been in a field the place documentation supplied to us became not purposeful (eg from devs: ‘a community partition isn’t lined here because it is extremely not going’. Successfully, bet what happened! And that became documentation they favorable stricken to jot down…), or we simply relied on previously-journalled investigations (by this time we had been writing things down) to work out what to raise out next time one thing identical happened.

This became frustrating all of us, and we spent a really very prolonged time complaining in regards to the documentation fairy not visiting us sooner than we took responsibility for it ourselves.

Documentation – What I Did

Here’s what I did.

  • I took two years’ price of precedence incidents (ie of us who ended in – or would have ended in – an out of hours call), and listed them. There had been over 1700 of them.
  • Then I categorized them by form of field.
  • Then I went thru every form of field and summarised the steps wished to both gain to the bottom of, or gain to some degree the place escalation became required

This took seven months of my full-time attention. I became a senior worker and I became costing my company a quantity of money to take a seat down there and write. And attributable to I had a clueful boss, I by no means bought wondered about whether this became a true whine of time. I became trusted (custom, all as soon as more!). I would explain it took four months sooner than any dividends at all had been considered from this effort. I take into accout this four-month length as a nerve-wracking time, as my attention became taken away from operations to what may had been a total fracture of my time and my employer’s money and an embarrassing failure.

Why not give it to an underling to raise out? For about a causes. This became so crucial, and we had not completed it sooner than so I wished to heed it became being completed successfully. I knew exactly what became wished, so I knew I may write it in such a approach that it would be purposeful to me on the very least. I became also a comparatively skilled writer (arts grad, aged journalist), so I beloved to guage that that would succor me write smartly.

We referred to as these ‘Incident Gadgets’ as per ITIL, nevertheless they may additionally be referred to as ‘bustle books’, ‘crib sheets’, despite. It doesn’t matter. What mattered became:

  • They had been straight forward to earn/sight
  • It became straight forward to name whether you accumulate a match
  • They weren’t duplicated
  • They may per chance be trusted

We place this documentation in frightful textual yelp material for the length of the ticketing design, below a separate JIRA project.

The documentation team bought wind of what we had been up to and tried to stress us to make whine of an inside wiki for this. We flat-out refused, and that became extreme: the documentation design’s colocation with the ticketing design intended that hunting and updating the documentation had no impedance mismatch. Because it became frightful textual yelp material it became snappily, straight forward to update, and uncluttered. We resisted job that jeopardised the utility of what we had been doing.

For those that preserve shut to must be taught more about bash, read my guide Be taught Bash the Onerous Intention, within the market at $5:

Documentation and the Criticality of De-cluttering

When we began, we designed a schema for these Incident Gadgets which became a thing of magnificence, masking every field and field that can per chance crop up.

In the discontinuance it became almost a total fracture of time. What we ended up using became a extremely dreary structure of:

  • Assertion of pain
  • Steps 1-n of what to raise out
  • Extra/deeper dialogue, connected articles

That became it. Makes an attempt to structure it more completely all failed because it became both complicated to newbies, created too grand administrative overhead, or didn’t duvet enough. Some articles developed their possess schema over time that became appropriate to the job, and recent categories (eg the ‘soar-off’ article that instructed you which of them article to transfer to next) developed over time. We couldn’t produce for this stuff upfront attributable to we didn’t know what would work or what would not.

Name it ‘agile documentation’ whilst you happen to preserve shut to have – agile’s what sells on this point in time (it became ITIL abet then). Again, what became extreme became that simplicity and utility trumped the total lot else.

There Is No Documentation Fairy

Having spent all this time and effort about a diversified things grew to turn out to be certain in the case of documentation.


First, we gave up accepting documentation from diversified teams. In the occasion that they commented code, immense, if there became one thing purposeful on the wiki for us to earn, also immense. However when it came to handing over initiatives we stopped ‘inquiring for documentation’. As an alternative we’d organize classes with skilled SREs the place the produce of the project would be discussed.

Invariably (assuming they’d no ops skills), the developer would specialize within the things they’d built and the design in which it worked – and this stuff had been in most cases the most completely examined and least at possibility of fail.

In distinction, the SRE would specialize within the aged points, the things that would race imperfect. ‘What happens if the community will get partitioned? What if the database runs out of disk? Compose we work out from the logs why the person didn’t gain paid?’

We’d then race away and write our possess documentation and gain the engineer to designate off on it – the reverse of the aged waft! They’d in most cases originate purposeful comments and give us added insights within the formulation.

The 2d thing we noticed became that our engineers had been peaceable reluctant to update the docs that easiest they had been using. There became peaceable a approach that documentation may peaceable be given to them. The leadership needed to consistently enhance that this became their documentation, not pills of stone handed down from on excessive, and within the occasion that they didn’t consistently preserve this, they’d turn out to be ineffective.

This became a cultural pain and took a really very prolonged time to undo. Undoing it also required the documentation modifications to be strengthened by job.

In the discontinuance, I’d explain about 10% of the ongoing working time became spent sustaining and writing documentation. After the preliminary 7-month burst, most of that 10% became spent on repairs moderately than producing recent field materials.

Documentation – Benefits

After getting all this documentation completed, we skilled advantages some distance in extra of the 10% ongoing price. To call out about a:

Earlier than this job began we had been reluctant to raise on less skilled staff. After, onboarding grew to turn out to be a mosey. Among diversified things the practicing alive to following incidents as they happened and shadowing more skilled staff. Recent staff had been tasked with helping preserve docs, which helped them heed what gaps they’d in their data.

The docs gave us a resource that allowed us to name practicing requirements. This ended up being a curriculum of tools and tactics that any engineer may purpose to gain a working data of.

  • Less stress thru more purposeful escalation

These became a substantial one. Earlier than we had the step-by-step incident units, when to escalate became a tense decision. Some engineers had a popularity for escalating early, and all had been shy about whether they’d ‘overlooked one thing glaring’ sooner than calling a responsible tech lead out of hours. SREs would also gain referred to as out for not escalating early enough as smartly!

The incident units removed that pain. Barely quickly, the first request an escalated-to techie requested became ‘have you adopted the incident mannequin’? If that is the case, and there one thing glaring became overlooked, then gaps in it grew to turn out to be certain and rapidly-mounted. Soon, non-SREs had been busy updating and sustaining the docs themselves for when they had been escalated-to. It grew to turn out to be a virtuous circle.

The glaring price of documentation to the team helped toughen self-discipline in diversified respects. Curiously, SREs previously had the recognition for being the ‘loudest’ team – there became in most cases a quantity of ‘packed with life’ debate, and the team became very social – which made sense, as we relied on every diversified as a team to duvet an excellent technical house, handled in most cases non-technical buyer professionals, and sharing data and custom became extreme.

As time stepped forward, the team grew to turn out to be quieter and quieter – partly as a result of the introduction of chatrooms, elevated some distance away working, and world teams, nevertheless also as a result of the indisputable truth that so grand of the work grew to turn out to be routine: apply the incident mannequin, within the occasion you’re completed, or don’t heed one thing, escalate to any individual more senior.

Automating the investigations this design intended that the formulation became certain to extra automate them with instrument.

Having metrics on which tickets had been linked to which incident units intended that we knew the place greatest to focus our effort. We wrote scripts to brush thru log data within the background, originate encoding complications sooner and more purposeful to work out, automate responses to prospects (‘Field became ended in by a alternate made by app admin person XXX’), and loads more.

These automations impressed an automation instrument we built for ourselves in step with pexpect: http://ianmiell.github.io/shutit/ However that’s one more legend. Generally, after we bought going it became a virtuous circle of exact enhance.

Wait on to Direction of

Given you have all these sources, how raise out you discontinuance them from degrading in price over time? That is the place job is extreme.

Two processes had been extreme in ensuring the total lot continued smoothly: triage and put up-incident overview.

Direction of – Triage

5%-10% of time became spent on the triage job. Again, it took a really very prolonged time to gain the formulation lawful, on the opposite hand it resulted in broad financial savings:

  • Slice back the steps to the minimal purposeful steps

It’s so tempting to place as grand as that it is most likely you’ll per chance per chance imagine into your triage job, on the opposite hand it’s a must must preserve the price within the formulation over completeness. Any step that isn’t in most cases purposeful tends to gain skipped over and skipped over by the triager.

  • Model out saving price within the formulation

Buying for duplicates, discovering the relevant incident mannequin, reverting rapidly to the shopper, and escalating early all reduced the price per heed vastly. It also saved diversified engineers the context switch of being requested a request whereas they’re pondering one thing else. It’s worthy to raise into legend the advantages of this stuff, nevertheless we had been in a receive 22 situation to manage with elevated volumes of incidents with fewer of us and no more project. Senior administration and prospects noticed.

Recording the details of those efforts also saved time, as (as an illustration) engineers given a triaged heed may gaze that the triager hunted for old incidents with a string that per chance they may toughen on. It also intended that more skilled staff may overview the triage quality.

Experienced staff must compare the triage job in most cases to make certain that it’s in fact being applied successfully.

Once I moved to at least one more operations team (in a website online I knew some distance less about), I slice the incident queue in 1/2 in about three days, true by making whine of those tactics successfully. The triage job became there, on the opposite hand it wasn’t being adopted with any understanding or oversight, and became given to a junior member of staff who became not the most reliable. Mountainous mistake. Triage ought to be completed – or overseen – by any individual with a quantity of skills, as whereas it appears to be like to be like routine and mechanical it entails a quantity of important decisions that relaxation on skills within the sphere.

And yes, I became the recent boss, and I chose to employ my first week doing the ‘lowly’ job of triage. That’s how crucial I assumed it became.

No-one desires to raise out triage for prolonged, so we rota’d it per week. This allowed some continuity and consistency, nevertheless stopped engineers from going crazy by spending too prolonged doing the identical job over and over.

Direction of – Post-Incident Evaluate

The replicate image of triage became the ‘put up incident overview’. Each and every heed became reviewed by an skilled team member. Again, this became a job that took up about 5% of effort, nevertheless became also important.

A aged gain became crammed out and any options had been added to a checklist of backlog ‘enhance’ tasks which can per chance per chance be prioritised. This gave us a quantity for technical/job debt that we wished to behold at.


I’ve talked about custom about a times, and it’s what you mostly return to whilst you happen to’re attempting to total to any extent extra or less alternate at all, since custom is at root a residing of conceptual frameworks that underlie all our actions.

I’ve also talked about that folks in most cases specialize within the ‘imperfect thing’. Repeatedly I hear of us specialize in tools and know-how moderately than custom. Yes, tools and know-how are crucial, nevertheless whilst you happen to’re not using them successfully then they’re worse than ineffective. That you may have the greatest golf clubs on the earth, nevertheless whilst you happen to don’t know how one can swing and in addition you’re enjoying baseball then they obtained’t succor grand.

Tradition requires funding design over know-how does (I invested over 1/2 a year true writing documentation, take into accout). If the custom is lawful, of us will behold for the lawful tools and know-how when they must.

When given a different about what to employ time and money on, repeatedly race for custom first. It price me a quantity of funds, nevertheless forcibly laying aside an ‘unhelpful’ team member became the greatest thing I did when I took over one more team. The relaxation of the team flowered as soon as he left, now not stifled by his aggressive behaviour, and plenty things bought completed that didn’t sooner than.

We also built a extremely efficient team with a funds so small that recruiters would phone me up to issue at me what I became shopping for became ‘impossible’, nevertheless by focussing on the lawful behaviours, investing time within the of us we chanced on, and having true processes in receive 22 situation, we bought an extremely efficient and trusty team that every person went on to bigger and greater things within and exterior the corporate (nevertheless mostly within!).


A short observe on politics. You’ve bought to raise your battles. You’re not going to gain the resources you will want, so drop the stuff that wont gain completed to the bottom.

Yes, you will desire a monitoring resolution, greater documentation, greater knowledgeable staff, more testing… it is most likely you’ll per chance per chance also be not going to gain all this stuff unless you have a money machine, so elevate the greatest and compare out and resolve that first. For those that attempt to toughen all this stuff straight away, it is most likely you’ll per chance doubtless fail.

After job, and documentation, I attempted to crack the ‘reproducible atmosphere’ puzzle. That led me to Docker, and a total alternate of profession. I talk about this stuff moderately here and here.

Any Questions?

Attain me on twitter: @ianmiell

Or LinkedIn

My guide Docker in Notice:

Web 39% off with the code: 39miell2


Read More

Previous ArticleNext Article

Send this to a friend