Being part of an oncall rota is pretty much a certainty if you work in the systems or operations team at any IT orientated company. Stuff needs to be working 24 hours a day, 7 days a week and therefore you need someone available to apply duct tape and staples when things go wrong out of normal working hours. It continually amazes me, therefore, that so many companies get the management and organisation of their rota badly wrong, sometimes to the point of it being a significant factor in staff moving on.
As someone who’s participated in oncall rotas at all levels, here’s how I think one should be organised:
- Sysadmins should only participate in the rota after they have completed their probationary period. Think of the goal of the probation period as being getting someone up to speed so they can participate.
- Each oncall shift should last one week, and rotate at 2.00pm on Tuesday. If you run a late shift, i.e. 11.00am to 8.00pm, make the person on that shift also be oncall. Get all the unsociable hours out of the way in one lump.
- The bare minimum gap between oncall shifts should be 5 weeks. If your rota is shorter than this, then you don’t have enough qualified staff to cope with holidays, illness, paternity/maternity leave etc anyway.
- The oncall shift should be mapped at least 3 months into the future. Staff should be free to swap oncall weeks providing that no-one is ever oncall two weeks in a row, and that all swaps are cleared with their line manager.
- Issue the oncaller with a good quality mobile phone and 3G capable laptop. The phone should not be a smartphone, the aim should be for maximum battery life and talk time. The laptop should be more like a netbook than a desktop replacement, should come with a spare battery, and should dual boot to Windows and a useful Linux desktop distro. These days I’d shell out for a nice big SSD to stick in it too. Make sure there are no restrictions on international calling and no data caps for the phone and 3G card.
- Don’t give the direct number of the oncall phone. Instead, use an answering service as filter. During office hours, the answering service should redirect enquiries to the regular helpdesk/support number. Out of hours, the answering service should accept calls, take details and then pass these on to the oncall number. Additionally this service should text and email call details so there’s a record.
- Pay a fixed daily amount for being oncall. Pay 1.5 times this amount for being oncall on weekend days. Pay 2 times this amount for being oncall on a public holiday. Where someone is oncall on a public holiday, add one day to their holiday allowance.
- Pay a per-incident fee when oncall is used. Each oncall use should be tracked in your ticketing system. Make using oncall a business cost, thus giving the business a reason to make sure oncall is not used trivially, and a reason to make sure problems are fixed permanently and not just temporarily alleviated.
- When the oncall person has dealt with out of hours issues, don’t expect them in at the regular time the next day. Expect them to use their judgement to make sure they are suitably rested. You do not want an overly tired oncaller dealing with problems on production systems.
- The person oncall should never be taking on “out of hours” work. Want a database dumped and reloaded? Want a disk unmounted and fscked overnight when a server’s not busy? All those things can be done, but not by the oncall person. That person is there to respond to problems, not to perform routine or planned maintenance.
- Make it very clear that abuse of oncall is unacceptable. Oncall is there to fix customer or service affecting problems, not to help someone with Excel.
- Have a clear demarcation between production and testing/development/QA systems. The latter group are not oncall’s responsibility to fix.
- If you have offices around the world, have a “follow the sun” oncall system, get your offices to cover each other.
- Have realistic expectations of oncall response times. If you need to guarantee that problems are attended to within 20-30 minutes then you should be running an overnight shift, not oncall.
- Expect a daily report summarising the previous 24 hour oncall period, even if that report is “Nothing to report”. The weekend period could be lumped together on the Monday.
- Have a weekly oncall handover meeting between the outgoing and incoming oncall staff.
- During the day, have a junior or trainee sysadmin be oncall. It’s good practice for them.
Doing all of the above shows that you take oncall seriously, and you appreciate the impact being oncall has on someone’s life. Your oncall staff are the people who salvage the business’s reputation when the midden hits the windmill at some unworldly hour of the night. Keeping them happy and making them feel valued and respected can only be to the business’s benefit.
The bare minimum gap between oncall shifts should be 5 weeks. If your rota is shorter than this, then you don’t have enough qualified staff to cope with holidays, illness, paternity/maternity leave etc anyway.
Yes, but what if the job doesn’t require more than one or two people in the team? When I was the e-commerce person at $ISP, that was that – I was the only person. When I was on holiday they got the developer to deal with emergencies, I think. You can’t argue that a company should employ several people for a job that only needs one person.
The rule about not being able to be on call for two consecutive weeks also sucks IF the people doing the job are happy with it (i.e. they want to swap so that it works out that way). I am assuming that systems run well enough that you’re not expecting a call out every night, so there’s no worry about sleep deprivation for that person. What if someone actively wants to do the on call – perhaps they need the money, live near work, whatever – and the other team members don’t mind?
Finally, why 2pm on Tuesdays? Are you trying to say “Shift changeover should not happen at 9am on Mondays” (in which case I agree) or do you have a thing about Tuesdays?
I agree with a lot of your points though – planning a long way in advance, having a shift report, etc – but some of them do seem very arbitrary.
A pretty thorough set of recommendations, not too much to add, but here are my thoughts (having done oncall for five years)
To answer the previous commenter, Mondays are more likely to be public holidays than Tuesdays. Tuesdays are a good day for that reason.
I don’t recommend paying per-incident callout fees, as it could encourage sloppy behaviour to generate extra fees. I agree that sharing the pain of oncall to the rest of the business to ensure that root cause is addressed.
Oncall is a point of escalation in a time of emergency. Therefore if serious issues occur that require time to deal with, that time should be repaid in time off.
I’ve been on three week and four week rotation before, and they’ve not been too bad. I have worked two weeks in a row once or twice, again, it was bearable. But shouldn’t be the norm.
Flash, I suggest Tuesdays because they won’t be bank holidays. I suggest 2pm as it’s just past lunchtime, the incoming oncaller has time in the morning to clear their deck, and time in the afternoon to do things like checking their VPN account still works, making sure the oncall laptop and phone are fully charged and so on.
I agree there’s some scope for dishonesty with per-incident payments. Hence my suggestion that an incident is tied to a ticket. That combined with the daily reports, and the trends for oncall incidents over time, should make faking things difficult in the first place, and fairly obvious when it happens. If disks always fill up when Martin’s oncall, and never do for anyone else, there’s a problem that needs to be addressed.
Just one point to make – you might like to restrict who has access to the on-call number. This will act as a muppet filter, preventing many of the “my start bar has moved…” type calls, if most of the minions need to get their line manager to call on their behalf. Most people don’t care about upsetting some IT lackey, but if they have to disturb their own boss, it’s usually for a good reason.
If you have a shift team *and* an on-call escalation point, the on-call report should probably include details of who escalated to the on-call support person, as often you’ll find some engineers will only contact the on-call bod if it’s a real emergency, whereas others will hit that speed-dial button faster than you can say “training requirements”.
People should be paid for working out of hours, it’s only fair. Alternatively, they should get time off in leiu at a good rate — in fact, this is preferable if you’re to avoid “burn out” in the best staff. If you suspect that you’re incentivising staff for outages that they’ll get paid for fixing, well, do you worry the fire brigade go around setting fires on the quiet?
Gary – It’s not entirely unheard of:
http://www.fireengineering.com/index/articles/display/1644404727/articles/fire-engineering/legal-matters/2010/07/murphy-firefighter-arsonists.html
I use this service coupled with an answering service (www.the365team.com). It is a simple and effective way of routing messages to the person on call. If someone needs to come off call they can easily organise with someone else to cover and receive the messages.
Immense 🙂
this all seems vaguely familiar, somewhat like something I posted once to the now defunct uk**t wiki.
It’s an entirely original piece of work, apart from all the bits I stole from other people because they looked good. You may or may not be one of those.
Does anybody have a useful link to an EXECL format that I could use for a general management on call rota. I have 35 persons who would be on call for one day at a time. Weekends would need to be shared / balanced. That would be great as a starter, but I then need to factor in WHO DID BANK HOLIDAYS / CHRISTMAS previously so that the joy is handed to those who didnt, but the starter would be good ! Many thanks – look forward to any ideas / links (tried a few but none work for me)
Thanks – I was asking questions about how to set a rota up and, like magic, this appeared 🙂
@bigpinots with http://www.pagerduty.com/ you can automate most of it