Running an oncall rota

Being part of an oncall rota is pretty much a certainty if you work in the systems or operations team at any IT orientated company. Stuff needs to be working 24 hours a day, 7 days a week and therefore you need someone available to apply duct tape and staples when things go wrong out of normal working hours. It continually amazes me, therefore, that so many companies get the management and organisation of their rota badly wrong, sometimes to the point of it being a significant factor in staff moving on.

As someone who’s participated in oncall rotas at all levels, here’s how I think one should be organised:

  • Sysadmins should only participate in the rota after they have completed their probationary period. Think of the goal of the probation period as being getting someone up to speed so they can participate.
  • Each oncall shift should last one week, and rotate at 2.00pm on Tuesday. If you run a late shift, i.e. 11.00am to 8.00pm, make the person on that shift also be oncall. Get all the unsociable hours out of the way in one lump.
  • The bare minimum gap between oncall shifts should be 5 weeks. If your rota is shorter than this, then you don’t have enough qualified staff to cope with holidays, illness, paternity/maternity leave etc anyway.
  • The oncall shift should be mapped at least 3 months into the future. Staff should be free to swap oncall weeks providing that no-one is ever oncall two weeks in a row, and that all swaps are cleared with their line manager.
  • Issue the oncaller with a good quality mobile phone and 3G capable laptop. The phone should not be a smartphone, the aim should be for maximum battery life and talk time. The laptop should be more like a netbook than a desktop replacement, should come with a spare battery, and should dual boot to Windows and a useful Linux desktop distro. These days I’d shell out for a nice big SSD to stick in it too. Make sure there are no restrictions on international calling and no data caps for the phone and 3G card.
  • Don’t give the direct number of the oncall phone. Instead, use an answering service as filter.  During office hours, the answering service should redirect enquiries to the regular helpdesk/support number.  Out of hours, the answering service should accept calls, take details and then pass these on to the oncall number. Additionally this service should text and email call details so there’s a record.
  • Pay a fixed daily amount for being oncall. Pay 1.5 times this amount for being oncall on weekend days. Pay 2 times this amount for being oncall on a public holiday. Where someone is oncall on a public holiday, add one day to their holiday allowance.
  • Pay a per-incident fee when oncall is used. Each oncall use should be tracked in your ticketing system. Make using oncall a business cost, thus giving the business a reason to make sure oncall is not used trivially, and a reason to make sure problems are fixed permanently and not just temporarily alleviated.
  • When the oncall person has dealt with out of hours issues, don’t expect them in at the regular time the next day.  Expect them to use their judgement to make sure they are suitably rested.   You do not want an overly tired oncaller dealing with problems on production systems.
  • The person oncall should never be taking on “out of hours” work. Want a database dumped and reloaded? Want a disk unmounted and fscked overnight when a server’s not busy? All those things can be done, but not by the oncall person. That person is there to respond to problems, not to perform routine or planned maintenance.
  • Make it very clear that abuse of oncall is unacceptable. Oncall is there to fix customer or service affecting problems, not to help someone with Excel.
  • Have a clear demarcation between production and testing/development/QA systems. The latter group are not oncall’s responsibility to fix.
  • If you have offices around the world, have a “follow the sun” oncall system, get your offices to cover each other.
  • Have realistic expectations of oncall response times. If you need to guarantee that problems are attended to within 20-30 minutes then you should be running an overnight shift, not oncall.
  • Expect a daily report summarising the previous 24 hour oncall period, even if that report is “Nothing to report”. The weekend period could be lumped together on the Monday.
  • Have a weekly oncall handover meeting between the outgoing and incoming oncall staff.
  • During the day, have a junior or trainee sysadmin be oncall. It’s good practice for them.

Doing all of the above shows that you take oncall seriously, and you appreciate the impact being oncall has on someone’s life. Your oncall staff are the people who salvage the business’s reputation when the midden hits the windmill at some unworldly hour of the night. Keeping them happy and making them feel valued and respected can only be to the business’s benefit.