Production Incident Response for an Odoo System

When production is down, the worst response is an unstructured scramble. This playbook is the runbook: a fixed order to work through so the response stays calm even when the situation is not.

A production system will, eventually, have a bad day. It goes down, or it slows to the point of being unusable, or a core function stops working. The quality of the response is decided by whether there is a procedure or a panic. This playbook is the procedure. Its whole purpose is to make the response orderly when instinct says scramble.

Step 1: Confirm and frame the incident

Before touching anything, establish what is actually wrong. Is the system down, or slow, or is one function broken while the rest works. Who is affected, and how badly. When did it start. A precise frame, "month-end posting has failed since the 14:00 cron, the rest of the system is fine", points at a cause. "Odoo is broken" points nowhere.

Step 2: Decide whether to stabilize first

If the incident is causing ongoing damage, writing bad data, sending wrong documents, corrupting records, the first move is to stop the damage, even crudely. Disable the offending job, take the affected function offline, restrict access. A contained incident is no longer an emergency, and a contained incident can be diagnosed without the clock running. Damage control comes before diagnosis.

Step 3: Work the triage order

Diagnose from the most recent and most likely cause outward:

  • What changed. The overwhelming majority of incidents follow a change: a deploy, a configuration edit, a data import, an infrastructure change. Establish what changed most recently. That is your first suspect, by a wide margin.
  • The logs. The server log around the start time usually names the failure directly. Read from the moment the incident began.
  • The resources. If nothing changed, check whether the system ran out of something: disk, memory, database connections.
  • The dependencies. If the system itself is healthy, check what it depends on: the database, an external integration, the network.

Step 4: Choose fix-forward or roll back

Once the cause is known, one decision: fix forward, or roll back. Roll back when the incident followed a change and that change can be cleanly undone. It is the fastest, safest route to a working system, and it is why the rollback plan is written at deploy time. Fix forward when there is nothing to roll back to, when the cause is data or environment rather than code, or when the rollback itself is risky. When in doubt, restore service first and perfect the fix afterward.

Step 5: Recover fully, then capture the post-mortem

After service is restored, confirm full recovery: run the smoke test, and check whether the incident left a backlog, queued jobs to drain, or records to repair. Then, while it is fresh, write the post-mortem: what happened, what the root cause was, what the timeline was, and what would prevent a repeat. The post-mortem is not blame. It is the only thing that converts an incident into something the system learns from.

The note for the file

Incident response is not about reacting faster. It is about reacting in order. A fixed sequence, confirm, contain, triage, decide, recover, is what keeps a stressful situation from becoming a chaotic one, and it is what makes the response repeatable instead of dependent on who happens to be online.