A System Rescue engagement starts with a client in pain and a strong, understandable wish: make it stop. The instinct — theirs and, if you are not disciplined, yours — is to start fixing on day one. Find the broken thing, fix the broken thing, feel productive.
We don't. The first two to four weeks of a rescue is a forensic audit, and the discipline of the playbook is in not writing a single line of corrective code until the audit is done. This note is the playbook.
Why fixing first is the mistake
A failed implementation is rarely failing for the reason that is currently visible. The symptom that triggered the call — reports nobody trusts, a process that stalls, a month-end that takes two weeks — is downstream of causes that are not yet visible. Fix the symptom and one of two things happens: it comes back, because the cause is untouched; or it moves, because the symptom was load-bearing and something else now carries the stress.
Worse, fixing first commits you. The first change you make is the moment you take ownership of the system's behaviour. Make that change before you understand the system and you have signed up to own a machine you have not read the manual for.
The audit is the manual. You read it first.
What the audit examines
The forensic audit is structured around four investigations. They run in parallel; they inform each other.
1. Data integrity
The first question is whether the data is trustworthy. Not "is the data nice" — is it structurally sound. Orphaned records. Duplicated masters — three customer records for one customer, two products for one SKU. Quantities that don't reconcile to movements. Accounting that doesn't tie to the subledgers. Manually-edited values where the system should be the source. Data-integrity damage is the most expensive class of problem to find late, because every fix you make on top of bad data inherits the badness.
2. Customisation debt
The audit reads the customisations — every module, every override, every automated action, every server action and scheduled job. The questions: what does each one do, does it still serve a purpose, does it conflict with another, does it have tests, and is it the reason the standard upgrade path is now blocked. Failed implementations almost always carry a layer of customisation that was added under deadline pressure, never documented, and is now both load-bearing and unexplained.
3. Process drift
The system was configured for a process. The business now runs a different process. The gap between the two — process drift — is where the day-to-day pain lives. The audit maps what the system assumes against what the team actually does, and catalogues every place a human is working around the system rather than in it. Each workaround is a piece of evidence about a configuration decision that no longer fits.
4. Governance gaps
The last investigation is the one that explains the other three. Who owns the system? Who approves a change? Is there a test environment, and is it used? Does anyone validate that a report is correct before leadership reads it? Most failed implementations are not failed software. They are software that was implemented and then never operated — handed over with no owner, no change discipline, no validation loop. Governance gaps are the root cause behind most of what the other three investigations find.
What the audit produces
The audit ends with two deliverables.
A triage map. Every finding, sorted into three states: critical (actively causing damage — data corruption in progress, a financial misstatement, a compliance exposure), at-risk (not failing yet, but will — an unsupported customisation, a process drift widening each month), and stable (works, leave it). Triage is what turns an overwhelming list of problems into a sequence of decisions.
A recommendation, with the honest version of the options. Stabilise and keep. Stabilise and migrate. Stabilise and rebuild. The audit's job is to make that recommendation on evidence — and to be honest when the evidence says the thing the client hoped to hear is not the thing the system needs.
The discipline: stabilise before you decide
One rule sits above the playbook: stabilise first, decide second. Even when the audit's recommendation is "this should be rebuilt," the rebuild is not the first move. The first move is stabilisation — stopping the active damage, closing the critical findings, getting the system to a state where it is not actively getting worse. A business cannot make a good decision about its operating system while that system is on fire. Put the fire out, then decide.
The forensic audit feels slow to a client in pain. It is the opposite. The slow rescue is the one that started fixing on day one, owned the wrong machine, and is now three months in with the original problem still unaddressed. The audit is how a rescue stays fast.