Thought

Business

Root Cause Analysis Series: digital project management runs on uncertainty, use FMEA to rank risks before they become problems

<p data-prosemirror-content-type="node" data-prosemirror-node-name="paragraph" data-prosemirror-node-block="true" data-pm-slice="1 1 []">Root Cause Analysis Series: digital project management runs on uncertainty, use FMEA to rank risks before they become problems</p>

FMEA (Failure Mode and Effects Analysis) is a risk analysis technique that came out of manufacturing engineering. It walks each step of a process, names every way that step could fail, scores the risk along three dimensions, and turns the result into a Risk Priority Number (RPN). The output tells the team where to act first. It fits best when the goal is to prevent failure in advance, rather than diagnose it after the fact.

What FMEA is and where it came from

FMEA was developed by the United States military in the 1940s and put to serious work inside NASA's Apollo program, where a system failure could cost lives. From there it spread into automotive, aerospace, and electronics engineering, and eventually into project management and digital product work. Today FMEA is a requirement of ISO 9001 and the AIAG VDA automotive standard.

In digital product management, FMEA earns its keep on the parts of the work that "break quietly", overlapping projects, thin staffing, broken cross-team communication, and third-party dependencies. It shows the team which points in the process tend to break, which break loudest, and which break in ways the team will not catch until too late.

 

How FMEA works (with a multi-project workload example)

Step 1: Define the process scope

Example: planning and resource allocation across overlapping projects.

Scope it explicitly so the analysis does not sprawl. State where the process starts and where it ends.

 

Step 2: Identify the process steps

The main steps in this workflow are taking the project brief, evaluating and prioritizing the new project against the current portfolio, allocating resources (people, time, budget), and tracking progress so the plan can be adjusted.

 

Step 3: Identify the failure modes for each step

A failure mode is anything that could go wrong at a given step. Examples for this workflow include unclear briefs, the absence of a clear prioritization system, and one person carrying multiple projects without backup. Brainstorm every plausible mode, not only the ones that have actually happened.

 

Step 4: Identify the effects of failure

The downstream effects include late delivery, lower quality, and a stressed team losing motivation. Effects compound. A late project pushes the next. The next ships with less polish. The team loses trust in the planning system.

 

Step 5: Score the risk with RPN (Risk Priority Number)

RPN is calculated with the formula RPN = Severity x Occurrence x Detection, where each dimension is scored from 1 to 10.

Severity (impact): 1 means almost no effect on the main work. 10 means severe impact, like a missed delivery or a damaged client relationship.

Occurrence (frequency): 1 means almost never. 10 means it happens almost every time the process runs.

Detection (likelihood of catching it before it becomes a problem): this dimension is the one teams misread most often. A high score means the failure is hard to detect, not easy. 1 means the team will definitely catch it before delivery. 10 means it is very hard to spot and usually surfaces only after damage is done.

 

Example comparison of three failure modes, ranked for action.

Unclear brief: Severity 7, Occurrence 8, Detection 6, RPN = 336. Very high risk, fix first.

One person owning multiple projects: Severity 9, Occurrence 8, Detection 4, RPN = 288. High risk, address right after.

No prioritization system: Severity 6, Occurrence 7, Detection 5, RPN = 210. Moderate risk, plan for the next phase.

 

Sorting RPN from high to low gives the team a clear sequence of work, instead of trying to solve every problem in parallel and spreading capacity until nothing improves.

 

Step 6: Recommended actions

For the highest RPN items, useful actions include building a shared resource calendar, switching to a tool that flags overlapping workload automatically, running daily standups so plans can adapt, and naming a central project manager with authority to rebalance work. Actions that lower Detection directly (adding checkpoints, alerts, or review gates) often pay back the most, because they shrink the window where a problem goes unnoticed.

 

Common pitfalls

Four mistakes show up in almost every first FMEA, and any one of them can invalidate the output.

Reading the Detection score backwards. The single most common error. Teams assume "high Detection" means "easy to catch", when it actually means "hard to catch". The most dangerous failure modes then get scored low and slip down the priority list. Spell the scale out on the worksheet, with 1 labelled "always caught" and 10 labelled "usually only found after damage".

Inconsistent scoring across the team. Without a shared rubric, one person's 7 is another person's 4. The RPN numbers look precise but mean nothing. Agree on a scoring guide with anchor examples for each band (1-3, 4-6, 7-10) before anyone scores a single failure mode.

Treating FMEA as a one-and-done document. FMEA is supposed to be a living document. Every time the process changes, scores should be revisited. When a mitigation lands, rescore the row to confirm the RPN actually dropped.

Skipping the actions column. Some teams produce a scored sheet and stop there. The deliverable is the change. Every high-RPN row needs a named owner, a specific action, and a recheck date.

 

Compared to other tools in the RCA Series

In SUFFIX's RCA toolkit, Problem-Analysis (the 4-axis framework) is the gateway, classifying the problem by clarity of symptom, scope, urgency, and ownership. Fact-Based Thinking, drawn from McKinsey practice, sharpens the problem statement before analysis starts.

5 Whys drills into a single causal chain layer by layer. Fishbone Diagram spreads the brainstorm across categories at once, useful for cross-functional disagreements. Fault Tree Analysis (FTA) uses AND/OR logic to map which factors must combine and which act alone.

Change Analysis fits when there is a clear before-and-after moment. Barrier Analysis maps which defensive layers held and which broke. Parent Cause and Management Oversight zoom out to the organizational layer, where ownership models shape why problems repeat.

All of those tools are reactive, run after a problem appears. FMEA is the proactive counterpart, run before failure to identify, score, and rank risks. A mature team uses both, FMEA at the start of a project or new process, and the reactive tools during incident reviews. The failure modes that fire most often during incidents are the ones to weight heavier in the next FMEA cycle.

 

When NOT to use FMEA

FMEA is the wrong shape in three situations.

When the problem has already happened and the team needs to diagnose it, FMEA is not the tool. Use 5 Whys, Fishbone, Fault Tree, or Barrier Analysis to trace the actual chain. FMEA ranks potential risks, not known ones.

When the problem is small and the decision needs to be fast, FMEA's setup cost (mapping steps, listing failure modes, agreeing on scales, scoring as a group) is too heavy. A quick 5 Whys gets to a useful answer in twenty minutes.

When the team has never done FMEA before and there is no facilitator, the first session usually produces inconsistent scores and a flipped Detection scale. Pair the team with someone experienced, or budget for a calibration pass on the first two or three rows.

 

Use case for digital product teams

For digital product teams, FMEA earns its keep on situations with several interacting risks. Multi-project portfolios where capacity is shared, product launches with multiple third-party dependencies, regulated workflows where a missed step has compliance consequences, and migrations where rollback is expensive.

The SUFFIX way to run it is to scope tightly, score with a shared rubric, prioritize by RPN with a clear cutoff for "act now" versus "monitor", and revisit on a fixed cadence. Keep it small. A single FMEA on the top three processes beats a giant sheet that no one updates.

For executives, FMEA is the right tool when the question is "where are we most likely to break first, and which fix gives us the biggest reduction in risk per unit of effort?" The RPN ranking makes that trade-off explicit.

FAQ

What is FMEA and where did it come from?
FMEA (Failure Mode and Effects Analysis) is a risk analysis technique developed in the 1940s by the US military, then adopted in automotive and aerospace engineering including NASA's Apollo program. It later moved into project management and digital product work. Today it is a requirement of ISO 9001 and the AIAG VDA automotive standard. The structure is the same in every domain. Identify each step, list how it could fail, score the risk, and rank the failures so the team works the most dangerous ones first.
How is RPN calculated and how should it be read?
RPN is calculated as Severity x Occurrence x Detection, where each dimension is scored from 1 to 10. The higher the RPN, the more urgent the failure mode. A common starting calibration is that RPN above 200 means act immediately, 100 to 200 means plan to address next phase, below 100 means monitor. The exact thresholds should be recalibrated after one or two FMEA cycles. What matters is that scores are comparable across the team, not that they match any external benchmark.
What does a high Detection score actually mean?
A high Detection score (close to 10) means the failure is hard to detect, not easy. The team will most likely find out only after the failure has caused damage. This is the opposite of what many people assume on first read, and is the most common source of error when teams first run FMEA. Failure modes with high Detection scores deserve extra attention. The fix often involves adding instrumentation, alerts, or review checkpoints that lower the Detection score itself, which then lowers the RPN even if Severity and Occurrence stay the same.
How is FMEA different from the other techniques in this series?
FMEA is a proactive technique, run before any failure occurs to identify and rank risks in advance. 5 Whys, Fishbone Diagram, Fault Tree Analysis, Change Analysis, and Barrier Analysis are mostly reactive, used after a problem appears to find its cause. The two modes complement each other. A mature team uses both, with FMEA at the start of a project or quarter and the reactive techniques during incident reviews.

Share

Writer
Director

Jate Saitthiti