Thought
Business
Root Cause Analysis Series: digital project management runs on uncertainty, use FMEA to rank risks before they become problems
FMEA (Failure Mode and Effects Analysis) is a risk analysis technique that came out of manufacturing engineering. It walks each step of a process, names every way that step could fail, scores the risk along three dimensions, and turns the result into a Risk Priority Number (RPN). The output tells the team where to act first. It fits best when the goal is to prevent failure in advance, rather than diagnose it after the fact.
What FMEA is and where it came from
FMEA was developed by the United States military in the 1940s and put to serious work inside NASA's Apollo program, where a system failure could cost lives. From there it spread into automotive, aerospace, and electronics engineering, and eventually into project management and digital product work. Today FMEA is a requirement of ISO 9001 and the AIAG VDA automotive standard.
In digital product management, FMEA earns its keep on the parts of the work that "break quietly", overlapping projects, thin staffing, broken cross-team communication, and third-party dependencies. It shows the team which points in the process tend to break, which break loudest, and which break in ways the team will not catch until too late.
How FMEA works (with a multi-project workload example)
Step 1: Define the process scope
Example: planning and resource allocation across overlapping projects.
Scope it explicitly so the analysis does not sprawl. State where the process starts and where it ends.
Step 2: Identify the process steps
The main steps in this workflow are taking the project brief, evaluating and prioritizing the new project against the current portfolio, allocating resources (people, time, budget), and tracking progress so the plan can be adjusted.
Step 3: Identify the failure modes for each step
A failure mode is anything that could go wrong at a given step. Examples for this workflow include unclear briefs, the absence of a clear prioritization system, and one person carrying multiple projects without backup. Brainstorm every plausible mode, not only the ones that have actually happened.
Step 4: Identify the effects of failure
The downstream effects include late delivery, lower quality, and a stressed team losing motivation. Effects compound. A late project pushes the next. The next ships with less polish. The team loses trust in the planning system.
Step 5: Score the risk with RPN (Risk Priority Number)
RPN is calculated with the formula RPN = Severity x Occurrence x Detection, where each dimension is scored from 1 to 10.
Severity (impact): 1 means almost no effect on the main work. 10 means severe impact, like a missed delivery or a damaged client relationship.
Occurrence (frequency): 1 means almost never. 10 means it happens almost every time the process runs.
Detection (likelihood of catching it before it becomes a problem): this dimension is the one teams misread most often. A high score means the failure is hard to detect, not easy. 1 means the team will definitely catch it before delivery. 10 means it is very hard to spot and usually surfaces only after damage is done.
Example comparison of three failure modes, ranked for action.
Unclear brief: Severity 7, Occurrence 8, Detection 6, RPN = 336. Very high risk, fix first.
One person owning multiple projects: Severity 9, Occurrence 8, Detection 4, RPN = 288. High risk, address right after.
No prioritization system: Severity 6, Occurrence 7, Detection 5, RPN = 210. Moderate risk, plan for the next phase.
Sorting RPN from high to low gives the team a clear sequence of work, instead of trying to solve every problem in parallel and spreading capacity until nothing improves.
Step 6: Recommended actions
For the highest RPN items, useful actions include building a shared resource calendar, switching to a tool that flags overlapping workload automatically, running daily standups so plans can adapt, and naming a central project manager with authority to rebalance work. Actions that lower Detection directly (adding checkpoints, alerts, or review gates) often pay back the most, because they shrink the window where a problem goes unnoticed.
Common pitfalls
Four mistakes show up in almost every first FMEA, and any one of them can invalidate the output.
Reading the Detection score backwards. The single most common error. Teams assume "high Detection" means "easy to catch", when it actually means "hard to catch". The most dangerous failure modes then get scored low and slip down the priority list. Spell the scale out on the worksheet, with 1 labelled "always caught" and 10 labelled "usually only found after damage".
Inconsistent scoring across the team. Without a shared rubric, one person's 7 is another person's 4. The RPN numbers look precise but mean nothing. Agree on a scoring guide with anchor examples for each band (1-3, 4-6, 7-10) before anyone scores a single failure mode.
Treating FMEA as a one-and-done document. FMEA is supposed to be a living document. Every time the process changes, scores should be revisited. When a mitigation lands, rescore the row to confirm the RPN actually dropped.
Skipping the actions column. Some teams produce a scored sheet and stop there. The deliverable is the change. Every high-RPN row needs a named owner, a specific action, and a recheck date.
Compared to other tools in the RCA Series
In SUFFIX's RCA toolkit, Problem-Analysis (the 4-axis framework) is the gateway, classifying the problem by clarity of symptom, scope, urgency, and ownership. Fact-Based Thinking, drawn from McKinsey practice, sharpens the problem statement before analysis starts.
5 Whys drills into a single causal chain layer by layer. Fishbone Diagram spreads the brainstorm across categories at once, useful for cross-functional disagreements. Fault Tree Analysis (FTA) uses AND/OR logic to map which factors must combine and which act alone.
Change Analysis fits when there is a clear before-and-after moment. Barrier Analysis maps which defensive layers held and which broke. Parent Cause and Management Oversight zoom out to the organizational layer, where ownership models shape why problems repeat.
All of those tools are reactive, run after a problem appears. FMEA is the proactive counterpart, run before failure to identify, score, and rank risks. A mature team uses both, FMEA at the start of a project or new process, and the reactive tools during incident reviews. The failure modes that fire most often during incidents are the ones to weight heavier in the next FMEA cycle.
When NOT to use FMEA
FMEA is the wrong shape in three situations.
When the problem has already happened and the team needs to diagnose it, FMEA is not the tool. Use 5 Whys, Fishbone, Fault Tree, or Barrier Analysis to trace the actual chain. FMEA ranks potential risks, not known ones.
When the problem is small and the decision needs to be fast, FMEA's setup cost (mapping steps, listing failure modes, agreeing on scales, scoring as a group) is too heavy. A quick 5 Whys gets to a useful answer in twenty minutes.
When the team has never done FMEA before and there is no facilitator, the first session usually produces inconsistent scores and a flipped Detection scale. Pair the team with someone experienced, or budget for a calibration pass on the first two or three rows.
Use case for digital product teams
For digital product teams, FMEA earns its keep on situations with several interacting risks. Multi-project portfolios where capacity is shared, product launches with multiple third-party dependencies, regulated workflows where a missed step has compliance consequences, and migrations where rollback is expensive.
The SUFFIX way to run it is to scope tightly, score with a shared rubric, prioritize by RPN with a clear cutoff for "act now" versus "monitor", and revisit on a fixed cadence. Keep it small. A single FMEA on the top three processes beats a giant sheet that no one updates.
For executives, FMEA is the right tool when the question is "where are we most likely to break first, and which fix gives us the biggest reduction in risk per unit of effort?" The RPN ranking makes that trade-off explicit.
FAQ
What is FMEA and where did it come from?
How is RPN calculated and how should it be read?
What does a high Detection score actually mean?
How is FMEA different from the other techniques in this series?
Writer
Director
Jate Saitthiti