TL; DR: Most risk matrices die in slide decks. This guide turns Impact × Likelihood into SLAs, owners, and weekly progress—so your security and platform teams actually deliver fixes.

Why risk matrices fail (and how to fix it)

Risk matrices aren’t the problem. Ambiguity is. If you don’t define scales, map scores to non-negotiable SLAs, and track work like a backlog, the matrix becomes theater.

  • Clear 1–5 impact and likelihood definitions
  • A simple score → band → SLA mapping
  • A risk register that behaves like a product backlog
  • “Quadrant playbooks” so teams know the first move without a meeting

Step 1 — Standardize scoring (1–5 with concrete examples)

Impact (1–5)

  1. Negligible (1): <15-min blip; no sensitive data; <$1k.
  2. Low (2): Short outage (<2h); non-sensitive data; easy workaround.
  3. Moderate (3): 2–8h disruption; limited PHI/PII; <$50k.
  4. High (4): 8–24h outage; 100–10k PHI/PII; likely notifications.
  5. Severe (5): >24h outage; >10k PHI/PII; penalties; reputation hit.

Likelihood (1–5)

  1. Rare (1): Multiple preconditions; segmented; strong auth; not exposed.
  2. Unlikely (2): Local access or niche misconfig required.
  3. Possible (3): Public-facing with mitigations; moderate skill needed.
  4. Likely (4): Actively exploited; weak compensating controls.
  5. Almost Certain (5): Trivial/automatable; no auth; known vulnerable, internet-exposed.

Tip: Put these in your policy and ticket templates so scoring stays consistent across teams.

Step 2 — Keep scoring simple: Impact × Likelihood (1–25)

  • Critical: 16–25
  • High: 12–15
  • Medium: 6–11
  • Low: 1–5

No weighted gymnastics. If you need sector-specific weighting (e.g., HIPAA), keep the formula simple and adjust impact definitions to reflect confidentiality and regulatory risk.

Step 3 — Make bands enforceable with SLAs + default actions

BandScoreSLA (days)Default actions
Critical16–257Immediate containment, exec notify, 24×7 monitoring, emergency change, patch/disable.
High12–1530Prioritize mitigation, schedule change this sprint, improve detective controls.
Medium6–1190Plan remediation, schedule window, document residual risk, monthly review.
Low1–5180Accept/backlog with monitoring; revisit on architecture changes.

Rule of thumb: Score decides the SLA. No negotiating that in ad-hoc meetings.

Step 4 — Manage risks like a backlog (not a spreadsheet graveyard)

Track every risk as a ticket (Jira/Linear/ServiceNow) or a light spreadsheet with these fields:

  • Risk ID, Asset/Service, Owner, Description, Threat Scenario, Vuln/CVE/CWE
  • Impact (1–5), Likelihood (1–5)Risk Score, Band
  • Current Controls / Planned Controls, Decision (Mitigate/Transfer/Accept/Avoid)
  • Due Date (from SLA), Status, Evidence/Link
  • Exception Approved By, Exception Expiry, Review Cadence, Notes

Automation to keep: auto-calculate Score/Band; auto-set Due Date from Band SLA; board swimlanes by Band.

Step 5 — Use quadrant playbooks

what are Quadrant Playbooks? They’re pre-approved, first-move runbooks for each of the four matrix quadrants so teams don’t debate what to do—they just execute.

High Impact × High Likelihood (HH) — “Stop the bleeding.”

  • Block/disable unsafe paths
  • Emergency change; WAF rules; rate limits; auth requirement
  • Tighten ingress/egress; isolate blast radius

High Impact × Low Likelihood (HL) — “Resilience.”

  • Backups, immutability, tested DR
  • Network segmentation; least privilege; break-glass runbooks
  • Consider transfer (contract/SLA, insurance)

Low Impact × High Likelihood (LH) — “Automate the sand.”

  • Throttling, quotas, circuit breakers
  • Input validation, spam/abuse filters
  • Logging/metrics to reduce alert fatigue

Low Impact × Low Likelihood (LL) — “Watch and wait.”

  • Accept with monitoring
  • Reassess when architecture or exposure changes

Step 6 — Exceptions that aren’t hand-wavy

If a High/Critical risk misses SLA, require:

  1. Named executive approver
  2. Compensating controls (e.g., monitoring, segmented access)
  3. Expiry date (no “forever” exceptions)
  4. Weekly review until closed

Step 7 — Run the program with simple KPIs/KRIs

  • KPI (Key Performance Indicator): Measures how well your process is performing. Think throughput, speed, SLA compliance.

  • KRI (Key Risk Indicator): Measures how risk is changing. Think exposure, likelihood, early warning signals.

KPIs prove you’re executing. KRIs warn you before something bites.

  • % risks within SLA (by band)
  • Mean/median days open (by band)
  • # of active exceptions and days past expiry
  • Top recurring threat scenarios (justify platform work)
  • Residual risk trend by asset tier

Real examples (you can reuse the patterns)

1) Public API DoS — Cloud Run Screenshot Endpoint

  • Scenario: Botnet hammers /api causing cost spikes and resource exhaustion.
  • Score: Impact 4 × Likelihood 4 = 16 (Critical)
  • Actions: API GW quotas, token bucket, per-UID/IP rate limits, CDN shielding, autoscaling caps, cost alerts; SLA 7 days.

2) Marketing WP Plugin Outdated (internal-only site)

  • Score: 2 × 3 = 6 (Medium)
  • Actions: Schedule patch this quarter, enable dependency alerts; SLA 90 days.

3) Missing DMARC quarantine on primary domain

  • Score: 3 × 4 = 12 (High)
  • Actions: Move to p=quarantine with alignment, plan BIMI; SLA 30 days.

Implementation checklist

  • [ ] Publish your 1–5 impact & likelihood definitions
  • [ ] Adopt Impact × Likelihood → Band → SLA mapping
  • [ ] Create a risk register (or Jira issue type) with auto score/band/due date
  • [ ] Stand up weekly risk review (15–30 min)
  • [ ] Add quadrant playbooks to runbooks
  • [ ] Require exec-approved exceptions with expiry
  • [ ] Track SLA compliance and aging in a single dashboard

FAQ

Should we weight confidentiality more for PHI?

Yes—simplest path: keep the formula the same, but raise the impact definitions when PHI/PII/regulatory exposure is likely. That preserves clarity while reflecting HIPAA/sector requirements.

Where do vulnerability scans, pen tests, and bug bounty fit?

They’re feeders. Convert findings into the risk register with the same scoring so remediation competes fairly for engineering time.

Can we skip Low/LL items?

You can accept them—document the decision and monitor. Don’t hide them; visibility prevents surprise escalations.


Leave a Reply