Deliverability Incident Response: A Playbook for the First 4 Hours

When email reputation collapses, the first 4 hours determine the recovery timeline. Triage, containment, communication the operational playbook teams actually need.

TL;DR

When email reputation collapses suddenly campaign-driven complaint spike, blocklist listing, authentication failure, ESP-side incident the response in the first 4 hours determines whether recovery takes days or weeks. The four-hour playbook: stop sending immediately for non-essential streams (hour 0-1), diagnose the root cause through reputation dashboards and authentication logs (hour 1-2), implement targeted containment based on what you found (hour 2-3), and communicate to internal stakeholders with concrete impact assessment (hour 3-4). The most common failure mode is not technical: it is organizational. Teams without a documented playbook waste the critical window debating whose problem it is. Teams with a playbook follow steps, even when the steps are uncomfortable. Print this. Pre-assign roles. Run a tabletop exercise before you need it.

What Counts as a Deliverability Incident

Not every degradation is an incident. The distinction matters because incident response is operationally expensive it pulls people from other work, halts revenue-generating sends, and creates organizational urgency that should not be triggered casually.

A deliverability incident is a sudden, significant change in delivery posture that:

Materially affects the business (revenue, customer experience, security)
Cannot be ignored without compounding damage
Requires intervention beyond normal operations to address

Examples:

Yes, this is an incident:

Domain reputation drops from High to Bad in Postmaster Tools over 24-48 hours
Spam complaint rate spikes from 0.05% to 0.4% in a single day
Blocklist listing on Spamhaus or another high-impact list
DMARC enforcement causing legitimate mail to be rejected
ESP-side incident affecting authentication or delivery
Sudden bounce rate increase from a normally-stable source

No, this is not an incident:

Gradual decline in open rates over weeks
Single-digit fluctuations in complaint rate
One Postmaster Tools chart showing a transient dip
A campaign that performed below expectations

The line is sharp because incident response procedures are designed for the first category. Applying them to the second category is overkill and trains the organization to ignore the procedures when they actually matter.

For this playbook, "incident" means the first category. If you are not sure which you have, the early diagnostic steps will tell you within an hour.

Hour 0-1: Stop the Bleeding

The first hour is about containment. The diagnostic and remediation steps come after. The single highest-leverage action in the first hour is reducing the surface area of the problem.

Stop Non-Essential Sends

The instinct is often to keep sending while you investigate. This is wrong. Every additional send during an incident either:

Adds to the volume that is being filtered (worsening reputation)
Adds to the population that may complain (increasing complaint rate)
Generates new bounces or rejections (adding noise to diagnostics)

The right action: pause all non-transactional sending immediately. Marketing campaigns, drip sequences, scheduled newsletters pause them. The lost sends can be re-queued after the incident is resolved. The reputation damage from sending into the storm cannot be reversed.

What to keep running:

Genuinely transactional mail (password resets, receipts, security notifications)
High-priority customer-triggered messages (support replies)
Mail required for legal or compliance obligations

What to pause:

All marketing and promotional sends
Drip sequences and automation
Re-engagement campaigns (these are highest-risk in any incident)
Notification mail that can be deferred (digests, summaries)

This decision needs to be made within 30 minutes of incident detection. Pre-authorization for the deliverability team to make this call without seeking marketing approval is the most important organizational structure to have in place before an incident.

Open the Diagnostic Channels

Before diving into root cause, ensure the data sources are open:

Postmaster Tools for Gmail reputation and complaint rate
SNDS for Microsoft IP reputation
ESP dashboards for delivery rates, bounce rates, deferral rates
DMARC monitoring tool (the aggregate report stack)
Blocklist lookup tools (MXToolbox, Hetrixtools)
Internal logging for sending pipeline status

Have a single person responsible for collecting data from each source. Time spent debating who should look at what is time the incident is compounding.

Establish a Communication Channel

Open a dedicated Slack channel, Teams thread, or chat for the incident. Pin it. Add the relevant people. The communication channel becomes the single source of truth for the duration of the incident. Decisions made elsewhere (DMs, side conversations, hallway discussions) do not exist for incident-response purposes.

Pre-defined attendee list:

Deliverability/email operations lead
Engineering or platform lead
Marketing operations representative
Communications / customer success representative
Executive sponsor (informed, not directing)

Hour 1-2: Diagnose the Root Cause

By hour 1, sending is contained, data sources are open, and people are coordinated. Now identify what actually went wrong.

The diagnostic question is not "what happened?" It is "what is the most likely cause given what we observe, and what would confirm or rule out that hypothesis?"

The Six Most Common Root Causes

In my experience auditing deliverability incidents, six causes account for ~85% of cases.

1. Marketing campaign produced unexpected complaint spike.

Diagnostic: Postmaster Tools shows complaint rate spike correlated with a specific recent campaign. Check campaign send timing against the spike onset.

Confirmation: Review the campaign's audience segment, content, and historical performance. If audience was atypical (re-engagement, dormant segment, recent acquisition), this is likely the cause.

2. List quality degradation.

Diagnostic: Bounce rate increase, complaint rate increase, possible spam-trap hit. May trace to a recent list import or acquisition campaign.

Confirmation: Review recent list additions in the last 30 days. Check for source data quality issues.

3. Authentication failure or misconfiguration.

Diagnostic: DMARC aggregate reports show sudden drop in pass rates. SPF, DKIM, or alignment failures appearing where they previously passed.

Confirmation: Test authentication on a sample message via Gmail "Show original." Compare against historical baseline.

4. Blocklist listing.

Diagnostic: SMTP rejection messages reference specific blocklists. Multi-list lookup tools confirm listing.

Confirmation: Check the listing reason at the source blocklist (Spamhaus, Spamcop, etc.). Identify which IPs are listed.

5. ESP-side incident.

Diagnostic: Multiple senders on the same ESP report similar issues. ESP status page shows incident. Shared IP pool reputation has shifted.

Confirmation: Check ESP status page. Reach out to ESP support. Compare with peers if industry channels exist.

6. Compromise or unauthorized sending.

Diagnostic: Volume spike from unexpected source. DMARC reports showing mail from unfamiliar IPs claiming your domain. Customer reports of unusual messages.

Confirmation: Review sending source IPs in DMARC aggregate reports. Audit recent access to ESP accounts and sending infrastructure.

The diagnostic time should be 30-60 minutes. If after an hour the cause is not at least narrowed to one of the six, the incident is unusual and may require external help (specialized deliverability consultant, ESP escalation).

The Diagnostic Mistakes to Avoid

Don't conclude based on a single data point. A complaint spike could be a campaign issue, a list issue, or an ESP issue. Triangulate before committing to a hypothesis.

Don't ignore correlations because they are inconvenient. If the incident started 4 hours after a campaign launch and the campaign had unusual targeting, the campaign is probably the cause. The instinct to defend the campaign is strong; resist it during diagnosis.

Don't waste time on speculative causes. Phantom hypotheses ("maybe Google changed their algorithm overnight") are uninformed. Major mailbox provider changes are not the typical cause of single-sender incidents. Investigate the boring, common causes first.

Hour 2-3: Implement Targeted Containment

Diagnosis tells you what to fix. The fix in the third hour is not the full remediation it is the specific containment action that stops the incident from worsening while you plan recovery.

Containment by Cause

Marketing campaign caused the spike.

Suppress the affected segment from all future sends
Audit similar segments and pause sends to them
Review campaign approval workflow for the gap that allowed this segment to be mailed

List quality issue.

Suppress recently imported addresses
Tighten list ingestion validation
Plan reputation recovery once cleanup is complete

Authentication failure.

Identify the broken authentication source (often a recent change)
Revert or re-deploy the configuration
Verify with test sends that authentication is restored

Blocklist listing.

Confirm the listing and reason
If shared IP, escalate to ESP immediately
If dedicated IP, address root cause and submit delisting request

ESP-side incident.

Engage ESP support
Consider failover to backup sending path if available
Communicate to internal stakeholders about expected duration

Compromise.

Rotate credentials immediately
Audit sending logs for unauthorized activity
Engage security team for full incident response (this becomes a security incident, not just a deliverability incident)

The containment action does not solve the problem permanently. It stops new damage. Recovery begins after containment is confirmed.

What "Confirmed Containment" Looks Like

You can move from containment to recovery planning when:

The metric that signaled the incident has stopped worsening
The root cause is understood with high confidence
A specific corrective action has been taken (not just identified)
New data confirms the action had the expected effect

If any of these is missing, containment is not confirmed. Continue to monitor and hold off on recovery announcement until the incident is stable.

By hour 4, internal stakeholders need a clear status. The communication should answer four questions:

1. What happened? Plain language description of the incident.

2. What is the impact? Concrete numbers where possible: percentage of mail affected, customer-facing impact, revenue impact estimate.

3. What are we doing? Current containment actions and recovery plan.

4. When will it be resolved? Honest estimate of resolution timeline.

The communication should not be defensive, should not minimize impact, and should not promise recovery faster than is realistic. Stakeholders generally tolerate bad news delivered honestly. They do not tolerate surprise updates that contradict earlier optimistic communications.

The Hour-4 Status Update Template

text

1Subject: [Email Deliverability Incident] Status Update - [Date]
2
3Summary:
4At approximately [time], we identified [brief description of incident]. 
5We have contained the immediate impact and are now working on recovery.
6
7Current Impact:
8- [X%] of marketing email is currently being throttled or routed to spam
9- Estimated customer-visible impact: [description]
10- Transactional mail (password resets, receipts) is [unaffected / affected with description]
11
12Root Cause:
13[Brief description of confirmed root cause]
14
15Actions Taken:
16- [Specific containment actions]
17- [Communication to ESP / external parties if applicable]
18
19Recovery Plan:
20- [Phase 1: immediate next steps]
21- [Phase 2: medium-term remediation]
22- Estimated time to full recovery: [X days/weeks based on the type of incident]
23
24Next Update:
25[Time of next status update]
26
27Incident Channel:
28[Link to dedicated channel/thread]

This template can be adapted for different audiences (executive, customer-facing, technical team) but the four core questions stay the same.

What Comes Next: The 90-Day Recovery

The first 4 hours are the emergency response. Recovery is a multi-week project. The reputation damage from a significant incident takes 14-90 days to fully recover, depending on severity.

The recovery framework, covered in detail in Sender Reputation Recovery: A 90-Day Plan After a Major Incident when published:

Days 1-7: Stabilization. Send only to most-engaged segment. Confirm reputation has stopped declining.

Days 7-30: Gradual ramp. Expand audience in segments based on engagement, similar to IP warming.

Days 30-90: Full recovery. Return to normal volume. Continue tighter monitoring. Document lessons learned.

The structure of recovery looks similar to IP warming because the underlying mechanism is similar rebuilding reputation through positive engagement signals. The difference is that you are starting from negative reputation rather than no reputation.

Pre-Incident Preparation

The single best predictor of incident recovery time is whether the team has prepared in advance. The work to do before an incident:

1. Write the playbook. A documented procedure that someone unfamiliar with the specific incident can follow. The version above is a starting point; adapt to your organization.

2. Pre-assign roles. Who pauses sending? Who diagnoses? Who communicates? Who decides on the ESP escalation? Names, not titles. The names should know they have the role.

3. Pre-authorize the hard decisions. Pausing marketing during an incident is a revenue-affecting decision. Pre-authorize the deliverability lead to make this call without marketing approval during incidents. Without pre-authorization, you waste the first hour debating authority.

4. Run tabletop exercises. Annually at minimum. Pose a hypothetical scenario, walk through the playbook, identify gaps. The first time you use the playbook should not be during a real incident.

5. Maintain monitoring. Postmaster Tools, SNDS, DMARC tooling, blocklist monitors. These should be checked daily, not just during incidents. The faster you detect, the more of the 4-hour window you actually have.

6. Document the inventory. Every sending source, IP, authentication configuration, ESP relationship. During an incident is not the time to figure out what infrastructure exists.

Most senders skip this preparation. Most senders also struggle when an incident happens. The correlation is not coincidence.

Frequently Asked Questions

How do I know an incident has actually started, vs. normal fluctuation?

The signal is sudden and significant. Postmaster Tools showing reputation drop within 24-48 hours from a previously stable baseline is incident-level. Gradual declines over weeks are operational issues, not incidents.

Should I tell customers during an incident?

For B2C senders, usually not unless impact is severe and prolonged. For B2B SaaS senders, often yes customers should know if their notifications are being affected. The decision should be in your communication playbook, not made ad-hoc during the incident.

What if the ESP is the cause?

Engage ESP support immediately. Escalate through your account manager if you have one. Consider activating any pre-arranged failover paths. This is one of the few scenarios where you cannot fully self-resolve.

How often do these incidents actually happen?

For mature senders with good operational hygiene: rare, perhaps once per year or less. For senders without proper hygiene: more often, sometimes every few months. The frequency is itself a signal of underlying program quality.

What if I do not have a deliverability team?

The work falls to whoever is closest to the email infrastructure. The playbook still applies. The role assignments may collapse to fewer people, but the steps remain the same.

Do I need to involve legal or compliance?

For incidents involving suspected compromise, customer data, or regulatory implications yes. For typical reputation incidents, no. The playbook should specify which incident types trigger legal involvement.

What happens to deferred sends after the incident?

Resume gradually, not all at once. Re-queueing a week of paused sends in a single day produces a volume spike that can re-trigger filtering issues. Spread the re-send over several days, prioritizing the most time-sensitive content first.

Key Takeaways

Deliverability incidents are sudden, significant changes in delivery posture that materially affect the business. Not every degradation is an incident.
The first 4 hours determine whether recovery takes days or weeks. Pre-prepared teams handle them faster than ad-hoc responses.
Hour 0-1: Stop non-essential sends. Open diagnostic channels. Establish a communication channel.
Hour 1-2: Diagnose root cause. Six common causes (campaign-driven, list quality, authentication, blocklist, ESP-side, compromise) account for ~85% of incidents.
Hour 2-3: Implement targeted containment based on confirmed cause. Containment stops new damage; full remediation comes later.
Hour 3-4: Communicate to stakeholders with the four core questions: what happened, what is the impact, what are we doing, when will it be resolved.
Pre-incident preparation: written playbook, pre-assigned roles, pre-authorized decisions, tabletop exercises, monitoring, documented inventory.
Recovery is a 14-90 day project. The 4-hour playbook is the start, not the end.

If you are building incident response capability for the first time, the prerequisites in this series are Postmaster Tools vs. SNDS and The DMARC Aggregate Report Stack for the diagnostic infrastructure, Blocklist Remediation Playbook for one of the most common containment scenarios, and The 0.3% Spam Complaint Threshold for the metric that signals most incidents.

Most senders write the playbook after their first major incident. The senders who write it before are the ones who treat email infrastructure as production infrastructure because that is what it is. The cost of preparation is low. The cost of the first uncoordinated incident response is significantly higher than the preparation would have been.

Reactions

ShareLinkedIn

Strategy Deliverability User Experience

List Hygiene Beyond Bounces: Engagement-Based Suppression Strategies

Bounce-based list cleaning is decades behind modern deliverability. The 90/180/365-day engagement model, sunset policies, and re-engagement that actually works.

Nov 21, 2024

Deliverability Regulation & Governance Security

Blocklist Remediation Playbook: Spamhaus, SORBS, and UCEPROTECT

How to remove your IP or domain from Spamhaus, SORBS, UCEPROTECT, and other blocklists. Which lists matter, which to ignore, and the delisting process for each.

Sep 17, 2024

Strategy Identity

Sending from Subdomains: A Strategic Guide to Domain Architecture

How to architect email subdomains for reputation isolation. The 4-stream model, naming conventions, and why [send.brand.com] beats [mail.brand.com].

Jun 25, 2024