When email reputation collapses, the first 4 hours determine the recovery timeline. Triage, containment, communication the operational playbook teams actually need.
What Counts as a Deliverability Incident
Not every degradation is an incident. The distinction matters because incident response is operationally expensive it pulls people from other work, halts revenue-generating sends, and creates organizational urgency that should not be triggered casually.
A deliverability incident is a sudden, significant change in delivery posture that:
- Materially affects the business (revenue, customer experience, security)
- Cannot be ignored without compounding damage
- Requires intervention beyond normal operations to address
Examples:
Yes, this is an incident:
- Domain reputation drops from High to Bad in Postmaster Tools over 24-48 hours
- Spam complaint rate spikes from 0.05% to 0.4% in a single day
- Blocklist listing on Spamhaus or another high-impact list
- DMARC enforcement causing legitimate mail to be rejected
- ESP-side incident affecting authentication or delivery
- Sudden bounce rate increase from a normally-stable source
No, this is not an incident:
- Gradual decline in open rates over weeks
- Single-digit fluctuations in complaint rate
- One Postmaster Tools chart showing a transient dip
- A campaign that performed below expectations
The line is sharp because incident response procedures are designed for the first category. Applying them to the second category is overkill and trains the organization to ignore the procedures when they actually matter.
For this playbook, "incident" means the first category. If you are not sure which you have, the early diagnostic steps will tell you within an hour.
Hour 0-1: Stop the Bleeding
The first hour is about containment. The diagnostic and remediation steps come after. The single highest-leverage action in the first hour is reducing the surface area of the problem.
Stop Non-Essential Sends
The instinct is often to keep sending while you investigate. This is wrong. Every additional send during an incident either:
- Adds to the volume that is being filtered (worsening reputation)
- Adds to the population that may complain (increasing complaint rate)
- Generates new bounces or rejections (adding noise to diagnostics)
The right action: pause all non-transactional sending immediately. Marketing campaigns, drip sequences, scheduled newsletters pause them. The lost sends can be re-queued after the incident is resolved. The reputation damage from sending into the storm cannot be reversed.
What to keep running:
- Genuinely transactional mail (password resets, receipts, security notifications)
- High-priority customer-triggered messages (support replies)
- Mail required for legal or compliance obligations
What to pause:
- All marketing and promotional sends
- Drip sequences and automation
- Re-engagement campaigns (these are highest-risk in any incident)
- Notification mail that can be deferred (digests, summaries)
This decision needs to be made within 30 minutes of incident detection. Pre-authorization for the deliverability team to make this call without seeking marketing approval is the most important organizational structure to have in place before an incident.
Open the Diagnostic Channels
Before diving into root cause, ensure the data sources are open:
- Postmaster Tools for Gmail reputation and complaint rate
- SNDS for Microsoft IP reputation
- ESP dashboards for delivery rates, bounce rates, deferral rates
- DMARC monitoring tool (the aggregate report stack)
- Blocklist lookup tools (MXToolbox, Hetrixtools)
- Internal logging for sending pipeline status
Have a single person responsible for collecting data from each source. Time spent debating who should look at what is time the incident is compounding.
Establish a Communication Channel
Open a dedicated Slack channel, Teams thread, or chat for the incident. Pin it. Add the relevant people. The communication channel becomes the single source of truth for the duration of the incident. Decisions made elsewhere (DMs, side conversations, hallway discussions) do not exist for incident-response purposes.
Pre-defined attendee list:
- Deliverability/email operations lead
- Engineering or platform lead
- Marketing operations representative
- Communications / customer success representative
- Executive sponsor (informed, not directing)
Hour 1-2: Diagnose the Root Cause
By hour 1, sending is contained, data sources are open, and people are coordinated. Now identify what actually went wrong.
The diagnostic question is not "what happened?" It is "what is the most likely cause given what we observe, and what would confirm or rule out that hypothesis?"
The Six Most Common Root Causes
In my experience auditing deliverability incidents, six causes account for ~85% of cases.
1. Marketing campaign produced unexpected complaint spike.
Diagnostic: Postmaster Tools shows complaint rate spike correlated with a specific recent campaign. Check campaign send timing against the spike onset.
Confirmation: Review the campaign's audience segment, content, and historical performance. If audience was atypical (re-engagement, dormant segment, recent acquisition), this is likely the cause.
2. List quality degradation.
Diagnostic: Bounce rate increase, complaint rate increase, possible spam-trap hit. May trace to a recent list import or acquisition campaign.
Confirmation: Review recent list additions in the last 30 days. Check for source data quality issues.
3. Authentication failure or misconfiguration.
Diagnostic: DMARC aggregate reports show sudden drop in pass rates. SPF, DKIM, or alignment failures appearing where they previously passed.
Confirmation: Test authentication on a sample message via Gmail "Show original." Compare against historical baseline.
4. Blocklist listing.
Diagnostic: SMTP rejection messages reference specific blocklists. Multi-list lookup tools confirm listing.
Confirmation: Check the listing reason at the source blocklist (Spamhaus, Spamcop, etc.). Identify which IPs are listed.
5. ESP-side incident.
Diagnostic: Multiple senders on the same ESP report similar issues. ESP status page shows incident. Shared IP pool reputation has shifted.
Confirmation: Check ESP status page. Reach out to ESP support. Compare with peers if industry channels exist.
6. Compromise or unauthorized sending.
Diagnostic: Volume spike from unexpected source. DMARC reports showing mail from unfamiliar IPs claiming your domain. Customer reports of unusual messages.
Confirmation: Review sending source IPs in DMARC aggregate reports. Audit recent access to ESP accounts and sending infrastructure.
The diagnostic time should be 30-60 minutes. If after an hour the cause is not at least narrowed to one of the six, the incident is unusual and may require external help (specialized deliverability consultant, ESP escalation).
The Diagnostic Mistakes to Avoid
Don't conclude based on a single data point. A complaint spike could be a campaign issue, a list issue, or an ESP issue. Triangulate before committing to a hypothesis.
Don't ignore correlations because they are inconvenient. If the incident started 4 hours after a campaign launch and the campaign had unusual targeting, the campaign is probably the cause. The instinct to defend the campaign is strong; resist it during diagnosis.
Don't waste time on speculative causes. Phantom hypotheses ("maybe Google changed their algorithm overnight") are uninformed. Major mailbox provider changes are not the typical cause of single-sender incidents. Investigate the boring, common causes first.
Hour 2-3: Implement Targeted Containment
Diagnosis tells you what to fix. The fix in the third hour is not the full remediation it is the specific containment action that stops the incident from worsening while you plan recovery.
Containment by Cause
Marketing campaign caused the spike.
- Suppress the affected segment from all future sends
- Audit similar segments and pause sends to them
- Review campaign approval workflow for the gap that allowed this segment to be mailed
List quality issue.
- Suppress recently imported addresses
- Tighten list ingestion validation
- Plan reputation recovery once cleanup is complete
Authentication failure.
- Identify the broken authentication source (often a recent change)
- Revert or re-deploy the configuration
- Verify with test sends that authentication is restored
Blocklist listing.
- Confirm the listing and reason
- If shared IP, escalate to ESP immediately
- If dedicated IP, address root cause and submit delisting request
ESP-side incident.
- Engage ESP support
- Consider failover to backup sending path if available
- Communicate to internal stakeholders about expected duration
Compromise.
- Rotate credentials immediately
- Audit sending logs for unauthorized activity
- Engage security team for full incident response (this becomes a security incident, not just a deliverability incident)
The containment action does not solve the problem permanently. It stops new damage. Recovery begins after containment is confirmed.
What "Confirmed Containment" Looks Like
You can move from containment to recovery planning when:
- The metric that signaled the incident has stopped worsening
- The root cause is understood with high confidence
- A specific corrective action has been taken (not just identified)
- New data confirms the action had the expected effect
If any of these is missing, containment is not confirmed. Continue to monitor and hold off on recovery announcement until the incident is stable.
By hour 4, internal stakeholders need a clear status. The communication should answer four questions:
1. What happened? Plain language description of the incident.
2. What is the impact? Concrete numbers where possible: percentage of mail affected, customer-facing impact, revenue impact estimate.
3. What are we doing? Current containment actions and recovery plan.
4. When will it be resolved? Honest estimate of resolution timeline.
The communication should not be defensive, should not minimize impact, and should not promise recovery faster than is realistic. Stakeholders generally tolerate bad news delivered honestly. They do not tolerate surprise updates that contradict earlier optimistic communications.
The Hour-4 Status Update Template
1Subject: [Email Deliverability Incident] Status Update - [Date]
2
3Summary:
4At approximately [time], we identified [brief description of incident].
5We have contained the immediate impact and are now working on recovery.
6
7Current Impact:
8- [X%] of marketing email is currently being throttled or routed to spam
9- Estimated customer-visible impact: [description]
10- Transactional mail (password resets, receipts) is [unaffected / affected with description]
11
12Root Cause:
13[Brief description of confirmed root cause]
14
15Actions Taken:
16- [Specific containment actions]
17- [Communication to ESP / external parties if applicable]
18
19Recovery Plan:
20- [Phase 1: immediate next steps]
21- [Phase 2: medium-term remediation]
22- Estimated time to full recovery: [X days/weeks based on the type of incident]
23
24Next Update:
25[Time of next status update]
26
27Incident Channel:
28[Link to dedicated channel/thread]This template can be adapted for different audiences (executive, customer-facing, technical team) but the four core questions stay the same.
What Comes Next: The 90-Day Recovery
The first 4 hours are the emergency response. Recovery is a multi-week project. The reputation damage from a significant incident takes 14-90 days to fully recover, depending on severity.
The recovery framework, covered in detail in Sender Reputation Recovery: A 90-Day Plan After a Major Incident when published:
Days 1-7: Stabilization. Send only to most-engaged segment. Confirm reputation has stopped declining.
Days 7-30: Gradual ramp. Expand audience in segments based on engagement, similar to IP warming.
Days 30-90: Full recovery. Return to normal volume. Continue tighter monitoring. Document lessons learned.
The structure of recovery looks similar to IP warming because the underlying mechanism is similar rebuilding reputation through positive engagement signals. The difference is that you are starting from negative reputation rather than no reputation.
Pre-Incident Preparation
The single best predictor of incident recovery time is whether the team has prepared in advance. The work to do before an incident:
1. Write the playbook. A documented procedure that someone unfamiliar with the specific incident can follow. The version above is a starting point; adapt to your organization.
2. Pre-assign roles. Who pauses sending? Who diagnoses? Who communicates? Who decides on the ESP escalation? Names, not titles. The names should know they have the role.
3. Pre-authorize the hard decisions. Pausing marketing during an incident is a revenue-affecting decision. Pre-authorize the deliverability lead to make this call without marketing approval during incidents. Without pre-authorization, you waste the first hour debating authority.
4. Run tabletop exercises. Annually at minimum. Pose a hypothetical scenario, walk through the playbook, identify gaps. The first time you use the playbook should not be during a real incident.
5. Maintain monitoring. Postmaster Tools, SNDS, DMARC tooling, blocklist monitors. These should be checked daily, not just during incidents. The faster you detect, the more of the 4-hour window you actually have.
6. Document the inventory. Every sending source, IP, authentication configuration, ESP relationship. During an incident is not the time to figure out what infrastructure exists.
Most senders skip this preparation. Most senders also struggle when an incident happens. The correlation is not coincidence.
Frequently Asked Questions
How do I know an incident has actually started, vs. normal fluctuation?
Should I tell customers during an incident?
What if the ESP is the cause?
How often do these incidents actually happen?
What if I do not have a deliverability team?
Do I need to involve legal or compliance?
What happens to deferred sends after the incident?
Key Takeaways
- Deliverability incidents are sudden, significant changes in delivery posture that materially affect the business. Not every degradation is an incident.
- The first 4 hours determine whether recovery takes days or weeks. Pre-prepared teams handle them faster than ad-hoc responses.
- Hour 0-1: Stop non-essential sends. Open diagnostic channels. Establish a communication channel.
- Hour 1-2: Diagnose root cause. Six common causes (campaign-driven, list quality, authentication, blocklist, ESP-side, compromise) account for ~85% of incidents.
- Hour 2-3: Implement targeted containment based on confirmed cause. Containment stops new damage; full remediation comes later.
- Hour 3-4: Communicate to stakeholders with the four core questions: what happened, what is the impact, what are we doing, when will it be resolved.
- Pre-incident preparation: written playbook, pre-assigned roles, pre-authorized decisions, tabletop exercises, monitoring, documented inventory.
- Recovery is a 14-90 day project. The 4-hour playbook is the start, not the end.
If you are building incident response capability for the first time, the prerequisites in this series are Postmaster Tools vs. SNDS and The DMARC Aggregate Report Stack for the diagnostic infrastructure, Blocklist Remediation Playbook for one of the most common containment scenarios, and The 0.3% Spam Complaint Threshold for the metric that signals most incidents.
Most senders write the playbook after their first major incident. The senders who write it before are the ones who treat email infrastructure as production infrastructure because that is what it is. The cost of preparation is low. The cost of the first uncoordinated incident response is significantly higher than the preparation would have been.
Related articles

List Hygiene Beyond Bounces: Engagement-Based Suppression Strategies
Bounce-based list cleaning is decades behind modern deliverability. The 90/180/365-day engagement model, sunset policies, and re-engagement that actually works.

Blocklist Remediation Playbook: Spamhaus, SORBS, and UCEPROTECT
How to remove your IP or domain from Spamhaus, SORBS, UCEPROTECT, and other blocklists. Which lists matter, which to ignore, and the delisting process for each.

Sending from Subdomains: A Strategic Guide to Domain Architecture
How to architect email subdomains for reputation isolation. The 4-stream model, naming conventions, and why [send.brand.com] beats [mail.brand.com].


