Incident Response - CostPlusDB

Incident Response Database down? Start here.	Page	← Home

Contact Information

Email:    jeremy@intentsolutions.io
Response: Within 30 minutes (typically)
Critical: IMMEDIATE response - automated 24/7 alerts

I'll see your message. For P0 critical issues, my phone alerts automatically.

Is This an Emergency?

Use this page if you're experiencing an active issue right now. If you're just reading about our reliability practices, check out Reliability & Recovery instead.

Severity Levels

How we classify incidents:

Level	What It Means	Examples	Response Time
P0 - CRITICAL	Service down or data at risk	Database offline, data loss, security breach, backup failure causing data loss	30 minutes
P1 - HIGH	Severe degradation	Major performance issues, backup failures (non-critical), security alerts	2 hours
P2 - MEDIUM	Moderate impact	Minor performance issues, non-critical errors, monitoring alerts	4 hours
P3 - LOW	Minimal impact	Questions, minor bugs, feature requests	Next business day

Note: Official SLA is 4-hour response during business hours (M-F 9am-6pm ET). Reality: I typically respond within 30 minutes, 7 days/week. Critical outages trigger immediate phone alerts 24/7.

What to Do Right Now

Scenario 1: Database is Completely Down (P0)

Symptoms: Can't connect. Application shows connection errors. All queries fail.

Immediate Actions

1. Check the status page (coming soon: status.costplusdb.com)
   For now: I'll email you if there's a known outage

2. Verify it's not DNS or network:
   ping your-host.costplusdb.com
   (Should get a response - if not, it's DNS/network)

3. Test the connection:
   psql "postgresql://user@host:5432/db?sslmode=require"

   If you get "Connection refused" → Database is down (P0)
   If you get "Timeout" → Network issue (check your firewall)
   If you get authentication error → Database is up, wrong credentials

4. Email me immediately: jeremy@intentsolutions.io
   Subject: "P0: Database Down - [your company]"
   Include: Error messages, when it started, any recent changes

What I'm Doing

You don't need to tell me the database is down.

Automated alerts already notified me:
- Betterstack detected the outage within 1 minute
- My phone alerted me immediately
- I'm already investigating or working on it

What happens next:
1. I diagnose the issue (5-15 minutes)
   - Check if PostgreSQL is running
   - Review system logs
   - Check disk space, memory, CPU

2. I fix or restore (15-90 minutes depending on cause)
   - Simple fix: Restart services, clear locks
   - Complex: Restore from backup, migrate to new VPS

3. I notify you when resolved
   - Email with incident summary
   - Explanation of what happened
   - Steps to prevent recurrence

Expected Timeline

Event	Time
Detection (automated)	1-5 minutes
My response	Immediate (phone alert)
Diagnosis	5-15 minutes
Resolution	30-120 minutes

Data loss: Maximum 24 hours (last backup). Usually zero if it's a service issue, not data corruption.

Scenario 2: Slow Performance / Connection Issues (P1)

Symptoms: Queries are slow. Connection timeouts. Application feels sluggish.

Quick Diagnostics You Can Run

These queries help us diagnose faster:

-- Check active connections
SELECT count(*) FROM pg_stat_activity;

-- Look for long-running queries
SELECT pid, now() - query_start AS duration, state, query
FROM pg_stat_activity
WHERE state = 'active'
  AND now() - query_start > interval '1 minute'
ORDER BY duration DESC;

-- Check for locks
SELECT pid, mode, granted, query
FROM pg_locks
JOIN pg_stat_activity USING (pid)
WHERE NOT granted;

Send me the results when you email. It speeds up diagnosis significantly.

Common Causes

Too many connections: Check if you're hitting max_connections limit
Long-running query: One slow query blocking others
Missing index: Sequential scan on large table
Disk space: Running low on storage
Memory: Shared buffers exhausted

I'll check all of these when you contact me.

Contact Me

Email: jeremy@intentsolutions.io
Subject: "P1: Performance Issues - [your company]"

Include:
- When did it start?
- How much slower? (e.g., "queries taking 10x longer")
- Any recent changes? (new feature, traffic spike)
- Results of diagnostic queries above (if you ran them)

Scenario 3: Suspected Data Loss or Corruption (P0)

Symptoms: Missing records. Incorrect values. Data doesn't match expectations.

STOP - Do NOT Do These Things

- Do NOT run DELETE or UPDATE statements to "fix" it
- Do NOT attempt to restore backups yourself
- Do NOT restart the database
- Do NOT run VACUUM FULL

Why? You might overwrite the data we need to recover.

- DO These Things Immediately

1. Document what you observe:
   - Which records are missing/wrong?
   - When did you notice it?
   - When was the data last correct? (approximate time)
   - What operations were running? (imports, updates, deletes)

2. If possible, export the current state:
   pg_dump -h host -U user -d db -F c -f emergency-$(date +%Y%m%d).dump

   This preserves what's there now, before we restore.

3. Email me IMMEDIATELY:
   Subject: "P0: DATA LOSS - [your company]"
   Mark as urgent.

What I'll Do

Recovery process:

1. Assess the scope (10-20 minutes)
   - What data is affected?
   - When did corruption occur?
   - What's the best recovery point?

2. Point-in-time recovery (30-90 minutes)
   - Restore from backup to specific timestamp
   - Replay WAL logs to exact recovery point
   - Verify data integrity

3. Validation (15-30 minutes)
   - Confirm data is restored correctly
   - Run sanity checks
   - Test application connections

4. Cutover (5-10 minutes)
   - Switch your application to restored database
   - Verify everything works
   - Monitor for issues

Total time: 1-2.5 hours typically

Scenario 4: Security Concern (P0)

Symptoms: Unauthorized access attempts. Suspicious activity. Unusual queries.

Signs of a Security Incident

Failed authentication attempts from unknown IPs (check logs)
Unusual queries in pg_stat_activity (e.g., attempts to read pg_user)
Unexpected connections from unfamiliar locations
Sudden performance degradation (could be malicious queries)
Your team didn't make changes but something changed

What to Do

1. Do NOT investigate further
   (Preserve evidence, don't tip off attacker)

2. Note the time and what you observed:
   - What suspicious activity did you see?
   - When did you notice it?
   - Were there any unusual access patterns?

3. Email me immediately:
   Subject: "P0: SECURITY INCIDENT - [your company]"

4. If you suspect credentials are compromised:
   - Don't change them yet (wait for my guidance)
   - I need to review logs first

My Response

Security incident protocol:

1. Immediate assessment (5-15 minutes)
   - Review PostgreSQL logs
   - Check authentication logs (/var/log/auth.log)
   - Analyze fail2ban blocks
   - Check firewall logs (UFW)

2. Containment (15-30 minutes)
   - Block suspicious IPs
   - Rotate credentials if compromised
   - Update firewall rules
   - Enable additional logging

3. Investigation (30-60 minutes)
   - Determine attack vector
   - Assess data access
   - Check for data exfiltration
   - Review all recent connections

4. Remediation (varies)
   - Patch vulnerabilities
   - Update security policies
   - Implement additional monitoring
   - Notify you of findings

I take security very seriously. I'll give you a full incident report.

Scenario 5: Backup or Recovery Issues (P0 if data at risk)

Symptoms: Backup failed. Can't restore. Backup verification alert.

What You Might See

Email from Healthchecks.io: "Backup check failed"
Email from me: "Backup verification failed"
You need to restore data but backup is missing/corrupt

Important: Backup failures are P0 if they put data at risk. If your database is still running normally, it's P1 (I'll fix the backup system).

Contact Me

Email: jeremy@intentsolutions.io
Subject: "P0: Backup Issue - [your company]"

Tell me:
- Did you receive a backup failure alert?
- Do you need to restore something right now? (urgent)
- Is your database still running? (if yes, less urgent)

If you need an emergency backup while waiting:
pg_dump -h host -U user -d db -F c -f backup-$(date +%Y%m%d).dump

Self-Service Emergency Tools

Emergency Database Export

If you need an immediate backup while waiting for support:

# Compressed format (recommended)
pg_dump -h your-host.costplusdb.com \
        -U your-username \
        -d your-database \
        -F c \
        -f emergency-backup-$(date +%Y%m%d-%H%M%S).dump

# Plain SQL format (easier to inspect)
pg_dump -h your-host.costplusdb.com \
        -U your-username \
        -d your-database \
        -F p \
        -f emergency-backup-$(date +%Y%m%d-%H%M%S).sql

This creates a backup you can restore independently if needed.

Connection Test

Test if your database is reachable:

psql "postgresql://user@host.costplusdb.com:5432/dbname?sslmode=require"

Success:  You see the psql prompt → Database is up
Failure:  "Connection refused" → Database is down (P0)
Timeout:  Network/DNS issue → Check your firewall, verify hostname

Quick Health Check Queries

-- Check database size
SELECT pg_size_pretty(pg_database_size(current_database()));

-- Check active connections
SELECT count(*) FROM pg_stat_activity;

-- Check replication lag (if applicable)
SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()));

-- Check for bloat
SELECT schemaname, tablename,
       pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 10;

Escalation Path

How to reach me if I'm not responding:

Step 1: Email jeremy@intentsolutions.io
        (I check email constantly)
        ↓
        Wait 30 minutes for P0/P1
        Wait 2 hours for P2
        ↓
Step 2: Slack channel (if enabled in your plan)
        Pro/Enterprise: Included
        Shared/Dedicated: +$29/month
        ↓
        Wait 1 hour for P0
        ↓
Step 3: Automated phone alert triggers
        (Happens automatically for P0 database down)
        My phone rings until I acknowledge
        ↓
        Wait 2 hours
        ↓
Step 4: Emergency Backup Operator system activates
        (See: /emergency.html#emergency-backup-operator)
        Automated access to credentials if I'm unreachable

Reality check: I've never needed Steps 3-4. I respond quickly. But it's there if needed.

After the Incident

What happens after we resolve the issue:

Incident Report

You'll get a written summary:

- What happened (root cause)
- When it happened (timeline)
- What I did to fix it
- Why it happened
- How we'll prevent it next time

For significant incidents (P0/P1), I publish detailed post-mortems in incident history.

Follow-Up Actions

If the incident revealed systemic issues, I'll:

Update monitoring to catch it earlier next time
Add automated alerts for similar conditions
Update SOPs with lessons learned
Implement preventive measures

I learn from every incident. Your database gets more reliable over time.

Not an Active Incident?

If you're not experiencing an issue right now:

Reliability & Recovery - How we prevent failures and recover from disasters
Emergency Procedures - Detailed technical procedures, backup operator system, incident history
Documentation - Setup guides, API docs, operational procedures