Incident ResponseDatabase down? Start here. |
Page | ← Home |
---|
Email: jeremy@intentsolutions.io Response: Within 30 minutes (typically) Critical: IMMEDIATE response - automated 24/7 alerts
I'll see your message. For P0 critical issues, my phone alerts automatically.
Use this page if you're experiencing an active issue right now. If you're just reading about our reliability practices, check out Reliability & Recovery instead.
How we classify incidents:
Level | What It Means | Examples | Response Time |
---|---|---|---|
P0 - CRITICAL | Service down or data at risk | Database offline, data loss, security breach, backup failure causing data loss | 30 minutes |
P1 - HIGH | Severe degradation | Major performance issues, backup failures (non-critical), security alerts | 2 hours |
P2 - MEDIUM | Moderate impact | Minor performance issues, non-critical errors, monitoring alerts | 4 hours |
P3 - LOW | Minimal impact | Questions, minor bugs, feature requests | Next business day |
Note: Official SLA is 4-hour response during business hours (M-F 9am-6pm ET). Reality: I typically respond within 30 minutes, 7 days/week. Critical outages trigger immediate phone alerts 24/7.
Symptoms: Can't connect. Application shows connection errors. All queries fail.
1. Check the status page (coming soon: status.costplusdb.com) For now: I'll email you if there's a known outage 2. Verify it's not DNS or network: ping your-host.costplusdb.com (Should get a response - if not, it's DNS/network) 3. Test the connection: psql "postgresql://user@host:5432/db?sslmode=require" If you get "Connection refused" → Database is down (P0) If you get "Timeout" → Network issue (check your firewall) If you get authentication error → Database is up, wrong credentials 4. Email me immediately: jeremy@intentsolutions.io Subject: "P0: Database Down - [your company]" Include: Error messages, when it started, any recent changes
You don't need to tell me the database is down. Automated alerts already notified me: - Betterstack detected the outage within 1 minute - My phone alerted me immediately - I'm already investigating or working on it What happens next: 1. I diagnose the issue (5-15 minutes) - Check if PostgreSQL is running - Review system logs - Check disk space, memory, CPU 2. I fix or restore (15-90 minutes depending on cause) - Simple fix: Restart services, clear locks - Complex: Restore from backup, migrate to new VPS 3. I notify you when resolved - Email with incident summary - Explanation of what happened - Steps to prevent recurrence
Event | Time |
---|---|
Detection (automated) | 1-5 minutes |
My response | Immediate (phone alert) |
Diagnosis | 5-15 minutes |
Resolution | 30-120 minutes |
Data loss: Maximum 24 hours (last backup). Usually zero if it's a service issue, not data corruption.
Symptoms: Queries are slow. Connection timeouts. Application feels sluggish.
These queries help us diagnose faster:
-- Check active connections SELECT count(*) FROM pg_stat_activity; -- Look for long-running queries SELECT pid, now() - query_start AS duration, state, query FROM pg_stat_activity WHERE state = 'active' AND now() - query_start > interval '1 minute' ORDER BY duration DESC; -- Check for locks SELECT pid, mode, granted, query FROM pg_locks JOIN pg_stat_activity USING (pid) WHERE NOT granted;
Send me the results when you email. It speeds up diagnosis significantly.
I'll check all of these when you contact me.
Email: jeremy@intentsolutions.io Subject: "P1: Performance Issues - [your company]" Include: - When did it start? - How much slower? (e.g., "queries taking 10x longer") - Any recent changes? (new feature, traffic spike) - Results of diagnostic queries above (if you ran them)
Symptoms: Missing records. Incorrect values. Data doesn't match expectations.
- Do NOT run DELETE or UPDATE statements to "fix" it - Do NOT attempt to restore backups yourself - Do NOT restart the database - Do NOT run VACUUM FULL Why? You might overwrite the data we need to recover.
1. Document what you observe: - Which records are missing/wrong? - When did you notice it? - When was the data last correct? (approximate time) - What operations were running? (imports, updates, deletes) 2. If possible, export the current state: pg_dump -h host -U user -d db -F c -f emergency-$(date +%Y%m%d).dump This preserves what's there now, before we restore. 3. Email me IMMEDIATELY: Subject: "P0: DATA LOSS - [your company]" Mark as urgent.
Recovery process: 1. Assess the scope (10-20 minutes) - What data is affected? - When did corruption occur? - What's the best recovery point? 2. Point-in-time recovery (30-90 minutes) - Restore from backup to specific timestamp - Replay WAL logs to exact recovery point - Verify data integrity 3. Validation (15-30 minutes) - Confirm data is restored correctly - Run sanity checks - Test application connections 4. Cutover (5-10 minutes) - Switch your application to restored database - Verify everything works - Monitor for issues Total time: 1-2.5 hours typically
Symptoms: Unauthorized access attempts. Suspicious activity. Unusual queries.
1. Do NOT investigate further (Preserve evidence, don't tip off attacker) 2. Note the time and what you observed: - What suspicious activity did you see? - When did you notice it? - Were there any unusual access patterns? 3. Email me immediately: Subject: "P0: SECURITY INCIDENT - [your company]" 4. If you suspect credentials are compromised: - Don't change them yet (wait for my guidance) - I need to review logs first
Security incident protocol: 1. Immediate assessment (5-15 minutes) - Review PostgreSQL logs - Check authentication logs (/var/log/auth.log) - Analyze fail2ban blocks - Check firewall logs (UFW) 2. Containment (15-30 minutes) - Block suspicious IPs - Rotate credentials if compromised - Update firewall rules - Enable additional logging 3. Investigation (30-60 minutes) - Determine attack vector - Assess data access - Check for data exfiltration - Review all recent connections 4. Remediation (varies) - Patch vulnerabilities - Update security policies - Implement additional monitoring - Notify you of findings I take security very seriously. I'll give you a full incident report.
Symptoms: Backup failed. Can't restore. Backup verification alert.
Important: Backup failures are P0 if they put data at risk. If your database is still running normally, it's P1 (I'll fix the backup system).
Email: jeremy@intentsolutions.io Subject: "P0: Backup Issue - [your company]" Tell me: - Did you receive a backup failure alert? - Do you need to restore something right now? (urgent) - Is your database still running? (if yes, less urgent) If you need an emergency backup while waiting: pg_dump -h host -U user -d db -F c -f backup-$(date +%Y%m%d).dump
If you need an immediate backup while waiting for support:
# Compressed format (recommended) pg_dump -h your-host.costplusdb.com \ -U your-username \ -d your-database \ -F c \ -f emergency-backup-$(date +%Y%m%d-%H%M%S).dump # Plain SQL format (easier to inspect) pg_dump -h your-host.costplusdb.com \ -U your-username \ -d your-database \ -F p \ -f emergency-backup-$(date +%Y%m%d-%H%M%S).sql
This creates a backup you can restore independently if needed.
Test if your database is reachable:
psql "postgresql://user@host.costplusdb.com:5432/dbname?sslmode=require" Success: You see the psql prompt → Database is up Failure: "Connection refused" → Database is down (P0) Timeout: Network/DNS issue → Check your firewall, verify hostname
-- Check database size SELECT pg_size_pretty(pg_database_size(current_database())); -- Check active connections SELECT count(*) FROM pg_stat_activity; -- Check replication lag (if applicable) SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())); -- Check for bloat SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size FROM pg_tables WHERE schemaname NOT IN ('pg_catalog', 'information_schema') ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC LIMIT 10;
How to reach me if I'm not responding:
Step 1: Email jeremy@intentsolutions.io (I check email constantly) ↓ Wait 30 minutes for P0/P1 Wait 2 hours for P2 ↓ Step 2: Slack channel (if enabled in your plan) Pro/Enterprise: Included Shared/Dedicated: +$29/month ↓ Wait 1 hour for P0 ↓ Step 3: Automated phone alert triggers (Happens automatically for P0 database down) My phone rings until I acknowledge ↓ Wait 2 hours ↓ Step 4: Emergency Backup Operator system activates (See: /emergency.html#emergency-backup-operator) Automated access to credentials if I'm unreachable
Reality check: I've never needed Steps 3-4. I respond quickly. But it's there if needed.
What happens after we resolve the issue:
You'll get a written summary:
- What happened (root cause) - When it happened (timeline) - What I did to fix it - Why it happened - How we'll prevent it next time
For significant incidents (P0/P1), I publish detailed post-mortems in incident history.
If the incident revealed systemic issues, I'll:
I learn from every incident. Your database gets more reliable over time.
If you're not experiencing an issue right now: