Health Check Monitoring - Decision Tree
Rozhodovací strom pro diagnostiku a řešení problémů detekovaných health check systémem.
Alert Decision Flow
┌─────────────────────────────────────┐
│ Přijat Uptimerobot Alert │
└───────────────┬─────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Otevřít /health-status/ │
└───────────────┬─────────────────────┘
│
▼
┌───────┴───────┐
│ Status? │
└───┬───────┬───┘
│ │
CRITICAL WARNING/OK
│ │
▼ ▼
┌───────────┐ ┌──────────────┐
│ URGENTNÍ │ │ False │
│ Akce │ │ Positive? │
└───────────┘ └──────────────┘
Critical Alert Flow
Status: CRITICAL
↓
┌───────────────────────────────────────────────┐
│ Zkontrolovat Alert Messages │
└───────┬───────────────────────────────────────┘
│
▼
┌────┴────┐
│ Typ? │
└─┬─────┬─┘
│ │
▼ ▼
Language Articles
Alert Alert
│ │
▼ ▼
┌─────────────────────────┐ ┌──────────────────────────┐
│ "Pouze X% článků │ │ "Pouze X článků │
│ v češtině" │ │ za 24h" │
└────────┬────────────────┘ └────────┬─────────────────┘
│ │
▼ ▼
┌─────────────────────────┐ ┌──────────────────────────┐
│ LLM Translation Failed │ │ NewsAPI / Generation │
│ │ │ Failed │
│ Actions: │ │ │
│ 1. Check OPENROUTER_KEY │ │ Actions: │
│ 2. Check LLM API status │ │ 1. Check NEWS_API_KEY │
│ 3. Review GitHub logs │ │ 2. Check GitHub Actions │
│ 4. Manual re-run if OK │ │ 3. Manual trigger │
└─────────────────────────┘ └──────────────────────────┘
Warning Alert Flow
Status: WARNING
↓
┌──────────────────────────────────┐
│ Zkontrolovat Severity │
└────────┬─────────────────────────┘
│
┌────┴────┐
│ Trend? │
└─┬─────┬─┘
│ │
Zhoršuje Stable/Zlepšuje
│ │
▼ ▼
┌──────────────┐ ┌─────────────────┐
│ Eskalovat │ │ Monitor only │
│ na CRITICAL │ │ Review týdně │
└──────────────┘ └─────────────────┘
Language Quality Decision Tree
Czech Ratio < 85%?
│
▼
┌────────────────────────────────────────┐
│ Zkontrolovat English Articles Sample │
└───────┬────────────────────────────────┘
│
┌────┴────┐
│ Pattern?│
└─┬─────┬─┘
│ │
▼ ▼
All New Random Mix
Articles
│ │
▼ ▼
┌─────────────────────┐ ┌──────────────────────┐
│ Recent LLM Failure │ │ Ongoing Partial │
│ │ │ Failure │
│ Cause: │ │ │
│ - API outage │ │ Causes: │
│ - API key expired │ │ - Rate limiting │
│ - Quota exceeded │ │ - Cost limits │
│ │ │ - Model degradation │
│ Fix: │ │ │
│ 1. Check API status │ │ Fix: │
│ 2. Verify API key │ │ 1. Check quotas │
│ 3. Re-run workflow │ │ 2. Check rate limits │
│ │ │ 3. Review LLM config │
└─────────────────────┘ └──────────────────────┘
Freshness Problem Decision Tree
Newest Article > 6h old?
│
▼
┌──────────────────────────────────┐
│ Zkontrolovat GitHub Actions │
└───────┬──────────────────────────┘
│
┌────┴────┐
│ Status? │
└─┬─────┬─┘
│ │
Success Failed
│ │
▼ ▼
┌────────────────────┐ ┌──────────────────────┐
│ Schedule Problem │ │ Workflow Failure │
│ │ │ │
│ Možné příčiny: │ │ Možné příčiny: │
│ - Cron disabled │ │ - API keys invalid │
│ - Repo archived │ │ - Dependencies error │
│ - Workflow paused │ │ - Syntax error │
│ │ │ - Network timeout │
│ Fix: │ │ │
│ 1. Check .yml │ │ Fix: │
│ 2. Manual trigger │ │ 1. Review logs │
│ 3. Re-enable cron │ │ 2. Fix error │
│ │ │ 3. Re-run │
└────────────────────┘ └──────────────────────┘
Content Quality Decision Tree
Avg Content Length < 300?
│
▼
┌──────────────────────────────────┐
│ Zkontrolovat Sample Articles │
└───────┬──────────────────────────┘
│
┌────┴────┐
│ Pattern?│
└─┬─────┬─┘
│ │
▼ ▼
All Short Random
Content
│ │
▼ ▼
┌──────────────────────┐ ┌─────────────────────┐
│ LLM Prompt Issue │ │ Source Quality │
│ │ │ Problem │
│ Možné příčiny: │ │ │
│ - Prompt truncated │ │ Možné příčiny: │
│ - Max tokens low │ │ - NewsAPI sources │
│ - Model changed │ │ - Snippet-only │
│ │ │ - Paywall content │
│ Fix: │ │ │
│ 1. Review prompt │ │ Fix: │
│ 2. Check token limit │ │ 1. Review sources │
│ 3. Test LLM params │ │ 2. Adjust filters │
└──────────────────────┘ └─────────────────────┘
Response Time Matrix
| Alert Level | Response Time | Escalation | Notes |
|---|---|---|---|
| CRITICAL | Immediate | 1 hour | Requires manual intervention |
| WARNING - Trend down | 4 hours | 24 hours | Monitor closely |
| WARNING - Stable | 24 hours | 1 week | Review at weekly check |
| INFO | Next weekly review | N/A | Informational only |
Diagnostic Command Cheat Sheet
# Quick Status Check
curl -s https://marigold.cz/health-check/ | jq '.status, .summary'
# Full Health Data
curl -s https://marigold.cz/health-check/ | jq '.'
# Check Specific Metric
curl -s https://marigold.cz/health-check/ | jq '.metrics.czech_ratio'
# List All Alerts
curl -s https://marigold.cz/health-check/ | jq '.alerts[]'
# GitHub Actions Status
gh run list --workflow=tech-news.yml --limit 5
# View Latest Run
gh run view --workflow=tech-news.yml
# View Logs of Latest Run
gh run view --log --workflow=tech-news.yml
# Manual Trigger Workflow
gh workflow run tech-news.yml
# Local Health Check
python3 scripts/tech_news_health_check.py
# Test Specific Article Language
python3 << 'EOF'
import sys
sys.path.insert(0, 'scripts')
from tech_news_health_check import TechNewsHealthCheck
checker = TechNewsHealthCheck()
text = open('_tech_news/2025-11-14-example.md').read()
score = checker._detect_language(text)
print(f"Language score: {score} ({'Czech' if score > 0.5 else 'English'})")
EOF
False Positive Handling
False Positive Alert
↓
┌──────────────────────────────────┐
│ Verify Against /health-status/ │
└───────┬──────────────────────────┘
│
┌────┴────┐
│ Real │
│ Problem?│
└─┬─────┬─┘
│ │
YES NO
│ │
▼ ▼
┌───────────────┐ ┌──────────────────────┐
│ Not False │ │ True False Positive │
│ Positive │ │ │
│ │ │ Možné příčiny: │
│ Follow │ │ - Uptimerobot cache │
│ normal │ │ - Stale data │
│ decision │ │ - Threshold too │
│ tree │ │ strict │
│ │ │ │
│ │ │ Actions: │
│ │ │ 1. Note in log │
│ │ │ 2. Consider adjust │
│ │ │ threshold │
│ │ │ 3. Wait for next │
│ │ │ check (5 min) │
└───────────────┘ └──────────────────────┘
Escalation Path
┌─────────────────────┐
│ CRITICAL Alert │
│ (Immediate) │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ Initial Response │
│ (< 15 min) │
│ - Acknowledge │
│ - Quick diagnosis │
└──────────┬──────────┘
│
▼
┌───────┴────────┐
│ Can Fix │
│ Quickly? │
└───┬────────┬───┘
│ │
YES NO
│ │
▼ ▼
┌──────────┐ ┌─────────────────┐
│ Fix & │ │ Escalate to │
│ Verify │ │ Team Lead │
│ │ │ │
│ < 1 hour │ │ If unresolved │
│ │ │ after 2 hours → │
│ │ │ Emergency │
│ │ │ manual run │
└──────────┘ └─────────────────┘
Priority Matrix
| Metric | Current | Threshold | Severity | Priority |
|---|---|---|---|---|
| Czech Ratio | < 50% | 85% | CRITICAL | P0 (Immediate) |
| Czech Ratio | 50-70% | 85% | CRITICAL | P1 (< 4h) |
| Czech Ratio | 70-85% | 85% | WARNING | P2 (< 24h) |
| Articles 24h | 0 | 10 | CRITICAL | P0 (Immediate) |
| Articles 24h | 1-5 | 10 | WARNING | P1 (< 4h) |
| Articles 24h | 5-10 | 10 | WARNING | P2 (< 24h) |
| Newest Age | > 24h | 6h | CRITICAL | P1 (< 4h) |
| Newest Age | 12-24h | 6h | WARNING | P2 (< 24h) |
| Newest Age | 6-12h | 6h | WARNING | P3 (Monitor) |
Preventive Actions
Daily (Automated)
- ✅ Uptimerobot monitoring running
- ✅ GitHub Actions scheduled runs
- ✅ Auto-generation of health data
Weekly (15 min manual)
- Review health trends in
/health-status/ - Check false positive/negative rate
- Verify alert email delivery
- Quick test of manual workflow trigger
Monthly (1 hour manual)
- Deep analysis of all metrics
- Review and adjust thresholds
- Update documentation
- Post-mortem any incidents
- Test disaster recovery procedure
Quarterly (2 hours manual)
- Full system audit
- Test all failure scenarios
- Review and update alerting strategy
- Team training on procedures
- Backup and disaster recovery drill
Tips:
- 🔍 Always verify against
/health-status/before acting - 📊 Check trends, not just single data points
- ⏱️ Consider time of day (weekends, holidays have lower traffic)
- 📝 Document all incidents for future threshold tuning
- 🧪 Test fixes in local environment before deploying
Quick Access:
- 📊 Dashboard: https://marigold.cz/health-status/
- 🔧 Uptimerobot: https://uptimerobot.com/dashboard
- 🤖 GitHub Actions: https://github.com/username/marigold-page/actions
|