Health Check Monitoring - Decision Tree

Rozhodovací strom pro diagnostiku a řešení problémů detekovaných health check systémem.

Alert Decision Flow

┌─────────────────────────────────────┐
│  Přijat Uptimerobot Alert          │
└───────────────┬─────────────────────┘
                │
                ▼
┌─────────────────────────────────────┐
│  Otevřít /health-status/            │
└───────────────┬─────────────────────┘
                │
                ▼
        ┌───────┴───────┐
        │  Status?      │
        └───┬───────┬───┘
            │       │
     CRITICAL    WARNING/OK
            │       │
            ▼       ▼
    ┌───────────┐  ┌──────────────┐
    │  URGENTNÍ │  │  False       │
    │  Akce     │  │  Positive?   │
    └───────────┘  └──────────────┘

Critical Alert Flow

Status: CRITICAL
    ↓
┌───────────────────────────────────────────────┐
│  Zkontrolovat Alert Messages                 │
└───────┬───────────────────────────────────────┘
        │
        ▼
   ┌────┴────┐
   │  Typ?   │
   └─┬─────┬─┘
     │     │
     ▼     ▼
  Language  Articles
   Alert    Alert
     │        │
     ▼        ▼
┌─────────────────────────┐  ┌──────────────────────────┐
│ "Pouze X% článků        │  │ "Pouze X článků          │
│  v češtině"             │  │  za 24h"                 │
└────────┬────────────────┘  └────────┬─────────────────┘
         │                             │
         ▼                             ▼
┌─────────────────────────┐  ┌──────────────────────────┐
│ LLM Translation Failed  │  │ NewsAPI / Generation     │
│                         │  │ Failed                   │
│ Actions:                │  │                          │
│ 1. Check OPENROUTER_KEY │  │ Actions:                 │
│ 2. Check LLM API status │  │ 1. Check NEWS_API_KEY    │
│ 3. Review GitHub logs   │  │ 2. Check GitHub Actions  │
│ 4. Manual re-run if OK  │  │ 3. Manual trigger        │
└─────────────────────────┘  └──────────────────────────┘

Warning Alert Flow

Status: WARNING
    ↓
┌──────────────────────────────────┐
│  Zkontrolovat Severity           │
└────────┬─────────────────────────┘
         │
    ┌────┴────┐
    │  Trend? │
    └─┬─────┬─┘
      │     │
   Zhoršuje  Stable/Zlepšuje
      │     │
      ▼     ▼
┌──────────────┐  ┌─────────────────┐
│ Eskalovat    │  │ Monitor only    │
│ na CRITICAL  │  │ Review týdně    │
└──────────────┘  └─────────────────┘

Language Quality Decision Tree

Czech Ratio < 85%?
    │
    ▼
┌────────────────────────────────────────┐
│  Zkontrolovat English Articles Sample  │
└───────┬────────────────────────────────┘
        │
   ┌────┴────┐
   │ Pattern?│
   └─┬─────┬─┘
     │     │
     ▼     ▼
  All New  Random Mix
  Articles
     │        │
     ▼        ▼
┌─────────────────────┐  ┌──────────────────────┐
│ Recent LLM Failure  │  │ Ongoing Partial      │
│                     │  │ Failure              │
│ Cause:              │  │                      │
│ - API outage        │  │ Causes:              │
│ - API key expired   │  │ - Rate limiting      │
│ - Quota exceeded    │  │ - Cost limits        │
│                     │  │ - Model degradation  │
│ Fix:                │  │                      │
│ 1. Check API status │  │ Fix:                 │
│ 2. Verify API key   │  │ 1. Check quotas      │
│ 3. Re-run workflow  │  │ 2. Check rate limits │
│                     │  │ 3. Review LLM config │
└─────────────────────┘  └──────────────────────┘

Freshness Problem Decision Tree

Newest Article > 6h old?
    │
    ▼
┌──────────────────────────────────┐
│  Zkontrolovat GitHub Actions     │
└───────┬──────────────────────────┘
        │
   ┌────┴────┐
   │ Status? │
   └─┬─────┬─┘
     │     │
  Success  Failed
     │        │
     ▼        ▼
┌────────────────────┐  ┌──────────────────────┐
│ Schedule Problem   │  │ Workflow Failure     │
│                    │  │                      │
│ Možné příčiny:     │  │ Možné příčiny:       │
│ - Cron disabled    │  │ - API keys invalid   │
│ - Repo archived    │  │ - Dependencies error │
│ - Workflow paused  │  │ - Syntax error       │
│                    │  │ - Network timeout    │
│ Fix:               │  │                      │
│ 1. Check .yml      │  │ Fix:                 │
│ 2. Manual trigger  │  │ 1. Review logs       │
│ 3. Re-enable cron  │  │ 2. Fix error         │
│                    │  │ 3. Re-run            │
└────────────────────┘  └──────────────────────┘

Content Quality Decision Tree

Avg Content Length < 300?
    │
    ▼
┌──────────────────────────────────┐
│  Zkontrolovat Sample Articles    │
└───────┬──────────────────────────┘
        │
   ┌────┴────┐
   │ Pattern?│
   └─┬─────┬─┘
     │     │
     ▼     ▼
  All Short  Random
  Content
     │        │
     ▼        ▼
┌──────────────────────┐  ┌─────────────────────┐
│ LLM Prompt Issue     │  │ Source Quality      │
│                      │  │ Problem             │
│ Možné příčiny:       │  │                     │
│ - Prompt truncated   │  │ Možné příčiny:      │
│ - Max tokens low     │  │ - NewsAPI sources   │
│ - Model changed      │  │ - Snippet-only      │
│                      │  │ - Paywall content   │
│ Fix:                 │  │                     │
│ 1. Review prompt     │  │ Fix:                │
│ 2. Check token limit │  │ 1. Review sources   │
│ 3. Test LLM params   │  │ 2. Adjust filters   │
└──────────────────────┘  └─────────────────────┘

Response Time Matrix

Alert Level	Response Time	Escalation	Notes
CRITICAL	Immediate	1 hour	Requires manual intervention
WARNING - Trend down	4 hours	24 hours	Monitor closely
WARNING - Stable	24 hours	1 week	Review at weekly check
INFO	Next weekly review	N/A	Informational only

Diagnostic Command Cheat Sheet

# Quick Status Check
curl -s https://marigold.cz/health-check/ | jq '.status, .summary'

# Full Health Data
curl -s https://marigold.cz/health-check/ | jq '.'

# Check Specific Metric
curl -s https://marigold.cz/health-check/ | jq '.metrics.czech_ratio'

# List All Alerts
curl -s https://marigold.cz/health-check/ | jq '.alerts[]'

# GitHub Actions Status
gh run list --workflow=tech-news.yml --limit 5

# View Latest Run
gh run view --workflow=tech-news.yml

# View Logs of Latest Run
gh run view --log --workflow=tech-news.yml

# Manual Trigger Workflow
gh workflow run tech-news.yml

# Local Health Check
python3 scripts/tech_news_health_check.py

# Test Specific Article Language
python3 << 'EOF'
import sys
sys.path.insert(0, 'scripts')
from tech_news_health_check import TechNewsHealthCheck
checker = TechNewsHealthCheck()
text = open('_tech_news/2025-11-14-example.md').read()
score = checker._detect_language(text)
print(f"Language score: {score} ({'Czech' if score > 0.5 else 'English'})")
EOF

False Positive Handling

False Positive Alert
    ↓
┌──────────────────────────────────┐
│  Verify Against /health-status/  │
└───────┬──────────────────────────┘
        │
   ┌────┴────┐
   │ Real    │
   │ Problem?│
   └─┬─────┬─┘
     │     │
    YES   NO
     │     │
     ▼     ▼
┌───────────────┐  ┌──────────────────────┐
│ Not False     │  │ True False Positive  │
│ Positive      │  │                      │
│               │  │ Možné příčiny:       │
│ Follow        │  │ - Uptimerobot cache  │
│ normal        │  │ - Stale data         │
│ decision      │  │ - Threshold too      │
│ tree          │  │   strict             │
│               │  │                      │
│               │  │ Actions:             │
│               │  │ 1. Note in log       │
│               │  │ 2. Consider adjust   │
│               │  │    threshold         │
│               │  │ 3. Wait for next     │
│               │  │    check (5 min)     │
└───────────────┘  └──────────────────────┘

Escalation Path

┌─────────────────────┐
│  CRITICAL Alert     │
│  (Immediate)        │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  Initial Response   │
│  (< 15 min)         │
│  - Acknowledge      │
│  - Quick diagnosis  │
└──────────┬──────────┘
           │
           ▼
   ┌───────┴────────┐
   │  Can Fix       │
   │  Quickly?      │
   └───┬────────┬───┘
       │        │
      YES      NO
       │        │
       ▼        ▼
┌──────────┐  ┌─────────────────┐
│ Fix &    │  │ Escalate to     │
│ Verify   │  │ Team Lead       │
│          │  │                 │
│ < 1 hour │  │ If unresolved   │
│          │  │ after 2 hours → │
│          │  │ Emergency       │
│          │  │ manual run      │
└──────────┘  └─────────────────┘

Priority Matrix

Metric	Current	Threshold	Severity	Priority
Czech Ratio	< 50%	85%	CRITICAL	P0 (Immediate)
Czech Ratio	50-70%	85%	CRITICAL	P1 (< 4h)
Czech Ratio	70-85%	85%	WARNING	P2 (< 24h)
Articles 24h	0	10	CRITICAL	P0 (Immediate)
Articles 24h	1-5	10	WARNING	P1 (< 4h)
Articles 24h	5-10	10	WARNING	P2 (< 24h)
Newest Age	> 24h	6h	CRITICAL	P1 (< 4h)
Newest Age	12-24h	6h	WARNING	P2 (< 24h)
Newest Age	6-12h	6h	WARNING	P3 (Monitor)

Preventive Actions

Daily (Automated)

✅ Uptimerobot monitoring running
✅ GitHub Actions scheduled runs
✅ Auto-generation of health data

Weekly (15 min manual)

Review health trends in /health-status/
Check false positive/negative rate
Verify alert email delivery
Quick test of manual workflow trigger

Monthly (1 hour manual)

Deep analysis of all metrics
Review and adjust thresholds
Update documentation
Post-mortem any incidents
Test disaster recovery procedure

Quarterly (2 hours manual)

Full system audit
Test all failure scenarios
Review and update alerting strategy
Team training on procedures
Backup and disaster recovery drill

Tips:

🔍 Always verify against /health-status/ before acting
📊 Check trends, not just single data points
⏱️ Consider time of day (weekends, holidays have lower traffic)
📝 Document all incidents for future threshold tuning
🧪 Test fixes in local environment before deploying

Quick Access:

📊 Dashboard: https://marigold.cz/health-status/
🔧 Uptimerobot: https://uptimerobot.com/dashboard
🤖 GitHub Actions: https://github.com/username/marigold-page/actions