How ServiceMon Prevents Downtime — Proactive Alerts & Automated Recovery

Getting Started with ServiceMon: Installation, Dashboards, and Best Practices

1. Quick overview

ServiceMon is a server and service monitoring tool that collects metrics, sends alerts, and offers dashboards for observability. The rest of this guide assumes a default single-server setup and an on-prem or cloud VM.

2. Installation (single-node Linux)

  1. Prerequisites

    • Linux (Ubuntu 20.04+ or CentOS 7+)
    • Docker 20.10+ and Docker Compose 1.29+ (recommended)
    • 2 GB RAM, 2 CPU, 10 GB disk free
  2. Install via Docker Compose

    • Create a directory and a docker-compose.yml with required services: service-mon, database (Postgres), and optionally alerting (Prometheus Alertmanager) and reverse proxy (nginx).
    • Example compose snippet:

      Code

      version: ‘3.7’ services:servicemon:

      image: servicemon/latest ports: ['8080:8080'] environment:   - DB_HOST=db   - DB_USER=sm   - DB_PASS=changeme depends_on: ['db'] 

      db:

      image: postgres:13 environment:   - POSTGRES_USER=sm   - POSTGRES_PASSWORD=changeme   - POSTGRES_DB=servicemon 

    • Start: docker compose up -d
  3. Alternative: Package manager

    • Use the provided .deb/.rpm if preferred: sudo dpkg -i servicemon_X.Y.Z.deb or sudo rpm -i servicemon-X.Y.Z.rpm.
  4. Initial setup

    • Visit http://:8080 and complete the web installer: create admin user, configure DB (if not using default), set base URL, and enable SMTP for alerts.

3. Dashboards

  • Default dashboards
    • Overview: system health, active incidents, alert status.
    • Host detail: CPU, memory, disk, network, and process metrics.
    • Service dependency: shows upstream/downstream relationships and impact.
  • Custom dashboards
    • Use the built-in query editor to add panels for specific metrics (e.g., request latency p95, error rate).
    • Recommended panels: CPU load (1m/5m), memory used (%), disk I/O, network throughput, request latency histograms, and error rate trend.
  • Best practices for dashboards
    • Keep overview dashboards high-level (no more than 8-10 panels).
    • Use one dashboard per team or service domain.
    • Use color thresholds for immediate signal (green/yellow/red).
    • Annotate incidents and deploys to correlate spikes with events.

4. Alerting & Notification

  • Alert rules
    • Define rules for key signals: high CPU, memory pressure, disk near full, service unresponsive, error rate spike, latency SLO breaches.
    • Use short evaluation windows for fast failures (e.g., 1–5m) and longer windows for noisy metrics (e.g., 15m).
  • Notification channels
    • Configure email, Slack, PagerDuty, and webhook integrations.
    • Use escalation policies and grouped alerts to avoid paging for transient issues.
  • Reduce noise
    • Use alert grouping, mute windows (maintenance), and deduplication.
    • Add suppressions for known deploy-related alerts.

5. Best Practices

  • Instrument services properly
    • Expose metrics in Prometheus-compatible format or use the provided agent.
    • Tag metrics with service, environment, and instance identifiers.
  • SLO-driven alerting
    • Define SLOs and create alerts tied to error budget burn rather than raw thresholds.
  • Secure the deployment
    • Enable HTTPS (letsencrypt/nginx), use strong admin passwords, rotate API keys, and restrict access via VPN or IP allowlist.
    • Encrypt database credentials and enable DB backups.
  • Scalability
    • For production, separate components: dedicated DB, horizontal collectors, and replicated frontends behind a load balancer.
    • Use sharding/partitioning for long-term metrics retention and configure retention policies.
  • Operational hygiene
    • Regularly test alerting channels and run playbook runbooks for common incidents.
    • Keep the system and agents

Comments

Leave a Reply