Skip to main content
Version: 9.3

Configure Monitoring and Logging

As a cloud operations engineer, you can use Qrvey's logging system to enable monitoring, configure notifications, and set up automatic log cleanup.

Overview

Qrvey v9.3 introduces a centralized logging system to standardize error logging across all modules and projects. The system includes the following features:

  • Centralized error logging with standardized log formats.
  • Dashboard monitoring using Prometheus and Grafana.
  • Configurable log levels (Fatal, Error, Warning, Info, Debug, Trace, Off).
  • Automatic log cleanup to manage storage costs.

The implementation deploys a monitoring stack into your Elastic Kubernetes Service (EKS) cluster:

Loki runs continuously, even with monitoring disabled. Your services use the OpenTelemetry Logs Protocol to push structured logs to Loki. When you enable monitoring, you can then view Grafana and dashboards.

Monitored Services

Qrvey monitors the following service categories:

Qrvey Application Services

Application services include all deployments in the qrveyapps namespace:

  • Dataset
  • Widget
  • Automation
  • Reporting
  • Worker services

Metrics include replica count, CPU/memory, restarts, and OOMKills (Out of Memory Killed status messages). Logs capture structured output from each service (using OTLP when enable_otlp_log = true).

Infrastructure Services

The following table lists monitored infrastructure services.

ServiceNamespaceMonitors
RabbitMQrabbitmqPod health, queue depths, connections, message rates, restart events
RedisredisPod health, memory usage, restart events
Elasticsearchelastic-systemCluster health, shard count, JVM memory, pod readiness
Kong API Gatewaykong / kong-privateHTTP request rates, 5xx error rates, P99 latency, bandwidth

Cluster Infrastructure

The following table lists monitored cluster infrastructure services. Prometheus scrapes all of the above continuously, and Qrvey services push structured logs to Loki using OTLP.

ComponentMonitors
Kubernetes nodesCPU, memory, disk I/O, network I/O.
All pods (all namespaces)Pending state, CrashLoopBackOff, OOMKills, replica mismatches.
Persistent volumesStorage utilization (alert fires when PVC exceeds 80% full).
CoreDNSDNS query rates, cache hit ratio, latency.
AWS VPC CNIIP address allocation and ENI usage per node.

For services that do not use OTLP, such as Kong, RabbitMQ, and Elasticsearch, you can enable Promtail to capture raw stdout/stderr from all namespaces in monitoring_log_namespaces.

Before You Begin

Before deploying a monitoring configuration to the Qrvey platform, verify the following permissions and configurations.

  • Access to the config.json file for your deployment.
  • kubectl access to your EKS cluster for advanced configuration.
  • AWS SNS configured for email notifications.

Setup

The ./qrvey alias used in this section executes the following Docker command:

docker run --platform=linux/amd64 -v $(pwd)/config.json:/app/qrvey/config.json -it qrvey.azurecr.io/qrvey-terraform:9.3

  1. Add the following variables to your config.json file:

    {
    "variables": {
    "enable_monitoring": true,
    "enable_otlp_log": true
    }
    }
  2. From the command console (terminal), run the ./qrvey apply script:

    ./qrvey apply

After the script completes, the output prints the following:

Monitoring:               Enabled
Grafana URL: http://<load-balancer-hostname>:80
Grafana User: admin
Grafana Password: <auto-generated>

To review the output, run the following command:

./qrvey output

JSON Configuration Options

The following table describes options you can use in the config.json file.

VariableTypeDefaultDescription
enable_monitoringbooleanfalseDeploys Grafana, dashboards, alert rules, and exporters
enable_otlp_logbooleantrueDirects services to push logs to Loki using OTLP. Sets ENABLE_OTLP_LOG=true in the service environment.
log_levelstring"info"Controls service log verbosity. Valid outputs: fatal, error, warn, info, debug, trace
enable_log_collectionbooleanfalseDeploys Promtail to scrape container stdout/stderr (separate from OTLP).
monitoring_grafana_passwordstring(auto)Custom Grafana admin password. To automatically generate logs, omit this password.
monitoring_grafana_service_typestring"LoadBalancer""LoadBalancer" for external access, "ClusterIP" for port-forward only.
monitoring_metrics_retention_daysnumber15Days to keep Prometheus metrics.
monitoring_logs_retention_daysnumber7Days to keep Loki logs.
monitoring_log_namespaceslist["qrveyapps", ...]Namespaces Promtail collects from (only when enable_log_collection = true).
slack_webhook_urlstring""Slack Incoming Webhook URL. When set, all alerts are sent to Slack automatically. Leave empty to disable.
slack_alert_channelstring"#alerts"Slack channel to receive alert notifications (for example, "#ops-alerts").

Examples

The following code block enables all monitoring functions.

{
"variables": {
"enable_monitoring": true,
"enable_otlp_log": true,
"log_level": "debug",
"monitoring_metrics_retention_days": 30,
"monitoring_logs_retention_days": 14,
"slack_webhook_url": "<https://hooks.slack.com/services/T.../B.../xxx>",
"slack_alert_channel": "#ops-alerts"
}
}

The following code block enables logs only (no dashboards).

{
"variables": {
"enable_monitoring": false,
"enable_otlp_log": true,
"log_level": "info"
}
}

Loki receives logs from services without a deployed Grafana interface.

Data Retention

Use the following configuration variables to define how long the logs and metrics are kept.

Config variableDataDefault
monitoring_logs_retention_daysLogs (Loki)7 days
monitoring_metrics_retention_daysMetrics (Prometheus)15 days

When enable_monitoring = false, Loki runs in minimal mode, using a 3-day retention limit independent of these variables.

To change retention, update the config.json file and re-apply the configuration:

{
"variables": {
"monitoring_logs_retention_days": 30,
"monitoring_metrics_retention_days": 60
}
}

Loki and Prometheus are reconfigured in place, changing the interval for future automated purges. No data is lost.

Note: Higher log verbosity (log_level: debug) produces significantly more log volume per day. If you change the log_level, you should also change the monitoring_logs_retention_days to keep storage usage predictable.

Enable or Disable Logging After Deployment

You can change logging settings by updating the config.json file and re-applying. No cluster downtime is required.

You can prevent jobs from stopping by adding the following flags to skip different tasks when running the .qrvey/apply script.

  • skip-workflows-hooks - Skips stoppage of workflows running during an upgrade.
  • skip-data-sync-jobs - Skips stoppage of data sync jobs running during an upgrade.
  • skip-cleanup-orphans - Skips stoppage of orphan cleanup jobs running during an upgrade.

Example:

./qrvey apply --skip-workflows-hooks --skip-data-sync-jobs --skip-cleanup-orphans

Changes can take up to an hour to apply. During this time, the environment might not be available. Plan to apply changes during off hours.

OTLP Log Shipping

The following JSON variables control the traffic of Qrvey service logs to Loki. To apply changes, re-apply the configuration.

Enable

{ "variables": { "enable_otlp_log": true } }

During the next rolling restart, services pick up ENABLE_OTLP_LOG=true from the updated general-configmap and begin pushing logs to Loki.

Disable

{ "variables": { "enable_otlp_log": false } }

Services stop pushing logs. Logs already stored in Loki remain until their retention period expires.

Deployed Components

The following components are deployed even when enable_monitoring = false.

ComponentPurpose
PrometheusMetrics collection (required by Elasticsearch adapter and job-downscaler).
LokiLog storage — receives OTLP logs from services.

The following components are deployed when enable_log_collection = true.

ComponentPurpose
PromtailScrapes container stdout/stderr automatically.

The following components are deployed when enable_monitoring = true.

ComponentPurpose
GrafanaDashboards and log viewer UI.
AlertmanagerRoutes alert notifications (Slack when slack_webhook_url is set).
PrometheusRuleAlert rules (disk, containers, Kong, Slack priority).
kube-state-metricsExposes Kubernetes object metrics.
node-exporterHost-level CPU, memory, and disk metrics.
RabbitMQ ExporterRabbitMQ queue and connection metrics.
Kong Prometheus PluginKong HTTP request metrics.
Dedicated NodePoolKarpenter ARM64 nodes for monitoring workloads.

Automatic Log Collection

The following variables control automated log collection traffic from Promtail to Loki.

Enable

{ "variables": { "enable_log_collection": true } }

When applied, the configuration script applies Promtail as a DaemonSet and begins collecting container stdout/stderr immediately.

Disable

{ "variables": { "enable_log_collection": false } }

When applied, the configuration script removes Promtail. Logs already in Loki remain until retention expires.

Note: enable_monitoring (Grafana, dashboards, alerts) and enable_otlp_log (log shipping) are independent switches. You can have logs without the Grafana UI, or the Grafana UI without OTLP log shipping.

Change the Log Level

All Qrvey services read their log verbosity from the LOG_LEVEL environment variable, which is injected into each service through the general-configmap. The value comes directly from the log_level in the config.json file:

config.json  >  log_level  >  general-configmap  >  LOG_LEVEL env var  >  all Qrvey services

To change the log level, edit the config.json file and re-apply the configuration:

{ "variables": { "log_level": "debug" } }

The general-configmap is updated immediately. Each service pod picks up the new level on its next rolling restart without a service outage.

Log Levels

Log level hierarchy: FATAL (highest) > ERROR > WARN > INFO > DEBUG > TRACE.

LevelLogsDescription
infologger.info, logger.warn, logger.error, logger.fatalDefault log level that includes INFO, WARN, ERROR, and FATAL messages.
fatallogger.fatalOnly critical errors are logged. This generates the fewest amount of logs.
errorlogger.error, logger.fatalAny errors in the application, including FATAL.
warnlogger.warn, logger.error, logger.fatalIncludes FATAL and ERROR in addition to any warnings.
debuglogger.debug, logger.info, logger.warn, logger.error, logger.fatalIncludes everything in INFO as well as debugging messages, producing high log volumes. Long-term usage quickly exhausts your log retention budget and increases storage costs. Switch back to your preferred log level after troubleshooting.
tracelogger.debug, logger.info, logger.warn, logger.error, logger.fatalIncludes everything in DEBUG as well as detailed TRACE information, producing high log volumes. Long-term usage quickly exhausts your log retention budget and increases storage costs. Switch back to your preferred log level after troubleshooting.

OTLP Environment Variables

When enable_otlp_log is true, the following environment variables are injected into each Qrvey service using general-configmap:

VariableValueDescription
LOG_LEVELinfo (configurable)Controls which log levels are emitted.
ENABLE_OTLP_LOGtrueFlag services check to enable OTLP export.
OTLP_LOG_URLhttp://loki.monitoring.svc.cluster.local:3100/otlp/v1/logsPOST location for service logs.

To push logs, the system uses the standard OTLP format used by OTEL SDKs and collectors.

To view debug-level logs from all services, set "log_level": "debug" in the config.json file.

Grafana Usage

After deployment, open the Grafana URL printed in the output. Log in with admin and the password shown.

View Logs

  1. Select Explore (compass icon in the sidebar).

  2. Select Loki from the data source dropdown.

  3. Use the label browser to filter:

    • service_name — the service that emitted the log.
    • service_namespace — Kubernetes namespace.
    • level or detected_level — log severity.

    OR

    Type a LogQL query directly:

    {service_name="dp-dataset-service"} |= "error"
    {service_namespace="qrveyapps", detected_level="error"}

You can also go to Dashboards > Logs > Logs Explorer to display a pre-built view with dropdowns.

View Dashboards

The Dashboards sidebar displays dashboards organized in folders:

FolderDashboardsDisplay
ClusterCluster Overview, Node Exporter Full, K8s Views (Global, Namespaces, Pods)CPU, memory, disk, network across the cluster
Qrvey ApplicationsQrvey Applications HealthReplica counts, OOMKills, restarts, HPA status for Qrvey services
DependenciesDependencies HealthRedis, RabbitMQ, Kong, Elasticsearch pod status
KongKong Official, Request Throttling & Kong PerformanceHTTP request rates, latency, bandwidth, 429s
Kube SystemCoreDNSDNS queries, cache hits, latency
ExportersElasticsearch Overview, RabbitMQ OverviewES cluster health, shards, queue depths
NetworkingAWS VPC CNIIP allocation, ENI usage
Alerts & IncidentsContainer Downtime & Recovery, Disk Space MonitoringCrash tracking, restart trends, disk usage gauges
LogsLogs ExplorerLive log viewer with namespace, app, pod, and level filters

View Alerts

Go to Alerting > Alert rules in the sidebar. The following pre-configured rules are organized in groups:

  • Disk Space — DiskSpaceCritical (>80%), DiskSpaceWarning (>70%), PVCSpaceCritical (>80%)
  • Container Downtime — CrashLoopBackOff, OOMKilled, PodNotReady, HighContainerRestarts, DeploymentReplicasMismatch
  • Request Throttling — KongHighHTTP429Rate, CPUThrottlingHigh, KongHighLatencyP99, KongHighErrorRate
  • Slack Priority — PodPendingTooLong, CriticalServicePodDown, CriticalServicePodRestarting, NodeMemoryHighUsage

Alerts show as Normal (green), Pending (yellow), or Firing (red).

Set Up Email Notifications

You can set up email notifications in Grafana.

Set Up Slack Notifications

Alerts are routed to Slack automatically through Alertmanager — no manual Grafana UI configuration needed.

  1. Create a Slack incoming webhook.

    a. Go to your Slack workspace > Apps and search for Incoming WebHooks.

    b. Select Add to Slack > choose the channel that receives alerts > click Add Incoming WebHooks Integration.

    c. Copy the Webhook URL (looks like https://hooks.slack.com/services/T.../B.../xxx).

  2. Add the webhook to the config.json file.

    {
    "variables": {
    "enable_monitoring": true,
    "slack_webhook_url": "<https://hooks.slack.com/services/T.../B.../xxx>",
    "slack_alert_channel": "#ops-alerts"
    }
    }
  3. Apply the configuration.

    ./qrvey apply

Alertmanager is reconfigured immediately without needing to restart Grafana.

Slack Notifications

When an alert fires, it sends a message like the following:

[FIRING:1] CriticalServicePodDown
• *Critical service pod rabbitmq/rabbitmq-0 is not ready*
Pod rabbitmq-0 in rabbitmq (RabbitMQ/Redis/Elasticsearch) has been not ready for 2+ minutes.
Severity: `critical` | Namespace: `rabbitmq`

When the condition clears, Slack generates a green RESOLVED message.

Test the Pipeline

Run the test script to inject synthetic alerts and verify they reach your Slack channel without touching production workloads:

./scripts/test-alerts.sh

Alerts Sent to Slack

All alert rules route through Alertmanager. The Slack Priority group is specifically tuned for operational urgency:

AlertSeverityTrigger
PodPendingTooLongwarningAny pod stuck Pending for five or more minutes.
CriticalServicePodDowncriticalRabbitMQ, Redis, or Elasticsearch pod not ready for two or more minutes.
CriticalServicePodRestartingcriticalAny restart in those namespaces (immediate).
NodeMemoryHighUsagecriticalNode memory greater than 90% for five or more minutes.

Disable Slack Notifications

Remove slack_webhook_url from config.json file (or set it to ""), then re-apply the configuration. Alertmanager switches to the null receiver and discards notifications silently.

Disable Monitoring

Add the following to the config.json file, then re-apply the configuration.

{
"variables": {
"enable_monitoring": false
}
}

This removes Grafana, dashboards, exporters, and the monitoring node pool. Loki remains running in minimal mode (10Gi storage, 3-day retention) so services can continue pushing logs using OTLP.

Troubleshooting

Grafana URL Shows Pending Message

The load balancer takes about 3 minutes. Run ./qrvey output again.

Alerts Not Displayed in Grafana

Go to Alerting > Alert rules. If empty, the PrometheusRule might not have been created. Re-apply the configuration.

Slack Alerts Not Arriving

Run ./scripts/test-alerts.sh to send synthetic alerts directly to Alertmanager and confirm routing. Verify that slack_webhook_url in the config.json file is correct and ./qrvey apply was run after adding the URL. You can also verify through the Alertmanager UI:

  1. Connect using kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-alertmanager 9093:9093
  2. Open http://localhost:9093.

Slack Alerts Firing Too Often

The default repeat_interval is four hours, so the same alert notification is generated every four hours while firing.

Logs Not Appearing

Verify that enable_otlp_log is true in the config.json file and services have been restarted to pick up the new configuration map values.

Promtail Not Collecting

Verify enable_log_collection is true in the config.json. Run ./qrvey output to confirm.

Retrieve Grafana Password

Run ./qrvey output. The password is printed in the output.