Version: 9.3

Configure Monitoring and Logging

As a cloud operations engineer, you can use Qrvey's logging system to enable monitoring, configure notifications, and set up automatic log cleanup.

Overview

Qrvey v9.3 introduces a centralized logging system to standardize error logging across all modules and projects. The system includes the following features:

Centralized error logging with standardized log formats.
Dashboard monitoring using Prometheus and Grafana.
Configurable log levels (Fatal, Error, Warning, Info, Debug, Trace, Off).
Automatic log cleanup to manage storage costs.

The implementation deploys a monitoring stack into your Elastic Kubernetes Service (EKS) cluster:

Grafana (dashboards)
Prometheus (metrics)
Loki (logs)
Promtail (optional automatic log scraping)

Loki runs continuously, even with monitoring disabled. Your services use the OpenTelemetry Logs Protocol to push structured logs to Loki. When you enable monitoring, you can then view Grafana and dashboards.

Monitored Services

Qrvey monitors the following service categories:

Qrvey Application Services

Application services include all deployments in the qrveyapps namespace:

Dataset
Widget
Automation
Reporting
Worker services

Metrics include replica count, CPU/memory, restarts, and OOMKills (Out of Memory Killed status messages). Logs capture structured output from each service (using OTLP when enable_otlp_log = true).

Infrastructure Services

The following table lists monitored infrastructure services.

Service	Namespace	Monitors
RabbitMQ	`rabbitmq`	Pod health, queue depths, connections, message rates, restart events
Redis	`redis`	Pod health, memory usage, restart events
Elasticsearch	`elastic-system`	Cluster health, shard count, JVM memory, pod readiness
Kong API Gateway	`kong` / `kong-private`	HTTP request rates, 5xx error rates, P99 latency, bandwidth

Cluster Infrastructure

The following table lists monitored cluster infrastructure services. Prometheus scrapes all of the above continuously, and Qrvey services push structured logs to Loki using OTLP.

Component	Monitors
Kubernetes nodes	CPU, memory, disk I/O, network I/O.
All pods (all namespaces)	Pending state, CrashLoopBackOff, OOMKills, replica mismatches.
Persistent volumes	Storage utilization (alert fires when PVC exceeds 80% full).
CoreDNS	DNS query rates, cache hit ratio, latency.
AWS VPC CNI	IP address allocation and ENI usage per node.

For services that do not use OTLP, such as Kong, RabbitMQ, and Elasticsearch, you can enable Promtail to capture raw stdout/stderr from all namespaces in monitoring_log_namespaces.

Before You Begin

Before deploying a monitoring configuration to the Qrvey platform, verify the following permissions and configurations.

Access to the config.json file for your deployment.
kubectl access to your EKS cluster for advanced configuration.
AWS SNS configured for email notifications.

Setup

The ./qrvey alias used in this section executes the following Docker command:

docker run --platform=linux/amd64 -v $(pwd)/config.json:/app/qrvey/config.json -it qrvey.azurecr.io/qrvey-terraform:9.3

Add the following variables to your config.json file:

{
 "variables": {
    "enable_monitoring": true,
    "enable_otlp_log": true
  }
}

From the command console (terminal), run the ./qrvey apply script:
```
./qrvey apply
```

After the script completes, the output prints the following:

Monitoring:               Enabled
Grafana URL:              http://<load-balancer-hostname>:80
Grafana User:             admin
Grafana Password:         <auto-generated>

To review the output, run the following command:

./qrvey output

JSON Configuration Options

The following table describes options you can use in the config.json file.

Variable	Type	Default	Description
`enable_monitoring`	boolean	`false`	Deploys Grafana, dashboards, alert rules, and exporters
`enable_otlp_log`	boolean	`true`	Directs services to push logs to Loki using OTLP. Sets `ENABLE_OTLP_LOG=true` in the service environment.
`log_level`	string	`"info"`	Controls service log verbosity. Valid outputs: `fatal`, `error`, `warn`, `info`, `debug`, `trace`
`enable_log_collection`	boolean	`false`	Deploys Promtail to scrape container `stdout`/`stderr` (separate from OTLP).
`monitoring_grafana_password`	string	(auto)	Custom Grafana admin password. To automatically generate logs, omit this password.
`monitoring_grafana_service_type`	string	`"LoadBalancer"`	`"LoadBalancer"` for external access, `"ClusterIP"` for port-forward only.
`monitoring_metrics_retention_days`	number	`15`	Days to keep Prometheus metrics.
`monitoring_logs_retention_days`	number	`7`	Days to keep Loki logs.
`monitoring_log_namespaces`	list	`["qrveyapps", ...]`	Namespaces Promtail collects from (only when `enable_log_collection = true`).
`slack_webhook_url`	string	`""`	Slack Incoming Webhook URL. When set, all alerts are sent to Slack automatically. Leave empty to disable.
`slack_alert_channel`	string	`"#alerts"`	Slack channel to receive alert notifications (for example, `"#ops-alerts"`).

Examples

The following code block enables all monitoring functions.

{
  "variables": {
    "enable_monitoring": true,
    "enable_otlp_log": true,
    "log_level": "debug",
    "monitoring_metrics_retention_days": 30,
    "monitoring_logs_retention_days": 14,
    "slack_webhook_url": "<https://hooks.slack.com/services/T.../B.../xxx>",
    "slack_alert_channel": "#ops-alerts"
  }
}

The following code block enables logs only (no dashboards).

{
  "variables": {
    "enable_monitoring": false,
    "enable_otlp_log": true,
    "log_level": "info"
  }
}

Loki receives logs from services without a deployed Grafana interface.

Data Retention

Use the following configuration variables to define how long the logs and metrics are kept.

Config variable	Data	Default
`monitoring_logs_retention_days`	Logs (Loki)	7 days
`monitoring_metrics_retention_days`	Metrics (Prometheus)	15 days

When enable_monitoring = false, Loki runs in minimal mode, using a 3-day retention limit independent of these variables.

To change retention, update the config.json file and re-apply the configuration:

{
  "variables": {
    "monitoring_logs_retention_days": 30,
    "monitoring_metrics_retention_days": 60
  }
}

Loki and Prometheus are reconfigured in place, changing the interval for future automated purges. No data is lost.

Note: Higher log verbosity (log_level: debug) produces significantly more log volume per day. If you change the log_level, you should also change the monitoring_logs_retention_days to keep storage usage predictable.

Enable or Disable Logging After Deployment

You can change logging settings by updating the config.json file and re-applying. No cluster downtime is required.

You can prevent jobs from stopping by adding the following flags to skip different tasks when running the .qrvey/apply script.

skip-workflows-hooks - Skips stoppage of workflows running during an upgrade.
skip-data-sync-jobs - Skips stoppage of data sync jobs running during an upgrade.
skip-cleanup-orphans - Skips stoppage of orphan cleanup jobs running during an upgrade.

Example:

./qrvey apply --skip-workflows-hooks --skip-data-sync-jobs --skip-cleanup-orphans

Changes can take up to an hour to apply. During this time, the environment might not be available. Plan to apply changes during off hours.

OTLP Log Shipping

The following JSON variables control the traffic of Qrvey service logs to Loki. To apply changes, re-apply the configuration.

Enable

{ "variables": { "enable_otlp_log": true } }

During the next rolling restart, services pick up ENABLE_OTLP_LOG=true from the updated general-configmap and begin pushing logs to Loki.

Disable

{ "variables": { "enable_otlp_log": false } }

Services stop pushing logs. Logs already stored in Loki remain until their retention period expires.

Deployed Components

The following components are deployed even when enable_monitoring = false.

Component	Purpose
Prometheus	Metrics collection (required by Elasticsearch adapter and job-downscaler).
Loki	Log storage — receives OTLP logs from services.

The following components are deployed when enable_log_collection = true.

Component	Purpose
Promtail	Scrapes container `stdout`/`stderr` automatically.

The following components are deployed when enable_monitoring = true.

Component	Purpose
Grafana	Dashboards and log viewer UI.
Alertmanager	Routes alert notifications (Slack when `slack_webhook_url` is set).
PrometheusRule	Alert rules (disk, containers, Kong, Slack priority).
kube-state-metrics	Exposes Kubernetes object metrics.
node-exporter	Host-level CPU, memory, and disk metrics.
RabbitMQ Exporter	RabbitMQ queue and connection metrics.
Kong Prometheus Plugin	Kong HTTP request metrics.
Dedicated NodePool	Karpenter ARM64 nodes for monitoring workloads.

Automatic Log Collection

The following variables control automated log collection traffic from Promtail to Loki.

Enable

{ "variables": { "enable_log_collection": true } }

When applied, the configuration script applies Promtail as a DaemonSet and begins collecting container stdout/stderr immediately.

Disable

{ "variables": { "enable_log_collection": false } }

When applied, the configuration script removes Promtail. Logs already in Loki remain until retention expires.

Note: enable_monitoring (Grafana, dashboards, alerts) and enable_otlp_log (log shipping) are independent switches. You can have logs without the Grafana UI, or the Grafana UI without OTLP log shipping.

Change the Log Level

All Qrvey services read their log verbosity from the LOG_LEVEL environment variable, which is injected into each service through the general-configmap. The value comes directly from the log_level in the config.json file:

config.json  >  log_level  >  general-configmap  >  LOG_LEVEL env var  >  all Qrvey services

To change the log level, edit the config.json file and re-apply the configuration:

{ "variables": { "log_level": "debug" } }

The general-configmap is updated immediately. Each service pod picks up the new level on its next rolling restart without a service outage.

Log Levels

Log level hierarchy: FATAL (highest) > ERROR > WARN > INFO > DEBUG > TRACE.

Level	Logs	Description
`info`	`logger.info`, `logger.warn`, `logger.error`, `logger.fatal`	Default log level that includes INFO, WARN, ERROR, and FATAL messages.
`fatal`	`logger.fatal`	Only critical errors are logged. This generates the fewest amount of logs.
`error`	`logger.error`, `logger.fatal`	Any errors in the application, including FATAL.
`warn`	`logger.warn`, `logger.error`, `logger.fatal`	Includes FATAL and ERROR in addition to any warnings.
`debug`	`logger.debug`, `logger.info`, `logger.warn`, `logger.error`, `logger.fatal`	Includes everything in INFO as well as debugging messages, producing high log volumes. Long-term usage quickly exhausts your log retention budget and increases storage costs. Switch back to your preferred log level after troubleshooting.
`trace`	`logger.debug`, `logger.info`, `logger.warn`, `logger.error`, `logger.fatal`	Includes everything in DEBUG as well as detailed TRACE information, producing high log volumes. Long-term usage quickly exhausts your log retention budget and increases storage costs. Switch back to your preferred log level after troubleshooting.

OTLP Environment Variables

When enable_otlp_log is true, the following environment variables are injected into each Qrvey service using general-configmap:

Variable	Value	Description
`LOG_LEVEL`	`info` (configurable)	Controls which log levels are emitted.
`ENABLE_OTLP_LOG`	`true`	Flag services check to enable OTLP export.
`OTLP_LOG_URL`	`http://loki.monitoring.svc.cluster.local:3100/otlp/v1/logs`	POST location for service logs.

To push logs, the system uses the standard OTLP format used by OTEL SDKs and collectors.

To view debug-level logs from all services, set "log_level": "debug" in the config.json file.

Grafana Usage

After deployment, open the Grafana URL printed in the output. Log in with admin and the password shown.

View Logs

Select Explore (compass icon in the sidebar).
Select Loki from the data source dropdown.
Use the label browser to filter:
- service_name — the service that emitted the log.
- service_namespace — Kubernetes namespace.
- level or detected_level — log severity.
OR

Type a LogQL query directly:
```
{service_name="dp-dataset-service"} |= "error"
```
```
{service_namespace="qrveyapps", detected_level="error"}
```

You can also go to Dashboards > Logs > Logs Explorer to display a pre-built view with dropdowns.

View Dashboards

The Dashboards sidebar displays dashboards organized in folders:

Folder	Dashboards	Display
Cluster	Cluster Overview, Node Exporter Full, K8s Views (Global, Namespaces, Pods)	CPU, memory, disk, network across the cluster
Qrvey Applications	Qrvey Applications Health	Replica counts, OOMKills, restarts, HPA status for Qrvey services
Dependencies	Dependencies Health	Redis, RabbitMQ, Kong, Elasticsearch pod status
Kong	Kong Official, Request Throttling & Kong Performance	HTTP request rates, latency, bandwidth, 429s
Kube System	CoreDNS	DNS queries, cache hits, latency
Exporters	Elasticsearch Overview, RabbitMQ Overview	ES cluster health, shards, queue depths
Networking	AWS VPC CNI	IP allocation, ENI usage
Alerts & Incidents	Container Downtime & Recovery, Disk Space Monitoring	Crash tracking, restart trends, disk usage gauges
Logs	Logs Explorer	Live log viewer with namespace, app, pod, and level filters

View Alerts

Go to Alerting > Alert rules in the sidebar. The following pre-configured rules are organized in groups:

Disk Space — DiskSpaceCritical (>80%), DiskSpaceWarning (>70%), PVCSpaceCritical (>80%)
Container Downtime — CrashLoopBackOff, OOMKilled, PodNotReady, HighContainerRestarts, DeploymentReplicasMismatch
Request Throttling — KongHighHTTP429Rate, CPUThrottlingHigh, KongHighLatencyP99, KongHighErrorRate
Slack Priority — PodPendingTooLong, CriticalServicePodDown, CriticalServicePodRestarting, NodeMemoryHighUsage

Alerts show as Normal (green), Pending (yellow), or Firing (red).

Set Up Email Notifications

You can set up email notifications in Grafana.

Set Up Slack Notifications

Alerts are routed to Slack automatically through Alertmanager — no manual Grafana UI configuration needed.

Create a Slack incoming webhook.

a. Go to your Slack workspace > Apps and search for Incoming WebHooks.

b. Select Add to Slack > choose the channel that receives alerts > click Add Incoming WebHooks Integration.

c. Copy the Webhook URL (looks like https://hooks.slack.com/services/T.../B.../xxx).

Add the webhook to the config.json file.

{
 "variables": {
    "enable_monitoring": true,
    "slack_webhook_url": "<https://hooks.slack.com/services/T.../B.../xxx>",
    "slack_alert_channel": "#ops-alerts"
  }
}

Apply the configuration.
```
./qrvey apply
```

Alertmanager is reconfigured immediately without needing to restart Grafana.

Slack Notifications

When an alert fires, it sends a message like the following:

[FIRING:1] CriticalServicePodDown
• *Critical service pod rabbitmq/rabbitmq-0 is not ready*
  Pod rabbitmq-0 in rabbitmq (RabbitMQ/Redis/Elasticsearch) has been not ready for 2+ minutes.
  Severity: `critical` | Namespace: `rabbitmq`

When the condition clears, Slack generates a green RESOLVED message.

Test the Pipeline

Run the test script to inject synthetic alerts and verify they reach your Slack channel without touching production workloads:

./scripts/test-alerts.sh

Alerts Sent to Slack

All alert rules route through Alertmanager. The Slack Priority group is specifically tuned for operational urgency:

Alert	Severity	Trigger
`PodPendingTooLong`	warning	Any pod stuck Pending for five or more minutes.
`CriticalServicePodDown`	critical	RabbitMQ, Redis, or Elasticsearch pod not ready for two or more minutes.
`CriticalServicePodRestarting`	critical	Any restart in those namespaces (immediate).
`NodeMemoryHighUsage`	critical	Node memory greater than 90% for five or more minutes.

Disable Slack Notifications

Remove slack_webhook_url from config.json file (or set it to ""), then re-apply the configuration. Alertmanager switches to the null receiver and discards notifications silently.

Disable Monitoring

Add the following to the config.json file, then re-apply the configuration.

{
  "variables": {
    "enable_monitoring": false
  }
}

This removes Grafana, dashboards, exporters, and the monitoring node pool. Loki remains running in minimal mode (10Gi storage, 3-day retention) so services can continue pushing logs using OTLP.

Troubleshooting

Grafana URL Shows Pending Message

The load balancer takes about 3 minutes. Run ./qrvey output again.

Alerts Not Displayed in Grafana

Go to Alerting > Alert rules. If empty, the PrometheusRule might not have been created. Re-apply the configuration.

Slack Alerts Not Arriving

Run ./scripts/test-alerts.sh to send synthetic alerts directly to Alertmanager and confirm routing. Verify that slack_webhook_url in the config.json file is correct and ./qrvey apply was run after adding the URL. You can also verify through the Alertmanager UI:

Connect using kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-alertmanager 9093:9093
Open http://localhost:9093.

Slack Alerts Firing Too Often

The default repeat_interval is four hours, so the same alert notification is generated every four hours while firing.

Logs Not Appearing

Verify that enable_otlp_log is true in the config.json file and services have been restarted to pick up the new configuration map values.

Promtail Not Collecting

Verify enable_log_collection is true in the config.json. Run ./qrvey output to confirm.

Retrieve Grafana Password

Run ./qrvey output. The password is printed in the output.

Overview​

Monitored Services​

Qrvey Application Services​

Infrastructure Services​

Cluster Infrastructure​

Before You Begin​

Setup​

JSON Configuration Options​

Examples​

Data Retention​

Enable or Disable Logging After Deployment​

OTLP Log Shipping​

Enable​

Disable​

Deployed Components​

Automatic Log Collection​

Enable​

Disable​

Change the Log Level​

Log Levels​

OTLP Environment Variables​

Grafana Usage​

View Logs​

View Dashboards​

View Alerts​

Set Up Email Notifications​

Set Up Slack Notifications​

Slack Notifications​

Test the Pipeline​

Alerts Sent to Slack​

Disable Slack Notifications​

Disable Monitoring​

Troubleshooting​

Grafana URL Shows Pending Message​

Alerts Not Displayed in Grafana​

Slack Alerts Not Arriving​

Slack Alerts Firing Too Often​

Logs Not Appearing​

Promtail Not Collecting​

Retrieve Grafana Password​

Overview

Monitored Services

Qrvey Application Services

Infrastructure Services

Cluster Infrastructure

Before You Begin

Setup

JSON Configuration Options

Examples

Data Retention

Enable or Disable Logging After Deployment

OTLP Log Shipping

Enable

Disable

Deployed Components

Automatic Log Collection

Enable

Disable

Change the Log Level

Log Levels

OTLP Environment Variables

Grafana Usage

View Logs

View Dashboards

View Alerts

Set Up Email Notifications

Set Up Slack Notifications

Slack Notifications

Test the Pipeline

Alerts Sent to Slack

Disable Slack Notifications

Disable Monitoring

Troubleshooting

Grafana URL Shows Pending Message

Alerts Not Displayed in Grafana

Slack Alerts Not Arriving

Slack Alerts Firing Too Often

Logs Not Appearing

Promtail Not Collecting

Retrieve Grafana Password