Configure Monitoring and Logging
As a cloud operations engineer, you can use Qrvey's logging system to enable monitoring, configure notifications, and set up automatic log cleanup.
Overview
Qrvey v9.3 introduces a centralized logging system to standardize error logging across all modules and projects. The system includes the following features:
- Centralized error logging with standardized log formats.
- Dashboard monitoring using Prometheus and Grafana.
- Configurable log levels (Fatal, Error, Warning, Info, Debug, Trace, Off).
- Automatic log cleanup to manage storage costs.
The implementation deploys a monitoring stack into your Elastic Kubernetes Service (EKS) cluster:
- Grafana (dashboards)
- Prometheus (metrics)
- Loki (logs)
- Promtail (optional automatic log scraping)
Loki runs continuously, even with monitoring disabled. Your services use the OpenTelemetry Logs Protocol to push structured logs to Loki. When you enable monitoring, you can then view Grafana and dashboards.
Monitored Services
Qrvey monitors the following service categories:
Qrvey Application Services
Application services include all deployments in the qrveyapps namespace:
- Dataset
- Widget
- Automation
- Reporting
- Worker services
Metrics include replica count, CPU/memory, restarts, and OOMKills (Out of Memory Killed status messages). Logs capture structured output from each service (using OTLP when enable_otlp_log = true).
Infrastructure Services
The following table lists monitored infrastructure services.
| Service | Namespace | Monitors |
|---|---|---|
| RabbitMQ | rabbitmq | Pod health, queue depths, connections, message rates, restart events |
| Redis | redis | Pod health, memory usage, restart events |
| Elasticsearch | elastic-system | Cluster health, shard count, JVM memory, pod readiness |
| Kong API Gateway | kong / kong-private | HTTP request rates, 5xx error rates, P99 latency, bandwidth |
Cluster Infrastructure
The following table lists monitored cluster infrastructure services. Prometheus scrapes all of the above continuously, and Qrvey services push structured logs to Loki using OTLP.
| Component | Monitors |
|---|---|
| Kubernetes nodes | CPU, memory, disk I/O, network I/O. |
| All pods (all namespaces) | Pending state, CrashLoopBackOff, OOMKills, replica mismatches. |
| Persistent volumes | Storage utilization (alert fires when PVC exceeds 80% full). |
| CoreDNS | DNS query rates, cache hit ratio, latency. |
| AWS VPC CNI | IP address allocation and ENI usage per node. |
For services that do not use OTLP, such as Kong, RabbitMQ, and Elasticsearch, you can enable Promtail to capture raw stdout/stderr from all namespaces in monitoring_log_namespaces.
Before You Begin
Before deploying a monitoring configuration to the Qrvey platform, verify the following permissions and configurations.
- Access to the
config.jsonfile for your deployment. - kubectl access to your EKS cluster for advanced configuration.
- AWS SNS configured for email notifications.
Setup
The ./qrvey alias used in this section executes the following Docker command:
docker run --platform=linux/amd64 -v $(pwd)/config.json:/app/qrvey/config.json -it qrvey.azurecr.io/qrvey-terraform:9.3
-
Add the following variables to your
config.jsonfile:{
"variables": {
"enable_monitoring": true,
"enable_otlp_log": true
}
} -
From the command console (terminal), run the
./qrvey applyscript:./qrvey apply
After the script completes, the output prints the following:
Monitoring: Enabled
Grafana URL: http://<load-balancer-hostname>:80
Grafana User: admin
Grafana Password: <auto-generated>
To review the output, run the following command:
./qrvey output
JSON Configuration Options
The following table describes options you can use in the config.json file.
| Variable | Type | Default | Description |
|---|---|---|---|
enable_monitoring | boolean | false | Deploys Grafana, dashboards, alert rules, and exporters |
enable_otlp_log | boolean | true | Directs services to push logs to Loki using OTLP. Sets ENABLE_OTLP_LOG=true in the service environment. |
log_level | string | "info" | Controls service log verbosity. Valid outputs: fatal, error, warn, info, debug, trace |
enable_log_collection | boolean | false | Deploys Promtail to scrape container stdout/stderr (separate from OTLP). |
monitoring_grafana_password | string | (auto) | Custom Grafana admin password. To automatically generate logs, omit this password. |
monitoring_grafana_service_type | string | "LoadBalancer" | "LoadBalancer" for external access, "ClusterIP" for port-forward only. |
monitoring_metrics_retention_days | number | 15 | Days to keep Prometheus metrics. |
monitoring_logs_retention_days | number | 7 | Days to keep Loki logs. |
monitoring_log_namespaces | list | ["qrveyapps", ...] | Namespaces Promtail collects from (only when enable_log_collection = true). |
slack_webhook_url | string | "" | Slack Incoming Webhook URL. When set, all alerts are sent to Slack automatically. Leave empty to disable. |
slack_alert_channel | string | "#alerts" | Slack channel to receive alert notifications (for example, "#ops-alerts"). |
Examples
The following code block enables all monitoring functions.
{
"variables": {
"enable_monitoring": true,
"enable_otlp_log": true,
"log_level": "debug",
"monitoring_metrics_retention_days": 30,
"monitoring_logs_retention_days": 14,
"slack_webhook_url": "<https://hooks.slack.com/services/T.../B.../xxx>",
"slack_alert_channel": "#ops-alerts"
}
}
The following code block enables logs only (no dashboards).
{
"variables": {
"enable_monitoring": false,
"enable_otlp_log": true,
"log_level": "info"
}
}
Loki receives logs from services without a deployed Grafana interface.
Data Retention
Use the following configuration variables to define how long the logs and metrics are kept.
| Config variable | Data | Default |
|---|---|---|
monitoring_logs_retention_days | Logs (Loki) | 7 days |
monitoring_metrics_retention_days | Metrics (Prometheus) | 15 days |
When enable_monitoring = false, Loki runs in minimal mode, using a 3-day retention limit independent of these variables.
To change retention, update the config.json file and re-apply the configuration:
{
"variables": {
"monitoring_logs_retention_days": 30,
"monitoring_metrics_retention_days": 60
}
}
Loki and Prometheus are reconfigured in place, changing the interval for future automated purges. No data is lost.
Note: Higher log verbosity (
log_level: debug) produces significantly more log volume per day. If you change thelog_level, you should also change themonitoring_logs_retention_daysto keep storage usage predictable.
Enable or Disable Logging After Deployment
You can change logging settings by updating the config.json file and re-applying. No cluster downtime is required.
You can prevent jobs from stopping by adding the following flags to skip different tasks when running the .qrvey/apply script.
skip-workflows-hooks- Skips stoppage of workflows running during an upgrade.skip-data-sync-jobs- Skips stoppage of data sync jobs running during an upgrade.skip-cleanup-orphans- Skips stoppage of orphan cleanup jobs running during an upgrade.
Example:
./qrvey apply --skip-workflows-hooks --skip-data-sync-jobs --skip-cleanup-orphans
Changes can take up to an hour to apply. During this time, the environment might not be available. Plan to apply changes during off hours.
OTLP Log Shipping
The following JSON variables control the traffic of Qrvey service logs to Loki. To apply changes, re-apply the configuration.
Enable
{ "variables": { "enable_otlp_log": true } }
During the next rolling restart, services pick up ENABLE_OTLP_LOG=true from the updated general-configmap and begin pushing logs to Loki.
Disable
{ "variables": { "enable_otlp_log": false } }
Services stop pushing logs. Logs already stored in Loki remain until their retention period expires.
Deployed Components
The following components are deployed even when enable_monitoring = false.
| Component | Purpose |
|---|---|
| Prometheus | Metrics collection (required by Elasticsearch adapter and job-downscaler). |
| Loki | Log storage — receives OTLP logs from services. |
The following components are deployed when enable_log_collection = true.
| Component | Purpose |
|---|---|
| Promtail | Scrapes container stdout/stderr automatically. |
The following components are deployed when enable_monitoring = true.
| Component | Purpose |
|---|---|
| Grafana | Dashboards and log viewer UI. |
| Alertmanager | Routes alert notifications (Slack when slack_webhook_url is set). |
| PrometheusRule | Alert rules (disk, containers, Kong, Slack priority). |
| kube-state-metrics | Exposes Kubernetes object metrics. |
| node-exporter | Host-level CPU, memory, and disk metrics. |
| RabbitMQ Exporter | RabbitMQ queue and connection metrics. |
| Kong Prometheus Plugin | Kong HTTP request metrics. |
| Dedicated NodePool | Karpenter ARM64 nodes for monitoring workloads. |
Automatic Log Collection
The following variables control automated log collection traffic from Promtail to Loki.
Enable
{ "variables": { "enable_log_collection": true } }
When applied, the configuration script applies Promtail as a DaemonSet and begins collecting container stdout/stderr immediately.
Disable
{ "variables": { "enable_log_collection": false } }
When applied, the configuration script removes Promtail. Logs already in Loki remain until retention expires.
Note:
enable_monitoring(Grafana, dashboards, alerts) andenable_otlp_log(log shipping) are independent switches. You can have logs without the Grafana UI, or the Grafana UI without OTLP log shipping.
Change the Log Level
All Qrvey services read their log verbosity from the LOG_LEVEL environment variable, which is injected into each service through the general-configmap. The value comes directly from the log_level in the config.json file:
config.json > log_level > general-configmap > LOG_LEVEL env var > all Qrvey services
To change the log level, edit the config.json file and re-apply the configuration:
{ "variables": { "log_level": "debug" } }
The general-configmap is updated immediately. Each service pod picks up the new level on its next rolling restart without a service outage.
Log Levels
Log level hierarchy: FATAL (highest) > ERROR > WARN > INFO > DEBUG > TRACE.
| Level | Logs | Description |
|---|---|---|
info | logger.info, logger.warn, logger.error, logger.fatal | Default log level that includes INFO, WARN, ERROR, and FATAL messages. |
fatal | logger.fatal | Only critical errors are logged. This generates the fewest amount of logs. |
error | logger.error, logger.fatal | Any errors in the application, including FATAL. |
warn | logger.warn, logger.error, logger.fatal | Includes FATAL and ERROR in addition to any warnings. |
debug | logger.debug, logger.info, logger.warn, logger.error, logger.fatal | Includes everything in INFO as well as debugging messages, producing high log volumes. Long-term usage quickly exhausts your log retention budget and increases storage costs. Switch back to your preferred log level after troubleshooting. |
trace | logger.debug, logger.info, logger.warn, logger.error, logger.fatal | Includes everything in DEBUG as well as detailed TRACE information, producing high log volumes. Long-term usage quickly exhausts your log retention budget and increases storage costs. Switch back to your preferred log level after troubleshooting. |
OTLP Environment Variables
When enable_otlp_log is true, the following environment variables are injected into each Qrvey service using general-configmap:
| Variable | Value | Description |
|---|---|---|
LOG_LEVEL | info (configurable) | Controls which log levels are emitted. |
ENABLE_OTLP_LOG | true | Flag services check to enable OTLP export. |
OTLP_LOG_URL | http://loki.monitoring.svc.cluster.local:3100/otlp/v1/logs | POST location for service logs. |
To push logs, the system uses the standard OTLP format used by OTEL SDKs and collectors.
To view debug-level logs from all services, set "log_level": "debug" in the config.json file.
Grafana Usage
After deployment, open the Grafana URL printed in the output. Log in with admin and the password shown.
View Logs
-
Select Explore (compass icon in the sidebar).
-
Select Loki from the data source dropdown.
-
Use the label browser to filter:
service_name— the service that emitted the log.service_namespace— Kubernetes namespace.levelordetected_level— log severity.
OR
Type a LogQL query directly:
{service_name="dp-dataset-service"} |= "error"{service_namespace="qrveyapps", detected_level="error"}
You can also go to Dashboards > Logs > Logs Explorer to display a pre-built view with dropdowns.
View Dashboards
The Dashboards sidebar displays dashboards organized in folders:
| Folder | Dashboards | Display |
|---|---|---|
| Cluster | Cluster Overview, Node Exporter Full, K8s Views (Global, Namespaces, Pods) | CPU, memory, disk, network across the cluster |
| Qrvey Applications | Qrvey Applications Health | Replica counts, OOMKills, restarts, HPA status for Qrvey services |
| Dependencies | Dependencies Health | Redis, RabbitMQ, Kong, Elasticsearch pod status |
| Kong | Kong Official, Request Throttling & Kong Performance | HTTP request rates, latency, bandwidth, 429s |
| Kube System | CoreDNS | DNS queries, cache hits, latency |
| Exporters | Elasticsearch Overview, RabbitMQ Overview | ES cluster health, shards, queue depths |
| Networking | AWS VPC CNI | IP allocation, ENI usage |
| Alerts & Incidents | Container Downtime & Recovery, Disk Space Monitoring | Crash tracking, restart trends, disk usage gauges |
| Logs | Logs Explorer | Live log viewer with namespace, app, pod, and level filters |
View Alerts
Go to Alerting > Alert rules in the sidebar. The following pre-configured rules are organized in groups:
- Disk Space — DiskSpaceCritical (>80%), DiskSpaceWarning (>70%), PVCSpaceCritical (>80%)
- Container Downtime — CrashLoopBackOff, OOMKilled, PodNotReady, HighContainerRestarts, DeploymentReplicasMismatch
- Request Throttling — KongHighHTTP429Rate, CPUThrottlingHigh, KongHighLatencyP99, KongHighErrorRate
- Slack Priority — PodPendingTooLong, CriticalServicePodDown, CriticalServicePodRestarting, NodeMemoryHighUsage
Alerts show as Normal (green), Pending (yellow), or Firing (red).
Set Up Email Notifications
You can set up email notifications in Grafana.
Set Up Slack Notifications
Alerts are routed to Slack automatically through Alertmanager — no manual Grafana UI configuration needed.
-
Create a Slack incoming webhook.
a. Go to your Slack workspace > Apps and search for Incoming WebHooks.
b. Select Add to Slack > choose the channel that receives alerts > click Add Incoming WebHooks Integration.
c. Copy the Webhook URL (looks like
https://hooks.slack.com/services/T.../B.../xxx). -
Add the webhook to the
config.jsonfile.{
"variables": {
"enable_monitoring": true,
"slack_webhook_url": "<https://hooks.slack.com/services/T.../B.../xxx>",
"slack_alert_channel": "#ops-alerts"
}
} -
Apply the configuration.
./qrvey apply
Alertmanager is reconfigured immediately without needing to restart Grafana.
Slack Notifications
When an alert fires, it sends a message like the following:
[FIRING:1] CriticalServicePodDown
• *Critical service pod rabbitmq/rabbitmq-0 is not ready*
Pod rabbitmq-0 in rabbitmq (RabbitMQ/Redis/Elasticsearch) has been not ready for 2+ minutes.
Severity: `critical` | Namespace: `rabbitmq`
When the condition clears, Slack generates a green RESOLVED message.
Test the Pipeline
Run the test script to inject synthetic alerts and verify they reach your Slack channel without touching production workloads:
./scripts/test-alerts.sh
Alerts Sent to Slack
All alert rules route through Alertmanager. The Slack Priority group is specifically tuned for operational urgency:
| Alert | Severity | Trigger |
|---|---|---|
PodPendingTooLong | warning | Any pod stuck Pending for five or more minutes. |
CriticalServicePodDown | critical | RabbitMQ, Redis, or Elasticsearch pod not ready for two or more minutes. |
CriticalServicePodRestarting | critical | Any restart in those namespaces (immediate). |
NodeMemoryHighUsage | critical | Node memory greater than 90% for five or more minutes. |
Disable Slack Notifications
Remove slack_webhook_url from config.json file (or set it to ""), then re-apply the configuration. Alertmanager switches to the null receiver and discards notifications silently.
Disable Monitoring
Add the following to the config.json file, then re-apply the configuration.
{
"variables": {
"enable_monitoring": false
}
}
This removes Grafana, dashboards, exporters, and the monitoring node pool. Loki remains running in minimal mode (10Gi storage, 3-day retention) so services can continue pushing logs using OTLP.
Troubleshooting
Grafana URL Shows Pending Message
The load balancer takes about 3 minutes. Run ./qrvey output again.
Alerts Not Displayed in Grafana
Go to Alerting > Alert rules. If empty, the PrometheusRule might not have been created. Re-apply the configuration.
Slack Alerts Not Arriving
Run ./scripts/test-alerts.sh to send synthetic alerts directly to Alertmanager and confirm routing. Verify that slack_webhook_url in the config.json file is correct and ./qrvey apply was run after adding the URL. You can also verify through the Alertmanager UI:
- Connect using
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-alertmanager 9093:9093 - Open http://localhost:9093.
Slack Alerts Firing Too Often
The default repeat_interval is four hours, so the same alert notification is generated every four hours while firing.
Logs Not Appearing
Verify that enable_otlp_log is true in the config.json file and services have been restarted to pick up the new configuration map values.
Promtail Not Collecting
Verify enable_log_collection is true in the config.json. Run ./qrvey output to confirm.
Retrieve Grafana Password
Run ./qrvey output. The password is printed in the output.