Production Configuration for AWS
This page describes recommended configuration settings for running a production-grade Qrvey environment on AWS. The default installation settings are designed for getting started, but a production environment requires adjustments to ensure high availability, performance, and reliability under load.
OpenSearch Cluster Configuration
Qrvey recommends the following minimum OpenSearch cluster configuration for production:
| Setting | Recommended Value |
|---|---|
| Instance Type | r6g.large.search |
| Dedicated Master Node Count | 3 |
| Data Node Count | 3 |
| Disk Space | 100 GB per node |
Alternatively, you can select newer instance types such as r7g.large.search and r8g.large.search. For very large datasets and frequent data syncs, consider upgrading to xlarge, 2xlarge, or 4xlarge sizes. You can adjust the instance size in the Infrastructure section of the Qrvey Admin Center.
The default installation uses two data nodes and no dedicated master nodes. While this configuration does support failover if one data node goes down, the configuration is not recommended for a healthy cluster that needs to recover from an outage in a high availability environment.
- Three data nodes are required to avoid the 'split-brain' problem, where a cluster cannot form a quorum if one node becomes unreachable.
- Three dedicated master nodes ensure the cluster elects a new master quickly and correctly after any node failure.
Note: As your OpenSearch cluster scales, you do not need more than three dedicated master nodes regardless of the number of data nodes.
Configuration Example
To apply the production OpenSearch configuration, update the opensearch_config variable in your config.json file. To review a full list of opensearch_config properties, see AWS Deployment Input Variables.
"opensearch_config": {
"enabled": true,
"instance_type": "r6g.large.search",
"instance_count": 3,
"volume_size": 100,
"dedicated_master_enabled": true,
"dedicated_master_type": "r6g.large.search",
"dedicated_master_count": 3,
"zone_awareness_enabled": true,
"encrypt_at_rest": true,
"node_to_node_encryption": true
}
OpenSearch Snapshots
Qrvey's data load process creates a new Elasticsearch index. When the data is loaded, the old index is removed. However, the hourly AWS OpenSearch Snapshot process can block index deletion. This can lead to orphan index accumulation during frequent data syncs, data syncs with full reloads, large dataset loads, or large Content Deployment jobs. Qrvey's background job automatically deletes orphan indexes. If the accumulation persists, consider temporarily disabling the Snapshot process in the AWS OpenSearch cluster console.
PostgreSQL RDS Configuration
The default PostgreSQL instance class is db.t3.medium, which belongs to the burstable performance (T-family) instance class. T-family instances accumulate CPU credits during idle periods and consume those credits under load.
This model works well for workloads with predictable peaks and valleys but is not suitable for production environments where data syncs run continuously. Constant syncs deplete CPU credit balances and throttles the instance to its baseline CPU performance, which can significantly impact data load throughput and query response times.
For production workloads with continuous or high-concurrency sync activity, use an instance class from a non-burstable family such as r6g or r7g. These instance types provide consistent CPU performance without relying on credit accumulation.
A recommended starting point for production is db.r6g.large. Adjust the instance class based on the number of concurrent users and the volume and frequency of data syncs in your environment.
To update the PostgreSQL instance class, set the postgresql_config variable in your config.json file:
"postgresql_config": {
"instance_class": "db.r6g.large",
"version": "16.11"
}
For the full list of postgresql_config properties, see AWS Deployment Input Variables.
VPC and Private Subnet Sizing
Each data sync operation can spin up a new Kubernetes pod to handle that sync process. In environments with many concurrent syncs running in parallel, the number of pods can grow rapidly and exhaust the available IP addresses in your private subnets.
The default private subnet CIDR blocks (/24) provide 251 usable IP addresses for each subnet, which might not be sufficient for high-concurrency production environments.
Qrvey recommends using /20 CIDR blocks for all private subnets in production. A /20 block provides 4,091 usable IP addresses per subnet, giving the Kubernetes cluster enough capacity to scale pods without running out of IP addresses.
To configure larger subnets, set the private_subnets_cidrs variable in your config.json file. You can use your own CIDR ranges as long as they do not conflict with any other subnets in your VPC. This is especially important if you are using VPC Peering, where overlapping CIDRs between peered VPCs causes routing failures.
"private_subnets_cidrs": [
"10.110.0.0/20",
"10.110.16.0/20",
"10.110.32.0/20",
"10.110.48.0/20"
]
The VPC CIDR must be large enough to accommodate all subnets. The default VPC CIDR of 10.110.0.0/16 supports the example configuration above.
Note: Subnet sizes can only be set at initial deployment time. If you need to resize subnets in an existing deployment, you must migrate to a new VPC. Plan your subnet sizes before deploying to production.
CoreDNS Scaling
In environments with a high number of concurrent sync operations, Kubernetes relies heavily on DNS resolution to route traffic between services. The default CoreDNS deployment (typically two replicas) can become a bottleneck under heavy load. This can cause slow or failed DNS lookups that affect sync performance and reliability. To help determine if CoreDNS pods need to be scaled, you can set up monitoring and logging for your CoreDNS pods.
If you observe DNS resolution issues or degraded performance during periods of high concurrency, scale up the number of CoreDNS replicas.
-
Update your kubeconfig to connect to your EKS cluster, replacing
<region>and<deploymentid>with your values:aws eks update-kubeconfig --region <region> --name qrvey-eks-<deploymentid> -
Scale the CoreDNS deployment to the desired number of replicas. Five replicas is a common starting point for high-concurrency environments:
kubectl scale deployment coredns -n kube-system --replicas=5 -
Verify the pods are running:
kubectl get pods -n kube-system -l k8s-app=kube-dns
Adjust the replica count based on the level of concurrent activity in your environment.
Dataload Scaling
Datasets with Transformations
For large datasets with many transformations, Qrvey recommends increasing the number of messages being processed by each replica from default 1 to 20 and increasing the maximum number of replicas from default 5 to 10 at the same time. This improves performance and stability of the dataload. You can accomplish this with the following kubectl commands:
kubectl set env deployment/qrvey-dp-dr-transformation TRANSFORM_CHUNK_QUEUE_HP_PREFETCH=20 TRANSFORM_CHUNK_QUEUE_LP_PREFETCH=20 -n qrveyapps
kubectl patch scaledobject qrvey-dp-dr-transformation-scaledobject --type merge -p '{"spec":{"minReplicaCount":1,"maxReplicaCount":10}}' -n qrveyapps
Loading Data from S3
When loading a large dataset from S3, Qrvey recommends increasing the maximum number of replicas from 2 to 4 to improve performance. This increases the number of S3 files processed at the same time. To increase the number of replicas, use the following kubectl command:
kubectl patch scaledobject qrvey-dp-dr-file-datasource-pump-service-scaledobject --type merge -p '{"spec":{"minReplicaCount":1,"maxReplicaCount":4}}' -n qrveyapps
Infrastructure Cost
Changing platform default configuration and increasing resources can affect the overall cost of the platform. To estimate the cost of configuration changes, use the AWS Pricing Calculator.
Additional Resources
- Configure Monitoring and Logging — Enable Prometheus, Grafana, and Loki for observability in your production environment.
- Secure Database Connection (AWS) — Configure VPC peering and secure access for your RDS instances.
- AWS Deployment Input Variables — Tune
dataload_configto control autoscaling and resource limits for dataset loading microservices. - Set Concurrent Kubernetes Data Sync Job Limits — Manage the number of data sync jobs that run concurrently.
- Set Concurrent Kubernetes Lake Garbage Collector Job Limits — Manage the number of Lake Garbage Collector jobs that run concurrently.
- Troubleshooting the Qrvey Instance - Set up
kubectlaccess.