Architect Your Data Pipeline
Once you've installed the Qrvey Platform, you will want to architect your data pipeline. This article highlights best practices regarding how your data gets imported into Qrvey, optimized for visualizations, and refreshed over time.
Connections
Key Point: Qrvey connects to any data source, structured or unstructured. Connections tell Qrvey where to fetch data from.
To establish Connections in Qrvey, you have two options: use Qrvey Composer or use the API. The choice depends on your team's preferences and business requirements. Composer has a limited number of connection types. If your data source is not listed, then the alternatives are using a REST API connection or using Qrvey's Data Router APIs.
In addition, when connecting your data to Qrvey, you must decide whether to use the actual source database or set up an intermediate layer. The process of bringing data into Qrvey is resource-intensive, and this becomes an issue when importing directly from an OLTP data source (i.e., production DB responsible for high rates of transactions). Therefore, if this is a business concern, we recommend using an intermediate layer.
What are some example scenarios where having an intermediate layer is recommended?
- In a production database, the huge volume of read requests required by Qrvey may cause interference.
- If connecting to the source is too slow to meet business needs.
- Importing vast amounts of data from one cloud into another cloud may be impossible and would require an intermediate layer.
The two most common types of intermediate layers are:
- Mirrored or Cloned Databases — Connecting to a mirrored database (e.g., a read replica) to offload queries is recommended when possible, as it is easier to maintain.
- Data Lakes or Cloud-Based Repos — If you already have this type of infrastructure set up (Snowflake, Data Bricks, etc.), then feel free to use it. If not, you can utilize cloud storage to configure your data lake.
You can establish as many connections as you need.
Datasets
Key Point: Design Datasets to be performant in the visualizations they support, as well as easy to use for the team.
After you've established connection(s) to your data source(s), you use them to create Datasets, which are used to make data visualizations.
Qrvey Composer provides three types of Datasets:
- Live-Connect Datasets — Data is retrieved by directly querying a supported data source (e.g. Postgres, Redshift, or Snowflake).
- Managed Datasets — Data is retrieved from Qrvey's data engine.
- Dataset Views — A generated view that points to a Managed Dataset.
The functional difference with Dataset Views is that you cannot transform data in a Dataset View. However, they do provide logic to show/hide columns.
Why would you create a Dataset View? If you have multiple tenants and are using commingled dataset architecture, you would not want to duplicate any of the commingled datasets. Instead, you can create one Dataset, create separate Dataset Views for each tenant, and then hide rows and columns in each one as needed to custom tailor what each end-user tenant has access to.
When creating Datasets, keep the following details in mind:
- Each visualization may be sourced from only one Dataset.
- Datasets may contain data from multiple connections.
- The Dataset itself can be modified as needed at any time.
- You can modify or altogether switch the connection that was used to create a Dataset, though if the schema is different it may break the visualizations that use it.
- Removal or modification of existing columns will break any visualizations that depend on those columns, but adding new columns will not impact them.
Every dataset in Qrvey maps to one and only one index within the dedicated Elasticsearch cluster. That is, they are materialized views (i.e., they remain persistent) in ElasticStorage. Therefore, they need to be periodically resynchronized with the connection to keep the data up to date.
Most of the time, you will need to augment your data from the source in order to build visualizations. You will likely need to clean it, add or remove rows and columns, apply joins or unions, transform or calculate its values, and so on.
Dataset Structure
You could theoretically make one massive Dataset to support every visualization type you have. However, this might have hundreds of columns and millions of rows that are never used. That would not be user-friendly and could slow down performance during aggregations. Alternatively, you could make a separate Dataset for each type of visualization or metric you have. But that means you have tons of Datasets to track and parse through, and in many cases, visualizations use the same or similar data, so you'd be copying data and taking up extra storage.
Therefore, you want to find an "optimum" point, where data can be reused across visualizations intelligently. A good strategy is to create one Dataset per dashboard, asset class, or business goal. For example, a Dataset for a "sales" dashboard, a separate dataset for "marketing", etc. Your goal here is to keep things clean and performant. Before you make a single dataset or establish connections and intermediate layers, identify the analytics & visualizations required, then make an inventory of all the features it needs to support them (columns, aggregates, geo-location, trends, addresses, etc.) and design the dataset accordingly.
This video describes how to design datasets to support data visualizations, and how to create datasets using Qrvey Composer. It also discusses some advanced topics including dataset joins, column data types, shared datasets, and dataset views.
Joins & Unions
Key Points: Joins must be planned carefully. You can join data in Qrvey or in your source database. Joins can never occur in real-time when building visualizations.
Remember, a visualization is sourced from a single Dataset. So if it requires data from multiple sources, you may need to join or union multiple connections into a single dataset. If all data resides in the same source database, you can perform Joins and Unions in Qrvey or in the source database. If data comes from multiple sources, Qrvey must perform the operations.
Source Database Joins
- You can create a view directly in your source database or you can source a dataset using an SQL query that performs the joins.
- If using an SQL query within Qrvey Composer, the source database is doing the join, but the workload then switches to Qrvey to write the results into ElasticSearch.
- This offloads computation to the source database, potentially reducing Qrvey’s processing load.
Qrvey Joins
- Composer can also perform joins and unions, then query and write the results into ElasticSearch.
- Qrvey creates and maintains its own data lake, using cloud storage, then uses Athena/Trino to query the data lake, and writes the results back to ElasticSearch.
- Joining in Qrvey will be required if you're joining multiple repository types.
In the end, where should joins happen? It's a trade-off that depends on your needs and infrastructure. For real-time updates or if you have a data source optimized for joins, handle the joins in the source. If you need to combine multiple sources/snapshots or wish to keep the load off your data source, use Qrvey.
Flattening Data
Flattening data in Qrvey is the process of transforming hierarchically nested arrays of unstructured data into a denormalized view that is easier to process. After applying a flattening transformation in Qrvey, each individual property from the hierarchically nested array or document becomes an individual column in the dataset. Therefore, as a result of the flattening operation, arrays and documents with deeply nested hierarchies end up becoming very wide datasets.
Normalized and deeply nested hierarchical data structures, while excellent for data integrity and minimizing redundancy, can introduce performance bottlenecks when used for analytics and visualizations. This is because:
- Querying across tables or relationships requires complex joins, which are compute-intensive and slow down performance.
- Deeply nested JSON structures need recursive processing, making aggregation operations more resource-heavy.
- Denormalized (flattened) data takes up more storage space, but significantly improves query performance and reduces computational overhead.
Example: Flattening Nested JSON Data
Consider the following nested JSON record:
{
"users": [
{
"id": 1,
"name": "Alice",
"email": "alice@email.com",
"purchases": [101, 102]
},
{
"id": 2,
"name": "Bob",
"email": "bob@email.com",
"purchases": [103]
},
],
"orders": [
{
"id": 101,
"product": "Laptop",
"price": 1200,
"details": {
"model": "Bell Inspiretron 100",
"ram": "256 MB"
}
},
{
"id": 102,
"product": "Mouse",
"price": 30,
"details": {
"brand": "Big Game Hunter"
}
},
{
"id": 103,
"product": "Laptop",
"price": 1200,
"details": {
"model": "Hackintosh 2008",
"ram": "512 MB"
}
}
]
}
This type of data requires complex queries to extract order details. Flattening it into a dataset makes it easier to work with for analytics and visualizations:
user.id | user.name | user.email | orders.id | orders.product | orders.price | orders.details.model | orders.details.ram | orders.details.brand |
---|---|---|---|---|---|---|---|---|
1 | Alice | alice@email.com | 101 | Laptop | 1200 | Bell Inspiretron 100 | 256 MB | null |
1 | Alice | alice@email.com | 102 | Mouse | 30 | null | null | Big Game Hunter |
1 | Bob | bob@email.com | 103 | Laptop | 1200 | Hackintosh 2008 | 512 MB | null |
Data Preparation
Once you have properly prepared your dataset, you will most likely need to take steps to clean it, transform values, run calculations, flatten any nested hierarchical data (e.g., NoSQL sources or JSON files), and otherwise prepare it for final use in visualizations. There are three locations you could prepare your data: in the source, in Qrvey, or in middleware (e.g., with a Lambda function).
Qrvey Composer provides a comprehensive data transformations tooling suite, which can perform most any common tasks.
There are also a number of reasons you'd want to prepare it in the source connection or with a Lambda function. You may prefer your own environment, you may want to automate the transform logic, keep the workload off Qrvey, or use special analytics or AI libraries, which may not be directly supported within Qrvey Composer.
This video provides an overview of how to use dataset transformations to prepare your data for analysis. It describes how to create new columns of data computed from existing columns, how to standardize data across multiple columns, and how to use formulas and other techniques to prepare data for analysis.
Data Synchronization
Key Point: Data synchronization is required to keep data in visualizations up to date in Managed Datasets.
Unique identifier columns, time-stamps, etc. are essential for proper synchronizations, joins, or any other operation. For details, please see configure a data sync.
Once you've created your datasets, you will need to Pull or Push the data into Qrvey's ES instance to keep it up to date with your source:
- Pull: Qrvey Composer fetches and synchronizes data on a set schedule.
- Push: You decide when data is synchronized with Qrvey by calling a REST API with the data payload included.
The advantage of the Push model is that you can synchronize data according to any business logic you need. You get to manage data ingestion, ensuring flexibility and scalability. During a sync operation, the transformations configured for a dataset will be executed over any new & updated records that are retrieved. If cloud storage is the source connection, we can simply use the file timestamps when performing a sync operation. (i.e. we do not require you to have a datetime column in your source)
This video describes how to keep your datasets up-to-date and in sync with your data sources.
Final Tips
Once you've established your connections, designed your datasets, cleaned and prepared the data for analysis, and set up synchronization, your data pipeline is complete!
Additional Points of Interest:
- Security and access control best practices for managing multi-tenant datasets.
- Optimizing real-time data streaming and event-driven updates for high-frequency data changes.
- How to Post Data to a Dataset