Designed to overcome the limitations
The Data Engine is a break-through analytics database designed to overcome the limitations of existing databases and data silos and to truly support the process of visual analysis. It is designed to reflect the capabilities of the latest hardware and the complete memory hierarchy from disk to L1 cache.
Tableau’s Data Engine shifts the curve between big data and fast analysis.
The Data Engine: analysis of massive data
The evolution of large data
Databases have evolved substantially over the last several years. Legacy databases are focused on disk-resident data and pre-computation. While that allowed for more computation power than before, it had the disadvantage of being slow requiring users to know what questions they would want to answer (their query workload) before building the database.
More recent databases have found performance benefits by just using the top-levels of the memory hierarchy and requiring all data to be memory resident. These “in-memory” solutions made computation much faster, but at the expense of limiting the data size to the size of the available memory.
Goals of the Tableau Data Engine
We designed the Data Engine to:
- Fully utilize current generation hardware to achieve instant query response on hundreds of millions of rows of data on commodity hardware such as a corporate laptop
- Support true ad hoc query by having predictable and consistent query performance for all queries and no requirement for known query workloads or precomputation of aggregates or summaries
- Integrate seamlessly with existing corporate data warehouses and infrastructure
- Not be limited by a requirement for an entire data set to be loaded into memory resident to achieve its performance goals
- Provide very quick load and connections to data sources.
The core Data Engine structure is a column-based representation using compression that supports execution of queries without decompression. Leveraging novel approaches from computer graphics, algorithms were carefully designed to allow full utilization of modern processors with near optimal usage of the L1 and L2 caches, minimal intermediate results, and break-through techniques for managing streaming of data from disk to avoid loss of throughput that enable us to avoid the common limitation of requiring data sets to be completely loaded into memory before analysis can be done resident.
Data Engine to live connection—and back
The Data Engine is designed to directly integrate with Tableau’s existing “live connection” technology, allowing users to toggle with a single click between a direct connection to the corporate database (issuing highly tuned platform-specific SQL queries) to querying an extract of that data loaded into the Data Engine (and back) with careful matching of calculation and collation semantics. This integration allows companies to do analysis on samples of data (GBs) then redirect that to a massively parallel warehouse such as Teradata to run the final analysis (or reports) on Petabytes of data.
True ad-hoc queries
The Data Engine was designed with a query language and query optimizer designed to support the queries typical of on-the-fly business analytics. When working with data at the speed of thought, it is common to need to run complex queries such as very large multi-dimensional filters or complex co-occurrence queries. Existing databases generally perform poorly on these types of queries, whereas the Data Engine processes them instantly.
Flexible data model
One of the key differences of the Data Engine compared to other in-memory solutions is we can operate on the data directly as its represented in the database on disc. So there's no required data modeling and no scripting that needs to be done to use the Data Engine.
One of the things that’s so powerful about the Data Engine is you can define, just as with any other relational database, new calculated columns or you might think of it as sort of ad hoc data modeling at anytime.
Instance load and connection time
The Data Engine is unique in that once your data is loaded into the data engine, it has very fast start-up time. We only need to read in that portion of the data which our queries actually touch. You might have a lot of data in the database that’s not relevant to a particular analysis, you are never going to wait for the Data Engine to read that data.