The Apache Open Source contributions to Hadoop are numerous and cover a broad portion of a reference architecture. It has been some time since we considered foundational low cost storage and in-place query capabilities. And as we saw in the “Data Lakes” blog posting, many organizations utilized this foundational offering.
The low cost commodity hardware could be scaled to align with a growing appetite and commensurate increased workloads. Queries in-place reduced dramatically the need for manual programming efforts traditionally required to extract, transform and load data into relational data models. Queries could be constructed using similar tools and techniques, albeit not identical and often necessitating some degree of retraining, in a simple (okay, simpler) manner than traditional methods. There were even attempts to leverage existing business queries remapped to Hadoop equivalents. Some actually worked. But the query experience from the business community missed expectations, occasionally taking hours/days to complete when their relational equivalents were completed in seconds or minutes.
Although the reduced cost of ownership for Hadoop was quite favorable, queries that took several hours to complete were a tremendous setback. The hopes and aspirations of using a centralized data store for operational reporting and even potentially operational analytics appeared quite doubtful. Many different attempts to combine hybrid components and arrange them in differing orders to address performance concerns failed, some failing quite miserably. An in-memory approach was surely needed. And one was created.
Apache Spark is a massively scalable, distributed in-memory parallel query software that extended the Hadoop footprint by providing in-memory query capability with exceptionally fast response times. In some cases, benchmarks of Apache Spark well-outperformed relational counterparts in similar size or volume tests. So we add Spark to our architectural illustration below. The Data Lake concept now has commodity storage, in-place query still avoiding expensive manual processes and an in-memory query processing component called Spark.
|Key Aspect||Response||NexJ DAI|
|Hadoop EcoSystem||HDFS Low Cost Commodity Storage
Hive In-Place Query
|NexJ DAI Integrates with Hadoop Data Lakes as a potential source system|
|Hadoop EcoSystem||Spark In-Memory Query||NexJ DAI Provisions Semantic View Data consumable through a Spark Adapter|
How does your organization use a Data Lake? What 360-degree data views power your analytics? We welcome your thoughts, value your insights and action your feedback: share below!