Many organizations established, or are presently establishing data lakes as a cost effective means of provisioning operational intelligence query and analytics capabilities directly to the field personnel who need them the most, understand the data the best, and are the most capable of actioning insights gleaned.  Sounds like an ideal arrangement.  The quality of insights will be determined by the sources ingested into the lake and how this ingestion occurs.

User queries from the data lake are often validated against the source operational systems.  The problem is that operational systems will change during the day, the volume depending on workload but the data lake may only be refreshed nightly or in batches.  Sometimes, due to complexities or service fees, only key aspects are refreshed regularly while certain portions are only accessed on-demand.  Understanding how your data lake is being populated can save you time validating results.

Analytics processes usually begin with data provisioning and the at rest data stored in the lake is an excellent place to start.  Augmenting this data with warehouse content ensures accuracy by leveraging the organization’s efforts to date by cleansing, standardizing and storing historical changes to key entities.  Data warehouse content is often extracted nightly to the data lake for this reason.  Uniting data in motion available from the enterprise bus or from streaming feeds with data lake contents usually means additional components that sit beside the lake.

Whenever data from multiple source systems or from multiple departments is combined with the intent of providing a consolidated view, data quality issues are to be expected.  Traditionally, organizations were empowered to identify perceived errors in their consumptive data and sought to remedy these errors by correcting the source when possible or the transformation process when feasible.  Now this is not to suggest that a data quality program is required to sustain operational reporting in a data lake, not at all, because this problem gets even further aggravated when uniting data in motion available from the enterprise bus or from streaming feeds.

Unexpected query results based on the unified results is a common complaint.  Tracing the origin and identifying the root cause can be a complicated undertaking.  Organizations can simplify this process by ensuring alignment behind a common information model rooted at the attribute level, a common dictionary if you will.  By focusing efforts towards improving the alignment of each individual attribute, consensus can be gained between the contributing systems and technology delivery channels.  Further agility can be expected by avoiding structural dependencies that may create multiple editions or versions of each attribute.

 

With NexJ DAi, an attribute model that natively unites data at rest or in motion and presents results using a consistent terminology helps firms to better provision results for both user query and analytic efforts.  With NexJ DAi, integration occurs at an attribute level and, once resolved, the attribute can be assigned to many views. Attributes can integrate content from streaming web services and messages or from data at rest like databases or files. Attributes can also define a computation, allowing for centralization and review of calculations, easily modified and tailored to address specific needs – all leveraging the attributes already defined.  The NexJ DAi engine will natively publish attributes to associated views so changes in operational systems, updates from streaming services, or the latest messages are reflected.

Takeaways To Date

Key Aspect Response NexJ DAI
Hadoop EcoSystem HDFS Low Cost Commodity Storage
Hive In-Place Query
NexJ DAI integrates with Hadoop data lakes as a potential source system
Hadoop EcoSystem Spark In-Memory Query NexJ DAI provisions semantic view data consumable through a Spark Adapter
Hadoop Data Lake Batch Load NexJ DAI provisions the most up to date data
Hadoop EcoSystem Separate technology for streaming NexJ DAI addresses both data at rest and in motion

 

How does your organization use a Data Lake?  What 360-degree data views power your analytics?  We welcome your thoughts, value your insights and action your feedback: share below!