The flexibility and cost effectiveness of Apache Hadoop was quickly recognized by many organizations as an effective delivery vehicle to empower business users with operational self-service query and analytic capabilities.  Many organizations established, or are presently establishing, data lakes for operational intelligence query and analytics capabilities for the field personnel who need them most, best understand the data, and are the most capable of actioning insights gleaned.  However, aligning the usable, credible, and high quality data elements available in the lake with other streaming feeds poses data quality challenges using traditional techniques.

Provisioning data for analytics processes usually begins with the data lake contents.  Combining, integrating or merging these results with data in motion whether it’s from a message bus, a streaming feed or intermittent latency service often creates data quality challenges.  Different and often disparate technologies need to unite content in a singular result, creating tremendous challenges for the organization to maintain a cohesive understanding and apply it correctly.

Let’s expand on the Customer Lifetime Value blog example and model the results, component by component.  Revenue is a great one to begin with. This will undoubtedly start with balances comprised of results batched into a data lake and then combined with messages from the bus or streaming micro-service feeds sourced internally, depending on the specific transaction underway.  Consider the computed revenue at this point as a sharable resource to other analytic projects.  The top percentile of this high revenue segment is interesting to analyze from a trend perspective to determine which traffic patterns created or fostered this result.  It’s also interesting from a monitoring perspective to identify when a customer is transacting more or less than usual.

From a semantic model perspective, we have within the domain of customer an attribute, specifically a computed measure called revenue.  Revenue can be considered to be initialized from a data at rest element housed in the data lake.  Revenue can also be mapped to messages pulled from the enterprise bus – these messages may specify specific transactions that occurred since the last refresh of the lake.  Revenue can also be considered as a mapping to an internal micro-service that contains transactions not available on the bus, usually third-party in nature.  Another computed revenue attribute, called the revenue percentile, is calculated from the domain of all mapped customers and their revenue – these could be computed within time-specified ranges to form meaningful variables such as year-to-date revenue, revenue since inception, or top percentile of customers ranked by revenue this year.

With a semantic attribute model, emphasis is now placed on understanding the attribute and its context.  Notice the complete lack of structural dependency, technical nomenclature (save the mapping), and the ability to apply the results in a meaningful setting.  The attribute can power historical analytics as well as alerting mechanisms and can service multiple consumers without the peril of additional scripting, loading, processing or technical mapping.  To learn more about semantic modelling, check out this on-line course and materials from the University of Florida, or follow me at NexJ Systems.

With NexJ DAi, an attribute model that natively unites data at rest or in motion and presents results using a consistent terminology helps firms to better provision results for both user query and analytic efforts.  With NexJ DAi, integration occurs at an attribute level and, once resolved, the attribute can be assigned to many views. Attributes can integrate content from streaming web services and messages or from data at rest like databases or files. Attributes can also define a computation, allowing for centralization and review of calculations, easily modified and tailored to address specific needs – all leveraging the existing attributes already defined.  The NexJ DAi engine will natively publish attributes to associated views so changes in operational systems, updates from streaming services or the latest messages are reflected.

Takeaways To Date

Hadoop EcoSystemMultiple modelling techniquesNexJ DAI provides a semantic model

Key Aspect Response NexJ DAI
Hadoop EcoSystem HDFS Low Cost Commodity Storage
Hive In-Place Query
NexJ DAI integrates with Hadoop data lakes as a potential source system
Hadoop EcoSystem Spark In-Memory Query NexJ DAI provisions semantic view data consumable through a Spark Adapter
Hadoop Data Lake Batch Load NexJ DAI provisions the most up to date data
Hadoop EcoSystem Separate technology for streaming NexJ DAI addresses both data at rest and in motion

How does your organization use a data lake?  What 360-degree data views power your analytics?  We welcome your thoughts, value your insights and action your feedback: share below!