Using Semantic Technologies To Manage A Data Lake Data

1680389812
Using semantic technologies to manage a data lake data

File Name: SSWS2020_paper5.pdf

File Size: 2.25 MB

File Type: Application/pdf

Last Modified: 2 years

Status: Available

Last checked: 2 days ago!

This Document Has Been Certified by a Professional

100% customizable

Language: English

We recommend downloading this file onto your computer

Summary

Using Semantic Technologies to Manage a Data Lake:
Data Catalog, Provenance and Access Control
Henrik Dibowski1, Stefan Schmid1, Yulia Svetashova1,4, Cory Henson2,
and Tuan Tran3
1Robert Bosch GmbH, Corporate Research, 71272 Renningen, Germany
[email protected]
[email protected]
2 Bosch Research and Technology Center, PA 15222 Pittsburgh, USA
[email protected]
3 Robert Bosch GmbH, Chassis Systems Control, 74232 Abstatt, Germany
[email protected]
4 Karlsruhe Institute of Technology, 76133 Karlsruhe, Germany
Abstract. Data lake architectures enable the storage and retrieval of large
amounts of data across an enterprise. At Robert Bosch GmbH, we have de-
ployed a data lake for this expressed purpose, focused on managing automotive
sensor data. Simply centralizing and storing data in a data lake, however, does
not magically solve critical data management challenges such as data findabil-
ity, accessibility, interoperability, and re-use. In this paper, we discuss how se-
mantic technologies can help to resolve such challenges. More specifically, we
will demonstrate the use of ontologies and knowledge graphs to provide vital
data lake functions including the cataloging of data, tracking provenance, ac-
cess control, and of course semantic search. Of particular importance is the de-
velopment of the DCPAC Ontology (Data Catalog, Provenance, and Access
Control) along with its deployment and use within a large enterprise setting to
manage the huge volume and variety of data generated by current and future
vehicles

Keywords: Ontology, Knowledge Graph, Semantic Data Lake, Semantic
Search, Semantic Layer, Provenance, Access Control

1 Introduction
Robert Bosch GmbH is a large enterprise company that designs and manufactures
automotive components, ensuring the agility, comfort, function and safety of vehicles
and driver assistance systems. Such components range from classical safety products
including airbags and electronic stability control to next generation automated driving
systems. Both the volume and variety of data generated by these systems have been
growing dramatically in the past few years. More specifically, the types of data range
from sensor data – including video, RADAR, LIDAR, and CANbus signals – to tex-
tual data and metadata about the various projects collecting and using the data within
65
Copyright 2020 for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0)

2
the company. To handle this complexity, we have developed a holistic architecture for
managing our data within the enterprise – the Bosch Automotive Data Lake

Simply centralizing and storing data in a data lake, however, does not immediately
solve all data management challenges. Specifically, issues of findability, accessibility,
interoperability, and re-use – the four principles of FAIR data1 – remain unresolved

To facilitate these principles of FAIR data, we have extended our data lake architec-
ture with a semantic layer. This semantic layer consists of an ontology and knowledge
graph (KG) that provides meaningful, semantic description of all resources in the data
lake. The resources include a heterogeneous assortment of documents, datasets, and
databases. Semantic description of these resources, represented as a knowledge graph,
includes information about the content of the resources, the provenance, and access
control permissions. The ability to perform semantic search of all data in the data lake
provides enhanced findability, access, interoperability, and re-use

The three primary contributions of this paper include the creation of a DCPAC On-
tology (Data Catalog, Provenance, and Access Control), the development of the Se-
mantic Data Lake Catalog KG that is conformant to DCPAC, and the application of
the ontology and KG for semantic search and retrieval. In Section 2, we discuss relat-
ed work and then introduce the development and structure of the DCPAC Ontology in
Section 3. The creation of a conformant KG and its use within an enterprise setting is
explained in Section 4. Finally, in Section 5 we conclude with an overall summary
and directions for the future

2 Related Work
In the era of big data, data catalogs emerged as the standard for metadata manage-
ment. In the last few years, however, new application areas have appeared and the
volume and richness of metadata required has grown significantly. Data lakes consti-
tute one such important new application for data catalogs, besides warehouses, master
data repositories, etc. According to Gartner, a data catalog “… maintains an invento-
ry of data assets through the discovery, description, and organization of datasets2. The
catalog provides context to enable data analysts, data scientists, data stewards, and
other data consumers to find and understand a relevant dataset for the purpose of ex-
tracting business value.” [1]

Current vendors offer a wide range of commercial data catalog software. A sample
of such vendors includes Alation Data Catalog, Atlan Enterprise Data Catalog, Talend
Data Catalog, Collibra Data Catalog, Informatica Enterprise Data Catalog, Microsoft
Azure Data Catalog, Oracle Cloud Infrastructure Data Catalog, and even Google is
joining the market with its Googles Data Catalog. To our knowledge, however, none
of these data catalogs uses or supports standard semantic technologies, nor do they
allow for using existing ontology vocabularies. Rather they are closed, propriety sys-
tems with their own metadata languages and glossaries

1 https://www.go-fair.org/fair-principles/
2 Datasets are the files, tables, graphs etc. that applications or data engineers need to find and
access

66
3
Anzo Cambridge Semantics3 is one of a few exceptions, as it is built from the open
data standards OWL, RDF and SPARQL, which makes it simple to leverage rapidly
evolving vocabularies in multiple industries. Anzo has a built-in smart data catalog
functionality that is able to automatically extract the schemas of databases in a data
lake and support the mapping of the schemas to ontology terms. But the integration of
this data catalog functionality with existing ETL pipelines, as well as extensibility of
the built-in data catalog ontology based on domain specific needs, is limited

Adding a semantic layer to a data lake is a common approach to developing a se-
mantic data lake, which have been described in literature. The use of data catalogs in
this context, however, are still rare. In [2], a data lake using semantic technologies is
presented that can manage datasets produced by sensors or simulation programs in the
manufacturing domain. It comprises a data catalog that provides inventory services
and also implements security mechanisms. Different from our approach, however, this
data catalog is not built using standard semantic technologies, but rather as a simple
file system

A semantic data lake architecture for autonomous fault management in software-
defined networking environments, with clear similarities to ours (Section 4), is de-
scribed in [3]. Another comparable semantic data lake architecture called “Squerall”
is proposed in [4]. This solution proposes distributed query execution techniques and
strategies for querying heterogeneous big data. Both approaches, however, lack a data
catalog and other means of handling provenance or access control

Our solution differs from existing solutions by proposing a semantic data lake ar-
chitecture that incorporates a semantic data catalog, built with standard semantic
technologies, and that addresses provenance and access control for resources in the
data lake. This solution is described in detail in the following sections

3 Semantic Data Catalog, Provenance and Access Control
Layer for Data Lakes
As one of the three primary contributions of this paper, this section describes the
DCPAC ontology (Data Catalog, Provenance, and Access Control). The DCPAC
ontology can be applied for adding a semantic layer to a data lake, which provides
semantic description of the content, provenance, and access control permissions of the
resources in a data lake. This ontology was created by combining several common,
(predominantly) standardized ontology vocabularies and by aligning and extending
them where necessary

3.1 Ontology Layer Architecture
Fig. 1 shows a layer architecture diagram of DCPAC ontology, including the ontology
vocabularies used and their import-relationships. The DCPAC ontology is shown at
3 https://www.cambridgesemantics.com/product/data-cataloging/
67
4
the bottom, and recursively imports all other ontologies. Additionally, it defines
SHACL constraints for validating instance data (ABox)

In the following subsections, the primary ontologies utilized by DCPAC are de-
scribed

SKOS DCMI Metadata
Ontology Terms Ontologies
Imp orts
Imp orts
Imports
Data Catalog Ontology Provenance
(DCAT) Ontology (PROV-O)
Imports
Imports
SKOS Tags Ontology ODRL DCAT – PROV-O Alignment FOAF
(STO) Ontology Ontology (DPA) Ontology
Imp orts
Imp orts
Imp orts
Imp orts
Data Catalog – Provenance – Access Control Ontology for Data Lakes SHACL
(DCPAC) Shapes
Fig. 1. Layer architecture of the data catalog, provenance and access control (DCPAC) ontolo-
gy for data lakes

Data Catalog (DCAT) Ontology [Prefix: dcat]. The Data Catalog (DCAT) ontology
“… is an RDF vocabulary designed to facilitate interoperability between data catalogs
published on the Web. … DCAT enables a publisher to describe datasets and data
services in a catalog using a standard model and vocabulary that facilitates the con-
sumption and aggregation of metadata from multiple catalogs. This can increase the
discoverability of datasets and data services. It also makes it possible to have a decen-
tralized approach to publishing data catalogs and makes federated search for datasets
across catalogs in multiple sites possible using the same query mechanism and struc-
ture.” [5]. DCAT is standardized as a W3C recommendation, with the latest version
from February 2020, and is being developed further by an active community

The DCAT ontology imports and uses the widely recognized SKOS [6] and DCMI
Metadata Terms [7] ontologies. Its primary purpose in the context of the DCPAC
ontology is the semantic description of the content of resources in a data lake

Provenance Ontology (PROV-O) [Prefix: prov]. The Provenance Ontology
(PROV-O) “… expresses the PROV Data Model using the OWL2 Web Ontology
Language. It provides a set of classes, properties, and restrictions that can be used to
represent and interchange provenance information generated in different systems and
68
5
under different contexts. It can also be specialized to create new classes and proper-
ties to model provenance information for different applications and domains.” [8]

PROV-O is a W3C recommendation from April 2013. Its purpose in the context of
DCPAC is to describe the provenance of the data lake resources. Such provenance
information may include the ownership of resources how they were created, by which
activity and agent, and from what data they were derived

Open Digital Rights Language (ODRL) Ontology [Prefix: odrl]. The Open Digital
Rights Language (ODRL) ontology “… is a policy expression language that provides
a flexible and interoperable information model, vocabulary, and encoding mecha-
nisms for representing statements about the usage of content and services. The ODRL
Vocabulary and Expression describes the terms used in ODRL policies and how to
encode them.” [9]. The latest version 2.2 was published by the W3C in September
2017. In our data lake scenario, ODRL is applied to defining access control permis-
sions for the data lake resources, including who can access a resource and which ac-
tions are permitted, i.e. display, read, modify, delete

DCAT – PROV-O Alignment (DPA) Ontology [Prefix: dpa]. The DCAT – PROV-
O Alignment (DPA) ontology [10] was created by the W3C Dataset Exchange Work-
ing Group (DXWG) and contains alignment axioms between DCAT ontology and
PROV-O. Thereby, it enhances the DCAT ontology with the ability to use PROV-O
for expressing advanced provenance information

The most relevant alignments defined in the DPA ontology are shown in Fig. 2. It
aligns the DCAT ontology classes dcat:CatalogRecord, dcat:Resource
and dcat:Distribution as subclasses of the PROV-O class prov:Entity by
adding corresponding rdfs:subClassOf statements. Thus, all instances of these
classes and their subclasses become instances of prov:Entity, which allows the
usage of all associated PROV-O object properties and classes for modeling prove-
nance information. This makes the provenance and authorship of data, along with its
evolution over time, trackable in each little detail

SKOS Tags Ontology (STO). The Simple Knowledge Organization System (SKOS)
is “a common data model for sharing and linking knowledge organization systems”
[6]. We design separate SKOS vocabularies for different domains and use them to
specify the semantics of resources in a data lake, dependent on its subject. In particu-
lar, we assign each dataset a set of skos:Concepts as tags that provide semantic
description of the content of a data lake resource

The SKOS vocabularies are domain specific. While defining these vocabularies,
we often reuse terms from existing or newly developed domain ontologies. From the
domain ontologies, we select subsets of classes and individuals that are relevant for
the tasks of retrieval, and define them as instances of skos:Concept. Domain spe-
cific SKOS vocabularies are iteratively added to the SKOS Tags ontology (STO),
which serves as a generic component of the domain-agnostic architecture of the
69
6
DCPAC ontology (see Fig. 1), bridging it with domain specific ontologies. In the
following sub-section, we describe one such domain ontology (ASO), developed for
the Bosch Automotive Data Lake, and show its relationship to the STO

prov:Entity
dcat:CatalogRecord dcat:Resource dcat:Distribution
dcat:Dataset dcat:DataService
dcat:Catalog dcat:DataDistributionService
Fig. 2. DPA ontology: Alignment of DCAT ontology with PROV-O

Automotive Signal Ontology (ASO) [Prefix: aso]. The primary goal of the Automo-
tive Signal Ontology [11] is to represent manifold signal types in automotive datasets
and to enable non-trivial queries spanning over datasets of different types, formats
and modalities (including radar signals, onboard diagnostics and video data). The use
of this ontology allows non-domain experts to understand and query the data, as well
as to automate the integration of signals from different sources in support of a wide
range of applications and use-cases of interest to the automotive industry

The ASO is an OWL 2 ontology. It borrows concepts from several standard ontol-
ogies and vocabularies, namely the W3C Semantic Sensor Network Ontology (SSN)
[12], the Quantities, Units, Dimensions, and Data Types Ontologies (QUDT) [13] and
the Vehicle Signal and Attribute Ontology (VSSo) [14], generated from the automo-
tive standard VSS [15]. The ASO conceptualizes a signal by defining several mean-
ingful relations, including the signal type (e.g. aso:WindowPosition as a sub-
class of aso:ObservableSignal), the associated vehicle component (e.g

aso:Window), the sensor(s) and actuator(s) involved in generating signal data, as
well as the measured physical quantities and units-of-measure. It also provides terms
to describe the specific details of automotive data collection, e.g. CAN bus data, CAN
frames, messages and signals

The ASO also defines an associated SKOS vocabulary, where all signals are de-
fined as instances of skos:Concepts. This vocabulary is a part of STO

Consequently, the ASO has a dual role in our Automotive Data Lake. The typing
of ASO signals as skos:Concepts provides the means to tag resources in the data
lake in a consistent way and enriches the semantic search capabilities provided by our
70
7
DCPAC ontology. In addition, the formal semantics of the ASO itself enables expres-
sive queries, which go beyond the hierarchical SKOS tag search and make the data
lake truly semantic. For example, find all datasets that are tagged with signals of a
certain type (e.g. aso:ObservableSignal)and being associated with specific
vehicle component (e.g. aso:Window)

3.2 Data Catalog – Provenance – Access Control (DCPAC) Ontology
[Prefix: dcpac]
The DCPAC ontology is our primary contribution to the ontology layer architecture
shown in Fig. 1. It combines, aligns and extends the ontology vocabularies described
in the previous section. The ontology directly imports the ODRL ontology, the DPA
ontology, the FOAF (“Friend of a Friend”) ontology [16] and optionally one or more
STO ontologies, and recursively imports all other shown ontologies. We chose to re-
use properties defined by the FOAF ontology – such as foaf:givenName,
foaf:name, and foaf:mbox – to extend the existing definitions of prov:Agent
and odrl:PartyCollection

Alignments and Extensions to the Upper Layer Ontologies. The DCPAC ontology
aligns the DCAT ontology with the ODRL ontology by declaring the classes
dcat:Distribution and dcat:Resource to be subclasses of odrl:Asset,
as can be seen in the upper part of Fig. 3. With odrl:Asset representing a resource
or a collection of resources that are the subject of access authorization rules, this ena-
bles the definition of access control permissions for these DCAT classes and sub-
classes with the ODRL vocabulary. Furthermore, the DCPAC ontology extends the
DCAT ontology by defining various types of dcat:Dataset subclasses (see Fig

3), which allows for distinguishing different types of datasets in a data lake, such as
raw data files, tabular data files, relational database and graph database resources

Another contribution is the alignment of PROV-O with the ODRL ontology, as
shown in Fig. 4. The PROV-O class prov:Agent is declared as subclass of
odrl:Party, hence enabling all instances of prov:Agent to undertake roles in
access control permissions. Additionally, the DCPAC ontology defines new sub-
classes of prov:Activity, which allow for distinguishing different types of ac-
tivities that created (dcpac:GenerationActivity) or modified (dcpac:
ModificationActivity) a data lake resource

SHACL Constraints. The DCPAC ontology is associated with a SHACL shapes
definition file that defines a comprehensive set of SHACL constraints of type SHACL
Shapes (Node Shapes, Property Shapes) and SPARQL-based constraints [17]

SHACL shapes define cardinalities and type restrictions on properties, and regular
expressions on the allowed values of string datatype properties. One such SHACL
shape, for example, validates that each dcat:Dataset instance has to have exactly
one value of type string defined for the property dct:identifier, and the string
71
8
must match the regular expression “^[a-z0-9][a-z0-9_\-]{2,59}$”

SPARQL-based constraints have a higher expressivity and can capture complex de-
pendencies as graph patterns. For the class dcat:Dataset, for example, we de-
fined a constraint that validates that each instance must have at least one semantic tag
(skos:Concept) attached, and the tags must be members of a
skos:ConceptScheme that is associated with (i.e. enabled for) the catalog the
dataset belongs to (see also next section and Fig. 5)

A SHACL engine can process the constraints and validate the consistency of the
KG (ABox)4. That improves the integrity and quality of the KG and prevents issues

odrl:Asset
odrl:AssetCollection dcat:Resource dcat:Distribution
dcat:Dataset dcat:DataService
dcpac:DatabaseDataset dcpac:FileBasedDataset dcpac:FolderBasedDataset dcat:Catalog dcat:DataDistribution-
Service
dcpac:GraphDatabase- dcpac:RelationalDatabase- dcpac:RawDataFile dcpac:TabularDataFile
Dataset Dataset
Fig. 3. DCPAC ontology: Refinement and alignment of DCAT ontology with ODRL ontology

dcpac:GenerationActivity
odrl:Party prov:Activity
dcpac:ModificationActivity
prov:Agent
odrl:Asset
prov:Person prov:Organization prov:Entity
prov:SoftwareAgent
dcat:Resource dcat:Distribution
dcat:Dataset dcat:DataService
Fig. 4. DCPAC ontology: Refinement and alignment of PROV-O with ODRL ontology

4 We use Stardog as highly scalable triple store for our Bosch data lake. Stardog supports
SHACL and has an inbuilt SHACL engine. https://www.stardog.com/platform/
72
9
3.3 The Core Vocabulary

This Section provides an overview and explanation of the core vocabulary of the
DCPAC ontology and the primary imported vocabularies, which are explained in the
previous sections. For the explanation, we refer to Fig. 5, which shows the main on-
tology classes as well as the most important object properties and datatype properties

The stereotypes shown for some of the classes in Fig. 5 contain their superclasses and
hence their alignment to the other ontologies described in the previous sections. We
abstain from showing and explaining additional classes and properties that are specif-
ic for the Bosch Automotive Data Lake in order to maintain comprehensibility and
domain-independence

DCAT Entities. Let us start with the DCAT ontology classes shown in the center and
bottom left of Fig. 5. The overall data catalog of the data lake is represented by one
instance of class dcat:Catalog. It can contain many dcat:Dataset instances,
one per resource in the data lake, e.g. raw data files, HBase or Hive tables, or RDF-
based knowledge graphs. An instance of class dcat:Distribution models a
specific representation of a dataset, comprising a specific serialization or schematic
arrangement. Different distributions can exist for the same dataset, and are accessible
via a URL (dcat:downloadURL). The data catalog and the datasets can each have
several data distribution services (dcat:DataDistributionService), which
are end-points that provide access. They are accessible via an endpoint URL
(dcat:endpointURL)

PROV-O Entities. The PROV-O classes and properties shown in the top right part of
Fig. 5 are used for modeling the provenance of the data catalog and its datasets (both
declared as subclasses of prov:Entity, see Fig. 2), and for defining agents (e.g

person, software agent) they are attributed to (prov:wasAttributedTo) or that
were involved in the activity of creating the dataset. Activities (prov:Activity)
are initiated by agents (prov:wasAssociatedWith), create new Datasets
(prov:wasGeneratedBy), have an start and end time, and can use other datasets
as input (prov:used)

ODRL Entities. Access control is defined by classes and properties from the ODRL
ontology. An odrl:Permission can define an access rule for groups of agents
(odrl:PartyCollection) to datasets, their distributions and/or data distribution
services (odrl:target). The allowed actions (odrl:Action), such as display,
read, modify, delete, are defined as skos:Concept and attached via
odrl:action object property

73
10
Fig. 5. The main classes and properties of the DCPAC ontology (TBox)

74
11
SKOS Entities. SKOS finally is applied for defining the semantics of the content of a
dataset. Therefore, the catalog refers to one or more sets of SKOS concepts
(skos:ConceptScheme) that can be used for semantically tagging datasets. The
defined SKOS tags can be either directly linked to a dataset (dcat:theme), or they
can be bundled and linked as a collection (skos:Collection), which enables the
definition and reuse of (large) sets of SKOS tags. For the Bosch Automotive Data
Lake we use the ASO ontology, as described in Section 3.1

4 Semantic Data Lake Catalog
At Bosch, we have built an Automotive Data Lake as a centralized platform for the
engineering and testing of our autonomous driving applications [11]. To handle and
manage the complexity and enormous volume of data from all our test drives, we
have developed a holistic architecture, which is shown in Fig. 6. and explained in this
section. Resources collected and stored in the Bosch Automotive Data Lake include a
heterogeneous assortment of documents, datasets and databases. We have created a
semantic layer, the Semantic Data Lake Catalog, which provides meaningful se-
mantic description of resources in the data lake and enables semantic search. The
Semantic Data Lake Catalog comprises a knowledge graph that is built with the vo-
cabulary defined in the DCPAC ontology (see Section 3). The semantic description of
the resources includes information about their content, provenance and access control
permissions. The ability to perform semantic search of all data in the data lake pro-
vides enhanced findability, access, interoperability, and re-use

Data Data Applications
Stream Data Ingestion Processing
AI, Business
Process Tasks Applications
Data Analytics
External Applications
Databases Data Lake
Catalog Search
External Data Lake Catalog
Files Data Lake Catalog
Population Service User Interface
Data Access Engine
Data Stores Model Store Semantic Data Scene KG
Lake Catalog
Data Lake
Fig. 6. Data lake architecture and role of Semantic Data Lake Catalog

75
12
In the sub-sections below, we explain the other components of the Automotive Data
Lake shown in Fig. 6, and clarify the process by which the Semantic Data Lake Cata-
log KG is populated and how it is used to query, find and access data assets

Data Ingestion Process. As illustrated in Fig. 6, external data from different sources
(test fleet vehicles, test benches, data warehouse, etc.) are ingested into the data lake,
either continuously in streams or driven by users and applications. The ingestion pro-
cess is responsible for extracting, transforming and loading new data assets into the
data stores. The implementation of this ingestion process was containerized using our
in-house DevOps tool in order to allow scaling-out based on the load. It is important
to note that this tool does not only provide a mechanism to deploy the ingestion pro-
cesses on our on-premises infrastructure, it also wraps the ingestion with a list of
standard operators that are automatically called to report the process information as
well as input & output data to a Kafka 5 cluster. These reports, published as standard-
ized Kafka messages, are consumed by the Data Lake Catalog Population Service

Data Lake Catalog Population Service. Triggered by Kafka messages, the Data
Lake Catalog Population Service reads the available metadata on the ingested data
assets and constructs the relevant semantic data as input for our Semantic Data Lake
Catalog. The Data Lake Catalog Population Service aligns, annotates and enriches the
input data from the Data Ingestion Process with DCPAC concepts before populating
the Semantic Data Lake Catalog6. Besides dictionary based mappings (i.e. input data
schema terms or tags are mapped to dedicated SKOS concepts of our Semantic Data
Lake Catalog taxonomies), the population service also links signal name strings to
relevant automotive signal concepts from the Automotive Signal Ontology (based on
Levenshtein distance). This is a critical part of the knowledge construction process, as
it enables us to search, integrate and process the various data assets based on a shared
conceptualization. The Data Lake Catalog Population Service will also record rele-
vant provenance information; e.g. the activity that has created or modified a data as-
set, including information about the source asset as well as begin and end time

Data Access Engine. The Data Access Engine (DAE) provides applications with a
uniform query interface and access to resources (e.g. files, tables, knowledge graphs)
in the data lake based on a common HTTP-based API and endpoint. At this stage, the
DAE supports data-type queries (i.e. in HBase7 or Hive8/Impala9 tables), knowledge
queries (i.e. SPARQL queries of knowledge graphs) and task requests (i.e. Oozie10
jobs in the Hadoop11 cluster). The DAE secures and hides the details of the underlying
5
Apache Kafka: A distributed streaming platform, https://kafka.apache.org/
6 We use Stardog for storing and processing the semantic layer as knowledge graph

7 Apache HBase: Distributed big data store that runs on Hadoop, https://hbase.apache.org/
8 Apache Hive: Data warehouse software for large distributed datasets, https://hive.apache.org/
9 Apache Impala: Native analytic database for Apache Hadoop, https://impala.apache.org/
10 Apache Oozie: Workflow scheduler for Hadoop, https://oozie.apache.org/
11 Apache Hadoop: Scalable, distributed computing software, https://hadoop.apache.org/
76

This data catalog functionality with existing ETL pipelines, as well as extensibility of the built-in data catalog ontology based on domain specific needs, is limited. Adding a . semantic layer. to a data lake is a common approach to developing a . se-mantic data lake, which have been described in literature. The use of data catalogs in

Download Now

Documemt Updated

Popular Download

Frequently Asked Questions

What is the semantic data lake platform based on?

The Semantic Data Lake platform it has created with its partners is based on AllegroGraph, Franz’s Semantic Graph Database. Franz’s Semantic Data Lake platform makes it possible for healthcare facilities to take their medical records and financial data and put it in Hadoop,...

What is a semantic layer in salesforce?

A semantic layer maps complex data into familiar business terms such as product, customer, or revenue to offer a unified, consolidated view of data across the organization” (From Wikipedia) The semantic layer is a single business representation of corporate data. It contains a clear and unique definition of corporate data entities.

What is semantic layer in dbms?

The Semantic layer contains information about the objects in the data source which it uses to generate queries to retrieve the data. So the Semantic layer allows to solve the issue with data meaning ambiguity. The idea of semantic layer is not new.

How to solve the usability issue of data lakes?

To solve their usability issue, Data lakes must be provided with semantic data layer (s) to take an inventory of all the key business metrics and collect them in a single abstraction layer where they can be managed and changed in one place.