File Name: SSWS2020_paper5.pdf
File Size: 2.25 MB
File Type: Application/pdf
Last Modified: 2 years
Status: Available
Last checked: 2 days ago!
This Document Has Been Certified by a Professional
100% customizable
Language: English
We recommend downloading this file onto your computer
Using Semantic Technologies to Manage a Data Lake: Data Catalog, Provenance and Access Control Henrik Dibowski1, Stefan Schmid1, Yulia Svetashova1,4, Cory Henson2, and Tuan Tran3 1Robert Bosch GmbH, Corporate Research, 71272 Renningen, Germany [email protected] [email protected] 2 Bosch Research and Technology Center, PA 15222 Pittsburgh, USA [email protected] 3 Robert Bosch GmbH, Chassis Systems Control, 74232 Abstatt, Germany [email protected] 4 Karlsruhe Institute of Technology, 76133 Karlsruhe, Germany Abstract. Data lake architectures enable the storage and retrieval of large amounts of data across an enterprise. At Robert Bosch GmbH, we have de- ployed a data lake for this expressed purpose, focused on managing automotive sensor data. Simply centralizing and storing data in a data lake, however, does not magically solve critical data management challenges such as data findabil- ity, accessibility, interoperability, and re-use. In this paper, we discuss how se- mantic technologies can help to resolve such challenges. More specifically, we will demonstrate the use of ontologies and knowledge graphs to provide vital data lake functions including the cataloging of data, tracking provenance, ac- cess control, and of course semantic search. Of particular importance is the de- velopment of the DCPAC Ontology (Data Catalog, Provenance, and Access Control) along with its deployment and use within a large enterprise setting to manage the huge volume and variety of data generated by current and future vehicles
Keywords: Ontology, Knowledge Graph, Semantic Data Lake, Semantic Search, Semantic Layer, Provenance, Access Control
1 IntroductionRobert Bosch GmbH is a large enterprise company that designs and manufacturesautomotive components, ensuring the agility, comfort, function and safety of vehiclesand driver assistance systems. Such components range from classical safety productsincluding airbags and electronic stability control to next generation automated drivingsystems. Both the volume and variety of data generated by these systems have beengrowing dramatically in the past few years. More specifically, the types of data rangefrom sensor data – including video, RADAR, LIDAR, and CANbus signals – to tex-tual data and metadata about the various projects collecting and using the data within 65 Copyright 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
2the company. To handle this complexity, we have developed a holistic architecture formanaging our data within the enterprise – the Bosch Automotive Data Lake
Simply centralizing and storing data in a data lake, however, does not immediatelysolve all data management challenges. Specifically, issues of findability, accessibility,interoperability, and re-use – the four principles of FAIR data1 – remain unresolved
To facilitate these principles of FAIR data, we have extended our data lake architec-ture with a semantic layer. This semantic layer consists of an ontology and knowledgegraph (KG) that provides meaningful, semantic description of all resources in the datalake. The resources include a heterogeneous assortment of documents, datasets, anddatabases. Semantic description of these resources, represented as a knowledge graph,includes information about the content of the resources, the provenance, and accesscontrol permissions. The ability to perform semantic search of all data in the data lakeprovides enhanced findability, access, interoperability, and re-use
The three primary contributions of this paper include the creation of a DCPAC On-tology (Data Catalog, Provenance, and Access Control), the development of the Se-mantic Data Lake Catalog KG that is conformant to DCPAC, and the application ofthe ontology and KG for semantic search and retrieval. In Section 2, we discuss relat-ed work and then introduce the development and structure of the DCPAC Ontology inSection 3. The creation of a conformant KG and its use within an enterprise setting isexplained in Section 4. Finally, in Section 5 we conclude with an overall summaryand directions for the future
2 Related WorkIn the era of big data, data catalogs emerged as the standard for metadata manage-ment. In the last few years, however, new application areas have appeared and thevolume and richness of metadata required has grown significantly. Data lakes consti-tute one such important new application for data catalogs, besides warehouses, masterdata repositories, etc. According to Gartner, a data catalog “… maintains an invento-ry of data assets through the discovery, description, and organization of datasets2. Thecatalog provides context to enable data analysts, data scientists, data stewards, andother data consumers to find and understand a relevant dataset for the purpose of ex-tracting business value.” [1]
Current vendors offer a wide range of commercial data catalog software. A sampleof such vendors includes Alation Data Catalog, Atlan Enterprise Data Catalog, TalendData Catalog, Collibra Data Catalog, Informatica Enterprise Data Catalog, MicrosoftAzure Data Catalog, Oracle Cloud Infrastructure Data Catalog, and even Google isjoining the market with its Googles Data Catalog. To our knowledge, however, noneof these data catalogs uses or supports standard semantic technologies, nor do theyallow for using existing ontology vocabularies. Rather they are closed, propriety sys-tems with their own metadata languages and glossaries
1 https://www.go-fair.org/fair-principles/2 Datasets are the files, tables, graphs etc. that applications or data engineers need to find and access
66 3 Anzo Cambridge Semantics3 is one of a few exceptions, as it is built from the opendata standards OWL, RDF and SPARQL, which makes it simple to leverage rapidlyevolving vocabularies in multiple industries. Anzo has a built-in smart data catalogfunctionality that is able to automatically extract the schemas of databases in a datalake and support the mapping of the schemas to ontology terms. But the integration ofthis data catalog functionality with existing ETL pipelines, as well as extensibility ofthe built-in data catalog ontology based on domain specific needs, is limited
Adding a semantic layer to a data lake is a common approach to developing a se-mantic data lake, which have been described in literature. The use of data catalogs inthis context, however, are still rare. In [2], a data lake using semantic technologies ispresented that can manage datasets produced by sensors or simulation programs in themanufacturing domain. It comprises a data catalog that provides inventory servicesand also implements security mechanisms. Different from our approach, however, thisdata catalog is not built using standard semantic technologies, but rather as a simplefile system
A semantic data lake architecture for autonomous fault management in software-defined networking environments, with clear similarities to ours (Section 4), is de-scribed in [3]. Another comparable semantic data lake architecture called “Squerall”is proposed in [4]. This solution proposes distributed query execution techniques andstrategies for querying heterogeneous big data. Both approaches, however, lack a datacatalog and other means of handling provenance or access control
Our solution differs from existing solutions by proposing a semantic data lake ar-chitecture that incorporates a semantic data catalog, built with standard semantictechnologies, and that addresses provenance and access control for resources in thedata lake. This solution is described in detail in the following sections
3 Semantic Data Catalog, Provenance and Access Control Layer for Data LakesAs one of the three primary contributions of this paper, this section describes theDCPAC ontology (Data Catalog, Provenance, and Access Control). The DCPAContology can be applied for adding a semantic layer to a data lake, which providessemantic description of the content, provenance, and access control permissions of theresources in a data lake. This ontology was created by combining several common,(predominantly) standardized ontology vocabularies and by aligning and extendingthem where necessary
3.1 Ontology Layer ArchitectureFig. 1 shows a layer architecture diagram of DCPAC ontology, including the ontologyvocabularies used and their import-relationships. The DCPAC ontology is shown at3 https://www.cambridgesemantics.com/product/data-cataloging/ 67 4the bottom, and recursively imports all other ontologies. Additionally, it definesSHACL constraints for validating instance data (ABox)
In the following subsections, the primary ontologies utilized by DCPAC are de-scribed
SKOS DCMI Metadata Ontology Terms Ontologies Imp orts Imp orts Imports Data Catalog Ontology Provenance (DCAT) Ontology (PROV-O) Imports Imports SKOS Tags Ontology ODRL DCAT – PROV-O Alignment FOAF (STO) Ontology Ontology (DPA) Ontology Imp orts Imp orts Imp orts Imp orts Data Catalog – Provenance – Access Control Ontology for Data Lakes SHACL (DCPAC) ShapesFig. 1. Layer architecture of the data catalog, provenance and access control (DCPAC) ontolo- gy for data lakes
Data Catalog (DCAT) Ontology [Prefix: dcat]. The Data Catalog (DCAT) ontology“… is an RDF vocabulary designed to facilitate interoperability between data catalogspublished on the Web. … DCAT enables a publisher to describe datasets and dataservices in a catalog using a standard model and vocabulary that facilitates the con-sumption and aggregation of metadata from multiple catalogs. This can increase thediscoverability of datasets and data services. It also makes it possible to have a decen-tralized approach to publishing data catalogs and makes federated search for datasetsacross catalogs in multiple sites possible using the same query mechanism and struc-ture.” [5]. DCAT is standardized as a W3C recommendation, with the latest versionfrom February 2020, and is being developed further by an active community
The DCAT ontology imports and uses the widely recognized SKOS [6] and DCMIMetadata Terms [7] ontologies. Its primary purpose in the context of the DCPAContology is the semantic description of the content of resources in a data lake
Provenance Ontology (PROV-O) [Prefix: prov]. The Provenance Ontology(PROV-O) “… expresses the PROV Data Model using the OWL2 Web OntologyLanguage. It provides a set of classes, properties, and restrictions that can be used torepresent and interchange provenance information generated in different systems and 68 5under different contexts. It can also be specialized to create new classes and proper-ties to model provenance information for different applications and domains.” [8]
PROV-O is a W3C recommendation from April 2013. Its purpose in the context ofDCPAC is to describe the provenance of the data lake resources. Such provenanceinformation may include the ownership of resources how they were created, by whichactivity and agent, and from what data they were derived
Open Digital Rights Language (ODRL) Ontology [Prefix: odrl]. The Open DigitalRights Language (ODRL) ontology “… is a policy expression language that providesa flexible and interoperable information model, vocabulary, and encoding mecha-nisms for representing statements about the usage of content and services. The ODRLVocabulary and Expression describes the terms used in ODRL policies and how toencode them.” [9]. The latest version 2.2 was published by the W3C in September2017. In our data lake scenario, ODRL is applied to defining access control permis-sions for the data lake resources, including who can access a resource and which ac-tions are permitted, i.e. display, read, modify, delete
DCAT – PROV-O Alignment (DPA) Ontology [Prefix: dpa]. The DCAT – PROV-O Alignment (DPA) ontology [10] was created by the W3C Dataset Exchange Work-ing Group (DXWG) and contains alignment axioms between DCAT ontology andPROV-O. Thereby, it enhances the DCAT ontology with the ability to use PROV-Ofor expressing advanced provenance information
The most relevant alignments defined in the DPA ontology are shown in Fig. 2. Italigns the DCAT ontology classes dcat:CatalogRecord, dcat:Resourceand dcat:Distribution as subclasses of the PROV-O class prov:Entity byadding corresponding rdfs:subClassOf statements. Thus, all instances of theseclasses and their subclasses become instances of prov:Entity, which allows theusage of all associated PROV-O object properties and classes for modeling prove-nance information. This makes the provenance and authorship of data, along with itsevolution over time, trackable in each little detail
SKOS Tags Ontology (STO). The Simple Knowledge Organization System (SKOS)is “a common data model for sharing and linking knowledge organization systems”[6]. We design separate SKOS vocabularies for different domains and use them tospecify the semantics of resources in a data lake, dependent on its subject. In particu-lar, we assign each dataset a set of skos:Concepts as tags that provide semanticdescription of the content of a data lake resource
The SKOS vocabularies are domain specific. While defining these vocabularies,we often reuse terms from existing or newly developed domain ontologies. From thedomain ontologies, we select subsets of classes and individuals that are relevant forthe tasks of retrieval, and define them as instances of skos:Concept. Domain spe-cific SKOS vocabularies are iteratively added to the SKOS Tags ontology (STO),which serves as a generic component of the domain-agnostic architecture of the 69 6DCPAC ontology (see Fig. 1), bridging it with domain specific ontologies. In thefollowing sub-section, we describe one such domain ontology (ASO), developed forthe Bosch Automotive Data Lake, and show its relationship to the STO
prov:Entity dcat:CatalogRecord dcat:Resource dcat:Distribution dcat:Dataset dcat:DataService dcat:Catalog dcat:DataDistributionService Fig. 2. DPA ontology: Alignment of DCAT ontology with PROV-O
Automotive Signal Ontology (ASO) [Prefix: aso]. The primary goal of the Automo-tive Signal Ontology [11] is to represent manifold signal types in automotive datasetsand to enable non-trivial queries spanning over datasets of different types, formatsand modalities (including radar signals, onboard diagnostics and video data). The useof this ontology allows non-domain experts to understand and query the data, as wellas to automate the integration of signals from different sources in support of a widerange of applications and use-cases of interest to the automotive industry
The ASO is an OWL 2 ontology. It borrows concepts from several standard ontol-ogies and vocabularies, namely the W3C Semantic Sensor Network Ontology (SSN)[12], the Quantities, Units, Dimensions, and Data Types Ontologies (QUDT) [13] andthe Vehicle Signal and Attribute Ontology (VSSo) [14], generated from the automo-tive standard VSS [15]. The ASO conceptualizes a signal by defining several mean-ingful relations, including the signal type (e.g. aso:WindowPosition as a sub-class of aso:ObservableSignal), the associated vehicle component (e.g
aso:Window), the sensor(s) and actuator(s) involved in generating signal data, aswell as the measured physical quantities and units-of-measure. It also provides termsto describe the specific details of automotive data collection, e.g. CAN bus data, CANframes, messages and signals
The ASO also defines an associated SKOS vocabulary, where all signals are de-fined as instances of skos:Concepts. This vocabulary is a part of STO
Consequently, the ASO has a dual role in our Automotive Data Lake. The typingof ASO signals as skos:Concepts provides the means to tag resources in the datalake in a consistent way and enriches the semantic search capabilities provided by our 70 7DCPAC ontology. In addition, the formal semantics of the ASO itself enables expres-sive queries, which go beyond the hierarchical SKOS tag search and make the datalake truly semantic. For example, find all datasets that are tagged with signals of acertain type (e.g. aso:ObservableSignal)and being associated with specificvehicle component (e.g. aso:Window)
3.2 Data Catalog – Provenance – Access Control (DCPAC) Ontology [Prefix: dcpac]The DCPAC ontology is our primary contribution to the ontology layer architectureshown in Fig. 1. It combines, aligns and extends the ontology vocabularies describedin the previous section. The ontology directly imports the ODRL ontology, the DPAontology, the FOAF (“Friend of a Friend”) ontology [16] and optionally one or moreSTO ontologies, and recursively imports all other shown ontologies. We chose to re-use properties defined by the FOAF ontology – such as foaf:givenName,foaf:name, and foaf:mbox – to extend the existing definitions of prov:Agentand odrl:PartyCollection
Alignments and Extensions to the Upper Layer Ontologies. The DCPAC ontologyaligns the DCAT ontology with the ODRL ontology by declaring the classesdcat:Distribution and dcat:Resource to be subclasses of odrl:Asset,as can be seen in the upper part of Fig. 3. With odrl:Asset representing a resourceor a collection of resources that are the subject of access authorization rules, this ena-bles the definition of access control permissions for these DCAT classes and sub-classes with the ODRL vocabulary. Furthermore, the DCPAC ontology extends theDCAT ontology by defining various types of dcat:Dataset subclasses (see Fig
3), which allows for distinguishing different types of datasets in a data lake, such asraw data files, tabular data files, relational database and graph database resources
Another contribution is the alignment of PROV-O with the ODRL ontology, asshown in Fig. 4. The PROV-O class prov:Agent is declared as subclass ofodrl:Party, hence enabling all instances of prov:Agent to undertake roles inaccess control permissions. Additionally, the DCPAC ontology defines new sub-classes of prov:Activity, which allow for distinguishing different types of ac-tivities that created (dcpac:GenerationActivity) or modified (dcpac:ModificationActivity) a data lake resource
SHACL Constraints. The DCPAC ontology is associated with a SHACL shapesdefinition file that defines a comprehensive set of SHACL constraints of type SHACLShapes (Node Shapes, Property Shapes) and SPARQL-based constraints [17]
SHACL shapes define cardinalities and type restrictions on properties, and regularexpressions on the allowed values of string datatype properties. One such SHACLshape, for example, validates that each dcat:Dataset instance has to have exactlyone value of type string defined for the property dct:identifier, and the string 71 8must match the regular expression “^[a-z0-9][a-z0-9_\-]{2,59}$”
SPARQL-based constraints have a higher expressivity and can capture complex de-pendencies as graph patterns. For the class dcat:Dataset, for example, we de-fined a constraint that validates that each instance must have at least one semantic tag(skos:Concept) attached, and the tags must be members of askos:ConceptScheme that is associated with (i.e. enabled for) the catalog thedataset belongs to (see also next section and Fig. 5)
A SHACL engine can process the constraints and validate the consistency of theKG (ABox)4. That improves the integrity and quality of the KG and prevents issues
odrl:Asset odrl:AssetCollection dcat:Resource dcat:Distribution dcat:Dataset dcat:DataService dcpac:DatabaseDataset dcpac:FileBasedDataset dcpac:FolderBasedDataset dcat:Catalog dcat:DataDistribution- Service dcpac:GraphDatabase- dcpac:RelationalDatabase- dcpac:RawDataFile dcpac:TabularDataFile Dataset DatasetFig. 3. DCPAC ontology: Refinement and alignment of DCAT ontology with ODRL ontology
dcpac:GenerationActivity odrl:Party prov:Activity dcpac:ModificationActivity prov:Agent odrl:Asset prov:Person prov:Organization prov:Entity prov:SoftwareAgent dcat:Resource dcat:Distribution dcat:Dataset dcat:DataService Fig. 4. DCPAC ontology: Refinement and alignment of PROV-O with ODRL ontology
4 We use Stardog as highly scalable triple store for our Bosch data lake. Stardog supports SHACL and has an inbuilt SHACL engine. https://www.stardog.com/platform/ 72 93.3 The Core Vocabulary
This Section provides an overview and explanation of the core vocabulary of theDCPAC ontology and the primary imported vocabularies, which are explained in theprevious sections. For the explanation, we refer to Fig. 5, which shows the main on-tology classes as well as the most important object properties and datatype properties
The stereotypes shown for some of the classes in Fig. 5 contain their superclasses andhence their alignment to the other ontologies described in the previous sections. Weabstain from showing and explaining additional classes and properties that are specif-ic for the Bosch Automotive Data Lake in order to maintain comprehensibility anddomain-independence
DCAT Entities. Let us start with the DCAT ontology classes shown in the center andbottom left of Fig. 5. The overall data catalog of the data lake is represented by oneinstance of class dcat:Catalog. It can contain many dcat:Dataset instances,one per resource in the data lake, e.g. raw data files, HBase or Hive tables, or RDF-based knowledge graphs. An instance of class dcat:Distribution models aspecific representation of a dataset, comprising a specific serialization or schematicarrangement. Different distributions can exist for the same dataset, and are accessiblevia a URL (dcat:downloadURL). The data catalog and the datasets can each haveseveral data distribution services (dcat:DataDistributionService), whichare end-points that provide access. They are accessible via an endpoint URL(dcat:endpointURL)
PROV-O Entities. The PROV-O classes and properties shown in the top right part ofFig. 5 are used for modeling the provenance of the data catalog and its datasets (bothdeclared as subclasses of prov:Entity, see Fig. 2), and for defining agents (e.g
person, software agent) they are attributed to (prov:wasAttributedTo) or thatwere involved in the activity of creating the dataset. Activities (prov:Activity)are initiated by agents (prov:wasAssociatedWith), create new Datasets(prov:wasGeneratedBy), have an start and end time, and can use other datasetsas input (prov:used)
ODRL Entities. Access control is defined by classes and properties from the ODRLontology. An odrl:Permission can define an access rule for groups of agents(odrl:PartyCollection) to datasets, their distributions and/or data distributionservices (odrl:target). The allowed actions (odrl:Action), such as display,read, modify, delete, are defined as skos:Concept and attached viaodrl:action object property
73 10 Fig. 5. The main classes and properties of the DCPAC ontology (TBox)
74 11SKOS Entities. SKOS finally is applied for defining the semantics of the content of adataset. Therefore, the catalog refers to one or more sets of SKOS concepts(skos:ConceptScheme) that can be used for semantically tagging datasets. Thedefined SKOS tags can be either directly linked to a dataset (dcat:theme), or theycan be bundled and linked as a collection (skos:Collection), which enables thedefinition and reuse of (large) sets of SKOS tags. For the Bosch Automotive DataLake we use the ASO ontology, as described in Section 3.1
4 Semantic Data Lake CatalogAt Bosch, we have built an Automotive Data Lake as a centralized platform for theengineering and testing of our autonomous driving applications [11]. To handle andmanage the complexity and enormous volume of data from all our test drives, wehave developed a holistic architecture, which is shown in Fig. 6. and explained in thissection. Resources collected and stored in the Bosch Automotive Data Lake include aheterogeneous assortment of documents, datasets and databases. We have created asemantic layer, the Semantic Data Lake Catalog, which provides meaningful se-mantic description of resources in the data lake and enables semantic search. TheSemantic Data Lake Catalog comprises a knowledge graph that is built with the vo-cabulary defined in the DCPAC ontology (see Section 3). The semantic description ofthe resources includes information about their content, provenance and access controlpermissions. The ability to perform semantic search of all data in the data lake pro-vides enhanced findability, access, interoperability, and re-use
Data Data ApplicationsStream Data Ingestion Processing AI, Business Process Tasks Applications Data Analytics External Applications Databases Data Lake Catalog Search External Data Lake Catalog Files Data Lake Catalog Population Service User Interface Data Access Engine Data Stores Model Store Semantic Data Scene KG Lake Catalog Data Lake Fig. 6. Data lake architecture and role of Semantic Data Lake Catalog
75 12In the sub-sections below, we explain the other components of the Automotive DataLake shown in Fig. 6, and clarify the process by which the Semantic Data Lake Cata-log KG is populated and how it is used to query, find and access data assets
Data Ingestion Process. As illustrated in Fig. 6, external data from different sources(test fleet vehicles, test benches, data warehouse, etc.) are ingested into the data lake,either continuously in streams or driven by users and applications. The ingestion pro-cess is responsible for extracting, transforming and loading new data assets into thedata stores. The implementation of this ingestion process was containerized using ourin-house DevOps tool in order to allow scaling-out based on the load. It is importantto note that this tool does not only provide a mechanism to deploy the ingestion pro-cesses on our on-premises infrastructure, it also wraps the ingestion with a list ofstandard operators that are automatically called to report the process information aswell as input & output data to a Kafka 5 cluster. These reports, published as standard-ized Kafka messages, are consumed by the Data Lake Catalog Population Service
Data Lake Catalog Population Service. Triggered by Kafka messages, the DataLake Catalog Population Service reads the available metadata on the ingested dataassets and constructs the relevant semantic data as input for our Semantic Data LakeCatalog. The Data Lake Catalog Population Service aligns, annotates and enriches theinput data from the Data Ingestion Process with DCPAC concepts before populatingthe Semantic Data Lake Catalog6. Besides dictionary based mappings (i.e. input dataschema terms or tags are mapped to dedicated SKOS concepts of our Semantic DataLake Catalog taxonomies), the population service also links signal name strings torelevant automotive signal concepts from the Automotive Signal Ontology (based onLevenshtein distance). This is a critical part of the knowledge construction process, asit enables us to search, integrate and process the various data assets based on a sharedconceptualization. The Data Lake Catalog Population Service will also record rele-vant provenance information; e.g. the activity that has created or modified a data as-set, including information about the source asset as well as begin and end time
Data Access Engine. The Data Access Engine (DAE) provides applications with auniform query interface and access to resources (e.g. files, tables, knowledge graphs)in the data lake based on a common HTTP-based API and endpoint. At this stage, theDAE supports data-type queries (i.e. in HBase7 or Hive8/Impala9 tables), knowledgequeries (i.e. SPARQL queries of knowledge graphs) and task requests (i.e. Oozie10jobs in the Hadoop11 cluster). The DAE secures and hides the details of the underlying5 Apache Kafka: A distributed streaming platform, https://kafka.apache.org/6 We use Stardog for storing and processing the semantic layer as knowledge graph
7 Apache HBase: Distributed big data store that runs on Hadoop, https://hbase.apache.org/8 Apache Hive: Data warehouse software for large distributed datasets, https://hive.apache.org/9 Apache Impala: Native analytic database for Apache Hadoop, https://impala.apache.org/10 Apache Oozie: Workflow scheduler for Hadoop, https://oozie.apache.org/11 Apache Hadoop: Scalable, distributed computing software, https://hadoop.apache.org/ 76
This data catalog functionality with existing ETL pipelines, as well as extensibility of the built-in data catalog ontology based on domain specific needs, is limited. Adding a . semantic layer. to a data lake is a common approach to developing a . se-mantic data lake, which have been described in literature. The use of data catalogs in
The Semantic Data Lake platform it has created with its partners is based on AllegroGraph, Franz’s Semantic Graph Database. Franz’s Semantic Data Lake platform makes it possible for healthcare facilities to take their medical records and financial data and put it in Hadoop,...
A semantic layer maps complex data into familiar business terms such as product, customer, or revenue to offer a unified, consolidated view of data across the organization” (From Wikipedia) The semantic layer is a single business representation of corporate data. It contains a clear and unique definition of corporate data entities.
The Semantic layer contains information about the objects in the data source which it uses to generate queries to retrieve the data. So the Semantic layer allows to solve the issue with data meaning ambiguity. The idea of semantic layer is not new.
To solve their usability issue, Data lakes must be provided with semantic data layer (s) to take an inventory of all the key business metrics and collect them in a single abstraction layer where they can be managed and changed in one place.