Intro to Data Engineering

Who am I?

Hello, my name is Bhawesh Mehta. I am a data engineer.

Prior to my current role, I worked as a business analyst, where I developed a strong foundation in data analysis and automation.

I am passionate about data engineering and enjoy staying up-to-date with the latest developments in the field.

Introduction To Data Engineering

Difference Between Various Roles in Data Field

Difference between Data Engineer,Data Analyst,Data Scientists,Business Analyst and Business Intelligence Analyst

Data Engineers Ensures:

  • Data is HIghly Available

  • Consistent

  • Secure

  • Recoverable

Data Scientist and Analyst make use of Data that Data Engineers Provide

Data Engineers work with other data professionals to ensure data matches their needs

Responsibilities of Data Engineers

Technical Skills Required

Role of Data Engineering in Customer Sentiment Analysis

  • Extracts data from various data sources like social media,ecommerce portals and blogs through api or web scraping.

  • Stores that data in temporary storage.

  • Do some sort of data manipulation with tools like Python.

  • Stores this processed data in Databases.

  • This cleaned format of data is then used by data analyst,business analyst and end users .

The Above Steps are not the one time activity

Its should be set in automatic pipelines

Data Types

Data Repositories

Data Pipeline

Languages

Reporting Tools

Structured Data

Semi Structured Data

Unstructured Data

Standard File Formats

Delimited Text

XML File Format

PDF

JSON

Sources of Data

Web Scraping

Data Streams and feeds

Metadata and Metadata Management

Objectives

  • After completing this reading, you will be able to:

  • Define what metadata is

  • Describe what metadata management is

  • Explain the importance of metadata management

  • List popular tools for metadata management

What is metadata?

Metadata is data that provides information about other data.

This is a very broad definition. Here we will consider the concept of metadata within the context of databases, data warehousing, business intelligence systems, and all kinds of data repositories and platforms.

We’ll consider the following three main types of metadata:

  • Technical metadata

  • Process metadata, and

  • Business metadata

Technical metadata

Technical metadata is metadata which defines the data structures in data repositories or platforms, primarily from a technical perspective.

For example, technical metadata in a data warehouse includes assets such as:

1-Tables that record information about the tables stored in a database, like:

  1. each table’s name

  2. the number of columns and rows each table has

2-A data catalog, which is an inventory of tables that contain information, like:

  1. the name of each database in the enterprise data warehouse

  2. the name of each column present in each database

  3. the names of every table that each column is contained in

  4. the type of data that each column contains

The technical metadata for relational databases is typically stored in specialized tables in the database called the System Catalog.

Process metadata

Process metadata describes the processes that operate behind business systems such as data warehouses, accounting systems, or customer relationship management tools.

Many important enterprise systems are responsible for collecting and processing data from various sources. Such critical systems need to be monitored for failures and any performance anomalies that arise. Process metadata for such systems includes tracking things like:

  1. process start and end times

  2. disk usage

  3. where data was moved from and to, and

  4. how many users access the system at any given time

This sort of data is invaluable for troubleshooting and optimizing workflows and ad hoc queries.

Business metadata

Users who want to explore and analyze data within and outside the enterprise are typically interested in data discovery. They need to be able to find data which is meaningful and valuable to them and know where that data can be accessed from. These business-minded users are thus interested in business metadata, which is information about the data described in readily interpretable ways, such as:

  1. how the data is acquired

  2. what the data is measuring or describing

  3. the connection between the data and other data sources

Business metadata also serves as documentation for the entire data warehouse system.

Managing metadata

Managing metadata includes developing and administering policies and processes to ensure information can be accessed and integrated from various sources and appropriately shared across the entire enterprise.

Creation of a reliable, user-friendly data catalog is a primary objective of a metadata management model.

The data catalog is a core component of a modern metadata management system, serving as the main asset around which metadata management is administered.

It serves as the basis by which companies can inventory and efficiently organize their data systems. A modern metadata management model will include a web-based user interface that enables engineers and business users to easily search for and find information on key attributes such as CustomerName or ProductType. This kind of model is central to any Data Governance initiative.

Why is metadata management important?

Good metadata management has many valuable benefits. Having access to a well implemented data catalog greatly enhances data discovery, repeatability, governance, and can also facilitate access to data.

Well managed metadata helps you to understand both the business context associated with the enterprise data and the data lineage, which helps to improve data governance. Data lineage provides information about the origin of the data and how it gets transformed and moved, and thus it facilitates tracing of data errors back to their root cause. Data governance is a data management concept concerning the capability that enables an organization to ensure that high data quality exists throughout the complete lifecycle of the data, and data controls are implemented that support business objectives.

The key focus areas of data governance include availability, usability, consistency, data integrity and data security and includes establishing processes to ensure effective data management throughout the enterprise such as accountability for the adverse effects of poor data quality and ensuring that the data which an enterprise has can be used by the entire organization.

Popular metadata management tools include:

  1. IBM InfoSphere Information Server

  2. CA Erwin Data Modeler

  3. Oracle Warehouse Builder

  4. SAS Data Integration Server

  5. Talend Data Fabric

  6. Alation Data Catalog

  7. SAP Information Steward

  8. Microsoft Azure Data Catalog

  9. IBM Watson Knowledge Catalog

  10. Oracle Enterprise Metadata Management (OEMM)

  11. Adaptive Metadata Manager

  12. Unifi Data Catalog

  13. data.world

  14. Informatica Enterprise Data Catalog

Data Repository in Details

Introduction:

Relational Databases

Non-Relational Databases

Data Warehouse

Data Marts

Dependent data Mart depends on warehouse where the data is already transformed and cleaned

Data Lakes

Relational Database

NoSQL Database

NoSQL Databases Types

  • Key-Value store

  • Document Bases

  • Column Based

  • Graph Based

Difference Between RDBMS and NOSQL Database

ETL Process

ELT Process

Data Pipelines

Data Integration

Note

The term “data repositories” includes not just RDBMS and NoSQL databases, it also includes data warehouses, data marts, and data lakes.

Big Data

The V’s of Big Data

Big Data Processing Tools

Architecting Data Platforms

Columns are attribute

Rows are columns

Data Collection and Data Wrangling

Tools for Data Wrangling

Data Ops Methodology

Summary

Special Thanks

  • IBM Skills

  • Google

Selection based on different criterion