Roles and services

Roles

DB admin: manage DBs and access, backup and restore
- security, policies
DE: manage infra for data, cleaning, pipelines
- privacy, performance
DA: explore and analyze data for visualizations and reports

Services

Azure SQL:
- A SQL DB: PaaS, fully managed
- A SQL managed instance: instance with SQL server, more flexible than DB, but more admin
- A SQL VM: VM with SQL service, max flexibility
Azure DBs for open-source rel. DBs:
- MySQL
- MariaDB: rewritten and optimized MySQL
- PostgreSQL: allows non-rel. properties
Azure Cosmos DB
- NoSQL: JSON, k-v pairs, etc. (semi-structured)
- Usually managed as part of application by developers
Azure Storage
- Blob containers
- File shares: basically a NAS
- Tables: k-v stores for fast read and write
Azure Data Factory
- Pipelines, ETL
Azure Synapse Analytics
- Unified DA solution
- Pipelines as in ADF
- SQL
- Spark
- Synapse Data Explorer: query logs and telemetry with Kusto QL (KQL)
- Azure's answer to Databricks
Azure Databricks
Azure HDInsight
- Azure-hosted clusters for Spark, Hadoop, HBase, Kafka, and Storm.
  - HBase: open-source large-scale NoSQL
  - Storm: real-time data processing
Azure Stream Analytics
- Stream processing
Azure Data Explorer
- Standalone Synapse Data Explorer
Microsoft Purview
- data governance and discoverability
- track data lineage
Microsoft Power BI

Core data concepts

Optimized file formats

Avro: JSON header for each record in compressed binary
ORC (Optimized Row Columnar format): Optimized for Hive
- Column-major format
- Stripes contain columns + stats (count, min, etc.)
Parquet: Chunks (partitions) of data with metadata with e.g. location for each row.

Databases

Normalization reduces redundancy by referencing entities in other tables via primary keys.

Non-relational

Key-value: record with unique key and some value
Document: Key-value with JSON value
Column-family: Rows and columns where columns can be grouped (e.g. nested schema in parquet)
Graph: Nodes and edges

Transactional systems

high-volume, low latency (e.g. retailer order database)
BOTH read and write optimized
CRUD: Create, Read, Update, Delete
ACID:
- Atomicity: Each transaction either succeeds in all steps or fails all steps
- Consistency: Only valid states, completed transactions are reflected in DB
- Isolation: Concurrent transactions don't interfere
- Durability: Transactions are persisted (e.g. written to disk, maybe backuped)

Analytical data processing

Mostly read optimized
Data lake: raw data from e.g. transactional system
Data warehouse: optimized for read
- ETL takes Data lake -> Data warehouse (DW)
- fact: numerical value to analyze (target feature, e.g. sales)
- dimension: table with entities related to fact (feature, e.g. customer, product)
OLAP: OnLine Analytical Processing model (cube with rollup etc.)
- pre-aggregated data ready for very fast querying
DS often uses DL directly to model data
DA often uses DW to create OLAP and reports
Business users use OLAP to consume reports

Fundamental Relational Concepts

One common database structure

Relational data

Based on tables
Each row is an entity
Strict schema, every row in a table have the same columns
- columns can be null
- columns are typed

Normalization

Simple definition:
1. Each entity type gets its own table
2. Each attribute gets its own column
3. Each entity (row) is uniquely identified using a primary key
4. Foreign keys link related entities in other tables
Removes duplication of data
Enforces data types (e.g. decimal prices, integral quantities)
Primary keys can be composite, i.e. unique combinations of multiple columns.

SQL

Open standard, many vendors have proprietary extensions
Dialect examples:
- Transact-SQL (T-SQL): Microsoft SQL Server, Azure SQL. Allows writing application code to DB
- pgSQL: PostgreSQL
- Procedural Language/SQL (PL/SQL): Oracle
Statement types:
- Data Definition Language (DDL)
  - Create, modify, remove tables
  - CREATE, ALTER, DROP, RENAME
- Data Control Language (DCL)
  - DB admins use DCL to manage permissions
  - GRANT, DENY, REVOKE
- Data Manipulation Language (DML)
  - Manipulate rows, query, insert etc.
  - SELECT, INSERT, UPDATE, DELETE, WHERE

Describe DB objects

Views: virtual tables based on SELECT query
Stored procedure: function in SQL
Index: optimize queries that filter on index column
- Makes reading fast, writing is slowed# Azure SQL Services and Capabilities
SQL Server on VMs: IaaS
- Good for moving on-prem SQL Server to Azure
SQL Managed Instance: PaaS
- Automated software updates, backups, maintenance
SQL DB: PaaS
- Good when creating new application in the cloud
- One server that scales vertically
- Single Database: one database per server
- Elastic Pool: several databases per server, good for uneven loads
SQL Edge: for IoT
- streaming time-series data

	SQL Server on Azure VMs	Azure SQL Managed Instance Azure	SQL Database
Type of cloud service	IaaS	PaaS	PaaS
SQL Server compatibility	Fully compatible with on-prem	Near-100% compatibility with SQL Server. Most on-premises databases can be migrated with minimal code changes by using the Azure Database Migration service	Supports most core database-level capabilities of SQL Server. Some features depended on by an on-premises application may not be available.
Architecture	SQL Server instances are installed in a virtual machine. Each instance can support multiple databases.	Each managed instance can support multiple databases. Additionally, instance pools can be used to share resources efficiently across smaller instances.	You can provision a single database in a dedicated, managed (logical) server; or you can use an elastic pool to share resources across multiple databases and take advantage of on-demand scalability.
Availability	99.99%	99.99%	99.995%
Management	You must manage all aspects of the server, including operating system and SQL Server updates, configuration, backups, and other maintenance tasks.	Fully automated updates, backups, and recovery.	Fully automated updates, backups, and recovery.
Use cases	Use this option when you need to migrate or extend an on-premises SQL Server solution and retain full control over all aspects of server and database configuration.	Use this option for most cloud migration scenarios, particularly when you need minimal changes to existing applications.	Use this option for new cloud solutions, or to migrate applications that have minimal instance-level dependencies.
Business benefits	Same as on-prem, but on cloud for easy scaling etc.	Less admin tasks/time. Login with Azure AD	Very little admin. Scalable. High availability. Security. Encryption.

Azure services for open-source DBs

MySQL:
- free Community edition, more powerful Standard and Enterprise editions
MariaDB:
- created by creators of MySQL.
- Built-in support for Oracle and temporal data.
- Tables can be versioned.
PostgreSQL
- hybrid relational-object DB
- geometric data (lines, circles, polygons)
- pgsql is a variant of standard SQL

Azure DB

MySQL
- Based on community edition
- firewalls etc
- easy scaling
- cannot do security and admin, managed by Azure
- new projects should use flexible server
- auto backups
- Good compatibility with LAMP (Linux, Apache, MySQL, PHP)
MariaDB
- Based on community edition
- easy scaling
- fully managed
PostgreSQL
- Same services as MySQL
- cannot perform specialized tasks, e.g. storing non-pgsql procedures or direct OS operations
- Most common extensions are supported, more are added over time
- single server, flexible server or hyperscale
  - choose tier for single server
  - flexible server is fully managed
  - hyperscale scales horizontally
    - for big data
- failure detection
- pgAdmin tool for DB management

Azure storage for non-relational data

Blob storage

Blob = Binary Large OBject
Organized in containers, access is managed on container level

three types of blob:

block blob
- set of blocks up to 100 MB each
- up to 50k blocks, > 4.7 TB
- use for discrete, large binary objects that change infrequently
Page blob
- collection of 512-byte pages
- supports random read and write
- up to 8 TB (~16 B pages)
- virtual disks on VMs are page blobs
Append blob
- optimized for append operations
- cannot update or delete existing blocks
- Up to 4 MB per block
- Up to 195 GB

Storage tiers

Hot
- frequent access
- low latency
- high-performance media (SSDs?)
Cool
- infrequent access
- cheaper storage than hot, more expensive access
- lower performance
Archive
- rare access
- cheapest storage, most expensive access
- high latency, offline storage.
- access can take hours
- to read, a blob must be first moved to Hot or Cool
Can create lifecycle management policies

DataLake Storage Gen2

Used for big data analytics
all types of structured, semi-structured, and unstructured data
part of Azure Blob Storage
Mounting point for distributed file systems
Need to enable Hierarchical Namespace in Storage Account
- Can upgrade from flat namespace
- CANNOT downgrade to flat namespace

Azure Files

cloud-based NAS
up to 100 TB / Storage Account
- can set quotas on file shares
File size up to 1 TB
upload via portal or azcopy
two performance tiers
- standard: HDD
- premium: SSD
two file sharing protocols
- Server Message Block (SMB): Window, Linus, MacOS
- Network File System (NFS): Linux, MacOS
  - Only for premium tier
  - requires virtual network

Azure Tables

NoSQL, k-v pairs
Items are stored in rows
- each row has unique key (partition key + row key)
- timestamp for last modification
- other columns can vary row-by-row
usually denormalized
partitioned on common property or partition key
- partitions are independent
- can filter on partition key to skip data reading
rows are stored in order of row key (log random access and range queries)

Fundamentals of Azure Cosmos DB

Scalable NoSQL DB
Supports multiple APIs
Indexed and partitioned data
Several Azure regions can be added to a Cosmos DB
- Results in local replicas

When to use Cosmos DB

Partitions up to 10 GB
automatic indexing
IoT
- Burst data
- Can define triggers on data ingest
Retail and marketing
- Windows Store, Xbox Live
Gaming
- very low latency
- spiking request rates
Web and mobile
- Xamarin for Android/iOS

APIs

SQL
- Native API
- Returns JSON documents
MongoDB
Table API
- for k-v pairs
- better scalability and performance than Table Storage
- language-specific SDKs to call table
Cassandra
- column-family
- SQL-based syntax
Gremlin
- graph DBs

Fundamentals of Modern Data Warehousing

infra for large-scale analytics and BI

Usually four steps:

ETL from trans. data stores, files, streams etc. -> data lake/wh
- optimize for analytical queries
- often distributed
Analytical data store
- rel. data wh, data lakes, or hybrid
Analytical data model
- pre-aggregated "cubes"
Data visualization
- reports, dashboards etc.
- consumed by non-technical personnel

Ingestion pipelines

Can use Data Factory or Synapse Analytics
Activities operate on data
linked services load and process data
- Linked service must be published in Synapse before use

Analytical data stores

Data warehouse

rel. DB optimized for analytics
- often denormalized
Numeric values in central fact tables
- one or more dimension tables with entities to aggregate over
- e.g. fact = sales orders
  - dim1 = customer
  - dim2 = product
  - dim3 = store
  - dim4 = time
- called a star schema
  - Extension to snowflake schema by adding tables to dimensions (e.g. product categories)
Good choice for trans. data easily organized into rel. DB where we want to use SQL

Data Lake

General file store, often distributed for high-performance access
schema-on-read for (semi-) structured data

Hybrid approaches

lake database / data lakehouse
- raw data in lake
- rel. storage layer abstracts underlying files
- e.g. Delta Lake
PolyBase in Synapse Analytics

Services for analytical stores

Synapse Analytics
- end-to-end
- Synapse Studio interface
Databricks
- common on multiple clouds
HDInsight
- supports open-source cluster types
- not as user-friendly

Notes on Synapse

Execute permissions must be set on all directories affected by a query, in addition to read permission on files.

Fundamentals of real-time analytics

Batch and stream processing

Batch

Advantages

Large volumes of data can be processed at a convenient time
Can schedule to run e.g. overnight or on off-peak hours

Disadvantages

Latency between ingestion and processing
All data must be ready before processing starts
- i.e. errors stop the whole process

Diffs between batch and stream

data scope
- batch: access to all data at once
- stream: only the most recent or rolling window
Data size
- batch: optimized for parallel processing of large datasets
- stream: processes one record at a time (or microbatching)
Latency
- batch: up to several hours
- stream: seconds or milliseconds
Analysis
- batch: complex analysis tasks
- stream: simple aggregations or response functions

Combining batch and stream

Do simple analysis as stream while persisting data for large scale historical batch analysis

Stream processing architecture

Event happens
Generated data is captured in source
data is processed
processed data is written to sink

In Azure

Azure Stream Analytics: PaaS for non-changing query
Spark Structured Streaming
- used in Synapse Analytics, Databricsk, HDInsight
Azure Data Explorer: real-time DB for time-series
- used in Synapse Analytics

Sources:

Azure Event Hubs
- Queue with strict ordering and exactly-once semantics
Azure IoT Hub
- Event Hub optimized for IoT
Azure Data Lake Store Gen 2
- usually batch, can be used for stream too
Apache Kafka
- open-source
- can be used with HDInsight

Sinks:

Event Hubs
- queue for downstream processing
Data Lake Store Gen 2
- persist result as file
SQL DB, Synapse Analytics, Databricks
- persist result as table
Power BI
- reports and visualizations

Azure Stream Analytics

Streaming pipeline
- ingest
- process
- persist
runs perpetually
- processes data as it comes in
Good for continually capturing data

Stream Analytics Jobs

Can use static data for reference or joining
Intensive jobs can be run on clusters (scalable dedicated tenants)

Spark on Azure

Used in Synapse Analytics, Databricks, and HDInsight

Synapse and Databricks support Delta Lake

Azure Data Explorer

Analyze big data from

web
apps
IoT
etc.

Ex. use case: process Stream Analytics errors

Ingest with connectors or programmatically

stream and batch

Good for logs and IoT

Uses Kusto Query Language (KQL)
- Optimized for read
- Made for DAs
Especially in Synapse

Fundamentals of Data Visualization

Mainly for DAs

charts in Excel
vis. in notebooks
- Synapse
- Databricks
BI tools

Power BI

Several tools and services
- desktop app
- phone app
- web browser
- Power BI Service
Workflow
- Create reports in Power BI Desktop
- Publish to Power BI Service
  - Can do simple modeling and report editing in Web interface to Service
  - Schedule report refreshing based on new data
  - Create dashboards
- Consume dashboards and reports in web browser or phone app

Core concepts of Data Modeling

OLAP Cube model
- numeric values to analyze (e.g. revenue) are called measures
- entities aggregated over are called dimensions
Model is stored as star schemas
- measures are central fact tables
- dimension tables are related to by keys
  - Time is almost always included as a dimension
- If dimensions have further relations to tables (e.g. product to category), it's a snowflake schema
In the model, ALL measure aggregations over ALL dimensions are pre-calculated
- Allows extremely fast switching of aggregation levels

Attribute hierarchies

hierarchy = aggregation level
- e.g. customer/city/country
- e.g. day/month/year
Hierarchy change is called drill-up or drill-down
- drill-up: coarser level (month -> year)
- drill-down: finer level (month -> day)

AMs in Power BI

Import tables with data for analytical model (AM)
data modeling interface
- relations between facts and dimensions
- hierarchies
- data types
- display formats

Considerations for visualization

Use model to create visualizations
Tables and text
- simple
- good for several related values
- exact metrics
- less intuitive
Bar and column charts
- compare numeric values for discrete categories
Line charts
- compare categorized values
- shows trends
Pie charts
- proportions of total
Scatter plots
- identify correlations
Maps
- geo data

Interactive reports in Power BI

Link several visualizations
- link aggregation level
- link filters (e.g. sales in Seattle)

Concepts of data analytics

ELT vs ELT

ETL
- simple processing
- can preserve privacy by not loading sensitive things to cloud
- easy to do, no specialist knowledge
ELT
- data lake support
- transformations with e.g. spark
- good for big data

DA activities

Descriptive
- what has happened?
- historical data
- KPI, ROI
Diagnostic
- why did it happen?
- anomaly detection and analysis
Predictive
- what will happen?
Prescriptive
- what should we do to make it happen?
- decision theory
Cognitive
- draw inferences from data and patterns
- e.g. video transcribing

Post-exam reflections

Test the connection in the room you plan to take the exam in.
- I had to use the hotspot on my phone since the wifi was too bad in Zenit Bageriet.
Several questions on specifics of Power BI, did not really expect that from the official learning material.

If you know relational and non-relational database theory, basic data engineering, and know what different types plots are called, then you mainly need to study what Azure service to use for what purpose.

Files

dp-900.md

Latest commit

History