Skip to content

Latest commit

 

History

History
562 lines (507 loc) · 18.2 KB

dp-900.md

File metadata and controls

562 lines (507 loc) · 18.2 KB

Roles and services

Roles

  • DB admin: manage DBs and access, backup and restore
    • security, policies
  • DE: manage infra for data, cleaning, pipelines
    • privacy, performance
  • DA: explore and analyze data for visualizations and reports

Services

  • Azure SQL:
    • A SQL DB: PaaS, fully managed
    • A SQL managed instance: instance with SQL server, more flexible than DB, but more admin
    • A SQL VM: VM with SQL service, max flexibility
  • Azure DBs for open-source rel. DBs:
    • MySQL
    • MariaDB: rewritten and optimized MySQL
    • PostgreSQL: allows non-rel. properties
  • Azure Cosmos DB
    • NoSQL: JSON, k-v pairs, etc. (semi-structured)
    • Usually managed as part of application by developers
  • Azure Storage
    • Blob containers
    • File shares: basically a NAS
    • Tables: k-v stores for fast read and write
  • Azure Data Factory
    • Pipelines, ETL
  • Azure Synapse Analytics
    • Unified DA solution
    • Pipelines as in ADF
    • SQL
    • Spark
    • Synapse Data Explorer: query logs and telemetry with Kusto QL (KQL)
    • Azure's answer to Databricks
  • Azure Databricks
  • Azure HDInsight
    • Azure-hosted clusters for Spark, Hadoop, HBase, Kafka, and Storm.
      • HBase: open-source large-scale NoSQL
      • Storm: real-time data processing
  • Azure Stream Analytics
    • Stream processing
  • Azure Data Explorer
    • Standalone Synapse Data Explorer
  • Microsoft Purview
    • data governance and discoverability
    • track data lineage
  • Microsoft Power BI

Core data concepts

Optimized file formats

  • Avro: JSON header for each record in compressed binary
  • ORC (Optimized Row Columnar format): Optimized for Hive
    • Column-major format
    • Stripes contain columns + stats (count, min, etc.)
  • Parquet: Chunks (partitions) of data with metadata with e.g. location for each row.

Databases

  • Normalization reduces redundancy by referencing entities in other tables via primary keys.

Non-relational

  • Key-value: record with unique key and some value
  • Document: Key-value with JSON value
  • Column-family: Rows and columns where columns can be grouped (e.g. nested schema in parquet)
  • Graph: Nodes and edges

Transactional systems

  • high-volume, low latency (e.g. retailer order database)
  • BOTH read and write optimized
  • CRUD: Create, Read, Update, Delete
  • ACID:
    • Atomicity: Each transaction either succeeds in all steps or fails all steps
    • Consistency: Only valid states, completed transactions are reflected in DB
    • Isolation: Concurrent transactions don't interfere
    • Durability: Transactions are persisted (e.g. written to disk, maybe backuped)

Analytical data processing

  • Mostly read optimized
  • Data lake: raw data from e.g. transactional system
  • Data warehouse: optimized for read
    • ETL takes Data lake -> Data warehouse (DW)
    • fact: numerical value to analyze (target feature, e.g. sales)
    • dimension: table with entities related to fact (feature, e.g. customer, product)
  • OLAP: OnLine Analytical Processing model (cube with rollup etc.)
    • pre-aggregated data ready for very fast querying
  • DS often uses DL directly to model data
  • DA often uses DW to create OLAP and reports
  • Business users use OLAP to consume reports

Fundamental Relational Concepts

  • One common database structure

Relational data

  • Based on tables
  • Each row is an entity
  • Strict schema, every row in a table have the same columns
    • columns can be null
    • columns are typed

Normalization

  • Simple definition:
    1. Each entity type gets its own table
    2. Each attribute gets its own column
    3. Each entity (row) is uniquely identified using a primary key
    4. Foreign keys link related entities in other tables
  • Removes duplication of data
  • Enforces data types (e.g. decimal prices, integral quantities)
  • Primary keys can be composite, i.e. unique combinations of multiple columns.

SQL

  • Open standard, many vendors have proprietary extensions
  • Dialect examples:
    • Transact-SQL (T-SQL): Microsoft SQL Server, Azure SQL. Allows writing application code to DB
    • pgSQL: PostgreSQL
    • Procedural Language/SQL (PL/SQL): Oracle
  • Statement types:
    • Data Definition Language (DDL)
      • Create, modify, remove tables
      • CREATE, ALTER, DROP, RENAME
    • Data Control Language (DCL)
      • DB admins use DCL to manage permissions
      • GRANT, DENY, REVOKE
    • Data Manipulation Language (DML)
      • Manipulate rows, query, insert etc.
      • SELECT, INSERT, UPDATE, DELETE, WHERE

Describe DB objects

  • Views: virtual tables based on SELECT query
  • Stored procedure: function in SQL
  • Index: optimize queries that filter on index column
    • Makes reading fast, writing is slowed# Azure SQL Services and Capabilities
  • SQL Server on VMs: IaaS
    • Good for moving on-prem SQL Server to Azure
  • SQL Managed Instance: PaaS
    • Automated software updates, backups, maintenance
  • SQL DB: PaaS
    • Good when creating new application in the cloud
    • One server that scales vertically
    • Single Database: one database per server
    • Elastic Pool: several databases per server, good for uneven loads
  • SQL Edge: for IoT
    • streaming time-series data
SQL Server on Azure VMs Azure SQL Managed Instance Azure SQL Database
Type of cloud service IaaS PaaS PaaS
SQL Server compatibility Fully compatible with on-prem Near-100% compatibility with SQL Server. Most on-premises databases can be migrated with minimal code changes by using the Azure Database Migration service Supports most core database-level capabilities of SQL Server. Some features depended on by an on-premises application may not be available.
Architecture SQL Server instances are installed in a virtual machine. Each instance can support multiple databases. Each managed instance can support multiple databases. Additionally, instance pools can be used to share resources efficiently across smaller instances. You can provision a single database in a dedicated, managed (logical) server; or you can use an elastic pool to share resources across multiple databases and take advantage of on-demand scalability.
Availability 99.99% 99.99% 99.995%
Management You must manage all aspects of the server, including operating system and SQL Server updates, configuration, backups, and other maintenance tasks. Fully automated updates, backups, and recovery. Fully automated updates, backups, and recovery.
Use cases Use this option when you need to migrate or extend an on-premises SQL Server solution and retain full control over all aspects of server and database configuration. Use this option for most cloud migration scenarios, particularly when you need minimal changes to existing applications. Use this option for new cloud solutions, or to migrate applications that have minimal instance-level dependencies.
Business benefits Same as on-prem, but on cloud for easy scaling etc. Less admin tasks/time. Login with Azure AD Very little admin. Scalable. High availability. Security. Encryption.

Azure services for open-source DBs

  • MySQL:
    • free Community edition, more powerful Standard and Enterprise editions
  • MariaDB:
    • created by creators of MySQL.
    • Built-in support for Oracle and temporal data.
    • Tables can be versioned.
  • PostgreSQL
    • hybrid relational-object DB
    • geometric data (lines, circles, polygons)
    • pgsql is a variant of standard SQL

Azure DB

  • MySQL
    • Based on community edition
    • firewalls etc
    • easy scaling
    • cannot do security and admin, managed by Azure
    • new projects should use flexible server
    • auto backups
    • Good compatibility with LAMP (Linux, Apache, MySQL, PHP)
  • MariaDB
    • Based on community edition
    • easy scaling
    • fully managed
  • PostgreSQL
    • Same services as MySQL
    • cannot perform specialized tasks, e.g. storing non-pgsql procedures or direct OS operations
    • Most common extensions are supported, more are added over time
    • single server, flexible server or hyperscale
      • choose tier for single server
      • flexible server is fully managed
      • hyperscale scales horizontally
        • for big data
    • failure detection
    • pgAdmin tool for DB management

Azure storage for non-relational data

Blob storage

  • Blob = Binary Large OBject
  • Organized in containers, access is managed on container level

three types of blob:

  • block blob
    • set of blocks up to 100 MB each
    • up to 50k blocks, > 4.7 TB
    • use for discrete, large binary objects that change infrequently
  • Page blob
    • collection of 512-byte pages
    • supports random read and write
    • up to 8 TB (~16 B pages)
    • virtual disks on VMs are page blobs
  • Append blob
    • optimized for append operations
    • cannot update or delete existing blocks
    • Up to 4 MB per block
    • Up to 195 GB

Storage tiers

  • Hot
    • frequent access
    • low latency
    • high-performance media (SSDs?)
  • Cool
    • infrequent access
    • cheaper storage than hot, more expensive access
    • lower performance
  • Archive
    • rare access
    • cheapest storage, most expensive access
    • high latency, offline storage.
    • access can take hours
    • to read, a blob must be first moved to Hot or Cool
  • Can create lifecycle management policies

DataLake Storage Gen2

  • Used for big data analytics
  • all types of structured, semi-structured, and unstructured data
  • part of Azure Blob Storage
  • Mounting point for distributed file systems
  • Need to enable Hierarchical Namespace in Storage Account
    • Can upgrade from flat namespace
    • CANNOT downgrade to flat namespace

Azure Files

  • cloud-based NAS
  • up to 100 TB / Storage Account
    • can set quotas on file shares
  • File size up to 1 TB
  • upload via portal or azcopy
  • two performance tiers
    • standard: HDD
    • premium: SSD
  • two file sharing protocols
    • Server Message Block (SMB): Window, Linus, MacOS
    • Network File System (NFS): Linux, MacOS
      • Only for premium tier
      • requires virtual network

Azure Tables

  • NoSQL, k-v pairs
  • Items are stored in rows
    • each row has unique key (partition key + row key)
    • timestamp for last modification
    • other columns can vary row-by-row
  • usually denormalized
  • partitioned on common property or partition key
    • partitions are independent
    • can filter on partition key to skip data reading
  • rows are stored in order of row key (log random access and range queries)

Fundamentals of Azure Cosmos DB

  • Scalable NoSQL DB
  • Supports multiple APIs
  • Indexed and partitioned data
  • Several Azure regions can be added to a Cosmos DB
    • Results in local replicas

When to use Cosmos DB

  • Partitions up to 10 GB
  • automatic indexing
  • IoT
    • Burst data
    • Can define triggers on data ingest
  • Retail and marketing
    • Windows Store, Xbox Live
  • Gaming
    • very low latency
    • spiking request rates
  • Web and mobile
    • Xamarin for Android/iOS

APIs

  • SQL
    • Native API
    • Returns JSON documents
  • MongoDB
  • Table API
    • for k-v pairs
    • better scalability and performance than Table Storage
    • language-specific SDKs to call table
  • Cassandra
    • column-family
    • SQL-based syntax
  • Gremlin
    • graph DBs

Fundamentals of Modern Data Warehousing

  • infra for large-scale analytics and BI

Usually four steps:

  • ETL from trans. data stores, files, streams etc. -> data lake/wh
    • optimize for analytical queries
    • often distributed
  • Analytical data store
    • rel. data wh, data lakes, or hybrid
  • Analytical data model
    • pre-aggregated "cubes"
  • Data visualization
    • reports, dashboards etc.
    • consumed by non-technical personnel

Ingestion pipelines

  • Can use Data Factory or Synapse Analytics
  • Activities operate on data
  • linked services load and process data
    • Linked service must be published in Synapse before use

Analytical data stores

Data warehouse

  • rel. DB optimized for analytics
    • often denormalized
  • Numeric values in central fact tables
    • one or more dimension tables with entities to aggregate over
    • e.g. fact = sales orders
      • dim1 = customer
      • dim2 = product
      • dim3 = store
      • dim4 = time
    • called a star schema
      • Extension to snowflake schema by adding tables to dimensions (e.g. product categories)
  • Good choice for trans. data easily organized into rel. DB where we want to use SQL

Data Lake

  • General file store, often distributed for high-performance access
  • schema-on-read for (semi-) structured data

Hybrid approaches

  • lake database / data lakehouse
    • raw data in lake
    • rel. storage layer abstracts underlying files
    • e.g. Delta Lake
  • PolyBase in Synapse Analytics

Services for analytical stores

  • Synapse Analytics
    • end-to-end
    • Synapse Studio interface
  • Databricks
    • common on multiple clouds
  • HDInsight
    • supports open-source cluster types
    • not as user-friendly

Notes on Synapse

  • Execute permissions must be set on all directories affected by a query, in addition to read permission on files.

Fundamentals of real-time analytics

Batch and stream processing

Batch

Advantages

  • Large volumes of data can be processed at a convenient time
  • Can schedule to run e.g. overnight or on off-peak hours

Disadvantages

  • Latency between ingestion and processing
  • All data must be ready before processing starts
    • i.e. errors stop the whole process

Diffs between batch and stream

  • data scope
    • batch: access to all data at once
    • stream: only the most recent or rolling window
  • Data size
    • batch: optimized for parallel processing of large datasets
    • stream: processes one record at a time (or microbatching)
  • Latency
    • batch: up to several hours
    • stream: seconds or milliseconds
  • Analysis
    • batch: complex analysis tasks
    • stream: simple aggregations or response functions

Combining batch and stream

Do simple analysis as stream while persisting data for large scale historical batch analysis

Stream processing architecture

  1. Event happens
  2. Generated data is captured in source
  3. data is processed
  4. processed data is written to sink

In Azure

  • Azure Stream Analytics: PaaS for non-changing query
  • Spark Structured Streaming
    • used in Synapse Analytics, Databricsk, HDInsight
  • Azure Data Explorer: real-time DB for time-series
    • used in Synapse Analytics

Sources:

  • Azure Event Hubs
    • Queue with strict ordering and exactly-once semantics
  • Azure IoT Hub
    • Event Hub optimized for IoT
  • Azure Data Lake Store Gen 2
    • usually batch, can be used for stream too
  • Apache Kafka
    • open-source
    • can be used with HDInsight

Sinks:

  • Event Hubs
    • queue for downstream processing
  • Data Lake Store Gen 2
    • persist result as file
  • SQL DB, Synapse Analytics, Databricks
    • persist result as table
  • Power BI
    • reports and visualizations

Azure Stream Analytics

  • Streaming pipeline
    • ingest
    • process
    • persist
  • runs perpetually
    • processes data as it comes in
  • Good for continually capturing data

Stream Analytics Jobs

  • Can use static data for reference or joining
  • Intensive jobs can be run on clusters (scalable dedicated tenants)

Spark on Azure

Used in Synapse Analytics, Databricks, and HDInsight

  • Synapse and Databricks support Delta Lake

Azure Data Explorer

Analyze big data from

  • web
  • apps
  • IoT
  • etc.

Ex. use case: process Stream Analytics errors

Ingest with connectors or programmatically

  • stream and batch

Good for logs and IoT

  • Uses Kusto Query Language (KQL)
    • Optimized for read
    • Made for DAs
  • Especially in Synapse

Fundamentals of Data Visualization

Mainly for DAs

  • charts in Excel
  • vis. in notebooks
    • Synapse
    • Databricks
  • BI tools

Power BI

  • Several tools and services
    • desktop app
    • phone app
    • web browser
    • Power BI Service
  • Workflow
    • Create reports in Power BI Desktop
    • Publish to Power BI Service
      • Can do simple modeling and report editing in Web interface to Service
      • Schedule report refreshing based on new data
      • Create dashboards
    • Consume dashboards and reports in web browser or phone app

Core concepts of Data Modeling

  • OLAP Cube model
    • numeric values to analyze (e.g. revenue) are called measures
    • entities aggregated over are called dimensions
  • Model is stored as star schemas
    • measures are central fact tables
    • dimension tables are related to by keys
      • Time is almost always included as a dimension
    • If dimensions have further relations to tables (e.g. product to category), it's a snowflake schema
  • In the model, ALL measure aggregations over ALL dimensions are pre-calculated
    • Allows extremely fast switching of aggregation levels

Attribute hierarchies

  • hierarchy = aggregation level
    • e.g. customer/city/country
    • e.g. day/month/year
  • Hierarchy change is called drill-up or drill-down
    • drill-up: coarser level (month -> year)
    • drill-down: finer level (month -> day)

AMs in Power BI

  • Import tables with data for analytical model (AM)
  • data modeling interface
    • relations between facts and dimensions
    • hierarchies
    • data types
    • display formats

Considerations for visualization

  • Use model to create visualizations
  • Tables and text
    • simple
    • good for several related values
    • exact metrics
    • less intuitive
  • Bar and column charts
    • compare numeric values for discrete categories
  • Line charts
    • compare categorized values
    • shows trends
  • Pie charts
    • proportions of total
  • Scatter plots
    • identify correlations
  • Maps
    • geo data

Interactive reports in Power BI

  • Link several visualizations
    • link aggregation level
    • link filters (e.g. sales in Seattle)

Concepts of data analytics

ELT vs ELT

  • ETL
    • simple processing
    • can preserve privacy by not loading sensitive things to cloud
    • easy to do, no specialist knowledge
  • ELT
    • data lake support
    • transformations with e.g. spark
    • good for big data

DA activities

  • Descriptive
    • what has happened?
    • historical data
    • KPI, ROI
  • Diagnostic
    • why did it happen?
    • anomaly detection and analysis
  • Predictive
    • what will happen?
  • Prescriptive
    • what should we do to make it happen?
    • decision theory
  • Cognitive
    • draw inferences from data and patterns
    • e.g. video transcribing

Post-exam reflections

  • Test the connection in the room you plan to take the exam in.
    • I had to use the hotspot on my phone since the wifi was too bad in Zenit Bageriet.
  • Several questions on specifics of Power BI, did not really expect that from the official learning material.

If you know relational and non-relational database theory, basic data engineering, and know what different types plots are called, then you mainly need to study what Azure service to use for what purpose.