- DB admin: manage DBs and access, backup and restore
- security, policies
- DE: manage infra for data, cleaning, pipelines
- privacy, performance
- DA: explore and analyze data for visualizations and reports
- Azure SQL:
- A SQL DB: PaaS, fully managed
- A SQL managed instance: instance with SQL server, more flexible than DB, but more admin
- A SQL VM: VM with SQL service, max flexibility
- Azure DBs for open-source rel. DBs:
- MySQL
- MariaDB: rewritten and optimized MySQL
- PostgreSQL: allows non-rel. properties
- Azure Cosmos DB
- NoSQL: JSON, k-v pairs, etc. (semi-structured)
- Usually managed as part of application by developers
- Azure Storage
- Blob containers
- File shares: basically a NAS
- Tables: k-v stores for fast read and write
- Azure Data Factory
- Pipelines, ETL
- Azure Synapse Analytics
- Unified DA solution
- Pipelines as in ADF
- SQL
- Spark
- Synapse Data Explorer: query logs and telemetry with Kusto QL (KQL)
- Azure's answer to Databricks
- Azure Databricks
- Azure HDInsight
- Azure-hosted clusters for Spark, Hadoop, HBase, Kafka, and Storm.
- HBase: open-source large-scale NoSQL
- Storm: real-time data processing
- Azure-hosted clusters for Spark, Hadoop, HBase, Kafka, and Storm.
- Azure Stream Analytics
- Stream processing
- Azure Data Explorer
- Standalone Synapse Data Explorer
- Microsoft Purview
- data governance and discoverability
- track data lineage
- Microsoft Power BI
- Avro: JSON header for each record in compressed binary
- ORC (Optimized Row Columnar format): Optimized for Hive
- Column-major format
- Stripes contain columns + stats (count, min, etc.)
- Parquet: Chunks (partitions) of data with metadata with e.g. location for each row.
- Normalization reduces redundancy by referencing entities in other tables via primary keys.
- Key-value: record with unique key and some value
- Document: Key-value with JSON value
- Column-family: Rows and columns where columns can be grouped (e.g. nested schema in parquet)
- Graph: Nodes and edges
- high-volume, low latency (e.g. retailer order database)
- BOTH read and write optimized
- CRUD: Create, Read, Update, Delete
- ACID:
- Atomicity: Each transaction either succeeds in all steps or fails all steps
- Consistency: Only valid states, completed transactions are reflected in DB
- Isolation: Concurrent transactions don't interfere
- Durability: Transactions are persisted (e.g. written to disk, maybe backuped)
- Mostly read optimized
- Data lake: raw data from e.g. transactional system
- Data warehouse: optimized for read
- ETL takes Data lake -> Data warehouse (DW)
- fact: numerical value to analyze (target feature, e.g. sales)
- dimension: table with entities related to fact (feature, e.g. customer, product)
- OLAP: OnLine Analytical Processing model (cube with rollup etc.)
- pre-aggregated data ready for very fast querying
- DS often uses DL directly to model data
- DA often uses DW to create OLAP and reports
- Business users use OLAP to consume reports
- One common database structure
- Based on tables
- Each row is an entity
- Strict schema, every row in a table have the same columns
- columns can be null
- columns are typed
- Simple definition:
- Each entity type gets its own table
- Each attribute gets its own column
- Each entity (row) is uniquely identified using a primary key
- Foreign keys link related entities in other tables
- Removes duplication of data
- Enforces data types (e.g. decimal prices, integral quantities)
- Primary keys can be composite, i.e. unique combinations of multiple columns.
- Open standard, many vendors have proprietary extensions
- Dialect examples:
- Transact-SQL (T-SQL): Microsoft SQL Server, Azure SQL. Allows writing application code to DB
- pgSQL: PostgreSQL
- Procedural Language/SQL (PL/SQL): Oracle
- Statement types:
- Data Definition Language (DDL)
- Create, modify, remove tables
- CREATE, ALTER, DROP, RENAME
- Data Control Language (DCL)
- DB admins use DCL to manage permissions
- GRANT, DENY, REVOKE
- Data Manipulation Language (DML)
- Manipulate rows, query, insert etc.
- SELECT, INSERT, UPDATE, DELETE, WHERE
- Data Definition Language (DDL)
- Views: virtual tables based on SELECT query
- Stored procedure: function in SQL
- Index: optimize queries that filter on index column
- Makes reading fast, writing is slowed# Azure SQL Services and Capabilities
- SQL Server on VMs: IaaS
- Good for moving on-prem SQL Server to Azure
- SQL Managed Instance: PaaS
- Automated software updates, backups, maintenance
- SQL DB: PaaS
- Good when creating new application in the cloud
- One server that scales vertically
- Single Database: one database per server
- Elastic Pool: several databases per server, good for uneven loads
- SQL Edge: for IoT
- streaming time-series data
SQL Server on Azure VMs | Azure SQL Managed Instance Azure | SQL Database | |
---|---|---|---|
Type of cloud service | IaaS | PaaS | PaaS |
SQL Server compatibility | Fully compatible with on-prem | Near-100% compatibility with SQL Server. Most on-premises databases can be migrated with minimal code changes by using the Azure Database Migration service | Supports most core database-level capabilities of SQL Server. Some features depended on by an on-premises application may not be available. |
Architecture | SQL Server instances are installed in a virtual machine. Each instance can support multiple databases. | Each managed instance can support multiple databases. Additionally, instance pools can be used to share resources efficiently across smaller instances. | You can provision a single database in a dedicated, managed (logical) server; or you can use an elastic pool to share resources across multiple databases and take advantage of on-demand scalability. |
Availability | 99.99% | 99.99% | 99.995% |
Management | You must manage all aspects of the server, including operating system and SQL Server updates, configuration, backups, and other maintenance tasks. | Fully automated updates, backups, and recovery. | Fully automated updates, backups, and recovery. |
Use cases | Use this option when you need to migrate or extend an on-premises SQL Server solution and retain full control over all aspects of server and database configuration. | Use this option for most cloud migration scenarios, particularly when you need minimal changes to existing applications. | Use this option for new cloud solutions, or to migrate applications that have minimal instance-level dependencies. |
Business benefits | Same as on-prem, but on cloud for easy scaling etc. | Less admin tasks/time. Login with Azure AD | Very little admin. Scalable. High availability. Security. Encryption. |
- MySQL:
- free Community edition, more powerful Standard and Enterprise editions
- MariaDB:
- created by creators of MySQL.
- Built-in support for Oracle and temporal data.
- Tables can be versioned.
- PostgreSQL
- hybrid relational-object DB
- geometric data (lines, circles, polygons)
- pgsql is a variant of standard SQL
- MySQL
- Based on community edition
- firewalls etc
- easy scaling
- cannot do security and admin, managed by Azure
- new projects should use flexible server
- auto backups
- Good compatibility with LAMP (Linux, Apache, MySQL, PHP)
- MariaDB
- Based on community edition
- easy scaling
- fully managed
- PostgreSQL
- Same services as MySQL
- cannot perform specialized tasks, e.g. storing non-pgsql procedures or direct OS operations
- Most common extensions are supported, more are added over time
- single server, flexible server or hyperscale
- choose tier for single server
- flexible server is fully managed
- hyperscale scales horizontally
- for big data
- failure detection
- pgAdmin tool for DB management
- Blob = Binary Large OBject
- Organized in containers, access is managed on container level
- block blob
- set of blocks up to 100 MB each
- up to 50k blocks, > 4.7 TB
- use for discrete, large binary objects that change infrequently
- Page blob
- collection of 512-byte pages
- supports random read and write
- up to 8 TB (~16 B pages)
- virtual disks on VMs are page blobs
- Append blob
- optimized for append operations
- cannot update or delete existing blocks
- Up to 4 MB per block
- Up to 195 GB
- Hot
- frequent access
- low latency
- high-performance media (SSDs?)
- Cool
- infrequent access
- cheaper storage than hot, more expensive access
- lower performance
- Archive
- rare access
- cheapest storage, most expensive access
- high latency, offline storage.
- access can take hours
- to read, a blob must be first moved to Hot or Cool
- Can create lifecycle management policies
- Used for big data analytics
- all types of structured, semi-structured, and unstructured data
- part of Azure Blob Storage
- Mounting point for distributed file systems
- Need to enable Hierarchical Namespace in Storage Account
- Can upgrade from flat namespace
- CANNOT downgrade to flat namespace
- cloud-based NAS
- up to 100 TB / Storage Account
- can set quotas on file shares
- File size up to 1 TB
- upload via portal or azcopy
- two performance tiers
- standard: HDD
- premium: SSD
- two file sharing protocols
- Server Message Block (SMB): Window, Linus, MacOS
- Network File System (NFS): Linux, MacOS
- Only for premium tier
- requires virtual network
- NoSQL, k-v pairs
- Items are stored in rows
- each row has unique key (partition key + row key)
- timestamp for last modification
- other columns can vary row-by-row
- usually denormalized
- partitioned on common property or partition key
- partitions are independent
- can filter on partition key to skip data reading
- rows are stored in order of row key (log random access and range queries)
- Scalable NoSQL DB
- Supports multiple APIs
- Indexed and partitioned data
- Several Azure regions can be added to a Cosmos DB
- Results in local replicas
- Partitions up to 10 GB
- automatic indexing
- IoT
- Burst data
- Can define triggers on data ingest
- Retail and marketing
- Windows Store, Xbox Live
- Gaming
- very low latency
- spiking request rates
- Web and mobile
- Xamarin for Android/iOS
- SQL
- Native API
- Returns JSON documents
- MongoDB
- Table API
- for k-v pairs
- better scalability and performance than Table Storage
- language-specific SDKs to call table
- Cassandra
- column-family
- SQL-based syntax
- Gremlin
- graph DBs
- infra for large-scale analytics and BI
Usually four steps:
- ETL from trans. data stores, files, streams etc. -> data lake/wh
- optimize for analytical queries
- often distributed
- Analytical data store
- rel. data wh, data lakes, or hybrid
- Analytical data model
- pre-aggregated "cubes"
- Data visualization
- reports, dashboards etc.
- consumed by non-technical personnel
- Can use Data Factory or Synapse Analytics
- Activities operate on data
- linked services load and process data
- Linked service must be published in Synapse before use
- rel. DB optimized for analytics
- often denormalized
- Numeric values in central fact tables
- one or more dimension tables with entities to aggregate over
- e.g. fact = sales orders
- dim1 = customer
- dim2 = product
- dim3 = store
- dim4 = time
- called a star schema
- Extension to snowflake schema by adding tables to dimensions (e.g. product categories)
- Good choice for trans. data easily organized into rel. DB where we want to use SQL
- General file store, often distributed for high-performance access
- schema-on-read for (semi-) structured data
- lake database / data lakehouse
- raw data in lake
- rel. storage layer abstracts underlying files
- e.g. Delta Lake
- PolyBase in Synapse Analytics
- Synapse Analytics
- end-to-end
- Synapse Studio interface
- Databricks
- common on multiple clouds
- HDInsight
- supports open-source cluster types
- not as user-friendly
- Execute permissions must be set on all directories affected by a query, in addition to read permission on files.
Advantages
- Large volumes of data can be processed at a convenient time
- Can schedule to run e.g. overnight or on off-peak hours
Disadvantages
- Latency between ingestion and processing
- All data must be ready before processing starts
- i.e. errors stop the whole process
- data scope
- batch: access to all data at once
- stream: only the most recent or rolling window
- Data size
- batch: optimized for parallel processing of large datasets
- stream: processes one record at a time (or microbatching)
- Latency
- batch: up to several hours
- stream: seconds or milliseconds
- Analysis
- batch: complex analysis tasks
- stream: simple aggregations or response functions
Do simple analysis as stream while persisting data for large scale historical batch analysis
- Event happens
- Generated data is captured in source
- data is processed
- processed data is written to sink
- Azure Stream Analytics: PaaS for non-changing query
- Spark Structured Streaming
- used in Synapse Analytics, Databricsk, HDInsight
- Azure Data Explorer: real-time DB for time-series
- used in Synapse Analytics
Sources:
- Azure Event Hubs
- Queue with strict ordering and exactly-once semantics
- Azure IoT Hub
- Event Hub optimized for IoT
- Azure Data Lake Store Gen 2
- usually batch, can be used for stream too
- Apache Kafka
- open-source
- can be used with HDInsight
Sinks:
- Event Hubs
- queue for downstream processing
- Data Lake Store Gen 2
- persist result as file
- SQL DB, Synapse Analytics, Databricks
- persist result as table
- Power BI
- reports and visualizations
- Streaming pipeline
- ingest
- process
- persist
- runs perpetually
- processes data as it comes in
- Good for continually capturing data
- Can use static data for reference or joining
- Intensive jobs can be run on clusters (scalable dedicated tenants)
Used in Synapse Analytics, Databricks, and HDInsight
- Synapse and Databricks support Delta Lake
Analyze big data from
- web
- apps
- IoT
- etc.
Ex. use case: process Stream Analytics errors
Ingest with connectors or programmatically
- stream and batch
Good for logs and IoT
- Uses Kusto Query Language (KQL)
- Optimized for read
- Made for DAs
- Especially in Synapse
Mainly for DAs
- charts in Excel
- vis. in notebooks
- Synapse
- Databricks
- BI tools
- Several tools and services
- desktop app
- phone app
- web browser
- Power BI Service
- Workflow
- Create reports in Power BI Desktop
- Publish to Power BI Service
- Can do simple modeling and report editing in Web interface to Service
- Schedule report refreshing based on new data
- Create dashboards
- Consume dashboards and reports in web browser or phone app
- OLAP Cube model
- numeric values to analyze (e.g. revenue) are called measures
- entities aggregated over are called dimensions
- Model is stored as star schemas
- measures are central fact tables
- dimension tables are related to by keys
- Time is almost always included as a dimension
- If dimensions have further relations to tables (e.g. product to category), it's a snowflake schema
- In the model, ALL measure aggregations over ALL dimensions are pre-calculated
- Allows extremely fast switching of aggregation levels
- hierarchy = aggregation level
- e.g. customer/city/country
- e.g. day/month/year
- Hierarchy change is called drill-up or drill-down
- drill-up: coarser level (month -> year)
- drill-down: finer level (month -> day)
- Import tables with data for analytical model (AM)
- data modeling interface
- relations between facts and dimensions
- hierarchies
- data types
- display formats
- Use model to create visualizations
- Tables and text
- simple
- good for several related values
- exact metrics
- less intuitive
- Bar and column charts
- compare numeric values for discrete categories
- Line charts
- compare categorized values
- shows trends
- Pie charts
- proportions of total
- Scatter plots
- identify correlations
- Maps
- geo data
- Link several visualizations
- link aggregation level
- link filters (e.g. sales in Seattle)
- ETL
- simple processing
- can preserve privacy by not loading sensitive things to cloud
- easy to do, no specialist knowledge
- ELT
- data lake support
- transformations with e.g. spark
- good for big data
- Descriptive
- what has happened?
- historical data
- KPI, ROI
- Diagnostic
- why did it happen?
- anomaly detection and analysis
- Predictive
- what will happen?
- Prescriptive
- what should we do to make it happen?
- decision theory
- Cognitive
- draw inferences from data and patterns
- e.g. video transcribing
- Test the connection in the room you plan to take the exam in.
- I had to use the hotspot on my phone since the wifi was too bad in Zenit Bageriet.
- Several questions on specifics of Power BI, did not really expect that from the official learning material.
If you know relational and non-relational database theory, basic data engineering, and know what different types plots are called, then you mainly need to study what Azure service to use for what purpose.