DP-900 Microsoft Azure Data Fundamentals
DP-900 Microsoft Azure Data Fundamentals
Azure Data Fundamentals – Introduction Notes
1. Why Multiple Data Storage Solutions Are Needed
Organizations store many kinds of data.
Each type has its own storage and access needs.
Data serves different purposes, including operations, analytics, and global sharing.
As a result, Azure provides various storage options to meet these requirements.
2. Types of Data and Their Storage Solutions
A. File-Based Data
This type stores data as complete objects, such as pictures and documents.
You need the entire file to be useful; you can’t use half an image.
It’s best for data that should be stored and retrieved as a single unit.
Azure Solutions:
Azure Storage Account offers:
Files, which provide shared file system storage.
Blobs, for unstructured data like images, audio, and video.
B. Structured Data
This data is highly organized and follows a consistent structure.
Every record adheres to the same schema, like customer name, ID, and address.
It's ideal for transactional systems and applications that need reliability.
Azure Solutions:
Relational Databases, which include:
Azure SQL Database (Microsoft SQL Server)
Azure Database for MySQL
Azure Database for PostgreSQL
Support for third-party and open-source relational databases.
C. Semi-Structured Data
This type has a flexible schema and allows for variations between records.
It's used when data has inconsistent attributes, like dresses versus stoves.
It combines structure and flexibility.
Azure Solution:
Azure Cosmos DB
This is a multi-model, globally distributed database service.
It supports key-value, document, and graph data models.
It allows real-time data replication across regions, such as Berlin to Singapore.
It provides high scalability and performance for large data sets.
3. Using Data for Analysis
Data is collected from multiple sources, including:
Files
Structured databases
Semi-structured sources
Transformation tools convert data into a consistent format for analysis.
B. Data Storage for Analysis
Data Warehouses store large volumes of historical data.
They are used for analyzing long-term trends and patterns.
Unlike transactional databases, which handle small, frequent queries, data warehouses process massive datasets at once.
C. Data Analysis Tools
Once data is loaded into the warehouse, it is analyzed to:
Understand past performance.
Generate insights from large datasets.
Predict future outcomes and trends.
4. Data Security and Privacy
Access control is crucial to prevent unauthorized access or corruption.
It protects intellectual property and sensitive customer data.
Privacy laws require that data remains confidential.
Only authorized users can view or modify specific data.
Role-based access ensures that people with technical expertise can manage systems without seeing private information.
5. Roles Involved in Data Management
| Role | Responsibilities |
|---|---|
| Users | Create, update, and use data for business activities. |
| Administrators | Control access, define permissions, manage who sees what data. |
| Data Analysts | Examine data to discover insights, trends, and predictions. |
| Auditors | Review how data is used and check for privacy/security compliance. |
| Engineers | Build, configure, and maintain data systems and infrastructure. |
. Key Takeaways
Azure provides various data storage solutions to meet business needs.
Files, structured, and semi-structured data each need unique handling.
Cosmos DB supports global distribution and multiple data models.
Data warehouses are optimized for large-scale analytical queries.
Security, privacy, and assigned roles ensure data integrity and controlled access.
Every role—user, admin, analyst, auditor, and engineer—contributes to a secure and efficient data ecosystem.
Module 1, File-Based Data in Azure
1. Understanding File-Based Storage
File-based storage involves keeping an entire file (object) as one unit.
You upload, download, and access the file as a whole.
You can't download or open just part of a file (like half an image or Word document).
It's also called:
Object Storage; the full file is seen as an object.
Unstructured Data; the content of the file can be anything and does not follow a fixed format.
2. Characteristics of Unstructured (Object) Data
Each file can hold various types of data:
An image file might contain several pictures.
An audio file can combine voices, music, and sound effects.
A Word file might store different kinds of text (emails, stories, etc.).
The main point is that you have to upload or download the entire file as one complete object.
3. Types of Files
All files are essentially binary data (zeros and ones).
They are usually divided into:
Text files; these can be edited in text editors like Notepad.
Binary files; these need specific software (like Photoshop for .jpg).
4. Common File Extensions
Examples include: .pdf, .json, .png, .csv, .xml, .gif, .jpeg
Text-based formats (for data storage and sharing):
CSV (Comma-Separated Values)
JSON (JavaScript Object Notation)
XML (Extensible Markup Language)
5. Real-World Examples
A company selling products around the world might have:
PDF files for product catalogs that customers can view.
Word documents with product specifications.
Image files with product photos in .png, .gif, or .jpeg.
Data exports for partners or vendors in formats like CSV, JSON, or XML.
These formats are widely used and supported across different systems.
6. Popular Text-Based File Formats
A. CSV (Comma-Separated Values)
Data is organized into rows and columns.
Each row ends with a carriage return and line feed.
Columns (fields) are separated by a delimiter, usually a comma.
Rows can contain different numbers of columns.
The first row may hold column names, which is optional.
Software that reads the file must know if the first row is a header.
Example:
Name,Age,City
Alice,25,New York
Bob,30,London,Engineer
B. JSON (JavaScript Object Notation)
JSON is more structured and organized than CSV.
It stores data in key-value pairs using curly braces {}.
It supports nested data structures (data within data).
It's flexible; not every entry needs to have the same attributes.
Example:
{
"people": [
{"firstName": "Alice", "lastName": "Smith"},
{"firstName": "Bob", "middleName": "J", "lastName": "Brown"}
]
}
JSON clearly shows hierarchies and relationships (for example, address → city, country).
It is often used for exchanging data between systems.
C. XML (Extensible Markup Language)
XML uses tags to describe and organize data.
Each element has opening and closing tags (< > and </ >).
It supports a hierarchical (tree-like) structure.
Example:
<person>
<firstName>Alice</firstName>
<lastName>Smith</lastName>
<address>
<city>New York</city>
<country>USA</country>
</address>
</person>
XML is still used, but less popular than JSON now.
It is good for defining structured documents and hierarchical relationships.
7. Other File Formats
Avro Format:
This format is commonly used in Hadoop ecosystems for large-scale storage.
It is mostly binary but has a JSON header at the top that describes the structure.
This combination makes it self-describing and flexible for big data systems.
It supports changes in structure without breaking compatibility.
8. File Storage in Azure
Azure Storage Accounts are used to keep file-based (object) data.
Azure doesn't care about the inside structure of files—it only needs to:
Upload files as one object.
Download files as one object.
The contents of the file can be any type or format (text, image, JSON, binary, etc.).
9. Key Takeaways
File-based (object) storage means keeping complete files as single objects.
Unstructured data indicates that data inside files can vary and doesn't follow a fixed structure.
Common data formats include:
CSV—simple, row/column-based, separated by commas.
JSON—hierarchical, key-value pairs that support nesting.
XML—hierarchical with tagged elements.
Avro—binary and a JSON header, used in Hadoop.
Azure Storage can keep any file format as long as you upload or download the whole file.
File-based storage is perfect for media files, documents, and data exports.
Azure File-Based Storage
* Overview
File-based storage in Azure offers various solutions for storing and managing file data.
Main tools:
1) Azure Storage Accounts, for storing files and unstructured data.
2) Microsoft Purview, for managing and overseeing stored data.
1) Azure Storage Accounts
This is the main service for storing file-based or unstructured data.
It features two main storage types:
a) Azure Files, which supports traditional file share access for backward compatibility.
b) Blob Storage (Containers), optimized for internet-based access and stores binary large objects (BLOBs).
Blob = Binary Large Object
This term refers to binary data and does not necessarily imply that the data is large.
* Other Data Types in Storage Accounts
Azure Tables, used for semi-structured data (covered in Module 3).
Azure Data Lakes, used for data analytics and big data (covered in Module 5).
2) Microsoft Purview
This tool is for data governance and data discovery.
Data governance includes:
Defining rules for how long to keep files (e.g., retain for 7 years and then delete).
Setting policies for sharing and where to store data.
Preventing confidential documents from being shared outside the organization.
Discoverability helps find data across the company, which is useful in legal or compliance situations.
It allows control over the data lifecycle and who has access.
* Categorizing Files
Files typically fall into two categories:
1) Frequently used files
Example: Product catalogs.
These require fast upload and download times.
2) Infrequently used files
Slower access is acceptable for these files.
They usually make up most of the stored data.
* Storage Account Types
1) Premium Storage Account
This account uses SSD (Solid-State Drives).
It has no moving parts, which leads to faster performance and low latency.
It has a higher storage cost.
This option is best for frequently accessed files where speed is important.
2) Standard Storage Account
This account uses magnetic disks (spinning drives).
It provides slower performance and higher latency.
It has a lower cost.
This option is ideal for infrequently accessed files.
* Data Protection: Geo-Redundant Storage (GRS)
This feature keeps a secondary copy of data in a paired region, about 300 miles away.
It ensures data availability during regional outages.
You can enable read access from the secondary region to reduce latency for users who are farther away.
It offers disaster recovery and improves data resilience.
* Summary
Azure Storage Accounts are the key solution for file storage, featuring options like Azure Files and Blob Storage.
Microsoft Purview offers governance, compliance, and discoverability features.
There is a trade-off between performance and cost when choosing between Premium and Standard Storage.
Geo-Redundancy adds reliability and provides global access options.
Create Azure Storage Account
Azure Files (File-Based Data in Azure Storage Accounts)
* Overview
Azure Files is a part of Azure Storage Accounts designed for file-based data.
It mimics a traditional hard disk file system, making it compatible with on-premises systems and applications.
It is best for scenarios that need standard file systems, mapped drives, or shared file access.
* Purpose & Use Cases
The main purpose is to ensure compatibility with existing file systems.
It is ideal for:
- Lift-and-shift migrations: move on-premises apps to the cloud without major code changes.
- Applications that expect a standard file system structure (e.g., folders for HTML pages, images, etc.).
- Docker containers that need a mounted or standard file system.
* Protocols Supported
1) SMB (Server Message Block)
This is the same protocol used by Windows OS to access local disks.
It enables Azure Files to function like a local network drive.
It supports Windows Access Control Lists (ACLs) for managing user access.
It allows the continuation of existing file permissions and security rules from on-premises setups.
Note: Setting up SMB-based ACLs in Azure can be complex.
2) NFS (Network File System)
This protocol is commonly used by macOS and Linux systems.
It is available only in Premium storage accounts.
It requires setting up a Virtual Network (VNet) for access.
* Capacity & Limits
The maximum file share size is 100 TB per share.
The maximum file size is 1 TB per file.
There can be up to 2,000 active connections per file share.
* Quota Management:
You can define quotas to limit file share growth and control storage costs.
* Uploading Files to Azure Files
There are three main tools for uploading and managing files:
1) Azure Portal
This is a GUI-based method.
It allows browsing storage accounts, creating folders and subfolders, and uploading files directly.
2) PowerShell
This is a command-line tool with Azure-specific commands for file operations.
It offers scripting and automation capabilities, though it is less visual than the portal.
3) AzCopy
This is a dedicated command-line tool for transferring files to and from Azure Storage.
It is optimized for speed and large-scale uploads.
* Synchronization Tool: Azure File Sync
This tool allows syncing between on-premises servers and Azure File Shares.
Workflow:
Upload a file to an on-premises file share →
It is automatically synced to Azure File Share →
Then replicated to other servers or data centers.
This enables hybrid storage solutions and centralized file management.
* Summary
Azure Files provides cloud-based file shares that work with on-premises systems.
It supports SMB and NFS protocols for cross-platform access.
It offers 100 TB capacity, 1 TB per file, and 2,000 active connections.
You can upload files using Azure Portal, PowerShell, and AzCopy.
Azure File Sync enables real-time synchronization across locations.
It is best for lift-and-shift migrations, shared storage, and applications needing standard file systems.
File Shares
Azure Blob Storage
* Overview :
Azure Storage Accounts support different data types, including tables, queues, files, and blobs. Azure Blob Storage, also known as Container Storage, is designed for unstructured data such as files, images, videos, documents, and backups.
It is more internet-friendly and secure than traditional file storage, making it the preferred option for cloud applications.
* Blob Storage Structure
Blob Storage employs a flat hierarchy:
A Storage Account contains one or more Containers.
Each Container holds multiple Blobs (files).
You cannot nest containers inside each other.
Each blob is accessed through a URL like:
https://<account>.blob.core.windows.net/<container>/<blob>
You can have unlimited containers and blobs, but performance might decline if too many blobs are in a single container. Each blob can be up to 4 TB in size.
* Types of Blobs
Azure supports three blob types:
1) Block Blob – Used for uploading or downloading complete files; this is the most common type.
2) Page Blob – Used for random read/write operations inside a blob. This type is used by Azure Virtual Machines for virtual hard disks.
3) Append Blob – Optimized for adding data at the end, making it perfect for logs and continuous writes.
* Storage Account Types
There are two main storage account performance levels:
1) Premium Account – SSD-based, offers the fastest upload/download speed, costs more, and has no access tiers.
2) Standard Account – HDD-based, cost-effective, and supports Hot, Cool, Cold, and Archive access tiers.
* Access Tiers (Standard Accounts)
| Tier | Storage Cost | Transfer Cost | Use Case | Min Retention |
|---|---|---|---|---|
| Hot Tier | High | Low | Frequently accessed data (~20%) | — |
| Cool Tier | Low | High | Infrequently accessed (~80%) | 30 days |
| Cold Tier | Very Low | Higher | Rarely accessed data | 90 days |
| Archive Tier | Lowest | Highest | Long-term, offline storage (retrieval up to 15 hours) | 180 days |
Example:
Active product files → Hot Tier
Old files → Cool Tier
Backups → Cold Tier
Discontinued product files → Archive Tier
* Cost Considerations
Azure charges based on:
1) Storage amount (GB/month)
2) Data transfers (upload/download operations)
Hot tier offers expensive storage but affordable access.
Cool and Cold tiers provide cheaper storage with higher access costs.
The Archive tier has the lowest storage cost, but it comes with the highest latency and retrieval expense.
* Lifecycle Management Policies
Azure allows you to automate file transitions between tiers to manage costs.
Examples:
Move files from Hot to Cool after 60 days of inactivity.
Move from Cool to Archive after one year.
Automatically delete files after a certain period.
These policies work only for Standard Accounts; Premium accounts support only Archive or deletion. They simplify storage maintenance and help optimize costs.
* Blob Change Feed
The Blob Change Feed keeps track of every change made to your blobs, which includes creating, modifying, and deleting, and stores this information in system blobs. It is off by default. However, when turned on, it enables you to:
Restore data to a specific point in time (for example, “files as of Friday 2 PM”).
Maintain audit logs and track versions.
Enable synchronization between storage accounts, similar to file sync.
* Quick Recap
Blob Storage is Azure’s cloud-optimized file system for unstructured data.
Structure: Storage Account → Container → Blob (flat structure).
Blob types include: Block (default), Page (for VMs), and Append (for logs).
Account types are Premium (for speed) and Standard (tiered).
Tiers consist of Hot (frequent access), Cool (infrequent access), Cold (rare access), and Archive (long-term).
Retention periods are: Cool – 30 days, Cold – 90 days, Archive – 180 days.
Lifecycle rules automate movement and deletion of files.
Change Feed assists in tracking changes and restoring data accurately.
Creating Blob Storage account and containers
Comments
Post a Comment