DP-900-4

 DP-900
Module - 4 Analyzing Data

* Introduction

- OLTP (Online Transaction Processing) : 

Manage day-to-day operations like inserting, updating, and deleting transactions.

- OLAP (Online Analytical Processing) :

Systems that anayze large amounts of data and produce meaningful insights.

* What is OLAP?



OLAP is all about analyzing data, not updating it. The data is typically read-only. It may be loaded continuously but not modified. OLAP systems work with huge datasets - often terabytes of data collected over years. The goal is to transform raw data into valuable information that aids decision-making, such as predicting next year’s sales.


* Case Study: Sales Analysis Example

- Let’s consider a company that wants to analyze and predict next year’s sales. They require data from several sources:

1) Sales Forecasts (Predictions):

- Sales staff prepare these using Excel spreadsheets. Each spreadsheet lists customers, products, and expected sales quantities. These spreadsheets are uploaded to an Azure Storage Account as blobs in a container.

2) Actual Sales Data:

- This data is stored as semi-structured information in Azure Table Storage, part of Azure Cosmos DB. This choice supports the company's global operations, which manage hundreds of thousands of transactions daily, necessitating high performance and scalability. Data from various countries is sent back to headquarters for analysis.

3) Customer & Product Information:

- Both of these are structured data, so they are stored in Azure SQL Database. Product data is kept in structured tables since the products are similar. Customer data is stored in a Customers table because all customers share a consistent structure.

So, we have:

Excel data → Blob Storage

Sales data → Cosmos DB (Table Storage)

Customer/Product data → Azure SQL

All of this data must be combined into a single analysis system.


* ETL Process (Extract, Transform, Load)


- To prepare data for analysis, we use ETL (Extract, Transform, Load) tools.

1) Extract

- Data is pulled from multiple sources (Excel, SQL, Cosmos DB, etc.). Each source stores data differently, so extraction ensures we can access everything.

2) Transform

- Data is converted into a common format that can be analyzed together. This process may involve cleaning or adjusting the data. For example, we can convert sales data from “by customer” to “by country” and match IDs and formats across all datasets, such as customer IDs and product IDs. This transformation ensures that Excel, SQL, and Cosmos DB data can work together effectively.

3) Load

- After transformation, the data is loaded into a centralized analytical store—either a Data Warehouse or a Data Lake. This store must manage large volumes of data (terabytes) while allowing for quick access to specific sections, like “sales in Europe 2023.”

- Remember: 

- ETL = Extract → Transform → Load

- This preparation process makes analysis possible.


* Data Storage for Analysis


1. Data Warehouse

- This is optimized for structured and relational data. It is ideal for business intelligence (BI) and reporting. It supports complex queries and aggregations.

2. Data Lake

- This can store all types of data (structured, semi-structured, unstructured) and scales to massive volumes. It is used for big data analytics and machine learning. In many scenarios, data may flow from a data lake (raw data) into a data warehouse (processed data) for analysis.


* Analyzing and Reporting the Data

- Once the data is prepared and stored, use analytical tools like Azure Synapse Analytics, Power BI, or Azure Data Explorer to analyze trends, patterns, and predictions. Create visual reports and dashboards to communicate insights clearly. These reports help decision-makers act on data-driven insights.


* Cost Optimization Tip

- Azure charges for data movement between regions (called egress/ingress charges). Within the same Azure region, the cost is usually free or minimal. However, moving data across regions can be expensive.

* Best Practice: Keep your ETL processes, data warehouse, and data lake in the same Azure region as your data sources to minimize transfer costs.


* Key Takeaways

- OLTP: Transactional systems that process and update data.

- OLAP: Analytical systems that read and analyze large datasets.

- ETL: The essential process—Extract, Transform, Load.

- Storage: Use a Data Warehouse for structured data and a Data Lake for large mixed data.

- Tools Involved:

1) Data Sources: Excel, Cosmos DB, Azure SQL

2) ETL: Data Factory, Synapse Pipelines

3) Analysis & Reporting: Power BI, Synapse Analytics

4) Cost Tip: Keep all resources in one Azure region.


StepDescriptionAzure Tools
ExtractPull data from multiple sourcesData Factory, Synapse Pipelines
TransformConvert, clean, and organize dataData Flow, Databricks
LoadStore in Data Lake or Data WarehouseAzure Synapse, Data Lake Storage
AnalyzeGenerate reports & insightsPower BI, Synapse Analytics


Processing Modes

* Overview :

- In analytic processing, we work with large volumes of data to generate insights.

* Types of Processing in Data Analysis


- There are two main processing modes:
1) Batch Processing
2) Stream Processing
3) We can also combine both methods using Hybrid Processing.


1) Batch Processing (Most Common in Analytics)


- Batch processing is used when we gather data over a certain period and process it all at once. 
- It is triggered by events such as:
- The end of the day/week/month for report generation
- A data threshold being reached, for example, when 5,000 new sales are recorded, we run forecast updates.
- A typical scenario involves gathering sales data, loading it into a data warehouse, and then running analytics jobs.
- Batch processing means periodic, scheduled, or threshold-based analysis.


2) Stream Processing (Real-Time or Near Real-Time)


- Stream processing is used when data flows in continuously without stopping. 
- It is common in telemetry and IoT devices. 
- Data arrives as a stream, often referred to as a queue of events.
- It is often processed using a rolling time window, such as:
- Analyzing the last 5 minutes of data
- Moving forward every minute
- It is called near real-time when processing does not occur every second but instead in short cycles.
- For example, a smart refrigerator sends temperature data continuously, which is processed every few minutes to monitor its performance.
- This approach is used when immediate insights are necessary.

3) Hybrid Processing (Batch and Stream Together)

- In hybrid processing, data is streamed continuously but not processed right away. 
- The stream is stored first, usually in a data warehouse. 
- Later, it is analyzed using batch processing, which can occur hourly, daily, and so on. 
- This method is useful when real-time decisions are not critical.
- For instance, a refrigerator continuously sends temperature data that is stored and processed every two hours to track long-term cooling efficiency.
- This approach is known as HTAP, or Hybrid Transaction and Analytic Processing.

* Stream Processing Components 

Term Meaning Example
Event A single piece of incoming data Temperature reading / Door sensor signal
Event Hub Service that receives & organizes incoming events Azure Event Hubs
IoT Hub Event hub optimized for IoT devices Smart home devices, factory sensors
Kafka Open-source event ingestion alternative Used in distributed streaming systems
Sink Where events are sent for processing/storage Data warehouse / analytical system
Analytic Cluster A group of computers working together to process large data Used when high-speed analytics is needed


* Cosmos DB + Azure Synapse Link (Important Feature)



- Normally, transactional data must undergo ETL (Extract, Transform, Load) before analysis. 
- Cosmos DB offers a shortcut with Azure Synapse Link. 
- This feature automatically and continuously copies data from Cosmos DB to Azure Synapse Analytics. 
- It enables fast analytics without affecting live operational systems.
- This method reduces the effort needed for ETL and speeds up analytics.

Processing Mode Data Flow When Used Key Benefit
Batch Collect → Store → Process later Scheduled reports & analytics Efficient for large historical datasets
Stream Continuous → Real-time processing IoT, telemetry, live monitoring Immediate insights
Hybrid (HTAP) Stream → Store → Batch analyze When near real-time isn't necessary Combines real-time collection + cost-efficient processing



Integrated Solutions

* Azure Synapse Analytics (Microsoft’s Integrated Solution)


- Azure Synapse Analytics is Microsoft’s comprehensive platform for:
1) Data ingestion (extracting data)
2) Data transformation
3) Data warehousing (storing structured analysis data)
4) Data analysis / querying
5) Running large-scale analytics workloads

* Core Components Used Inside Synapse  

Function Tool Used by Synapse Purpose

Extract & Transform
Azure Data Factory Moves and cleans the data before storage

Warehouse / Storage

Azure Synapse (Data Lake Storage)

Stores data for analysis
Analysis / Querying
Azure Data Explorer & Apache Spark

Performs large-scale analytics and querying

* Important Cost Optimization Note

- Synapse can run continuously to analyze data at all times. However, many organizations run analytics periodically, such as for:

1) Daily sales reports
2) Monthly business summaries
3) Annual financial analysis

- If Synapse runs continuously, it can get costly. The recommended best practice is to:
- Pause and schedule Synapse workloads to run only when needed.

* Data Lake in Azure Synapse



- A Data Lake stores raw data in its original format, so there’s no need to convert everything into a single standard structure.

* Traditional Data Warehouse Process  

- Extract → Transform → Load into warehouse
- In this method, all data is changed into a single structure before loading.

* Data Lake Approach  

- Extract → Load (No formatting required)  
- Transform happens later during query time
- This setup allows for storing many types of files side by side, such as:

- .csv
- .json
- .xml
- .parquet (column-oriented storage format)

* Running Queries on Mixed Data Formats

- Synapse uses PolyBase, which:
1) Lets you query files directly in the Data Lake
2) Transforms data during query execution
- This approach is sometimes called ELT (Extract → Load → Transform during query).

* Enabling Storage for Data Lake: Hierarchical Namespace

- Data in a Data Lake is stored inside an Azure Storage Account as blobs. By default, Blob storage is flat, meaning:
- One container → many objects
- No folders, no subfolders
- If we enable Hierarchical Namespace (also known as ADLS Gen2), we get virtual folders and subfolders. This improves organization and access control.

Why Hierarchical Namespace Matters  

Benefit Explanation
Organize data     Store files in logical folder groups (e.g., Sales, Inventory)
Secure data     Apply access permissions at folder level
Required for Data Lake     Data Lake won't function without it


* Delta Lake (Enhancement Layer over Data Lake)


- While a Data Lake is powerful, it lacks:
1) Strong transaction control
2) Data consistency handling
3) Simplified versioning
- Delta Lake adds these features on top of Data Lake storage.
- Key Benefit: 
Delta Lake makes batch processing and streaming appear the same. This means you can create a reporting solution that runs at night (batch) and later switch to real-time streaming data without changing your code. This significantly reduces development complexity.



* Apache Databricks (Non-Microsoft Integrated Solution)

- Azure Databricks is a unified analytics platform that is:
- Built on Apache Spark
- Uses Delta Lake
- Manages all parts of the analytics pipeline, which includes:
  - Extract
  - Transform
  - Load
  - Warehouse
  - Analyze

Why Many Organizations Prefer Databricks  

Advantage Explanation
Easy Deployment
Comes as a packaged environment — no need to configure each component separately
Easy Collaboration
Analysts, data engineers, and scientists can work together in shared notebooks
Easy to share across teams/regions Useful for large organizations or multi-region deployments

Databricks focuses on scalable, collaborative, and reusable analytics workflows.

Feature Azure Synapse Analytics Azure Databricks
Vendor Microsoft
Based on Apache Spark (MS-hosted version available)

Data Storage

Data Lake (ADLS Gen2)

Data Lake + Delta Lake

Processing

SQL, Spark, Data Explorer

Spark (Strong focus)

Best Use Case

Enterprise data warehousing + analytics

Collaborative data science + scalable ML/analytics

Deployment

Simplifies data integration automation
Simplifies data analytics collaboration and scaling


Demo: Setting Up an Azure Synapse Analytics Solution


* Azure Synapse Analytics is an integrated analytics environment. It combines multiple data tools so you do not have to set up each service separately.

1. Create a Synapse Workspace

- Go to the Azure Portal.
- Search for and select Azure Synapse Analytics.
- Click Create.
- The workspace is the core environment where you write and run analytical queries. You view query results and store reusable queries and artifacts. This is essentially your analytics working environment.

2. Resource Group Setup

- You select the resource group for your Synapse workspace. 
- However, Synapse needs other services to operate:
- Data transformation (Azure Data Factory)
- Data storage / Data Lake
- Other processing resources

- To keep your main resource group organized, Synapse uses a Managed Resource Group.
Managed Resource Group:
- Created automatically by Synapse
- Stores all components required by Synapse
- Managed by Azure (typically, you do not modify it)

- You can name it, but most people keep the default name.

3. Workspace Naming and Region

You enter:
- Workspace name (use lowercase only)
- Region (choose the same region where your data is located for better performance)

4. Selecting / Creating Data Lake Storage (ADLS Gen2)

Synapse uses Data Lake Storage Gen2 to hold data files.
You need to choose Data Lake Storage Gen2:

- If you already have a correctly configured storage account, select From Subscription.
- If you do not have one, choose Create New.

Important Setup Requirement:
The storage account must have Hierarchical Namespace enabled because Data Lake supports folders and subfolders. This is why it is often easier to let Synapse create the storage account automatically, since:

- It enables hierarchical namespace.
- It ensures proper security configuration.
- It sets required permissions for your user identity.

You will give names to:

- Primary storage account (blob storage) e.g., phvsynblob
- Secondary storage account (files) e.g., phvsynfiles

Synapse will link your workspace to these automatically.

5. Security Configuration

You need to choose how users will authenticate.

Recommended: 
Use Azure Active Directory only.

This simplifies identity and permission management and enhances security.

You will also encounter an option regarding:

Allowing network access to the storage. This is locked because the storage is created automatically.

Encryption:

Default is Microsoft-managed encryption keys. 

You can choose your own keys, but remember that this cannot be changed later once the workspace is created.

6. Networking Settings

Default setting allows network access from all IP addresses. 

You can restrict this later if needed (e.g., only allow connections from corporate network ranges).

7. Review and Create

Azure will validate the configuration. 

Click Create.

Deployment may take several minutes.

When completed, Azure displays a list of all resources created, including:

- Synapse Workspace
- Data Lake Storage Accounts
- Managed Resource Group
- Spark Pools / SQL Pools (depending on configuration)

8. Viewing Your Workspace

Go to the resource group and select the Synapse Workspace.

Here, you will see:

- Synapse Studio link (where you run queries, notebooks, pipelines, etc.)

Analytics Tools available:

- Apache Spark (for big data processing)
- Data Explorer (fast query engine)

This is the operational interface for:

- Importing data
- Transforming data
- Running analytics jobs
- Building dashboards or reports


Concept Meaning
Synapse Workspace Your analytics working environment
Managed Resource Group Automatically created to store supporting resources
Data Lake Storage Gen2 Used for storing data files for analysis
Hierarchical Namespace Enables folders/subfolders + required for Data Lake
Authentication Best practice is Azure Active Directory only
Encryption Storage encryption choice cannot be changed later

Azure Data Factory: Extract & Transform in ETL 

* What Are Extract and Transform?

- Extract: Pull data from multiple sources.  
- Examples include Excel files, SQL databases, Cosmos DB, and web APIs.
- Transform: Clean, convert, or reshape data.  
- Examples include:
1) Converting formats to match a common warehouse schema
2) Adding calculated fields
3) Re-grouping data for analysis
- These steps prepare data before it is loaded into a storage system, such as a warehouse or data lake.

* Azure Data Factory (ADF)



- Azure Data Factory is Microsoft’s cloud service used to:
1) Extract data
2) Transform data
3) Control automation and scheduling of these processes

* Azure Data Factory uses Transformation Pipelines.
* Transformation Pipelines

- A pipeline is a series of steps that Azure Data Factory runs to:
- Start with raw data, transform it, and deliver it to an output storage (sink).

- Key Terms
  
Term Meaning
Source     Where data is extracted from
Activities     Steps that process or transform the data
Sink     The final destination storage where transformed data is placed

- You can also schedule pipelines to run:
1) Every hour
2) Daily
3) Weekly
4) Or event-triggered

* Pipeline Activities

ADF pipelines are built by dragging and dropping Activities, which define how data flows and changes.

Activity Purpose
Example Use

Copy Data Flow Activity Moves data from one place to another Copy data from Blob Storage → Synapse table

Transform Data Flow Activity
Modifies data while transferring it
Convert formats, merge tables, clean fields

Lookup Activity

Retrieves reference / lookup data during transformation

Look up customer country from a SQL database during load


* No-Code / Low-Code Interface

- Azure Data Factory is visual. You do not need to write code. You configure each activity using property panels and parameters. This makes it usable for analysts, not just developers.

* Service Links (Linked Services)

- To extract and transform data, Azure Data Factory must connect to various systems. A Service Link, or Linked Service, is a connection configuration that tells ADF how to:
1) Authenticate
2) Communicate with
3) Retrieve data from a specific source.

- Examples of Linked Services  

Data Source Purpose
Cosmos DB (MongoDB API) Extract NoSQL documents
Azure Storage (Blob Containers) Access .csv, .json, parquet files
Azure SQL Database Look up additional fields during transformation

- Service links are used throughout pipelines, not just in the extract phase.

* Summary  

Component Role in ETL Part of Azure Data Factory
Source Connection     Extract data         Service Links
Activities     Transform data         Copy, Transform, Lookup
Sink     Where final data is stored         Data warehouse / Data Lake / SQL DB


* Azure Data Factory:

- Automates extract and transform
- Provides flexible, visual pipeline building
- Connects to many cloud and on-premise systems
- Allows scheduled or triggered runs


 Analyzing Data

* Understanding the tools used to analyze data after it has gone through the Extract, Transform, Load (ETL) pipeline.

* Tools for Analyzing Data in Azure

1. Azure Synapse Data Explorer (Primary Analytics Tool)

- This tool is used when your data is loaded into Azure Synapse. 
- It allows you to run analytical queries on large datasets. 
- It uses a special query language called Kusto Query Language (KQL).

** About KQL (Kusto Query Language)

- KQL stands for Kusto Query Language. 
- It is easier to learn, especially if you know SQL. 
- It is very efficient for:
- - Aggregations
- - Summaries
- - Grouping operations
- KQL is better than SQL for summarizing and grouping data because SQL can be complicated for those tasks.

- Built-in Visualization

- One of the best features of KQL is that you can generate charts directly from your query. 
- For example, after aggregating data, you can tell KQL to show it as:
- - a column chart
- - a line chart
- Charts render immediately as part of the query result.

* Open-Source Tools Available in Azure

2. Apache Spark

- This popular analytics engine processes large-scale data. 
- It supports SQL as its query language and works with both batch and streaming data.

3. Apache Hadoop

- This tool is useful when data cannot be processed quickly enough on a single machine. 
- It allows you to distribute processing across multiple computers. 
- It works closely with:
Apache HBase (a NoSQL datastore), which is used for large analytical clusters.

* Programming Languages for Analysis

- If you prefer coding your own analysis manually, you can use:
- - Python
- - R
- These languages give you full control over data processing and analysis.

* Combining Tools: HDInsight

4. Azure HDInsight

- This framework combines various open-source tools into one solution. 
It helps integrate:
- Apache Spark for analytics 
- Apache Hive for data storage and warehousing 
- Apache Kafka for streaming data 
- Apache Hadoop for cluster management 
- This is useful when you need different tools to work together.
However…
- When too many independent tools are combined manually, management becomes complex. 
- This is why integrated platforms are often easier.
- Integrated Analytics Platforms
Instead of handling multiple services separately, platforms like:
- Azure Synapse Analytics 
- Azure Databricks 
offer fully integrated, end-to-end analytics environments. These simplify data processing, analysis, and visualization.

* Summary

- Analysis comes at the end of the ETL pipeline. 
- Azure provides many tools, from integrated platforms to open-source engines. 
- KQL in Synapse Data Explorer is powerful for querying and visualizing data. 
- Tools like Spark, Hadoop, and HDInsight support large-scale distributed analytics. 
- Python and R are top programming choices for custom analytics workloads.


Reporting & Power BI

* What is Power BI?

Power BI is more than just a reporting tool. It supports the entire data analysis process, including:
- Extracting data
- Transforming data
- Loading data into datasets
- Creating interactive visualizations
- Publishing and sharing reports
- You can think of Power BI as having two main parts:

1. The Creating Part

This part is for designing and building interactive reports. It includes:
- Power Query
- Power View

2. The Distributing Part

This part is for sharing reports and dashboards with others. It includes:
- Power BI Service
- These components work together but are used individually during the report lifecycle.
- Power BI Designer / Power BI Desktop
This free tool is used to create Power BI reports. Both free and licensed versions have the same features. The main difference is that the free version has a stricter data size limit. 

* Power BI Desktop includes:
- Power Query
- Power View

-  Power Query – Extract, Transform, Load (ETL)

- What Power Query Does
- Power Query is the first step in building a report. It is used to:
- Extract data from various data sources
- Transform it into a format Power BI understands
- Load it into your dataset
- By default, Power Query performs a basic ETL process automatically.
When Customization Is Needed
Often, you need to customize the ETL steps in real-world scenarios, such as:
- Joining tables
- Splitting tables
- Cleaning invalid data
- Removing unnecessary rows
- Improving report performance
- Power Query keeps a record of every transformation step, building a pipeline similar to Azure Data Factory. Whenever you refresh your data, Power Query automatically replays all your custom steps.

- When You Don’t Need ETL Customization
- If your data comes from a well-designed data warehouse using a star schema, the default behavior of Power Query is usually sufficient.

- DirectQuery Mode

Sometimes, importing data isn’t suitable due to:
- Constantly changing data
- Dataset size limits
In these cases, use DirectQuery, which:
- Does not import the data
- Connects directly to the source system
- Performs real-time queries
- Uses an ELT (Extract → Load → Transform) process
- Trade-offs
- Some Power BI features may not be available. Reports may feel slower because they query the source system each time.


- Power View – Creating Visual Reports

- After preparing data with Power Query, you use Power View to build visual reports.
What Power View offers:
- 40 to 50 built-in visualizations
- Ability to import hundreds more from the Power BI visuals gallery
- Interactive report building features
- Interactivity in Power BI Reports
- Power BI visualizations are connected by default.
- Examples:
Selecting a product in one visual updates all other visuals on the page. A date slicer lets users choose a date range, and all visuals update accordingly. This makes Power BI reports dynamic and interactive.

- Drill-Up and Drill-Down Features
Power BI supports hierarchies, such as:
- Region → Country → City → Customer
- Date → Year → Quarter → Month → Week

Users can:
- Drill down to more detailed levels
- Drill up to see the broader view
- Power BI automatically creates hierarchies for date fields after import.
- Data Modeling in Power BI
Power BI can import tables from multiple data sources:

- Excel spreadsheets
- Azure SQL Database
- Cosmos DB Tables
- Tabular models
- Many others


* Problem:

- Tables from different data sources don’t automatically relate to each other.

* Solution:

- Use Power BI Data Modeling to create relationships between tables.
You define relationships such as:
- CustomerID in Azure SQL ↔ CustomerID in Cosmos DB
- ProductID in Sales table ↔ ProductID in Excel forecast sheet
Sometimes, Power BI can automatically detect relationships when:
- All related columns have the same name
- All tables come from the same source
- You still need to review and confirm these relationships.


* Summary

Power BI is the main reporting tool in Azure's data ecosystem. It serves two key functions:
- Creating reports (Power Query + Power View)
- Distributing reports (Power BI Service)
Power Query handles ETL/ELT and builds transformation pipelines. DirectQuery is used for live data and large datasets. Power View creates interactive visualizations with drill-down and slicers. Data modeling links tables from multiple sources to form a coherent dataset.


Power BI Visualizations

* Power BI offers many visualizations, with each one serving to tell a specific story with your data.

1) What types of visualization exist
2) Why we choose one visualization over another
3) How visual types fit into the four types of analysis

* 4 Types of Data Analysis

- Power BI visuals help tell stories that fall under four main analysis types:

1) Descriptive Analytics

- Answers the question, “What happened?” 
- Provides a record of past events.
- Common visuals include tables and basic charts.

2) Diagnostic Analytics

- Answers the question, “Why did it happen?”
- Compares trends, relationships, and categories.
- Example: Sales vs. Inventory, Sales vs. Temperature
- Common visuals include column charts, line charts, and scatter charts.

3) Predictive Analytics

- Answers the question, “What will happen next?”
- Uses patterns in past data to forecast outcomes.
- Common visuals include line charts, trend charts, and forecasting tools.

4) Prescriptive Analytics

- Answers the question, “What should we do?”
- Combines diagnostic and predictive analytics.
- Gives actions or suggestions based on past and future data.
- Example: “Keep more inventory to maintain sales.”

* Common Power BI Visualizations Explained

- Below are the main visual types covered in the lecture, along with reasons for using each.

1) Table Visualization

- Looks like a list of columns and rows.
- Best for simple descriptive reports.
- No trends or comparisons; just raw data.

2) Column and Bar Charts

- Used to compare values visually.
- Column Chart
- Columns rise upward.
- Helps understand variations (“Why is this column taller or shorter?”)

- Bar Chart
- Bars extend sideways.
- Useful when labels are long.

- Clustered Column Chart
- Shows two values side-by-side within each category.
- Helps compare relationships:
- How the dark-blue bar affects the light-blue bar
- How clusters differ from one another

- Stacked Column Chart
- Stacks values on top of one another.
- Good for comparing totals of multiple numbers.

- 100% Stacked Column Chart
- Forces all columns to the same height (100%).
- Shows each part’s contribution as a percentage.
- Useful when percentage share is more important than totals.

3) Line Chart

- Ideal for showing change over time.
- Displays trends month-by-month or year-by-year.
- Helps answer questions like:
Why did sales spike?
Why did expenses drop?
- Leads naturally into predictive analytics using trend projections.

4) Waterfall Chart

- Illustrates how a value changes from one stage to another.
- Starts at an initial value
- Adds or subtracts steps 
- Ends at a final value
Great for showing:
- Profit breakdown
- Budget changes
- Monthly growth
- It hides repeated values and focuses only on the change, not the entire amount.

5) Map Visualizations

- Displays numbers by geographical location.
- Bubble size indicates bigger or smaller values.
Helps identify:
- Regional performance
- Geographic trends
- Also supports some diagnostic thinking:
- Why are numbers higher in the West region?
- Why are values low on the East Coast?

6) Pie Chart

- Shows each category’s contribution to the whole.
Best when:
- There are only 4 to 5 segments.
- You only care about major contributors.
Not good when:
- There are too many tiny slices.
- You want detail on smaller components.

- Tree Map
- A better option when a pie chart has too many categories.
- Uses rectangles inside a larger square.
- Small values are easier to see.
- Good for showing the contribution of many items at once.

7) Scatter Chart

- One of the most powerful visuals in Power BI.
Scatter Chart Uses:
- Descriptive
Shows how two numbers relate (X-axis vs. Y-axis).
- Predictive
If points form a line, we can predict the Y value given the X value.
- Diagnostic
Shows multiple clusters that indicate different patterns.
- Prescriptive
- Identifies outliers:
High-profit outlier suggests, "Do more of this."
Low-performance outlier suggests, "Avoid this."
Scatter charts support all four analytics types based on how they’re used.

* Summary

- Power BI visuals are chosen based on the story you want to tell:
- Tables represent raw facts.
- Column and bar charts make comparisons.
- Stacked charts show totals and contributions.
- Line charts display trends and predictions.
- Waterfall charts illustrate step-by-step value changes.
- Maps support regional analysis.
- Pie and tree maps show category contributions.
- Scatter charts clarify relationships, predictions, and outliers.
- Selecting the right visualization helps move from merely describing data to diagnosing problems, predicting outcomes, and making informed decisions.


Power BI Service 

1. What is Power BI Service?

- Cloud-based platform used to share, distribute, and manage Power BI reports.
- Not fully included with free Power BI Desktop; you get 30–90 days trial.
- Licensed Power BI users get a permanent Power BI Service account.

2. Publishing to Power BI Service

- When a report is ready → Publish to Power BI Service.
- Published into a workspace.
- Other users can access the workspace and view/interact with the report online.

3. Workspaces

Workspaces serve three key functions:

1) Security & Access Control

- Control who can access, publish, and use the reports.

2) Report & Dataset Management

When you publish, two items are created:
1)Report    2)Dataset
Datasets record:
- Data lineage (where data comes from)
- Transformations applied (Power Query)

3) Data Refresh

- You can refresh datasets manually or automatically.
- Supports scheduled refresh:
- Example: Refresh nightly at 2 AM for “end-of-day” updates.
Or refresh every X hours.

4. Dashboards

- Dashboards allow combining visuals from multiple reports & datasets.
You can pin:
- Individual visualizations
- Entire report pages
- Limitations:
Dashboards have less interactivity compared to reports.
All pinned visuals must come from the same workspace.

5. Apps

- Apps are created from workspaces.

Combine:

- Multiple reports
- Multiple dashboards
- Provide a single package for users to download/use.
- Useful when users need a group of related reports together.
- Again, everything must be in the same workspace.

6. Paginated Reports

- Traditional, page-based, printable reports.
- Built using SQL Server Reporting Services (SSRS) under the hood.
- Pulled from the Power BI dataset.
- Output can be PDF or printed.

- Best for:

Tabular, report-like views
- High-detail lists
- Official documents

7. Power BI Service Types

- Two deployment models:

1) Cloud Power BI Service

- Default version paired with Power BI Desktop.

- Requires a gateway if the dataset refreshes from:

On-premises databases
- File shares
- Local network sources

2) On-Premises Power BI Service

- Installed on your organization’s servers.
- No need for a gateway.
- Access strictly controlled within internal network.

8. Gateways

- Required when using cloud Power BI Service and refreshing data from on-premises sources.
A gateway allows:
- Cloud → On-premise connection
- Automated scheduled refresh
- Secure data transfer
Not required if:
- Data source is cloud-based (Azure SQL, SharePoint Online, etc.)
- Using on-premises Power BI service

* Summary

Power BI Service enables:

- Sharing (publish reports)
- Security (workspace access control)
- Automation (data refresh schedules)
- Consolidation (dashboards & apps)
- Traditional reporting (paginated reports)
- Hybrid data connectivity (gateway for on-premises sources)














Comments

Popular posts from this blog

AI-900-3,4

AI-900 12,13

AI-900 10,11