DP-900-3
DP-900
Module 3: Semi-Structured Data & Microsoft Table Storage
- Consider Amazon.com as an example, an online transaction system that processes millions of purchases daily around the globe. Relational databases often struggle at this scale for several reasons:
- A single transaction usually needs updates to multiple tables (customer, order, product).
- Each update requires time and adds overhead to keep transaction integrity intact.
- Fetching data from normalized tables can slow down performance.
- Every row in a relational table must have the same columns, which complicates handling diverse product data like books, clothes, and electronics.
- As a result, relational databases lack the flexibility and scalability needed for large, diverse workloads.
* The Solution: Semi-Structured Data
- Semi-structured databases provide a flexible method for storing and processing significant amounts of diverse data efficiently. Here’s how they are different:
- Each transaction is stored in a single row without needing joins or multiple table updates.
- There are no foreign or primary keys; all data for a transaction is stored together.
- Transaction integrity is simpler because only one record gets affected.
- They can easily support millions of transactions each day.
- To achieve this, semi-structured data discards rigid schemas. Each row is self-describing, like JSON or XML, defining its own fields and structure.
Example:
{
"CustomerName": "Peter Vogel",
"Product": "Book",
"Price": 20.99,
"City": "London"
}
Another transaction might omit “City” or include different fields, and that’s acceptable.
* From Structured to Semi-Structured
- Semi-structured data is essentially denormalized data. Instead of spreading information across several tables, everything is kept together.
* Advantages
- Faster writes and fewer updates.
- No complicated joins or transaction management.
- High scalability for millions of operations daily.
* Limitations
- Data duplication, such as repeating customer addresses in every transaction.
- Limited search capabilities with no standard indexes or WHERE clauses like SQL.
- Less flexibility for complex queries.
- Usually tailored for specific applications, not for organization-wide analytics.
- Searching Challenges
- In relational databases, every row shares the same columns, making indexing and searching straightforward. For instance, you can easily find all customers in London. In semi-structured systems, not all rows have the same attributes, which limits indexing and searching.
- However, since semi-structured databases are designed for specific, high-volume applications, this is often not a significant problem.
* What About SQL?
- SQL (Structured Query Language) was created for structured data, so it doesn’t work directly with semi-structured databases. That’s why they are often referred to as:
* NoSQL databases (Not Only SQL)
- or Non-relational databases.
- These databases use APIs or query models specific to their design rather than SQL commands.
- Microsoft's Solution: Azure Table Storage
* Azure Table Storage is Microsoft's NoSQL key-value store designed for:
- High scalability
- Massive transaction processing
- Low latency operations
- It stores semi-structured data in tables, but unlike SQL tables:
- Each row (called an entity) can have different columns (called properties).
- Every entity must have:
- A PartitionKey to group related data
- A RowKey that acts as a unique identifier within a partition
- This structure allows for fast and cost-effective queries in large-scale systems like e-commerce, IoT, or logging.
* Real-World Use
- Most organizations use both types of databases:
- NoSQL (Table Storage) to handle fast, large-scale transactions.
- Relational Databases (SQL Server, Azure SQL) to store structured, shared data for analytics and reporting.
- For example, Amazon records orders in a semi-structured system for speed and keeps customer profiles in a relational database for easy access.
| Feature | Relational (SQL) | Semi-Structured (NoSQL) |
|---|---|---|
| Data Model | Structured | Flexible (self-describing) |
| Schema | Fixed | Variable per record |
| Speed | Slower (joins, transactions) | Faster (single record writes) |
| Scalability | Limited | Very high |
| Query Language | SQL | NoSQL / API-based |
| Example | Azure SQL Database | Azure Table Storage |
* Key Takeaways :
- Semi-structured data supports high-speed, high-volume transactions.
- It’s schema-less, denormalized, and self-describing.
- Azure Table Storage is Microsoft’s NoSQL solution for semi-structured data.
- Most enterprises adopt a hybrid approach, combining both structured and semi-structured systems for better performance and flexibility.
Azure Table Storage, Semi-Structured Data Explained
* What is Table Storage?
- Table Storage is Microsoft’s semi-structured database for managing large-scale data.
- It’s a key-value database. Every record (row) is identified by:
- Partition Key, which groups related rows.
- Row Key, which uniquely identifies a record within the partition.
- It can store 25 times more data than a single Azure SQL Database.
- Every row can have different properties, unlike relational tables with fixed columns.
| Feature | Relational DB | Semi-Structured (Table Storage) |
|---|---|---|
| Relationships | Supported | Not supported |
| Views / Stored Procedures | Supported | (some NoSQL DBs support them) |
| Indexes | Any column | Only Partition + Row Keys |
| Normalization | Supported | Not applicable |
| Search | Flexible | Limited but very fast |
| Speed | Moderate | Millions of transactions/day |
Advantage: Fast inserts and reads at scale.
Trade-off: Less flexible queries and relationships.
* How Table Storage Works
- Each row includes:
- Partition Key, which divides data into groups for scalability.
- Row Key, which is unique within each partition.
- Timestamp, which auto-updates when data changes.
- The best performance occurs when both the Partition Key and Row Key are known.
* Choosing a Good Partition Key
- The partition key should split data into balanced, evenly active groups.
- Avoid hot partitions, which cause an uneven load.
- For example, if customers are grouped by country, some, like India or the USA, may have too much activity.
- It’s better to partition by something evenly distributed, like region, customer type, or date range.
* Flexibility in Use
- In relational databases, any column can be indexed or searched.
- In Table Storage, searches depend only on the Partition Key and Row Key.
- So, the design relies on specific application needs:
- For a Customer Management System, country might work as a partition key.
- For a Lead Generation App, it might not work, since new leads’ countries aren’t known.
* Timestamp Property
- This is an auto-generated field that shows the last update time.
- It is used for concurrency control; you can check if another user modified the record since you last retrieved it.
* Tools to Work with Table Storage
1. Azure Storage Browser (Web Tool)
- You can access this via the Azure Portal.
- It lets you add, edit, or delete rows.
- You can manage files, blobs, containers, and tables directly in your browser.
2. Azure Storage Explorer (Desktop App)
- This is a downloadable tool for Windows, Mac, and Linux.
- It’s similar to File Explorer but for Azure resources.
- You can manage tables, blobs, files, and queues offline.
* Summary
- Table Storage is Azure’s key-value, semi-structured data store.
- It supports massive scale, fast operations, and a flexible schema.
- It is ideal for high-volume apps like IoT, logs, and user data.
- Choose partition keys wisely to avoid performance bottlenecks.
Comments
Post a Comment