AI-900 10,11

Module : 10 

Introduction to Computer Vision Concepts 

Computer Vision Tasks & Techniques  

* What is Computer Vision?

Computer vision uses AI techniques to process and understand visual input such as:
- Images  
- Videos  
- Live camera streams  
- It enables machines to "see" and interpret visual information.  

1. Image Classification  


- This predicts a single label for an entire image.  
- The model is trained using many labeled images.  
- For example, a smart grocery checkout can identify an apple, orange, or banana from a single item placed on a scale.  
- Use case: Recognizing the main subject in an image.  

2. Object Detection 


 - This identifies multiple objects in a single image.  
It returns:  
- Object labels  
- Bounding box coordinates  
- For example, a checkout system can detect all fruits placed together.  
- Use case: Finding items and their positions.  

3. Semantic Segmentation


  - This involves pixel-level classification.  
- It produces detailed masks that show exact object shapes.  
- This method is more precise than object detection.  
- Use case: Medical imaging, self-driving cars, and defining precise object boundaries.  

4. Contextual Image Analysis (Multimodal Models)  

- This technique understands the relationships between objects and text.  
- It can describe activities, generate captions, and suggest tags.  
- For example, it recognizes "A person eating an apple."  
- Use case: Caption generation, content moderation, and visual search.  


Images and Image Processing  

1. How a Computer Sees an Image  

- An image is a numeric array of pixel values.  
- Example: a 7 × 7 array creates a 7-pixel-by-7-pixel image with that resolution.  
- Pixel values in grayscale are:  
0 = black  
255 = white  
- Values in between represent shades of gray.  


* Grayscale Image  

- It is represented using 1 channel, which is a 2D array of rows and columns.  


* Color Image (RGB)  

- It has 3 channels:  
Red  
Green  
Blue  
- Each channel is a 2D array.  
- The final pixel color is a combination of the 3 numbers.  
Examples:  
- Purple pixel: R=150, G=0, B=255  
- Yellow pixel: R=255, G=255, B=0  




2. What Are Filters?  

- Filters change pixel values to create visual effects.  
- Filter Kernel  
- A small matrix, like a 3 × 3, is used to process images.  
- Example kernel for edge detection (Laplace filter):  

-1  -1  -1  
-1   8  -1  
-1  -1  -1  

Example : 
Let's start with the grayscale image we explored previously:


First, we apply the filter kernel to the top left patch of the image, multiplying each pixel value by the corresponding weight value in the kernel and adding the results:


The result (-255) becomes the first value in a new array. Then we move the filter kernel along one pixel to the right and repeat the operation:

Again, the result is added to the new array, which now contains two values.

-255  -510


3. Convolution (How Filters Work)  

- The filter kernel moves across the image.  
- For each 3×3 patch, you multiply pixel values by kernel values.  
- Then, add the results to get a weighted sum.  
- Place the result in a new output image.  
- Repeat this for every possible patch.  
- If values are below 0 or above 255, they must be clipped to the range of 0–255.  
- Borders use padding, usually with 0s.  

4. Result of Filtering  

       

- Filtering produces a new image that changes the original.  
Examples:  
- Laplace filter highlights edges.  
Other filters can:  
Blur  
Sharpen  
Invert colors  
Detect edges  
Smooth noise  

* Key Concept: Convolution  

- Since the filter moves across the image and performs weighted sums, this is called convolutional filtering.  
- This idea is also the basis of Convolutional Neural Networks (CNNs).

Convolutional Neural Networks (CNNs)  

* Purpose of CNNs  

- CNNs are deep learning models used in computer vision.  
- Goal: extract meaning or insights from images (e.g., classification, detection).  
- CNNs use filters to automatically learn visual patterns from data.  


* How CNNs Work (Simplified Overview)  

1. Input  

- Images with known labels (e.g., apple = 0, banana = 1, orange = 2).  

2. Convolution Layers (Feature Extraction)  

- Use filter kernels to scan the image.  
- Kernels start with random weights.  
- As training progresses, weights adjust to detect meaningful features.  
- The output of each filter is a feature map.  
Feature maps:  
- Highlight edges, textures, shapes, colors, and more.  
- Multiple convolution layers extract increasingly complex patterns.  

3. Pooling Layers (Downsizing)  

- Reduce the size of feature maps (e.g., Max Pooling).  
- They keep important information while lowering computation.  
- This helps the model focus on the main features.  

4. Flattening  

- This converts the final feature maps into a single 1D array (vector).  

5. Fully Connected Layers (Classification)  

- This acts like a traditional neural network.  
- It takes the extracted features and predicts the final class.  
Output layer:  
- Uses softmax to give class probabilities, e.g.  
[0.2, 0.5, 0.3] → highest = 0.5 → class = banana  

6. Training Process  

- Predicted probabilities are compared with actual labels.  
- The loss is computed (the difference between predicted and true).  
Weights in:  
- convolution filters  
- fully connected layers  
- are adjusted using backpropagation.  
- This process repeats for many epochs until the model learns the best weights.  

7. After Training  

- Model weights are saved.  
- The model can now classify new, unseen images.  

* Key Notes  

- CNNs automatically learn which features matter (no manual feature engineering).  
- Early layers learn simple features (edges); deeper layers learn complex ones (object shapes).  
- CNN models consist of many layers not shown in simple diagrams:  
- Multiple convolutions  
- Pooling  
- Activation functions (ReLU)  
- Normalization layers, and more.

Vision Transformers and multimodal models 

1. CNNs in Computer Vision  

- CNNs have long been the basis of computer vision.  
They are effective for tasks like:  
- Image classification  
- Object detection (CNN features and region proposals)  
- Many improvements in vision came from enhancing CNN-based structures.  

2. Transformers in NLP  

- Transformers changed NLP.  
- They process large amounts of text using attention.  
- Transformers convert words and phrases (tokens) into embeddings, which are high-dimensional numeric vectors.  
- Attention captures contextual meaning:  
- Tokens used in similar contexts lead to vectors pointing in similar directions.  
- Transformers are used for tasks like translation, summarization, and text generation.  

3. Vision Transformers (ViT)  

- ViTs draw inspiration from the success of NLP transformers.  
They work by:  
- Splitting an image into patches  
- Flattening each patch into a vector  
- Applying attention to learn relationships between patches  
ViTs encode visual features such as:  
- Color  
- Shape  
- Texture  
- Contrast  
- They create a visual embedding space where patches with similar features have similar vectors.  

4. Visual Embedding Space  

- The model does not recognize specific objects, such as hats or heads.  
- It only understands relationships between visual features learned during training.  
For example:  
- “Hat” features often appear near “head” features, so their vectors become related.  

5. Multimodal Models  

These models combine:  
- Language embeddings (from text transformers)  
- Vision embeddings (from ViTs)  
- They use cross-modal attention to align the two.  
- This creates a shared vector space for images and text.  
It enables the model to:  
- Describe unseen images  
- Link text with image features  
- Understand relationships across different types of data  
- Develop rich semantic understanding  

6. Example Scenario  

- Image: A person in a park wearing a hat and a backpack.  
- The vision encoder extracts visual features.  
- The language encoder matches those visual features to words.  
- The combined model generates the description: “A person in a park with a hat and a backpack.” 

Module : 11

Azure Foundry Tools for Computer Vision

* Azure AI offers strong cloud-based tools for creating computer vision applications. These tools include ready-made models and options for training your own custom models.

* What Is Azure Vision?
- Azure Vision is a group of AI services that analyze images and videos. It features prebuilt deep learning models that help you understand visual content without needing to build a model from the ground up.
You can:
- Detect objects
- Read text from images
- Recognize faces
- Generate captions
- Tag images
- And much more

* Main Components of Azure Vision

1. Azure Vision Image Analysis

This service analyzes the content of images. It can:
- Detect common objects
- Generate image captions
- Tag visual features
- Perform OCR (extract text from images)
Useful for: SEO, content moderation, document digitization, accessibility apps, and more.

2. Azure AI Face Service

This is a specialized service for detecting and analyzing human faces. It can:
- Detect faces
- Recognize identity
- Analyze attributes such as age and emotion
- Match faces across images
- Useful for: Security, attendance systems, device unlocking, social media tagging, searching for missing persons, and identity checks at airports.

* Real-World Applications of Azure Vision

1) Search Optimization

- Image tagging and captions help improve search rankings.

2) Content Moderation

- It detects sensitive or unsafe content before publishing.

3) Security

- Facial recognition supports access control and surveillance.

4) Social Media

- It allows auto-tagging of friends in photos.

5) Missing Persons

- CCTV and face recognition can identify missing individuals.

6) Identity Verification

- It can verify a person at ports, airports, kiosks, and more.

7) Museum and Archive Management

- OCR helps digitize old documents and photos.

* Azure Video Indexer

Modern systems combine several capabilities. Azure AI Video Indexer uses tools like:
- Image Analysis
- Face Recognition
- Speech and Translation
- This allows for deep analysis of videos, including objects, people, speech, and text.

* Summary

- Azure Vision equals a set of tools for understanding images and videos.
- Image Analysis focuses on objects, text, tags, and captions.
- Face Service focuses on detecting, analyzing, and recognizing faces.
- This is useful for SEO, security, social media, identity checks, and moderation.
- Video Indexer combines multiple tools for video analysis.


Azure Vision Image Analysis

* Azure Vision Image Analysis offers strong prebuilt AI features to understand images. You can use these as they are or adjust them with custom models.

* Core Image Analysis Capabilities (No Customization Needed)  

1. Image Captioning

Azure Vision can:
- Understand objects and context in an image.  
- Generate a natural, human-like caption.  
Example  
- Image: A person skateboarding  
- Caption: “A person jumping on a skateboard”

2. Object Detection

Azure Vision can:
- Detect thousands of common objects.  
- Provide confidence scores.  
- Return bounding box coordinates (top, left, width, height).  
Example for the skateboarder image:  
- Person (95.5%)  
- Skateboard (90.4%)  
- Bounding boxes show where the objects are located in the image.

3. Tagging Visual Features

- Azure Vision automatically generates tags that describe key features of an image. Tags function like metadata and are great for search indexing.  
Example tags returned for a skateboarding image:  
sport  
person  
skating  
stunt  
extreme sport  
skateboarder  
outdoor  
jumping  
…and many more, each with a confidence score.

4. Optical Character Recognition (OCR)

- Azure Vision can extract text from printed images.  
Example:  
- Nutrition label ➝ detected text:  
- Nutrition Facts  
- Serving size: 1 bar (40g)  
- Total Fat 13g  
- Calories 190  
- Sodium 20mg  
…and more.  
OCR is useful for:  
- Digitizing documents  
- Extracting data from labels  
- Business automation  

* Custom Model Training (When You Need More Control)

- If built-in models do not suit your needs, Azure Vision allows you to train custom models using your own images. These models build on top of Azure’s pre-trained foundation models, so they require fewer images.

1. Custom Image Classification

- Predicts the category of an entire image.  
Example model:  
- Apple  
- Banana  
- Orange  
- Feed an image and the model returns the fruit type.

2. Custom Object Detection

- Detects multiple custom objects and returns bounding boxes.  
Example:  
- You can train a model to detect individual fruits like:  
- Apple  
- Banana  
- Orange  
…inside one image at the same time.  
Great for:  
- Retail  
- Manufacturing  
- Inventory management  
- Agriculture  

* Quick Summary

- Captioning: Generates natural descriptions.  
- Object Detection: Identifies objects and bounding boxes.  
- Tagging: Adds metadata tags for search.  
- OCR: Extracts text from images.  
- Custom Models: Train your own classifiers and detectors.


Azure Vision, Face Service Capabilities

* Azure AI Face is a specific service within Azure Vision. It concentrates on identity-related tasks like user verification, liveness detection, access control, and face recognition.

1. Facial Detection
- Face detection finds where human faces are in an image.
It provides:
- Bounding box coordinates around each face
Facial landmarks such as:
- Eyes
- Nose
- Eyebrows
- Lips
- Face contour
- These landmarks assist in further analysis, like recognition or emotion detection.

2. Facial Recognition

- Facial recognition identifies who a person is.
- A machine learning model trains using multiple images of the same person.
- Once trained, the model can recognize that person in new images.
It is useful for:
- Security systems
- Access control
- Attendance
- Personalized customer experiences
- When used responsibly, facial recognition boosts efficiency and security.

3. Azure AI Face – Supported Attributes

When Azure AI Face detects a face, it can return several attributes:

1) Accessories : Detects headwear, glasses, masks, each with a confidence score.

2) Blur : Shows how blurred the face is.

3) Exposure : Indicates if the face is underexposed or overexposed.

4) Glasses : Identifies if the person is wearing regular glasses or sunglasses.

5) Head Pose : Reports the angle and direction of the face in 3D space.

6) Mask : Indicates whether the person is wearing a mask.

7) Noise : Measures graininess or visual noise in the face area.

8) Occlusion : Checks if parts of the face are blocked by objects, like hands or hair.

9) Quality for Recognition : 
Rates the face image as:
- High
- Medium
- Low
- This helps determine if the image is suitable for recognition tasks.

4. Responsible AI & Limited Access Policy

1) Azure Vision and Face services follow Microsoft’s Responsible AI Standard.
Anyone can use the Face API for:
- Detecting face locations
- Getting attributes (glasses, noise, blur, exposure, occlusion, etc.)
- Getting head pose information

2) Limited Access Features (require an approval form):

- Face verification – Comparing two faces for similarity
- Face identification – Identifying named individuals
- Liveness detection – Detecting if input video is real or a spoof (like a photo, video replay, or mask)
- These require an intake form to ensure ethical and responsible use.

Quick Summary

- Face Detection: Finds faces and returns landmarks.
- Face Recognition: Identifies people using trained images.
- Attributes: Blur, head pose, mask, noise, exposure, occlusion, accessories.
- Responsible AI: Advanced features require approval.
- Use Cases: Identity verification, access control, redaction, fraud prevention.


 

Comments

Popular posts from this blog

AI-900-3,4

AI-900 12,13