AI-900 10,11
Module : 10
Introduction to Computer Vision Concepts
Computer Vision Tasks & Techniques
* What is Computer Vision?Computer vision uses AI techniques to process and understand visual input such as:
- Images
- Videos
- Live camera streams
- It enables machines to "see" and interpret visual information.
1. Image Classification
- This predicts a single label for an entire image.
- The model is trained using many labeled images.
- For example, a smart grocery checkout can identify an apple, orange, or banana from a single item placed on a scale.
- Use case: Recognizing the main subject in an image.
2. Object Detection
- This identifies multiple objects in a single image.
It returns:
- Object labels
- Bounding box coordinates
- For example, a checkout system can detect all fruits placed together.
- Use case: Finding items and their positions.
3. Semantic Segmentation
- This involves pixel-level classification.
- It produces detailed masks that show exact object shapes.
- This method is more precise than object detection.
- Use case: Medical imaging, self-driving cars, and defining precise object boundaries.
4. Contextual Image Analysis (Multimodal Models)
- This technique understands the relationships between objects and text.
- It can describe activities, generate captions, and suggest tags.
- For example, it recognizes "A person eating an apple."
- Use case: Caption generation, content moderation, and visual search.
Images and Image Processing
1. How a Computer Sees an Image
- An image is a numeric array of pixel values.
- Example: a 7 × 7 array creates a 7-pixel-by-7-pixel image with that resolution.
- Pixel values in grayscale are:
0 = black
255 = white
- Values in between represent shades of gray.
- It is represented using 1 channel, which is a 2D array of rows and columns.
* Color Image (RGB)
- It has 3 channels:
Red
Green
Blue
- Each channel is a 2D array.
- The final pixel color is a combination of the 3 numbers.
Examples:
- Purple pixel: R=150, G=0, B=255
- Yellow pixel: R=255, G=255, B=0
- Filters change pixel values to create visual effects.
- Filter Kernel
- A small matrix, like a 3 × 3, is used to process images.
- Example kernel for edge detection (Laplace filter):
-1 -1 -1
-1 8 -1
-1 -1 -1
Example :
Let's start with the grayscale image we explored previously:
First, we apply the filter kernel to the top left patch of the image, multiplying each pixel value by the corresponding weight value in the kernel and adding the results:
- The filter kernel moves across the image.
- For each 3×3 patch, you multiply pixel values by kernel values.
- Then, add the results to get a weighted sum.
- Place the result in a new output image.
- Repeat this for every possible patch.
- If values are below 0 or above 255, they must be clipped to the range of 0–255.
- Borders use padding, usually with 0s.
4. Result of Filtering
Examples:
- Laplace filter highlights edges.
Other filters can:
Blur
Sharpen
Invert colors
Detect edges
Smooth noise
* Key Concept: Convolution
- Since the filter moves across the image and performs weighted sums, this is called convolutional filtering.
- This idea is also the basis of Convolutional Neural Networks (CNNs).
Convolutional Neural Networks (CNNs)
* Purpose of CNNs
- CNNs are deep learning models used in computer vision.
- Goal: extract meaning or insights from images (e.g., classification, detection).
- CNNs use filters to automatically learn visual patterns from data.
1. Input
- Images with known labels (e.g., apple = 0, banana = 1, orange = 2).
2. Convolution Layers (Feature Extraction)
- Use filter kernels to scan the image.
- Kernels start with random weights.
- As training progresses, weights adjust to detect meaningful features.
- The output of each filter is a feature map.
Feature maps:
- Highlight edges, textures, shapes, colors, and more.
- Multiple convolution layers extract increasingly complex patterns.
3. Pooling Layers (Downsizing)
- Reduce the size of feature maps (e.g., Max Pooling).
- They keep important information while lowering computation.
- This helps the model focus on the main features.
4. Flattening
- This converts the final feature maps into a single 1D array (vector).
5. Fully Connected Layers (Classification)
- This acts like a traditional neural network.
- It takes the extracted features and predicts the final class.
Output layer:
- Uses softmax to give class probabilities, e.g.
[0.2, 0.5, 0.3] → highest = 0.5 → class = banana
6. Training Process
- Predicted probabilities are compared with actual labels.
- The loss is computed (the difference between predicted and true).
Weights in:
- convolution filters
- fully connected layers
- are adjusted using backpropagation.
- This process repeats for many epochs until the model learns the best weights.
7. After Training
- Model weights are saved.
- The model can now classify new, unseen images.
* Key Notes
- CNNs automatically learn which features matter (no manual feature engineering).
- Early layers learn simple features (edges); deeper layers learn complex ones (object shapes).
- CNN models consist of many layers not shown in simple diagrams:
- Multiple convolutions
- Pooling
- Activation functions (ReLU)
- Normalization layers, and more.
Vision Transformers and multimodal models
1. CNNs in Computer Vision
- CNNs have long been the basis of computer vision.
They are effective for tasks like:
- Image classification
- Object detection (CNN features and region proposals)
- Many improvements in vision came from enhancing CNN-based structures.
2. Transformers in NLP
- Transformers changed NLP.
- They process large amounts of text using attention.
- Transformers convert words and phrases (tokens) into embeddings, which are high-dimensional numeric vectors.
- Attention captures contextual meaning:
- Tokens used in similar contexts lead to vectors pointing in similar directions.
- Transformers are used for tasks like translation, summarization, and text generation.
3. Vision Transformers (ViT)
- ViTs draw inspiration from the success of NLP transformers.
They work by:
- Splitting an image into patches
- Flattening each patch into a vector
- Applying attention to learn relationships between patches
ViTs encode visual features such as:
- Color
- Shape
- Texture
- Contrast
- They create a visual embedding space where patches with similar features have similar vectors.
4. Visual Embedding Space
- The model does not recognize specific objects, such as hats or heads.
- It only understands relationships between visual features learned during training.
For example:
- “Hat” features often appear near “head” features, so their vectors become related.
5. Multimodal Models
These models combine:
- Language embeddings (from text transformers)
- Vision embeddings (from ViTs)
- They use cross-modal attention to align the two.
- This creates a shared vector space for images and text.
It enables the model to:
- Describe unseen images
- Link text with image features
- Understand relationships across different types of data
- Develop rich semantic understanding
6. Example Scenario
- Image: A person in a park wearing a hat and a backpack.
- The vision encoder extracts visual features.
- The language encoder matches those visual features to words.
- The combined model generates the description: “A person in a park with a hat and a backpack.”
Module : 11
Azure Foundry Tools for Computer Vision
* Azure AI offers strong cloud-based tools for creating computer vision applications. These tools include ready-made models and options for training your own custom models.
* What Is Azure Vision?
- Azure Vision is a group of AI services that analyze images and videos. It features prebuilt deep learning models that help you understand visual content without needing to build a model from the ground up.
You can:
- Detect objects
- Read text from images
- Recognize faces
- Generate captions
- Tag images
- And much more
* Main Components of Azure Vision
1. Azure Vision Image Analysis
This service analyzes the content of images. It can:
- Detect common objects
- Generate image captions
- Tag visual features
- Perform OCR (extract text from images)
Useful for: SEO, content moderation, document digitization, accessibility apps, and more.
2. Azure AI Face Service
This is a specialized service for detecting and analyzing human faces. It can:
- Detect faces
- Recognize identity
- Analyze attributes such as age and emotion
- Match faces across images
- Useful for: Security, attendance systems, device unlocking, social media tagging, searching for missing persons, and identity checks at airports.
* Real-World Applications of Azure Vision
1) Search Optimization
- Image tagging and captions help improve search rankings.
2) Content Moderation
- It detects sensitive or unsafe content before publishing.
3) Security
- Facial recognition supports access control and surveillance.
4) Social Media
- It allows auto-tagging of friends in photos.
5) Missing Persons
- CCTV and face recognition can identify missing individuals.
6) Identity Verification
- It can verify a person at ports, airports, kiosks, and more.
7) Museum and Archive Management
- OCR helps digitize old documents and photos.
* Azure Video Indexer
Modern systems combine several capabilities. Azure AI Video Indexer uses tools like:
- Image Analysis
- Face Recognition
- Speech and Translation
- This allows for deep analysis of videos, including objects, people, speech, and text.
* Summary
- Azure Vision equals a set of tools for understanding images and videos.
- Image Analysis focuses on objects, text, tags, and captions.
- Face Service focuses on detecting, analyzing, and recognizing faces.
- This is useful for SEO, security, social media, identity checks, and moderation.
- Video Indexer combines multiple tools for video analysis.
Azure Vision Image Analysis
* Azure Vision Image Analysis offers strong prebuilt AI features to understand images. You can use these as they are or adjust them with custom models.
* Core Image Analysis Capabilities (No Customization Needed)
1. Image Captioning
Azure Vision can:
- Understand objects and context in an image.
- Generate a natural, human-like caption.
Example
- Image: A person skateboarding
- Caption: “A person jumping on a skateboard”
2. Object Detection
Azure Vision can:
- Detect thousands of common objects.
- Provide confidence scores.
- Return bounding box coordinates (top, left, width, height).
Example for the skateboarder image:
- Person (95.5%)
- Skateboard (90.4%)
- Bounding boxes show where the objects are located in the image.
3. Tagging Visual Features
- Azure Vision automatically generates tags that describe key features of an image. Tags function like metadata and are great for search indexing.
Example tags returned for a skateboarding image:
sport
person
skating
stunt
extreme sport
skateboarder
outdoor
jumping
…and many more, each with a confidence score.
4. Optical Character Recognition (OCR)
- Azure Vision can extract text from printed images.
Example:
- Nutrition label ➝ detected text:
- Nutrition Facts
- Serving size: 1 bar (40g)
- Total Fat 13g
- Calories 190
- Sodium 20mg
…and more.
OCR is useful for:
- Digitizing documents
- Extracting data from labels
- Business automation
* Custom Model Training (When You Need More Control)
- If built-in models do not suit your needs, Azure Vision allows you to train custom models using your own images. These models build on top of Azure’s pre-trained foundation models, so they require fewer images.
1. Custom Image Classification
- Predicts the category of an entire image.
Example model:
- Apple
- Banana
- Orange
- Feed an image and the model returns the fruit type.
2. Custom Object Detection
- Detects multiple custom objects and returns bounding boxes.
Example:
- You can train a model to detect individual fruits like:
- Apple
- Banana
- Orange
…inside one image at the same time.
Great for:
- Retail
- Manufacturing
- Inventory management
- Agriculture
* Quick Summary
- Captioning: Generates natural descriptions.
- Object Detection: Identifies objects and bounding boxes.
- Tagging: Adds metadata tags for search.
- OCR: Extracts text from images.
- Custom Models: Train your own classifiers and detectors.
Azure Vision, Face Service Capabilities
* Azure AI Face is a specific service within Azure Vision. It concentrates on identity-related tasks like user verification, liveness detection, access control, and face recognition.
1. Facial Detection
- Face detection finds where human faces are in an image.
It provides:
- Bounding box coordinates around each face
Facial landmarks such as:
- Eyes
- Nose
- Eyebrows
- Lips
- Face contour
- These landmarks assist in further analysis, like recognition or emotion detection.
2. Facial Recognition
- Facial recognition identifies who a person is.
- A machine learning model trains using multiple images of the same person.
- Once trained, the model can recognize that person in new images.
It is useful for:
- Security systems
- Access control
- Attendance
- Personalized customer experiences
- When used responsibly, facial recognition boosts efficiency and security.
3. Azure AI Face – Supported Attributes
When Azure AI Face detects a face, it can return several attributes:
1) Accessories : Detects headwear, glasses, masks, each with a confidence score.
2) Blur : Shows how blurred the face is.
3) Exposure : Indicates if the face is underexposed or overexposed.
4) Glasses : Identifies if the person is wearing regular glasses or sunglasses.
5) Head Pose : Reports the angle and direction of the face in 3D space.
6) Mask : Indicates whether the person is wearing a mask.
7) Noise : Measures graininess or visual noise in the face area.
8) Occlusion : Checks if parts of the face are blocked by objects, like hands or hair.
9) Quality for Recognition :
Rates the face image as:
- High
- Medium
- Low
- This helps determine if the image is suitable for recognition tasks.
4. Responsible AI & Limited Access Policy
1) Azure Vision and Face services follow Microsoft’s Responsible AI Standard.
Anyone can use the Face API for:
- Detecting face locations
- Getting attributes (glasses, noise, blur, exposure, occlusion, etc.)
- Getting head pose information
2) Limited Access Features (require an approval form):
- Face verification – Comparing two faces for similarity
- Face identification – Identifying named individuals
- Liveness detection – Detecting if input video is real or a spoof (like a photo, video replay, or mask)
- These require an intake form to ensure ethical and responsible use.
Quick Summary
- Face Detection: Finds faces and returns landmarks.
- Face Recognition: Identifies people using trained images.
- Attributes: Blur, head pose, mask, noise, exposure, occlusion, accessories.
- Responsible AI: Advanced features require approval.
- Use Cases: Identity verification, access control, redaction, fraud prevention.
Comments
Post a Comment