AI-900 10,11

Module : 10

Introduction to Computer Vision Concepts

Computer Vision Tasks & Techniques

* What is Computer Vision?

Computer vision uses AI techniques to process and understand visual input such as:
- Images
- Videos
- Live camera streams
- It enables machines to "see" and interpret visual information.

1. Image Classification

- This predicts a single label for an entire image.
- The model is trained using many labeled images.
- For example, a smart grocery checkout can identify an apple, orange, or banana from a single item placed on a scale.
- Use case: Recognizing the main subject in an image.

2. Object Detection

- This identifies multiple objects in a single image.
It returns:
- Object labels
- Bounding box coordinates
- For example, a checkout system can detect all fruits placed together.
- Use case: Finding items and their positions.

3. Semantic Segmentation

  - This involves pixel-level classification.
- It produces detailed masks that show exact object shapes.
- This method is more precise than object detection.
- Use case: Medical imaging, self-driving cars, and defining precise object boundaries.

4. Contextual Image Analysis (Multimodal Models)

- This technique understands the relationships between objects and text.
- It can describe activities, generate captions, and suggest tags.
- For example, it recognizes "A person eating an apple."

- Use case: Caption generation, content moderation, and visual search.

Images and Image Processing

1. How a Computer Sees an Image

- An image is a numeric array of pixel values.

- Example: a 7 × 7 array creates a 7-pixel-by-7-pixel image with that resolution.

- Pixel values in grayscale are:

0 = black

255 = white

- Values in between represent shades of gray.

* Grayscale Image

- It is represented using 1 channel, which is a 2D array of rows and columns.

* Color Image (RGB)

- It has 3 channels:

Red

Green

Blue

- Each channel is a 2D array.

- The final pixel color is a combination of the 3 numbers.

Examples:

- Purple pixel: R=150, G=0, B=255

- Yellow pixel: R=255, G=255, B=0

2. What Are Filters?

- Filters change pixel values to create visual effects.

- Filter Kernel

- A small matrix, like a 3 × 3, is used to process images.

- Example kernel for edge detection (Laplace filter):

-1 -1 -1

-1 8 -1

-1 -1 -1

Example :

Let's start with the grayscale image we explored previously:

First, we apply the filter kernel to the top left patch of the image, multiplying each pixel value by the corresponding weight value in the kernel and adding the results:

The result (-255) becomes the first value in a new array. Then we move the filter kernel along one pixel to the right and repeat the operation:

Again, the result is added to the new array, which now contains two values.

-255 -510

3. Convolution (How Filters Work)

- The filter kernel moves across the image.

- For each 3×3 patch, you multiply pixel values by kernel values.

- Then, add the results to get a weighted sum.

- Place the result in a new output image.

- Repeat this for every possible patch.

- If values are below 0 or above 255, they must be clipped to the range of 0–255.

- Borders use padding, usually with 0s.

4. Result of Filtering

- Filtering produces a new image that changes the original.

Examples:

- Laplace filter highlights edges.

Other filters can:

Blur

Sharpen

Invert colors

Detect edges

Smooth noise

* Key Concept: Convolution

- Since the filter moves across the image and performs weighted sums, this is called convolutional filtering.

- This idea is also the basis of Convolutional Neural Networks (CNNs).

Convolutional Neural Networks (CNNs)

* Purpose of CNNs

- CNNs are deep learning models used in computer vision.

- Goal: extract meaning or insights from images (e.g., classification, detection).

- CNNs use filters to automatically learn visual patterns from data.

* How CNNs Work (Simplified Overview)

1. Input

- Images with known labels (e.g., apple = 0, banana = 1, orange = 2).

2. Convolution Layers (Feature Extraction)

- Use filter kernels to scan the image.

- Kernels start with random weights.

- As training progresses, weights adjust to detect meaningful features.

- The output of each filter is a feature map.

Feature maps:

- Highlight edges, textures, shapes, colors, and more.

- Multiple convolution layers extract increasingly complex patterns.

3. Pooling Layers (Downsizing)

- Reduce the size of feature maps (e.g., Max Pooling).

- They keep important information while lowering computation.

- This helps the model focus on the main features.

4. Flattening

- This converts the final feature maps into a single 1D array (vector).

5. Fully Connected Layers (Classification)

- This acts like a traditional neural network.

- It takes the extracted features and predicts the final class.

Output layer:

- Uses softmax to give class probabilities, e.g.

[0.2, 0.5, 0.3] → highest = 0.5 → class = banana

6. Training Process

- Predicted probabilities are compared with actual labels.

- The loss is computed (the difference between predicted and true).

Weights in:

- convolution filters

- fully connected layers

- are adjusted using backpropagation.

- This process repeats for many epochs until the model learns the best weights.

7. After Training

- Model weights are saved.

- The model can now classify new, unseen images.

* Key Notes

- CNNs automatically learn which features matter (no manual feature engineering).

- Early layers learn simple features (edges); deeper layers learn complex ones (object shapes).

- CNN models consist of many layers not shown in simple diagrams:

- Multiple convolutions

- Pooling

- Activation functions (ReLU)

- Normalization layers, and more.

Vision Transformers and multimodal models

1. CNNs in Computer Vision

- CNNs have long been the basis of computer vision.

They are effective for tasks like:

- Image classification

- Object detection (CNN features and region proposals)

- Many improvements in vision came from enhancing CNN-based structures.

2. Transformers in NLP

- Transformers changed NLP.

- They process large amounts of text using attention.

- Transformers convert words and phrases (tokens) into embeddings, which are high-dimensional numeric vectors.

- Attention captures contextual meaning:

- Tokens used in similar contexts lead to vectors pointing in similar directions.

- Transformers are used for tasks like translation, summarization, and text generation.

3. Vision Transformers (ViT)

- ViTs draw inspiration from the success of NLP transformers.

They work by:

- Splitting an image into patches

- Flattening each patch into a vector

- Applying attention to learn relationships between patches

ViTs encode visual features such as:

- Color

- Shape

- Texture

- Contrast

- They create a visual embedding space where patches with similar features have similar vectors.

4. Visual Embedding Space

- The model does not recognize specific objects, such as hats or heads.

- It only understands relationships between visual features learned during training.

For example:

- “Hat” features often appear near “head” features, so their vectors become related.

5. Multimodal Models

These models combine:

- Language embeddings (from text transformers)

- Vision embeddings (from ViTs)

- They use cross-modal attention to align the two.

- This creates a shared vector space for images and text.

It enables the model to:

- Describe unseen images

- Link text with image features

- Understand relationships across different types of data

- Develop rich semantic understanding

6. Example Scenario

- Image: A person in a park wearing a hat and a backpack.

- The vision encoder extracts visual features.

- The language encoder matches those visual features to words.

- The combined model generates the description: “A person in a park with a hat and a backpack.”

Module : 11

Azure Foundry Tools for Computer Vision

* Azure AI offers strong cloud-based tools for creating computer vision applications. These tools include ready-made models and options for training your own custom models.

* What Is Azure Vision?

- Azure Vision is a group of AI services that analyze images and videos. It features prebuilt deep learning models that help you understand visual content without needing to build a model from the ground up.

You can:

- Detect objects

- Read text from images

- Recognize faces

- Generate captions

- Tag images

- And much more

* Main Components of Azure Vision

1. Azure Vision Image Analysis

This service analyzes the content of images. It can:

- Detect common objects

- Generate image captions

- Tag visual features

- Perform OCR (extract text from images)

Useful for: SEO, content moderation, document digitization, accessibility apps, and more.

2. Azure AI Face Service

This is a specialized service for detecting and analyzing human faces. It can:

- Detect faces

- Recognize identity

- Analyze attributes such as age and emotion

- Match faces across images

- Useful for: Security, attendance systems, device unlocking, social media tagging, searching for missing persons, and identity checks at airports.

* Real-World Applications of Azure Vision

1) Search Optimization

- Image tagging and captions help improve search rankings.

2) Content Moderation

- It detects sensitive or unsafe content before publishing.

3) Security

- Facial recognition supports access control and surveillance.

4) Social Media

- It allows auto-tagging of friends in photos.

5) Missing Persons

- CCTV and face recognition can identify missing individuals.

6) Identity Verification

- It can verify a person at ports, airports, kiosks, and more.

7) Museum and Archive Management

- OCR helps digitize old documents and photos.

* Azure Video Indexer

Modern systems combine several capabilities. Azure AI Video Indexer uses tools like:

- Image Analysis

- Face Recognition

- Speech and Translation

- This allows for deep analysis of videos, including objects, people, speech, and text.

* Summary

- Azure Vision equals a set of tools for understanding images and videos.

- Image Analysis focuses on objects, text, tags, and captions.

- Face Service focuses on detecting, analyzing, and recognizing faces.

- This is useful for SEO, security, social media, identity checks, and moderation.

- Video Indexer combines multiple tools for video analysis.

Azure Vision Image Analysis

* Azure Vision Image Analysis offers strong prebuilt AI features to understand images. You can use these as they are or adjust them with custom models.

* Core Image Analysis Capabilities (No Customization Needed)

1. Image Captioning

Azure Vision can:

- Understand objects and context in an image.

- Generate a natural, human-like caption.

Example

- Image: A person skateboarding

- Caption: “A person jumping on a skateboard”

2. Object Detection

Azure Vision can:

- Detect thousands of common objects.

- Provide confidence scores.

- Return bounding box coordinates (top, left, width, height).

Example for the skateboarder image:

- Person (95.5%)

- Skateboard (90.4%)

- Bounding boxes show where the objects are located in the image.

3. Tagging Visual Features

- Azure Vision automatically generates tags that describe key features of an image. Tags function like metadata and are great for search indexing.

Example tags returned for a skateboarding image:

sport

person

skating

stunt

extreme sport

skateboarder

outdoor

jumping

…and many more, each with a confidence score.

4. Optical Character Recognition (OCR)

- Azure Vision can extract text from printed images.

Example:

- Nutrition label ➝ detected text:

- Nutrition Facts

- Serving size: 1 bar (40g)

- Total Fat 13g

- Calories 190

- Sodium 20mg

…and more.

OCR is useful for:

- Digitizing documents

- Extracting data from labels

- Business automation

* Custom Model Training (When You Need More Control)

- If built-in models do not suit your needs, Azure Vision allows you to train custom models using your own images. These models build on top of Azure’s pre-trained foundation models, so they require fewer images.

1. Custom Image Classification

- Predicts the category of an entire image.

Example model:

- Apple

- Banana

- Orange

- Feed an image and the model returns the fruit type.

2. Custom Object Detection

- Detects multiple custom objects and returns bounding boxes.

Example:

- You can train a model to detect individual fruits like:

- Apple

- Banana

- Orange

…inside one image at the same time.

Great for:

- Retail

- Manufacturing

- Inventory management

- Agriculture

* Quick Summary

- Captioning: Generates natural descriptions.

- Object Detection: Identifies objects and bounding boxes.

- Tagging: Adds metadata tags for search.

- OCR: Extracts text from images.

- Custom Models: Train your own classifiers and detectors.

Azure Vision, Face Service Capabilities

* Azure AI Face is a specific service within Azure Vision. It concentrates on identity-related tasks like user verification, liveness detection, access control, and face recognition.

1. Facial Detection

- Face detection finds where human faces are in an image.

It provides:

- Bounding box coordinates around each face

Facial landmarks such as:

- Eyes

- Nose

- Eyebrows

- Lips

- Face contour

- These landmarks assist in further analysis, like recognition or emotion detection.

2. Facial Recognition

- Facial recognition identifies who a person is.

- A machine learning model trains using multiple images of the same person.

- Once trained, the model can recognize that person in new images.

It is useful for:

- Security systems

- Access control

- Attendance

- Personalized customer experiences

- When used responsibly, facial recognition boosts efficiency and security.

3. Azure AI Face – Supported Attributes

When Azure AI Face detects a face, it can return several attributes:

1) Accessories : Detects headwear, glasses, masks, each with a confidence score.

2) Blur : Shows how blurred the face is.

3) Exposure : Indicates if the face is underexposed or overexposed.

4) Glasses : Identifies if the person is wearing regular glasses or sunglasses.

5) Head Pose : Reports the angle and direction of the face in 3D space.

6) Mask : Indicates whether the person is wearing a mask.

7) Noise : Measures graininess or visual noise in the face area.

8) Occlusion : Checks if parts of the face are blocked by objects, like hands or hair.

9) Quality for Recognition :

Rates the face image as:

- High

- Medium

- Low

- This helps determine if the image is suitable for recognition tasks.

4. Responsible AI & Limited Access Policy

1) Azure Vision and Face services follow Microsoft’s Responsible AI Standard.

Anyone can use the Face API for:

- Detecting face locations

- Getting attributes (glasses, noise, blur, exposure, occlusion, etc.)

- Getting head pose information

2) Limited Access Features (require an approval form):

- Face verification – Comparing two faces for similarity

- Face identification – Identifying named individuals

- Liveness detection – Detecting if input video is real or a spoof (like a photo, video replay, or mask)

- These require an intake form to ensure ethical and responsible use.

Quick Summary

- Face Detection: Finds faces and returns landmarks.

- Face Recognition: Identifies people using trained images.

- Attributes: Blur, head pose, mask, noise, exposure, occlusion, accessories.

- Responsible AI: Advanced features require approval.

- Use Cases: Identity verification, access control, redaction, fraud prevention.

Search This Blog

DigitalDrafts

AI-900 10,11

Images and Image Processing

Convolutional Neural Networks (CNNs)

Vision Transformers and multimodal models

Module : 11

Azure Foundry Tools for Computer Vision

Azure Vision Image Analysis

Azure Vision, Face Service Capabilities

Comments

Post a Comment

Popular posts from this blog

AI-900-3,4

AI-900 12,13