AI-900 8,9

Module : 8
Introduction to AI Speech Concepts

Speech-Enabled Solutions

* Speech technologies change how users interact with AI by allowing voice input (speech recognition) and voice output (speech synthesis). They make applications more accessible, easy to use, and human-like.

* Why Add Speech Capabilities?

- Accessibility: Helps users with visual or mobility challenges.
- Productivity: Allows multitasking without screens or keyboards.
- User Experience: Encourages natural, human-like conversations.
- Global Reach: Works in many languages and dialects.

* Speech Recognition (Speech-to-Text)

- Speech recognition converts spoken audio into written text. It is widely used in customer service, productivity, automation, and accessibility.

** Common Speech Recognition Scenarios

1. Customer Service & Support

- Real-time call transcription
- Auto-routing calls based on spoken requests
- Sentiment analysis
- Searchable call records
- Business Value: Better accuracy, less manual note-taking, improved customer insights.

2. Voice-Activated Assistants & Agents

- Execute voice commands
- Answer questions
- Set reminders, send messages
- Control smart devices and cars
- Business Value: Higher engagement and hands-free operation.

3. Meeting & Interview Transcription

- Create searchable notes
- Real-time captions
- Summaries for interviews or focus groups
- Extract key discussion points
- Business Value: Saves hours of manual work, ensures accuracy, boosts accessibility.

4. Healthcare Documentation

- Dictate patient notes
- Update records quickly
- Reduce paperwork and burnout
- Improve documentation accuracy
- Business Value: More time for patient care and fewer errors.

* Speech Synthesis (Text-to-Speech)

- Speech synthesis converts text into natural-sounding audio for communication, guidance, and accessibility.

** Common Speech Synthesis Scenarios
1. Conversational AI & Chatbots

- Respond using realistic voices
- Adjust tone and speaking style
- Answer customer questions
- Provide a consistent brand voice
- Business Value: More engaging, friendly, and accessible customer interactions.

2. Accessibility & Content Consumption

- Reads articles and website text aloud
- Helps users with dyslexia or reading challenges
- Enables listening while multitasking
- Business Value: Expands audience and boosts inclusivity.

3. Alerts & Notifications

- Voice alerts, reminders, navigation cues
- Updates on industrial system status
- Business Value: Offers critical information without using screens.

4. E-Learning & Training

- Narrated lessons
- Audio versions of written content
- Pronunciation and language learning
- Scalable multilingual course creation
- Business Value: Saves production time and supports different learning styles.

5. Entertainment & Media

- Generate character voices
- Create podcasts and audiobooks
- Video voiceovers
- Personalized audio experiences
- Business Value: Faster production and highly customizable content.

* Combining Speech Recognition with Speech Synthesis

- This creates full conversational experiences:
- Voice-based customer service
- IVR systems with natural dialogue
- Language learning apps with voice feedback
- Voice-controlled cars and devices
- Result: Smooth, two-way communication that feels natural.

* Before Implementing Speech Features: Key Considerations

- Audio Quality: Background noise and mic quality affect accuracy.
- Language Support: Check which languages and dialects are supported.
- Privacy & Compliance: Securely handle and store audio.
- Latency: Real-time experiences need low latency.
- Accessibility Standards: Must follow WCAG guidelines.
- Always Provide Alternatives: Users should have text options as well.

* Pro Tip

- Start small: Choose one valuable speech feature, test it, and gradually expand.

Speech Recognition

* Speech recognition (speech-to-text) converts spoken language into written text through six stages:
Audio Capture, Pre-processing, Acoustic Modeling, Language Modeling, Decoding, Post-processing.

1. Audio Capture - Converting Sound to Digital

- A microphone converts analog sound waves into a digital signal.
- Typical sampling rate is 16 kHz to balance clarity and efficiency.
- Higher rates, like 44.1 kHz, provide more detail, but require more processing power.
- Speech recognition typically uses 8-16 kHz.
- Accuracy depends on noise, microphone quality, and distance.
- Basic noise filters remove hum and clicks before processing starts.

2. Pre-processing - Extracting Meaningful Features

- Raw audio is too detailed. Features help models recognize patterns.
- MFCC (Mel-Frequency Cepstral Coefficients) is the most common feature extraction method.

How MFCC works:

- Split audio into 20-30 ms overlapping frames.
- Apply Fourier Transform to convert to the frequency domain.
- Map frequencies to the Mel scale, which corresponds to human sensitivity to low pitches.
- Extract around 13 coefficients per frame to summarize the sound.

- This produces feature vectors for each frame, which serve as inputs for acoustic models.

3. Acoustic Modeling - Recognizing Phonemes

- Acoustic modeling identifies phonemes, the basic sound units like /k/, /æ/, and /t/.
- Modern systems use Transformer models.

Reasons Transformers are effective:

- The attention mechanism interprets surrounding frames for better clarity.
- They process in parallel, making them faster than older RNN-based systems.
- They learn natural phoneme patterns for contextual predictions.

- The output is a probability distribution of phonemes for each frame.
- For example, frame 42 might show an 80% probability of /æ/ and a 15% probability of /ɛ/.
- Note: Phonemes are specific to each language; English models do not handle Mandarin tones.

4. Language Modeling - Predicting Correct Word Sequences

- Acoustic predictions can confuse words like “their” and “there.”
- A language model improves accuracy by using:

1) Statistical patterns, like preferring “weather is nice” over “whether is nice.”
2) Context awareness, expecting verbs when the phrase begins with “I need to…”
3) Domain adaptation for specific vocabularies, such as medical or legal terms.

- This guides the system to choose realistic and grammatically correct words.

5. Decoding - Selecting the Best Transcription

- Decoding searches through possible word combinations to find the most likely text.
Beam Search Decoding:
- Maintains a beam of the best partial hypotheses.
- Expands these hypotheses step by step using probabilities from the acoustic and language models.
- Prunes weaker candidates.
- Selects the final best transcription.
- Decoding is computationally intensive. The beam width influences the speed and accuracy.

6. Post-processing - Clean, Human-Readable Output

- Post-processing refines the raw decoded text into a readable format.
Common tasks include:
- Capitalization
- Restoring punctuation
- Formatting numbers (e.g., “one thousand twenty-three” to “1,023”)
- Filtering profanity
- Normalizing text (e.g., “three p m” to “3 PM”)
- Scoring confidence to highlight uncertain words
- Azure Speech also provides timestamps and confidence values.

* How the Full Pipeline Works

- Audio capture leads to raw sound.
- Pre-processing generates MFCC feature vectors.
- Acoustic modeling produces phoneme probabilities.
- Language modeling generates likely word sequences.
- Decoding results in the best transcription.
- Post-processing delivers the clean final text.

* Troubleshooting Accuracy

- If results are incorrect, the problems often stem from:
- Poor audio input, such as noise or microphone issues
- A weak language model that lacks domain-specific terms
- Insufficiently defined decoding parameters

- Overly aggressive or insufficient post-processing

Speech Synthesis (Text-to-Speech)

* Speech synthesis (TTS) converts written text to spoken audio using a four-stage process:

1) Text Normalization, 2) Linguistic Analysis, 3) Prosody Generation, 4) Speech Synthesis

1. Text Normalization – Preparing Text for Speech

- This step changes raw text into a form that can be spoken.

Tasks include:

- Expanding abbreviations (e.g., Dr. to Doctor)

- Changing numbers to words (3 to three)

- Converting currency (25.50 to twenty-five dollars and fifty cents)

- Interpreting dates (12/15/2023 to December fifteenth, two thousand twenty-three)

- Handling symbols ($ to dollars, @ to at)

- Resolving homographs (read as /riːd/ vs /rɛd/)

Importance:

- It avoids awkward or robotic pronunciation by turning text into natural spoken language.

Domain note:

- Different fields have specific rules (medical, financial, legal, etc.).

2. Linguistic Analysis – Mapping Words to Phonemes

- This phase breaks text into phonemes, which are the smallest speech sounds.

What it does:

- Splits text into words and syllables

- Uses pronunciation dictionaries

- Applies G2P (grapheme-to-phoneme) models for unknown words

- Identifies stressed syllables

- Considers context to select the right pronunciation (e.g., read in present vs past)

** G2P Example:

- though to /θoʊ/

- through to /θruː/

- cough to /kɔːf/

- Modern systems deploy transformer-based neural models to manage irregular spellings and context.

3. Prosody Generation – Making Speech Sound Natural

- Prosody includes rhythm, pitch, stress, pauses, and intonation.

Elements of prosody:

- Pitch (rise/fall)

- Duration (length of sounds)

- Intensity (loudness)

- Pauses (boundaries)

- Stress patterns (emphasis)

Importance:

- Prosody is what makes speech sound human and not robotic.

Example:

- “I never said he ate the cake.”

- The meaning changes based on which word is emphasized.

Transformer-based prosody prediction:

- Encodes phonemes and linguistic features

- Uses self-attention to identify relationships

- Predicts pitch, duration, and energy

- Adapts to style, speaker identity, and emotion

Prosody is influenced by:

- Syntax (sentence structure)

- Semantics (meaning)

- Discourse context

- Speaker voice/style

- Emotional tone

4. Speech Synthesis – Generating the Final Audio

- This step creates the audio waveform from phonemes and prosody.

- Modern approach: Neural Vocoders

- Popular vocoders include WaveNet, WaveGlow, and HiFi-GAN.

Steps in waveform generation:

1. The acoustic model generates mel-spectrograms from phonemes and prosody.

2. The vocoder converts the spectrogram to raw audio.

3. Post-processing includes noise removal, normalization, and effects.

Why neural vocoders sound natural:

- High audio quality (near studio quality)

- Captures subtle voice details (breathiness, tone)

- Operates in real time (for interactive apps)

- Easily adapts to any speaker, language, or style

- Vocoder is the opposite of speech recognition:

(ASR: audio to text, TTS vocoder: text to audio)

* Full Pipeline Summary (Example)

Input:

- “Dr. Chen's appointment is at 3:00 PM”

Pipeline:

Normalization:

“Doctor Chen’s appointment is at three o’clock P M”

Linguistic analysis produces phonemes:

/ˈdɑktər ˈtʃɛnz əˈpɔɪntmənt ɪz æt θri əˈklɑk pi ɛm/

Prosody prediction:

Slight rise on “appointment,” pause after “is,” emphasis on “three”

Speech synthesis:

The neural vocoder generates the final natural-sounding audio.

All this happens in under one second on modern systems.

Module : 9 Get Started with Speech in Azure

Speech Recognition & Speech Synthesis

* Speech Recognition (Speech-to-Text)

* Definition:

- Converts spoken audio into data, usually text.

* Sources of audio:

- Recorded voice (audio files)

- Live microphone input

* How it works:

- Speech recognition uses multiple models:

1. Acoustic Model

- Converts audio signals into phonemes, which are the basic units of sound.

2. Language Model

- Converts phonemes into words.

- Uses statistical algorithms to predict the most likely word sequence.

* Uses of Speech Recognition:

- Closed captions for videos

- Transcribing calls or meetings

- Automatically dictating notes

- Understanding user commands in systems

* Speech Synthesis (Text-to-Speech)

* Definition:

- Converts text into spoken audio.

* Required inputs:

- Text to speak

- Voice properties, like voice style and tone

* How it works:

- Tokenization: Breaks text into words.

- Assigns phonetic sounds to words.

- Prosody analysis: Breaks text into natural units, such as phrases and clauses.

- Generates phonemes and converts them to audio.

- Applies chosen voice, pitch, rate, and volume.

* Uses of Speech Synthesis:

- Voice responses in applications

- Interactive voice menus

- Reading text aloud, such as emails and SMS

- Public announcements in places like airports and stations

Azure AI Speech, Overview

* Azure AI Speech Capabilities

- Azure provides multiple speech-related services:

- Speech to Text (STT)

- Text to Speech (TTS)

- Speech Translation

* Speech to Text (STT)

What it does:

- Converts audio, whether real-time or recorded, into text.

Model:

- Uses Microsoft’s Universal Language Model

Optimized for:

- Conversational speech

Dictation

- You can build custom acoustic, language, or pronunciation models if needed.

1. Real-Time Transcription

- Transcribes live audio from:

- Microphone

- Audio stream or file

App must:

- Listen to audio input

- Stream audio to Azure Speech service

- Receive continuous text output

- Uses include presentations, demos, live captions, and voice-enabled apps.

2. Batch Transcription

- For non-real-time scenarios

- Works with prerecorded audio stored:

- On file shares

- Servers

- Azure Storage via SAS URI

- Runs asynchronously

- Jobs start within minutes, but timing isn’t guaranteed.

* Text to Speech (TTS)

What it does:

- Converts text to spoken audio that can be played or saved as an audio file.

Voice Options:

- Many predefined voices across languages and accents

- Includes neural voices that sound more natural with better intonation

- You can create custom voices

Choose:

- Voice style

- Speaking rate

- Tone and pitch

- Speech Translation

What it does:

- Provides real-time translation of spoken language.

How it works:

- Converts speech to text using ASR

- Translates the recognized text into the target language(s)

Output can be:

- Text

- Synthesized speech

Supports:

- Many source and target languages

- REST APIs and SDK integration

Use cases include:

- Multilingual meetings

- Live event captioning

- Global customer support

- Real-time translation apps

Using Azure AI Speech

* Ways to Use Azure AI Speech

You can access Azure AI Speech through:

- Studio interfaces (Azure AI Foundry Speech Playground)

- Command Line Interface (CLI)

- REST APIs

- SDKs (for multiple programming languages)

1) Studio Interfaces

- The Azure AI Foundry portal provides a Speech Playground.

You can experiment with:

- Speech to Text

- Text to Speech

- Speech Translation

- No coding is needed for basic testing and prototyping.

Azure Resources Needed

- To use Azure AI Speech in any application, create one of the following:

1. Speech Resource

This is dedicated to Azure AI Speech only.

It is best when:

- You want Speech separated from other services.

- You need different billing or access management.

2. Azure AI Services Resource

This includes multiple Azure AI services, including Speech.

It is best when:

- You will use Speech with other AI services (Vision, Language, etc.).

- You prefer one billing and access control.

Search This Blog

DigitalDrafts

AI-900 8,9

Module : 8

Introduction to AI Speech Concepts

Speech Recognition

Speech Synthesis (Text-to-Speech)

Module : 9 Get Started with Speech in Azure

Speech Recognition & Speech Synthesis

Azure AI Speech, Overview

Using Azure AI Speech

Comments

Post a Comment

Popular posts from this blog

AI-900-3,4

AI-900 12,13

AI-900 10,11