Model that capture everyone’s word-Wav2vec model for Automatic speech recognition

Speech recognition is the technique used to convert the audio into readable text. For speech recognition we need to translate the vibration from the voice of user into electrical signal and then into digital signal. This digital signal will then be analyzed by the speech recognition model and classified into words or sentences that are recognized by the speech recognition model.

Technology is now quit advanced and automated due to technics like machine learning, deep learning. So we have readily available automatic speech recognition models. Wav2Vec 2.0 is one of the current state-of-the-art models for Automatic Speech Recognition. Wav2Vec 2.0 is a pre-trained model for Automatic Speech Recognition (ASR) and was released in September 2020 by Alexei Baevski, Michael Auli, and Alex Conneau.

Wav2vec is a self supervised learning algorithm for automatic speech recognition which uses Hugging face transformer. Now let’s see how we can use this model for automatic speech recognition.
I am using python language for programming the model. Python is a high-level, open source, general-purpose programming language. It has inbuilt libraries which makes programing easy.
Importing necessary libraries:
First of all we need to import required modules from specific libraries. Audio is used to take audio as input. Wavfile is used to know the sampling rate and total time of the audio. Numpy is the library of python used to handle dataframes and series. Numpy denoted numerical python.
from IPython.display import Audio
from scipy.io import wavfile
import numpy as np
file_name=’my_audio.wav’
data = wavfile.read(file_name)
framerate = data[0]
sounddata = data[1]
time = np.arange(0,len(sounddata))/framerate
print('Sample rate:',framerate,'Hz')
print('Total time:',len(sounddata)/framerate,'s')
Name of my audio file is my_audio.wav . Data is a variable to store audio file. This code returns the sample rate and total time of my audio file.
Generating Text from Audio
Wav2vec model uses Hugging Face transformers for automatic speech recognition. So we need to install transformer. We use python pip to install packages.
!pip install -q transformer
We require certain libraries like soundfile, librosa, torch. Librosa is used for managing audio file. Torch is used for importing pytorch.
Wav2Vec2Tokenizer is used for decoding the output of wav2vec model into desired form.
import soundfile as sf
import librosa
import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h") input_audio, _ = librosa.load(file_name, sr=16000)
input_values = tokenizer(input_audio, return_tensors="pt").input_values
logits = model(input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = tokenizer.batch_decode(predicted_ids)[0]
transcription
This is a piece of code for using wav2vec model for converting speech to text. In this blog we have seen how to convert speech into text using wav2vec model. Using this model we can convert our speech into text within a couple of minutes.
Reference:
1. https://huggingface.co/transformers/model_doc/wav2vec2.html
2. https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/