Speech recognition and transformation using deep neural network

Intro -

The inspiration behind the creation of Deep Neural Networks is the human brain. Working way beyond the “if and else” conditions, the Deep Neural Network software predicts and gives solutions. With Deep Neural Network AI, there is no need for programming and coding to get the output. The conclusions are made based on learning and experiences (like the human brain).

Deep Neural Network - A neural network consists of several connected units called nodes. These are the smallest part of the neural network and act as the neurons in the human brain. When a neuron receives a signal, it triggers a process. The signal is passed from one neuron to another based on input received. A complex network is formed that learns from feedback.

The nodes are grouped into layers. A task is solved by processing the various layers between the input and output layers. The greater the number of layers to be processed, the deeper the network, therefore the term, deep learning.

Different types of DNN -

1.ANN: Artificial Neural Networks,

2.CNN: Convolution Neural Networks,

3.RNN: Recurrent Neural Networks.

SPEECH SIGNALS - Speech signals can provide us with different kinds of information. Such kinds of information are:

• Speech recognition, which gives information about the content of speech signals.

• Speaker recognition that carries information about the speaker identity.

• Emotion recognition, which delivers information about the speaker’s emotional state.

• Health recognition, which offers information on the patient’s health status.

• Language recognition, that yields information of the spoken language. • Accent recognition, which produces information about the speaker accent.

• Age recognition that supplies information about the speaker age.

• Gender recognition, which carries information about the speaker gender.

Automatic speaker recognition can be defined as the process of recognizing the unknown speaker on the basis of the information embedded in his/her speech signal using machine (computer). Speaker recognition is divided into two parts: speaker identification and speaker verification (authentication). The process of determining to which of the registered speaker a given utterance corresponds to is termed as the speaker identification part. This part can be used in public facilities or for the media. These cases comprise, but not limited to, district or other government institutions, calls to radio stations, insurance agencies, or documented conversations.

Deep learning in speech recognition In addition to offering excellent performance in image recognition, deep learning models have also shown state of-the-art performance in speech recognition. A significant milestone is achieved in acoustic modeling research with the aid of DBNs at multiple institutions. DBNs are trained in layer-wise fashion followed by end to-end fine-tuning for speech applications. This DBN architecture and training process has been extensively tested on a number of large-vocabulary speech recognition datasets including TIMIT, Bing-Voice-Search speech, Switchboard speech, Google Voice Input speech, YouTube speech, and the English-Broadcast-News speech dataset. DBNs significantly outperform state-of-the-art methods in speech recognition when compared to highly tuned Gaussian mixture model (GMM)-HMM.RNN has succeeded in improving speech recognition performance because of its ability to learn sequential patterns as seen in speech, language, or time-series data. RNNs have challenges in using traditional backpropagation technique for training such models. This technique has difficulties in using memory to process portions of a sequence with larger degrees of separation. The problem is addressed with the development of long short-term memory (LSTM) networks that use special hidden units known as “gates” to retain memory over longer portions of a sequence. First studied the LSTM architecture in speech recognition over a large vocabulary set. Their double-layer deep LSTM is found to be superior to a baseline DBN model. LSTM has been successful in an end-to-end speech learning method, known as Deep-Speech-2 (DS2), for two largely different languages: English and Mandarin Chinese. Other speech recognition studies using an LSTM network have shown significant performance improvement compared to previous state-of-the-art DBN based models. Furthermore, performed an extensive experiment with various LSTM architectures for speech recognition and compared the performance with state-of-the-art models. The LSTM model is extended in Xiong et al. This BLSTM is stacked on top of convolutional layers to improve speech recognition performance. The inclusion of attention enables LSTM models to outperform purely recurrent architectures. An attention mechanism called Listen, Attend, and Spell (LAS) is used to encode, attend, and decode, respectively. This LAS module is used with LSTM to improve speech recognition performance. Using a pretraining technique with attention and LSTM model, speech recognition performance has been improved to a new state-of-the-art level. To summarize key results in speech recognition using DBNs, RNNs (including LSTMs), and attention models, Another memory network based on RNN is proposed by Weston et al to recognize speech content. This memory network stores pieces of information to be able to retrieve the answer related to the inquiry, making it unique and distinctive from standard RNNs and LSTMs. RNN-based models have reached far beyond speech recognition to support natural language processing (NLP). NLP aims to interpret language and semantics from speech or text to perform a variety of intelligent tasks, such as responding to human speech, smart assistants (Siri, Alexa, and Cortana), analyzing sentiment to identify positive or negative attitude towards a situation, processing events or news, and language translation in both speech and texts.

Speech emotion and visual speech recognition are two important topics that have gained recent attention in deep learning literature. Mirsamadi et al. have used a deep recurrent network with local attention to automatically learn speech features from audio signals. Their proposed RNN captures a large context region, while the attention focuses on aspects of the speech relevant to emotion detection. This idea is later extended in Chen et al. where operation on frequency bank representation of speech signals can be used as inputs into a convolutional layer. This convolutional layer is followed by LSTM and attention layers. Mirsamadi et al. have further improved the work of Chen et al. to yield the state-of-the-art performance on Interactive Emotional Dyadic Motion Capture (IEMOCAP) emotion recognition tasks. However, they use heuristic features as network input including spectral and energy features of speech in the IEMOCAP emotion recognition task. Visual speech recognition involves lip reading of human subjects in video data to generate text captions. Recently, two notable studies have used attention-based networks for this problem. Afouras et al. use 3D CNN to capture spatio-temporal information of the face, and a transformer self-attention module guides the network for speech extraction from the extracted convolutional features. Stafylakis et al. consider zero-shot keyword spotting, where the phrase is not seen in training and is searched for in a visual speech video. The input video is first fed to a 3D spatial-temporal residual network to capture face information over time. This is followed by attention and LSTM layers to predict the presence of the phrase in the video as well as the moment in time of the phrase. Both studies consider “in the wild” speech recognition or a large breadth of natural sentences in speech.

Datasets for vision and speech applications Several current datasets have been compiled for state-of-the-art benchmarking of computer vision. ImageNet is a large-scale dataset of annotated images including bounding boxes. This dataset includes over 14 million labeled images spanning more than 20,000 categories. CIFAR-10 is a dataset of smaller images that contain a recognizable object class in low resolution. Each image is only 32x32 pixels, and there are 10 classes with 60,000 images each. Microsoft Common Objects in Context (COCO) provides segmentation of objects in images for benchmarking problems including saliency detection. This dataset includes 2.5 million instances of objects in 328K images. More complex image datasets are now being developed for UAV deployment. Here detection and tracking take place in a highly unconstrained environment. This includes different weather, obstacles, occlusions, and varied camera orientation relative to the flight path. Recently, two large scale datasets were released for benchmarking detection and tracking in UAV applications. The Unmanned Aerial Vehicle Benchmark includes single and TABLE IV COMPARISON OF RECURRENT NEURAL NETWORK MODEL CONTRIBUTIONS Architecture Application Contribution Limitations Amodei et al. Gated Recurrent Unit Network English or Chinese Speech Recognition Optimized Speech Recognition using Gated Recurrent Units for Speed of Processing achieving near human level results Deployment requires GPU server Weston et al. Memory Network Answering questions about simple text stories Integration of long term memory (readable and writable) component within neural network architecture Questions and input stories are still rather simple Wu et al. Deep LSTM Language Translation (e.g. English-to-French) Multi-layer LSTM with attention mechanism Especially difficult translation cases and multi-sentence input yet to be tested Karpathy et al. CNN/RNN Fusion Labeling Images and Image Regions Use of CNN and RNN together to generate natural language descriptions of images Fixed image size / requires training CNN and RNN models separately 11 multiple bounding boxes for detection and tracking in various flight conditions. An even more ambitious project called Vision Meets Drones gathered a dataset with 2.5 million object annotations for detection and tracking in UAV urban and suburban flight environments. Speech recognition also has several current datasets for state-of-the-art benchmarking. DARPA commissioned a collaboration between Texas Instruments and MIT (TIMIT) to make a speech transcription dataset. TIMT includes 630 speakers from several American English dialects. VoxCeleb is a more current speech dataset, with 1000 celebrities’ voice transcriptions in a more unconstrained or “in the wild” setting. In machine translation, Stanford’s natural language processing group has released several public language translation datasets including WMT'15 English-Czech, WMT'14 English-German, and IWSLT'15 English Vietnamese. The English to Czech and English to German datasets have 15.8 and 4.5 million sentence pairs respectively. CHiME 5 is a speech recognition dataset that contains challenging speech recognition conditions including multiple speaker natural conversations. A dataset called LRS3-TED has been compiled for visual speech recognition. This dataset includes hundreds of hours of TED talk videos with subtitles aligned in time at the resolution of single words. Many other niche datasets can be found on the Kaggle Challenge website free to the public. These datasets include diverse computer vision and speech related problems.

How to formulate Automatic Speech Recognition (ASR)?

The overall flow of ASR can be represented as shown below:

A typical ASR system has the following processing steps:

Pre-processing

Feature extraction

Classification

Language modeling.

The pre-processing step aims to improve the audio signal by reducing the signal-to-noise ratio, reducing the noise, and filtering the signal.

Datasets for ASR

Various databases with text from audiobooks, conversations, and talks have been recorded.

The Call Home English, Spanish and German databases ( Post et al.1) contain conversational data with a high number of words, which are not in the vocabulary. They are challenging databases with foreign words and telephone channel distortion. The English CallHome database has 120 spontaneous English telephone conversations between native English people. The training set has 80 conversations of about 15 hours of speech, while the test and development sets contain 20 conversations, where each set has 1.8 hours of audio files.

Conclusion -

It is evident that deep architectures have already had a significant impact on automatic speech recognition. Convolutional neural networks, recurrent neural networks, and transformers have all been utilized with great success. Today’s SOTA models are all based on some combination of the aforementioned techniques.