OPENAI WHISPER – Your speech-to-text AI
When asked to define “whisper,” many people would say it’s a soft or confidential tone of voice. Despite the recent attention on ChatGPT and DALL-E 2, other releases from OpenAI, like “Whisper,” have been overlooked. Whisper is an Automatic Speech Recognition system that can transcribe audio files in over 100 languages and even translate them into English if required.
To better understand the concept of Whisper and how it can be used, let’s begin by defining the tasks it performs.
Automatic Speech Recognition (ASR)
As evident from the title, the task requires an algorithm or system that can accurately distinguish human speech from background noise and other sounds, and then generate text based on the recognized speech. This task can be categorized into online and offline Automatic Speech Recognition (ASR), depending on the specific use case and available resources.
Online ASR systems have numerous applications. They can be used for tasks that require real-time conversion of speech to text, such as generating live stream subtitles, automatically transcribing court proceedings, providing assistance in contact centers, and moderating content, among others.
Furthermore, it could also be integrated into pipelines to accomplish more complex tasks. Just imagine a platform that utilizes ASR technology to convert your voice input into text, and then feeds that text into ChatGPT for further processing.
Consequently, users would have the ability to ask questions to ChatGPT rapidly without the need for manual typing. Moreover, one could take it a step further and incorporate a voice generation model at the end of the pipeline, which would convert ChatGPT’s response into speech that the user can hear.
WHAT TO CONSIDER BEFORE YOU SAY I DO
“What to Consider before you say I do” is a book that aims to help people make articulate decisions about whether or not to get married. The book discusses a variety of important topics that individuals should consider before tying the knot, including communication, conflict resolution, financial responsibilities, red flags and personal goals and values. It is written for anyone who is thinking about getting married, whether they are just starting to consider the idea or are already engaged. If you are a Christian, this is even more inclusive for you. The subject of marriage is sensitive and as a Christian, it is designed to be a lifetime decision. I am sure you do not want to make a lifetime decision in a haste. Therefore, before you say “I do”, you should read and digest the pages of this book. Get yours here!
This technology would enable users to engage in a near-real conversation with ChatGPT, making it a cutting-edge voice assistant. However, for such scenarios, the ASR technology must operate in real-time, which means it should be computationally efficient while maintaining high accuracy in transcribing speech.
In contrast, offline ASR systems do not have strict speed requirements, so they tend to be more accurate but also larger. These systems are suitable for tasks that do not require real-time performance, such as voice search, extracting lyrics from songs, and generating subtitles for video files, among others.
Automatic Speech Recognition Technology
- Mathematical Modeling Era
The origin of Automatic Speech Recognition can be traced back to 1952 with the creation of Bell Labs Audrey, a rudimentary ASR system that could only recognize numbers from 0 to 9. However, there was a significant gap of almost 20 years before the introduction of Hidden Markov Models (HMM), a sophisticated statistical method that brought about significant advancements in ASR technology.
Researchers combined Hidden Markov Models (HMM) with audio processing techniques to develop the well-known “trigram” model. Notably, these refined versions of the “trigram” model are widely used in modern ASR technologies due to their real-time speed and high accuracy.
- The Deep Learning Era
In recent decades, there has been a rapid increase in the popularity of neural networks, although they were originally conceived in the 1960s. The limited availability of computational resources at that time prevented their widespread adoption, but advancements in technology have enabled the massive parallel computations required for neural networks to become feasible.
Deep learning models are essentially neural networks that consist of multiple layers of artificial neurons organized in a specific architecture. Their widespread adoption began with the significant increase in accuracy achieved by AlexNet in the ImageNet image classification challenge, and this success quickly extended to various other domains and tasks.
Initial efforts to utilize deep learning models for ASR were focused on enhancing the performance of “trigram” models by employing more advanced feature extractors for spoken language. Thus, the ASR technology involved a combination of a deep learning model, HMM, and various audio and text processing techniques.
WHAT TO CONSIDER BEFORE YOU SAY I DO
“What to Consider before you say I do” is a book that aims to help people make articulate decisions about whether or not to get married. The book discusses a variety of important topics that individuals should consider before tying the knot, including communication, conflict resolution, financial responsibilities, red flags and personal goals and values. It is written for anyone who is thinking about getting married, whether they are just starting to consider the idea or are already engaged. If you are a Christian, this is even more inclusive for you. The subject of marriage is sensitive and as a Christian, it is designed to be a lifetime decision. I am sure you do not want to make a lifetime decision in a haste. Therefore, before you say “I do”, you should read and digest the pages of this book. Get yours here!
Nevertheless, a major limitation of these models was that they were not trainable end-to-end. This meant that the deep learning model had to be trained separately from the HMM. As a result, researchers and developers took a significant step forward by focusing on end-to-end models, with the Whisper model being a notable example of this approach.
The Whisper
As advancements were made toward comprehensive training, the next obstacle arose – the scarcity of labeled data for speech recognition. The much-talked-about models from the previous year, namely Dall-E 2 and ChatGPT, were trained using vast amounts of data.
For ChatGPT, the dataset used for training was massive, requiring approximately 1000 GPUs and costing around 4.8 million US dollars. However, unlike computer vision and natural language processing, which have abundant large datasets, researchers in the field of speech recognition faced challenges in obtaining such extensive datasets.
To tackle this problem, OpenAI introduced a solution called Whisper. They pointed out that public datasets alone are inadequate for the requirements of Automatic Speech Recognition (ASR) models, as these datasets typically focus on a single task and a single language.
They suggested implementing multitask learning, wherein a single model is trained to perform multiple tasks. These tasks may include speech detection, transcription in non-English languages, transcription in English, and translation from any language to English.
Earlier, Automatic Speech Recognition (ASR) models were restricted to a single task of transcribing English speech due to limitations in the available English speech recognition datasets. However, with the introduction of Whisper, similar datasets in multiple languages are now incorporated to broaden the capabilities of ASR models.
In addition, there is an ample supply of speech detection and language-to-English translation data, as numerous videos and movies are available with English translations. This allowed them to compile a training dataset that exceeds 680,000 hours of audio.
When it comes to speech recognition, the model takes in a raw audio file and processes it into a Log Mel Spectrogram, which is a representation of the audio’s frequencies. The model then generates the corresponding text of what is spoken in the audio.
When we intend to enable a deep learning model to handle multiple tasks, we create separate branches within the model’s architecture, with each branch dedicated to a specific task.
During the inference process, special tokens are provided along with the input audio to guide the model in producing the desired output from the specific branch designated for that task.
How OpenAI Whisper Works
Consequently, the Whisper model can conduct speech recognition in 100 different languages worldwide. It accomplishes this by initially identifying the language of the input speech and then translating the resulting text to English if requested. Unlike ChatGPT, which is not utilized as a speech-to-text API on certain websites, the authors of Whisper have made the code and pre-trained language models available for access here, rather than deploying it on specific platforms.
WHAT TO CONSIDER BEFORE YOU SAY I DO
“What to Consider before you say I do” is a book that aims to help people make articulate decisions about whether or not to get married. The book discusses a variety of important topics that individuals should consider before tying the knot, including communication, conflict resolution, financial responsibilities, red flags and personal goals and values. It is written for anyone who is thinking about getting married, whether they are just starting to consider the idea or are already engaged. If you are a Christian, this is even more inclusive for you. The subject of marriage is sensitive and as a Christian, it is designed to be a lifetime decision. I am sure you do not want to make a lifetime decision in a haste. Therefore, before you say “I do”, you should read and digest the pages of this book. Get yours here!
Using the Whisper
The decision to share only the model’s source code with the public means that potential users would need to have a basic understanding of Python to utilize it. In retrospect, about 20 years ago, engineers had to manually code every mathematical component of deep learning algorithms from scratch to train even a simple classification model. However, nowadays, with the availability of powerful Python libraries, it is sufficient to import these libraries and use their pre-built modules for deep learning tasks.
With the current trend of making AI models user-friendly, it is expected that speech recognition technology will soon be deployed with an intuitive and easy-to-use user interface, allowing anyone to utilize it without much delay.
The creators of Whisper have made available pre-trained language models of various sizes, ranging from tiny to large model. The tiny model is faster but less precise compared to the larger model. In the tutorial, we utilized the tiny model as it has sufficient accuracy to differentiate between the speeches of the two presidential candidates.
Conclusion
Whisper primarily functions as an offline Automatic Speech Recognition technology. However, with a powerful Graphics Processing Unit (GPU). It can also achieve real-time performance, unless it is applied to transcribing Eminem’s “Rap God” due to its high speed and complexity.
Important Affiliate Disclosure
We at culturedlink.com are esteemed to be a major affiliate for some of these products. Therefore, if you click any of these product links to buy a subscription, we earn a commission. However, you do not pay a higher amount for this. The information provided here is well researched and dependable.