Status quo & trends in automatic speech recognition

Automated speech recognition

Content of this article


Voice to textYou can also find many more useful tips in our eBook Recording, typing, analyzing - a guide to conducting interviews and transcriptions.

The book is available as a free download: Now everything about transcription & Co!


What is speech recognition?

Back to the table of contents

Speech recognizer

Machines that interact with humans are part of almost every good science fiction film. Over sixty years ago, in his novel "2001 - A Space Odyssey", made into a film by Stanley Kubrick, Arthur C. Clarke created the vision of the computer HAL, which communicated with the humans on board the spaceship using language as a matter of course.

Although machines already have some of the capabilities of HAL - such as playing chess or navigating a spaceship - we are still a long way from intelligent, meaningful and bidirectional communication between humans and machines.

Speech recognition software refers to special computer programs or apps that recognize spoken language and automatically convert it into written text. into written text. The speech is analyzed in terms of spoken words, meaning and speaker characteristics in order to achieve the most accurate result possible. This should not be confused with voice recognition, which is a biometric process used to identify people based on their voice.

With the help of speech recognition software spoken language is automatically into text - with the option of between speaker-dependent and speaker-independent speech recognition can be

Voice recognition can now be used to control a PC, write emails or surf the internet. Numerous loudspeakers with integrated voice control, such as Alexa from Amazon or Google Home, also use this technology. In addition, it is now included as standard in most smartphones.

A distinction is made between two types of speech recognition:

  • Speaker-independent speech recognition: Any voice can be recognized and processed, making it possible for anyone to operate the device. Although this type of application is aimed at a broad target group, the available vocabulary is limited.
  • Speaker-dependent speech recognition: In this variant, the program is trained for the individual language of the respective user, allowing specific abbreviations and phrases to be learned. The vocabulary is therefore much more extensive.

From a technical point of view, there are two possible ways of handling this process. Either it takes place directly on the user's device, whereby the result is available almost immediately (front-end), or the implementation takes place on a separate server, independent of the user's device (back-end).

A major role in this process is of course played by the quality of the sound recording. Many speakers, background noise or too great a distance from the microphone have a negative impact on the result. Due to these limitations and other difficulties, such as individual speaker behavior or dialect, a completely automated transcription is not (yet) possible without errors and is therefore inferior in quality to human manual transcription. In any case, a human correction is therefore necessaryif a certain level of quality is to be achieved. However, under optimal conditions and with prior training based on the user's voice, the results are already good. There are already numerous users, particularly among professional groups such as doctors and lawyers.

For automatic the quality of the recording is of the recording is particularly important - The challenges are many speakers, background noise and deviations from the standard pronunciation. In general human correction is necessary. is necessary.

The market leader in this field is the manufacturer Nuance Communications with its "Dragon" program series. The latest version Dragon Professional Individual 15 offers not only voice control of the PC but also a transcription function, even for any number of speakers. The following formats are supported:

mp3, .aif, .aiff, .wav, .mp4, .m4a and .m4v

The market leader in this field is Dragon - Dragon Professional 15 offers extensive functions for transcription

The manufacturers promise that even non-dictated punctuation marks are set automatically. However, tests have shown that this does not work without errors, especially in interviews with a lot of background noise. In addition, the program cannot assign speakers . In the case of a single person whose voice the software has been trained on beforehand, the results are much better. However, you must always bear in mind that extensive training on your own voice requires a lot of work. This solution is not very practical for a group discussion or interview, as each speaker would have to have a license to use the program and the system would have to learn the voices of each individual interviewee.

The program cannot speaker assignment and should should be trained for a good result own voice for good results

Accordingly, the software is comparatively expensive at €399. It can be used from Windows 7 or with MacOS. However, it should be noted that the transcription function is only included in the "Professional" version. The cheaper "Home" version only offers speech recognition and control. In addition, the software can only be used with Nuance-certified dictation devices. However, the "Dragon Anywhere" app allows mobile use of the functions on a smartphone.

In the meantime, other large corporations such as Google have also discovered this market for themselves and offer solutions for automated transcription in addition to voice-controlled speakers. With the help of Google Cloud Speech API, speech can also be converted into text. Neural networks and machine learning are also used to continuously improve the results.

Google Cloud offers an alternative Speech - here the speaker speaker assignment is in the test phase

In conclusion, it can be said that the software is not yet worthwhile due to the high price and the many errors with several speakers or slight background noise. Satisfactory results cannot be achieved without learning the speech patterns of the people in advance. Added to this is the subsequent high correction effort. A speaker assignment must also be carried out manually. This cannot yet be done by the AI. This function is currently being tested by Google, among others, and here too the speaker assignment is still too imprecise. The automated setting of time stamps is also not possible; this function is also still in the test phase (e.g. at f4).

Without previously trained speech patterns the correction effort is usually very high - a speaker assignment still has to be done manually be made manually


Scientific study: speech recognition is 67.6% accurate 

Back to the table of contents

abtipper.de conducted a scientific study in 2019 and 2020 to assess the performance of the seven speech recognition systems currently available for the German-speaking world. In addition to large providers such as Google and Alexa, a number of smaller niche providers were also examined.

The test examined how high the word recognition rate is in a normal conversation recording with two people, i.e. a typical interview situation. Depending on the subject area and experience, a human being achieves a rate of 96-99% in a manual audio transcription. This means that for every 100 words, there are usually 1-4 errors in human transcription.

The best speech recognition system achieved a value of 67.6%. This means that 2/3 of the words are currently recognized correctly. However, even some of the larger systems are currently far below this value, with Bing's system performing the worst.

Overview of the quality (in percent) of machine-generated transcripts, as results of a scientific study:

 

Quality of created transcripts

Automatic speech recognition

 

All in all, however, machine transcription generally does not yet reach the level of a manually created transcription. For a first impression, here is an example of the transcription of an interview (with two female speakers) using artificial intelligence. This was created by one of the currently best-known transcription programs, Google Cloud Speech-to-Text.

Exemplary result of a language recognition:
Interview Anette Bronder at the Hannover Messe
(excerpt from: https://www.youtube.com/watch?v=Es-CIO9dEwA, accessed on 08.05.2019)

"Digitization and networking will once again play a major role at this year 's HannoverMesse Industrie Telekom will berepresented with a stand for thethird time and will be showing very specific application examples the motto is "Making digitization simple" Ms. Bronder what do you actually mean by "making it simple" can we give you an example yes very good keyword delivered to me simply make it simple you just said the trade fair will be the third time on the subject of digitization here at the Hannover Messe.I think now is the time to move fromthe lab to the real world, we have to expect that now it is also waiting for Germany as a location to come up with solutions explicitly for SMEs but also for large customers that are applicable and standardized .Kit a box with the hardware with sensor technology where we make the topic of data collection and data evaluation very easy for customers what other technologies and solutions Telekom is still presenting here I would like to point out, however, that it is important for us to say this year would not be technology and solutions that we have status, but we are offering the Internet of Things as a service package for the very first time we are able to deliver connectivity via our good network cloud solutions security solutions through to individual detailed solutions in analytics

Here it can be seen once again, that no speaker assignment is made by the "AI". Punctuation is also not taken into account here.

Overall, it can be said that automated speech recognition is currently suitable for two fields of application:

  • For dictations (e.g. from lawyers or doctors): For these recordings with usually only one speaker and excellent audio quality, as well as a limited vocabulary, a tool can be trained very well for the corresponding voice and vocabulary and thus deliver good results.
  • It can also be useful if the transcription quality requirements are low. This is the case, for example, in the digitization of radio archives where searchability is the goal and therefore perfect transcripts are not required. With an often extremely large amount of material, manual transcription is ruled out from the outset in such applications for economic reasons.

Unfortunately, automated speech recognition is not yet suitable for all other purposes, e.g. interviews, at the current technical level. However, further developments may be expected here in the coming years and decades.


Order your transcription now at abtipper.de! 

 

The result shows that automated speech recognition systems still leave a lot to be desired, especially in situations with several speakers. For transcription, they are only suitable for very specific applications (e.g. digitization of archives that would otherwise not be financially viable). The situation is different for recordings with only one speaker (e.g. typical dictation). Here, the systems currently already achieve values of around 85% and can therefore already be used sensibly for some practical applications.

There are already some comparable surveys for the recognition of previously known commands (e.g. Alexa skills). However, these reflect an unnatural speech situation with previously known topics and commands. The quality of free speech recognition without an artificially limited vocabulary has now been scientifically investigated for the first time by abtipper.de for the German-speaking world.


Fields of application for automated speech recognition

Back to the table of contents

There are already numerous practical applications for audio transcription. In addition to the exponential increase in the use of smartphone speech recognition, for example to quickly compose text messages and emails or to control voice assistance systems such as Apple's Siri, Amazon's Alexa or Microsoft's Bing, voice transcription technologies have also become indispensable in call centers and hospitals.

In fact, we at abtipper.de have been the first provider in Germany to offer transcriptions using artificial intelligence since 2018:

In the case of transcription by artificial intelligence, the transcription is carried out using automated speech recognition.

Thanks to our speech recognition system specially developed for transcriptions, recordings with few, clearly speaking speakers and perfect sound quality achieve particularly good results.

Even if the quality of transcription using artificial intelligence does not yet quite match that of manual transcription, there are many fields of application for which it is particularly suitable. This applies above all to the digitization of large amounts of data where manual transcription would not be worthwhile in terms of price.

Click here for an example of a transcript generated by artificial intelligence transcript.

Procedure for transcription with artificial intelligence: Acceptable results can only be achieved with this type of transcription if the above criteria are met. For this reason, we first check all relevant submissions with our experts. If, for example, a good transcript cannot be produced due to dialect, background noise or too many speakers, you will be informed of this within 6 to a maximum of 24 hours, including a detailed explanation. You are then free to choose a different type of transcription.

With this type of transcription, we offer to create two minutes of your file as a sample transcript free of charge and without obligation so that you can check the result of this new type of transcription. You can then decide for the specific case whether the quality meets your requirements or whether a manual transcription would be more suitable. To do this, please place an order and note in the comments field that you would like a free trial transcription.

Order your transcription by artificial intelligence from abtipper now!


The history of automatic speech recognition - a look back

Back to the table of contents

John Pierce, pioneer of speech recognition
John Pierce, pioneer of speech recognition

Research into speech recognition systems began early in the 1960s, but did not produce promising results. The first systems developed by IBM made it possible to recognize individual words under laboratory conditions, but due to a lack of technical knowledge in the new field of research at the time, no significant progress was made - this was also evident from a report presented in 1969 by the US engineer John Pierce, an expert in the field of high-frequency technology, telecommunications and acoustics as head of the Bell Group.

 

IBM Shoebox for speech recognition
The IBM Shoebox from the 1960s could recognize 16 words. (Source: IBM)

It was not until the mid-1980s that research was given new impetus by the discovery of the differentiability of homophones using context tests. By compiling and systematically evaluating statistics on the frequency of certain word combinations, it was possible to automatically deduce which word was meant when words sounded similar.

An important milestone here was the introduction of a new speech recognition system by IBM in 1984, which was able to understand 5,000 individual English words and convert them into text using so-called "trigram statistics". However, the recognition process required several minutes of processing time on an industrial mainframe computer and was therefore practically unusable. In contrast, a system developed by Dragon Systems only a little later, which could be used on a portable PC, was much more advanced.

 

IBM as a pioneer for speech-to-text
Excerpt for IBM speech recognition commercial 1984 (Source: IBM)

In the following years, IBM worked intensively on improving its speech recognition software. In 1993, the first speech recognition system developed for the mass market and commercially available, the IBM Personal Dictation System, was introduced.

In 1997, both the successor version IBM ViaVoice and version 1.0 of the Dragon NaturallySpeaking software were released. While the further development of IBM ViaVoice was discontinued after a few years, Dragon NaturallySpeaking became the most widely used speech recognition software for Windows PCs. The software has been produced and distributed by Nuance Communications since 2005.

With the acquisition of Philips Speech Recognition Systems in 2008, Nuance also acquired the rights to the SpeechMagic software development kit, which is widely used in the healthcare sector in particular.

Siri Inc. was founded in 2007 and acquired by Apple in April 2010. With the launch of the iPhone 4s in 2011, the automatic voice assistant Siri was presented to the public for the first time and has been continuously developed ever since. Presentation of Siri:

 


   

How the speech-to-text systems work

Back to the table of contents

Modern voice recognition systems have become an integral part of our everyday lives. But how do they actually work?

The basic principle of transcription is very simple: when we speak, we exhale air through our lungs. Depending on the composition of the spoken syllables, this causes the air to vibrate in certain patterns, which are recognized by the speech recognition software and converted into a sound file. This is then divided into small parts and searched specifically for known sounds. However, because not all sounds are recognized, an intermediate step is necessary.

Using the so-called "hidden Markov method", the speech recognition software calculates which sound is likely to follow another and which in turn could come after it. In this way, a list of possible words is created, with which the same thing happens in a second run that previously happened with the letters: the computer analyzes the probability with which a certain word follows another - after "I'm going to..." comes "home" rather than "shower" or "break". However, the computer can only know this if it knows a large number of spoken sentences and how often and in what context the words occur.

Hidden Markov model for speech recognition
Illustration of how the Hidden Markov Model works

Such a computational task exceeds the processing capabilities of a pocket-sized cell phone many times over. It can only be solved by using cloud computing, i.e. outsourcing difficult computing operations to large stationary computers. The cell phone itself simply records the voice command, converts it into a sound file, sends it to the computer center via the Internet and has it analyzed there. The result is then sent back to the smartphone via the Internet.

The huge databases of speech and text files that have already been spoken and correctly transcribed by humans, which are stored via cloud computing, are the real secret behind the success of the new speech recognizers. Good speech recognition software cannot simply be programmed like a new computer game or a printer driver. "The trick is to obtain good data and integrate it optimally into the learning process," says Joachim Stegmann, Head of the Future Telecommunication department at Telekom Innovation Laboratories.

Really good and accurate speech recognition software also requires a particularly large number of recordings of everyday speech, so that dialects, speech errors, mumbles and falsetto voices can also be captured. The speakers should also differ demographically - there should be an equal number of children, men, women, old and young people as well as people of different regional origins. In practice, for example, transcripts of parliamentary speeches, manuscripts read aloud or recordings of radio broadcasts are used.


Opportunities and challenges in the development of automatic speech recognition

Back to the table of contents

Well-functioning speech recognition systems promise to make our everyday lives much easier. In professional applications, they could automate the transcription of spoken language in the future - for example, the recording of minutes or the often laborious manual transcription of speeches, interviews or videos. They are also becoming increasingly popular in the private sphere, whether for voice-controlled operation of smartphones in the car, calling up Google searches or operating smart home applications such as switching the lights on and off or turning the heating down.

The big challenge with electronic speech recognition, however, is that nobody pronounces a term in exactly the same way in every situation. Sometimes the user is tired, sometimes hectic, sometimes loud, sometimes quiet, sometimes concentrated, sometimes drunk, sometimes angry, sometimes has a cold. It is therefore very difficult for software to recognize words by searching for congruent tone sequences.

Older people or people on the move are particularly difficult for the systems to understand. Background noises make recognition even more difficult - Microsoft is therefore already working on the new "CRIS" software, which enables the individual configuration of frequently occurring background noises and vocabulary and should therefore also allow use in noisy production areas or in retirement homes.

Current systems now achieve recognition rates of around 99 percent when dictating continuous text on personal computers and therefore meet practical requirements for many areas of application, e.g. for scientific texts, business correspondence or legal briefs. Their use reaches its limits where the respective author constantly requires new words and word forms that are initially unrecognizable by the software, the manual addition of which is possible but simply not efficient if they only occur once in texts by the same speaker.

Benchmarks for speech recognition
Benchmark of speech recognition systems for English (Source: Economist)

 

The most important providers of automatic speech recognition systems

Back to the table of contents

As with many modern technologies, new providers are springing up like mushrooms in the field of audio transcription.

Nuance is the market leader in automatic speech recognition and transcription with its Dragon NaturallySpeaking software. The use of deep learning technology enables the software to be used even in environments with strong background noise. Through targeted training on a specific speaker, an accuracy of up to 99% can be achieved in speech-to-text conversion with just a few minutes of invested "reading time". Meanwhile, Nuance is working on the next generation of in-car electronics, which in future should make it possible to accurately write complicated texts by voice input, use social networks and consult search engines without distracting the driver's attention from the road.

Using the same technology, but far better known than Nuance, is probably Siri, the personal voice assistant that has been available to Apple users since the release of the iPhone 4s. The software can be started with the command "Hey Siri" and therefore requires almost no manual operation at all. However, it is only suitable to a limited extent as speech recognition software for dictating entire letters or longer texts, as it does not continuously record speech and continuously output digital text. Siri saves a few spoken sentences until they are sent to the central translation server with a "Done" command or stops recording text for transmission when the maximum memory has been reached. Dictation must pause until the digital text has been retransmitted. This transmission poses risks for information security, and if the transmission is interrupted, e.g. in a GSM dead zone, the dictated text is lost.

Similar to Apple's Siri, Microsoft operates the virtual assistant Cortana on its Windows Phone 8.1, which uses the Bing! search and personal information stored on the smartphone to provide the user with personalized recommendations. There are already plans to extend the functions to the smart control of household appliances such as fridges, toasters and thermostats using Internet of Things technology. Microsoft also achieved a historic milestone in October 2016 with its speech recognition software, the Computational Network Toolkit: Using deep learning technology, the software was able to achieve an error rate of just 5.9% in comparative tests between humans and machines - the same error rate as its human counterparts. The software has thus achieved parity between humans and machines for the first time.

Google also opened a programming interface for cloud services as a beta version in March 2016. The Cloud Speech API translates spoken text into written text and recognizes around 80 languages and language variants. The API can deliver the text as a stream while it is being recognized and automatically filters out background noise. It is currently only available to developers.

Amazon also recently announced the release of the new "Amazon Lex" service for the development of conversational interfaces with voice and text. It is based on the technology for automatic speech recognition and natural language understanding that Amazon Alexa also uses. Developers will be able to use the new service to build and test intelligent voice assistants - so-called bots.

And the IBM Watson cognitive system, which marked the dawn of the era of cognitive computing in 2011, makes use of neural networks, machine learning and text analysis tools, in particular speech recognition, to learn for itself. In the meantime, even irony, metaphors and puns are no longer an obstacle for IBM Watson.


Fazit

Back to the table of contents

In recent years, technology has developed rapidly, supported in particular by cloud computing and the automated processing of extremely large amounts of data that this enables as the basis for intelligent systems. With the help of professional speech recognition software, automatic transcription is already possible today with virtually no errors.

However, pure speech recognition systems are just the beginning. True interaction between man and machine - as prophesied in science fiction films - requires machines that not only reproduce information, but also understand contexts and can make intelligent decisions.


Order your transcription by artificial intelligence from abtipper now!


Further questions and answers

✅ How does speech recognition work?

Automatic speech recognition systems basically all work in the same way.

Simply put, the core is always a large database in which many possible variants of the pronunciation of one or more words with the matching text are stored. When a new recording is fed into the system, it compares the sound with the database and outputs the text that is most likely to match this recording.

The larger and better maintained this database is, the better the speech recognition will be. Furthermore, of course, the recording quality plays a major role in achieving a good recognition rate.

✅ Is it possible to transcribe with speech recognition?

Transcription with a speech recognition is possible.

With a dictation from a person with clear pronunciation, no dialect and no background noise, a quality level of approx. 90% can be achieved with speech recognition. This is only just below the usual human transcription level of approx. 95%. If one of these prerequisites is missing and in almost all interviews or group conversations today's speech recognition systems are not yet able to generate comprehensible texts.

According to current scientific studies, speech recognition in interviews currently only reaches a level of around 65%, resulting in largely incomprehensible texts.

✅ Which provider has the best speech recognition?

There are now many providers for automatic speech recognition.

The systems differ in terms of
- recognition rate (how many words are recognized correctly)
- spelling and punctuation
- format (e.g. with or without speaker assignment)
- usability (usability as a program, app or only via API interface)
- price and billing model

Google Speech-to-Text and Nuance (Dragon) achieve good results for the German language. Overall, the best systems currently achieve a recognition rate of approx. 67% under good conditions, i.e. approx. 67 words are recognized correctly out of 100. A manual transcription has a recognition rate of approx. 97%.

We will start your project today: