Status Quo & Trends in Automatic Speech Recognition

Automated speech recognition

Content of this article

Voice to TextYou can also find many more useful tips in our eBook Recording, Typing, Analysing - Guide to Conducting Interview & Transcription.

The book is available as a free download: Find out everything about Transcription & Co now!

What is speech recognition?

Back to the table of contents

Speech Recognizer

Machines that interact with people are part of almost every good science fiction film. More than sixty years ago, Arthur C. Clarke, in his novel "2001 - A Space Odyssey", filmed by Stanley Kubrick, created the vision of the computer HAL, which communicated linguistically with the people on board the spaceship as a matter of course.

Although machines today already have some of the capabilities of HAL - such as playing chess or navigating a spaceship - we are still a long way from intelligent, meaningful and bidirectional communication between humans and machines.

Speech recognition software refers to special computer programs or apps that recognise spoken language and automatically convert it into written text. convert. The speech is analysed in terms of spoken words, meaning and speaker characteristics to achieve the most accurate result possible. This is not to be confused with voice recognition, i.e. a biometric procedure to identify people by their voice.

With the help of speech recognition software speech is automatically converted into text - it is possible to between speaker-dependent and speaker-independent speech recognition can be distinguished

In the meantime, voice recognition can be used to control the PC, write e-mails or surf the Internet. Numerous speakers with integrated voice control, such as Alexa from Amazon or Google Home, also use this technology. In addition, it is now included as standard in most smartphones.

A distinction is made between Two types of speech recognition:

  • Speaker-independent speech recognition: Here, any voice can be recognised and processed, making it possible for anyone to operate the device. Although this type of application is aimed at a broad target group, the available vocabulary is limited.
  • Speaker-dependent speech recognition: With this variant, the programme is trained for the individual language of the respective user, whereby specific abbreviations and phrases can be learned. The vocabulary is thus much more extensive.

From a technical point of view, there are two possible ways of handling this process. Either it takes place directly on the user's device, whereby the result is available almost immediately (front-end), or the implementation takes place on a separate server, independent of the user's device (back-end).

A major role in this process is, of course, the Quality of the Sound recording. Many speakers, background noise or too great a distance from the microphone have a negative influence on the result. Due to these limitations and other difficulties, such as individual speaker behaviour or dialect, a completely automated transcription is not (yet) possible without errors and it is therefore qualitatively inferior to human manual transcription. In any case, therefore, a human Post-correction necessaryis necessary if a certain level of quality is to be achieved. However, under optimal conditions and with prior training based on the user's voice, the results are already good. There are already numerous users, especially among professional groups such as doctors or lawyers.

For automatic The quality of the recording is of the recording is particularly important - Challenges are many speakers, background noise and deviations from the standard pronunciation. Generally human correction is necessary.

The market leader in this field is the manufacturer Nuance Communications with its "Dragon" programme series. The latest version Dragon Professional Individual 15 offers a transcription function in addition to the voice control of the PC, also for any number of speakers. The following formats are supported:

mp3, .aif, .aiff, .wav, .mp4, .m4a and .m4v

The market leader in this field is Dragon - Dragon Professional 15 offers extensive functions for transcription

The manufacturers promise that even non-dictated punctuation marks are set automatically. However, tests show that this does not work error-free at all, especially in interviews with a lot of background noise. In addition, the programme cannot assign a speaker . With a single person, on whose Voice the software has been trained on beforehand, the results are much better. However, one must always bear in mind that extensive training on one's own voice requires a lot of work. This solution is not very practical for a group conversation or interview, as each speaker would have to have a licence to use the programme and the system would have to learn the voices of each individual interlocutor.

The programme cannot speaker assignment and should be trained to your own voice for be trained to your own voice

Accordingly, the software is comparatively expensive at 399€. It can be used with Windows 7 or higher or with MacOS. It should be noted, however, that the transcription function is only included in the "Professional" version. The cheaper "Home" version only offers speech recognition and control. In addition, the software can only be used with dictation devices certified by Nuance. On the other hand, the "Dragon Anywhere" app allows mobile use of the functions on a smartphone.

In the meantime, other large corporations such as Google have also discovered this market for themselves and, in addition to voice-controlled speakers, also offer solutions for automated transcriptions. With the help of Google Cloud Speech API, speech can also be converted into text. In addition, neural networks and machine learning are used to constantly improve the results.

An alternative is offered by Google Cloud Speech - here the speaker speaker assignment is in the test phase

In conclusion, it can be said that the software is not yet worthwhile due to the high price and the many errors with multiple speakers or slight noise. Without learning the speech patterns of the persons in advance, no satisfactory results can be achieved. In addition, there is the subsequent high correction effort. A Speaker assignment correction must also be carried out manually. This cannot yet be done by the AI. At Google, among others, this function is in the test phase; here, too, the speaker assignment is still too imprecise. The automated setting of time stamps is also not possible; this function is also still in the test phase (e.g. at f4).

Without pre-trained speech patterns the correction effort is usually very high high - a speaker assignment must still be done manually manually

Scientific study: speech recognition is 67.6% accurate 

Back to the table of contents undertook a scientific study in 2019 and 2020 to assess the performance of the seven speech recognition systems currently available for the German-speaking world. In addition to large providers such as Google and Alexa, a number of smaller niche providers were also examined.

The test examined how high the word recognition rate is in a normal conversation recording with two people, i.e. a typical interview situation. A human achieves a rate of 96-99% in a manual audio transcription, depending on the subject area and his or her experience. This means that for 100 words, there are usually 1-4 errors in the human transcription.

The best speech recognition system achieved a value of 67.6%. This means that currently 2/3 of the words are recognised correctly. However, even some of the larger systems are currently far below this value, with Bing's system performing the worst.

Overview of the quality (in percent) of machine-generated transcripts, as results of a scientific study:


Quality of transcripts produced

Automatic speech recognition


All in all, however, the machine transcription does not yet reach the level of a manually created transcription. For a first impression, here is an example of the transcription of an interview (with two speakers) with artificial intelligence. This was created by one of the currently most popular transcription programmes, Google Cloud Speech-to-Text.

Exemplary result of a sparch recognition:
Interview Anette Bronder at Hannover Messe
(excerpt from:, accessed 08.05.2019)

"Digitization and networking are also playing an important role thisyear at the Hannover Messe Industrie Telekom is represented for the third time witha stand and is showing very specific examples of applications themotto is Making digitization simple Ms. Bronder what do you actually mean by making it simple can we give ourselves an example yes very good keyword already delivered make it simple you said yes just nowthe trade fair is becoming the theme for the third time ondigitization here at theHannover Messe.Fair I believe now the time is come of mummy from thelaboratory purely in the practice must expect could now he is waiting also the location Germany take I with solutions to come explicitly also for middle class in addition, for large customers the applicable arestandardized he me to the first Starter-Kit a box with the hardware with sensor technology where we make the topic data collect that data evaluate already customer very simply which further technologies and solutions theTelekom comes still here before each quantity I would like to point out however that it is important to us this year to say would not be technology and solutions that are we status have we however we offer the topic Internet of the thingsas service package for the very first time we are ableconnectivity over our good net to supply Cloud solutions Security solutionsup to individual detail solutions in the Analytics"

Here it is to be recognized once again, that no speaker attribution is made by "AI". Also the punctuation is not considered here.

Overall, it can be said that automated speech recognition is currently suitable for two fields of application:

  • For dictations (e.g. from lawyers or doctors): For these recordings with usually only one speaker who is always the same and an excellent audio quality, in addition to a limited vocabulary, a tool can be trained very well to the corresponding voice and vocabulary and thus deliver good results.
  • If the requirements for transcription quality are low, the use can also make sense. This is the case, for example, in the digitisation of radio archives where searchability is the goal and therefore perfect transcripts are not necessary. With an often extremely large amount of material, manual transcription is ruled out from the outset in such applications for reasons of economy.

For all other purposes, e.g. interviews, automated speech recognition is unfortunately not yet suitable at the current technical level. However, further developments can possibly be expected here in the coming years and decades.

Order your transcription now at! 


The result shows that especially in situations with multiple speakers, automated speech recognition systems still leave a lot to be desired. For transcription, they are only for very specific use cases (e.g. digitisation of archives that would otherwise not be financially worthwhile). The situation is different, however, for recordings with only one speaker (e.g. typical dictation). Here, the systems currently already achieve values around 85% and can thus already be used sensibly for some practical applications.

There are already some comparable surveys for the recognition of previously known commands (e.g. Alexa Skills). However, these reflect an unnatural speech situation with previously known topics and commands. The quality of free speech recognition without an artificially limited vocabulary has now been scientifically investigated by for the first time for the German language area.

Fields of application of automated speech recognition

Back to the table of contents

Already today, there are numerous practical areas of use for audio transcriptions. In addition to the exponential increase in the use of smartphone voice recognition, for example for quickly composing short messages and emails or for controlling voice assistance systems such as Apple's Siri, Amazon's Alexa or Microsoft's Bing, voice transcription technologies are now also indispensable in call centres and hospitals.

In fact, since 2018, we at have succeeded in becoming the first provider in Germany to offer transcriptions through artificial intelligence:

In artificial intelligence transcription, the transcription is done through the use of automated speech recognition.

Thanks to our speech recognition system specially developed for transcriptions, recordings with few, clearly speaking speakers and flawless sound quality achieve particularly good results.

Even if the quality of transcription by artificial intelligence does not yet quite reach that of manual transcription, there are many fields of application for which it is particularly suitable. This is especially true for the digitisation of large amounts of data where manual transcription would not be worth the price.

Click here for an example of a transcript created by artificial intelligence. Transcript.

Procedure for transcription with artificial intelligence: Acceptable results can only be achieved with this type of transcription if the above criteria are met. Therefore, we first check all corresponding submissions by our experts. If, for example, a good transcript cannot be produced due to dialect, background noise or too many speakers, you will be informed of this, including detailed reasons, within 6 to a maximum of 24 hours. You are then free to choose another transcription type.

With this type of transcription, we offer to create two minutes of your file as a test transcript, free of charge and without obligation, so that you can check the result of this new type of transcription. You can then decide for the specific case whether the quality meets your requirements or whether a manual transcription would be more appropriate. To do so, please place an order and note in the comment field that you would like the free trial transcription.

Order your artificial intelligence transcription from abtipper now!

The history of automatic speech recognition - a review

Back to the table of contents

John Pierce, pioneer of speech recognition
John Pierce, pioneer of speech recognition

Research into speech recognition systems began early in the 1960s, but did not yield promising results. The first systems developed by IBM made it possible to recognise individual words under laboratory conditions, but due to a lack of technical knowledge in the new field of research at the time, they did not deliver any significant progress - this also emerged from a report presented in 1969 by the US engineer John Pierce, an expert in the field of high-frequency technology, telecommunications and acoustics as head of the Bell Group.


IBM Shoebox for speech recognition
The IBM Shoebox from the 1960s could recognise 16 words. (Source: IBM)

It was not until the mid-1980s that research gained new momentum with the discovery of the differentiability of homophones by means of contextual tests. By compiling statistics on the frequency of certain word combinations and systematically evaluating them, it was possible to automatically deduce which one was meant in the case of similar-sounding words.

An important milestone was the presentation of a new speech recognition system by IBM in 1984, which was able to understand 5,000 individual English words and convert them into text with the help of so-called "trigram statistics". However, at the time, the recognition process required several minutes of processing time on an industrial mainframe computer and was thus practically unusable. By contrast, a system developed only a little later by Dragon Systems was much more advanced and could be used on a portable PC.


IBM as a pioneer for speech-to-text
Excerpt for advertising film for IBM speech recognition 1984 (Source: IBM)

In the following years, IBM worked intensively on improving its speech recognition software. Thus, in 1993, the first speech recognition system developed for the mass market and commercially available, the IBM Personal Dictation System, was introduced.

In 1997, both the successor version IBM ViaVoice and version 1.0 of the Dragon NaturallySpeaking software appeared. While further development of IBM ViaVoice was discontinued after a few years, Dragon NaturallySpeaking became the most widely used speech recognition software for Windows PCs. Since 2005, the software has been produced and distributed by Nuance Communications.

In 2008, with the acquisition of Philips Speech Recognition Systems, Nuance also obtained the rights to the SpeechMagic software development kit, whose use is particularly widespread in the healthcare sector.

In 2007, the company Siri Inc. was founded and bought by Apple in April 2010. With the introduction of the iPhone 4s in 2011, the automatic voice assistant Siri was presented to the public for the first time and has been continuously developed since then. Presentation of Siri:



The functionality behind Speech-to-Text systems

Back to the table of contents

Modern speech recognition systems have become an indispensable part of our everyday lives. But how do they actually work?

The basic principle of transcription is very simple: When we speak, we breathe out air through our lungs. Depending on the composition of the spoken syllables, we set the air into certain vibration patterns, which are recognised by the speech recognition software and converted into a sound file. This is then divided into small parts and specifically searched for known sounds. However, because not all sounds are recognised, an intermediate step is necessary.

Using the so-called "Hidden Markov Method", the speech recognition software calculates which sound is likely to follow another and which in turn could come after it. In this way, a list of possible words is created with which, in a second run, what happened before with the letters happens: The computer analyzes the probability with which a certain word follows another - after "I'm going to..." comes "home" rather than "sherbet" or "break". But the computer can only know this if it knows a great many spoken sentences and how often and in what context the words occur.

Hidden Markov model for speech recognition
Illustration of how the Hidden Markov Model works

Such a computing task exceeds the processing capabilities of a pocket-sized mobile phone many times over. It can only be solved by using cloud computing, i.e. outsourcing difficult computing operations to stationary large computers. The mobile phone itself simply records the voice command, converts it into a sound file, sends it via the Internet to the computer centre and has it analysed there. The result is then sent back to the smartphone via the internet.

The huge databases of speech and text files already spoken and correctly transcribed by humans, held via cloud computing, are the real secret behind the success of the new speech recognizers. So good speech recognition software can't just be programmed like a new computer game or printer driver. "The trick is to get hold of good data and incorporate it optimally into the learning process," says Joachim Stegmann, head of the Future Telecommunication department at Telekom Innovation Laboratories.

For really good and accurate speech recognition software, a particularly large number of recordings of everyday speech are also necessary, so that dialects, speech errors, mumbled and falsetto voices can also be recorded. The speakers should also differ demographically - there should be an equal number of children, men, women, old and young people as well as people of different regional origins among them. In practice, for example, transcripts of speeches in the Bundestag, manuscripts read aloud or recordings of radio broadcasts are used.

Opportunities and challenges in the development of automatic speech recognition

Back to the table of contents

Well-functioning speech recognition systems promise to make our everyday lives much easier. In professional fields of application, they could automate the transcription of spoken language in particular in the future - for example, the recording of minutes or the often laborious manual transcription of speeches, interviews or videos. They are also becoming increasingly widespread in the private sphere, whether for voice-controlled operation of the smartphone in the car, calling up Google searches or operating smart home applications such as switching the lights on and off or turning down the heating.

The big challenge in electronic speech recognition, however, is that nobody always pronounces a term exactly the same way in every situation. Sometimes the user is tired, sometimes hectic, sometimes loud, sometimes quiet, sometimes concentrated, sometimes drunk, sometimes angry, sometimes with a cold. Therefore, it is very difficult for software to recognise words by searching for congruent sound sequences.

Especially elderly people or people on the move are difficult for the systems to understand. Background noises make recognition even more difficult - Microsoft is therefore already working on the new software "CRIS", which should enable individual configuration of frequently occurring background noises and vocabulary and thus also permit use in noisy production areas or in retirement homes.

In the meantime, current systems achieve recognition rates of approximately 99 per cent when dictating continuous texts on personal computers and thus fulfil the requirements of practice for many areas of application, e.g. for scientific texts, business correspondence or legal briefs. However, their use is limited when the author constantly needs new words and word forms that cannot be recognised by the software at first. Although it is possible to add these words manually, it is simply not efficient if they only occur once in texts by the same speaker.

Benchmarks for speech recognition
Benchmark of speech recognition systems for English (Source: Economist)


The most important providers of automatic speech recognition systems

Back to the table of contents

As with many modern technologies, new providers are mushrooming in the field of audio transcription.

The market leader in automatic speech recognition and transcription is Nuance with its Dragon NaturallySpeaking software. The use of Deep Learning technology enables the software to be used even in environments with strong background noise. Through targeted training on a specific speaker, an accuracy of up to 99% in speech-to-text conversion can be achieved with only a few minutes of invested "reading time". Nuance, meanwhile, is working on the next generation of in-car electronics that will in the future enable the accurate writing of complicated texts via voice input, the use of social networks and the querying of search engines without diverting the driver's attention from the road.

Using the same technology, but far better known than Nuance, is probably Siri, the personal voice assistant that has been available to Apple users since the release of the iPhone 4s. The software can be started with the command "Hey Siri" and thus requires almost no manual operation at all. However, it is only suitable to a limited extent as voice recognition software for dictating entire letters or longer texts, since speech is not continuously recorded and digital text is continuously output. Siri saves a few spoken sentences until they are sent to the central translation server with a "Done" command or stops recording text for transmission when the maximum memory is reached. Until the digital text has been retransmitted, dictation must pause. This transmission poses risks for information security; furthermore, if the transmission is interrupted, e.g. in a GSM dead spot, the dictated text is lost.

Comparable to Apple's Siri, Microsoft operates the virtual assistant Cortana on its Windows Phone 8.1. which uses the Bing! search as well as personal information stored on the smartphone to provide the user with personalised recommendations. An extension of the functions to the smart control of household appliances such as refrigerators, toasters or thermostats through the technology of the Internet of Things is already planned. With its speech recognition software, the so-called "Computational Network Toolkit", Microsoft was also able to set a historic milestone in October 2016: With the help of Deep Learning technology, the software was able to achieve an error rate of only 5.9% in comparative tests between humans and machines - the same error rate as its human counterparts. The software has thus achieved equality between humans and machines for the first time.

Google also opened a programming interface for cloud services as a beta version in March 2016. The Cloud Speech API translates spoken text into written text and recognises around 80 languages and language variants. The API can already deliver the text as a stream during recognition and automatically filters out background noise. It is currently only available to developers.

Most recently, Amazon also announced the release of the new service "Amazon Lex" for the development of conversational interfaces with voice and text. It is based on the technology for automatic speech recognition and natural language understanding that Amazon Alexa also uses. Developers can use the new service to build and test intelligent voice assistants - so-called bots - in the future.

And the IBM Watson cognitive system, which marked the dawn of the era of cognitive computing in 2011, makes use of neural networks, machine learning and text analysis tools, in particular speech recognition, to learn for itself. In the meantime, even irony, metaphors and puns are no longer an obstacle for IBM Watson.


Back to the table of contents

In recent years, technology has developed rapidly, supported in particular by cloud computing and the automated processing of extremely large amounts of data that this makes possible as a basis for intelligent systems. With the help of professional speech recognition software, automatic transcription is already possible today with almost no errors.

Pure speech recognition systems in themselves, however, are only the beginning. True interaction between humans and machines - as prophesied in science fiction films - requires machines that not only reproduce information, but can understand contexts and make intelligent decisions.

Order your artificial intelligence transcription from abtipper now!

Further questions and answers

✅ How does speech recognition work?

Automatic speech recognition systems all basically work in the same way.

Simply put, the core is always a large database in which many possible variants of the pronunciation of one or more words are stored with the matching text. When a new recording is then fed into the system, it compares the sound with the database and outputs the text most likely to match that recording.

The larger and better maintained this database is, the better the speech recognition will be. Furthermore, the Recording quality plays a major role in achieving a good recognition rate.

✅ Can you transcribe with speech recognition?

Transcription with a Speech recognition is possible.

leg of dictation from a person with clear pronunciation, no dialect and no background noise, a quality level of approx. 90% can be achieved with speech recognition. This is only just below the usual human transcription level of approx. 95%. If one of these prerequisites is missing and in almost all interviews or Group conversations today's speech recognition systems are not yet capable of generating comprehensible texts.

According to current scientific studies, speech recognition in interviews currently reaches a level of only about 65%, resulting in largely incomprehensible texts.

✅ Which provider has the best speech recognition?

There are now many providers of automatic speech recognition.

The systems differ in terms of
- recognition rate (how many words are correctly recognised)
- spelling and punctuation
- format (e.g. with or without speaker assignment)
- usability (usability as a programme, app or only via API interface)
- price and billing model

Google Speech-to-Text and Nuance (Dragon ) achieve good results for the German language. Overall, the best systems currently achieve a recognition rate of approx. 67% under good conditions, i.e. approx. 67 words are recognised correctly for 100 words. A manual Transcription system has a recognition rate of approx. 97%.

We start your project today: Request a quote