The suitability of cloud-based speech recognition engines for language learning

Koji iwago As online automatic speech recognition (asr) engines become more accurate and more widely implemented with call software, it becomes important to evaluate the effectiveness and the accuracy of these recognition engines using authentic speech samples. This study investigates two of the most prominent cloud-based speech recognition enginesApple’s Siri and Google Speech Recognition (gsr) to determine which engine would be more accurate at transcribing l2 learners’ speech. The average recognition accuracy of Siri and gsr is reported using language samples of Japanese learners speaking English. The study also presents a series of computerized speech assessment tasks that were developed by the researchers using a cloud-based speech recognition engine in conjunction with Moodle, a widely used course management system.


Koji iwago
As online automatic speech recognition (asr) engines become more accurate and more widely implemented with call software, it becomes important to evaluate the effectiveness and the accuracy of these recognition engines using authentic speech samples.This study investigates two of the most prominent cloud-based speech recognition engines-Apple 's Siri and Google Speech Recognition (gsr) to determine which engine would be more accurate at transcribing l2 learners' speech.The average recognition accuracy of Siri and gsr is reported using language samples of Japanese learners speaking English.The study also presents a series of computerized speech assessment tasks that were developed by the researchers using a cloud-based speech recognition engine in conjunction with Moodle, a widely used course management system.

Background of speech recognition
Computerized speech recognition systems were being designed as far back as the early 1930s when Bell Labs began conducting research on computerized transcription of human speech.As personal computers became more widespread, speech recognition software, such as Dragon NaturallySpeaking, shifted to the desktop market.While speech recognition initially was lauded as an effective text input method, users unsurprisingly preferred keyboards to microphones for text input.Speech recognition technology has seen wider use in assisting people with text input who are not able to use traditional text input devices such as keyboards.As the accuracy and the efficiency of speech recognition software improve, a wider range of user may embrace it.
It was not long before language educators and call developers became interested in integrating speech recognition technology with call activities, particularly with language production practice.Speech recognition software was utilized early on in Dyned's language learning software, in Subarashii, an interactive dialog system for learning Japanese and in echos, a voice interactive French language training system.Voice recognition also was adopted by companies debuting automated speech assessment technology.PhonePass, now Pearson Versant, offered one of the first fully automated tests of spoken language.
With the growing popularity of mobile devices, speech recognition is now becoming a useful tool for mobile users as it enhances multitasking.It was initially used to assist users with hands-free searching for contacts and with dialing numbers, useful when driving.With early mobile devices, speech recognition was rudimentary since the recognition engine was installed on those mobile devices.Speech recognition changed significantly with the introduction of Smartphones.These new 'Smart' devices are typically bundled with data services allowing users to be connected to the Internet anytime, anywhere.With today's robust mobile networks, the speech recognition engines are able to processes speech on powerful cloud servers.The mobile device is simply acting as a microphone which sends the audio out over the Internet to a server which performs the cpu intensive processing of the speech and sends back the transcribed text to the mobile device.This kind of a software system, called a client-server model, has several advantages.One advantage is that applications that use speech recognition can be easily deployed on a mobile device without additional strain on the device's cpu or memory.Another advantage is that the speech recognition software is easy to update and maintain because it is installed on the server side.Today, cloud based speech recognition is embedded in almost all mobile operating systems.

cloud-based speech recognition
Apple's Siri and Google Speech recognition (gsr) have evolved as two of the most promising cloud-based speech recognition technologies.When designing language learning tasks, it is possible to use either of these cloud-based recognition engines to analyze l2 pronunciation.While it is no surprise that past research (Ploger, 2015) concludes that human beings can understand accents and mispronunciation better than speech recognition software, Siri and gsr can handle accented and mispronounced speech to some extent.l2 speakers are often unaware of their pronunciation problems.However, by using a recognition engine, such as Siri or gsr, pronunciation problems can be instantly identified by the learner because the actual utterance is transcribed to text in real-time.When mistakes are identified, l2 speakers can become more aware of their pronunciation problems.With online speaking tasks, learners can practice and easily check their pronunciation again and again.However, improving pronunciation is not always obvious to a learner.The learner must first identify which syllables are mispronounced.Once problematic areas are identified, a specific remedy can be suggested.The accuracy of l2 speech transcription becomes an important element when employing speech recognition tools for language learning (Neri, et al., 2003).Therefore, the purpose of this study is to determine whether Siri or gsr is more accurate at transcribing l2 speech.

Background research
Previous research on asr and language learning has focused predominantly on pronunciation training.Studies conducted by Neri, et al. (2002), Ploger (2015), Hincks (2002), andElimat &AbuSeileek (2014) suggest that asr holds potential benefits for language learners, particularly when coupled with self-study call activities that incorporate practical learner feedback.Neri, et al. (2002) observed that pronunciation training using asr offered a valuable, stress-free learner experience, particularly when learners were provided verification of correct responses as well as effective remedies for their learning errors.Ploger (2015), reporting on a single learner in a case study, found that dramatic pronunciation improvements occurred when using dialogue practice along with asr.Ploger (2015) also suggested that feedback was more helpful to the learner when a score or accuracy percentage was provided by the asr application rather than a simple positive or negative response, although the researcher also pointed out that the asr's false negatives posed a problem for the learner.The importance of immediate and useful feedback is a recurrent theme and therefore, a feature which needs to be given careful consideration when designing asr activities for language learning purposes.
The majority of the research on asr was conducted before cloud-based speech recognition tools were readily available to the public sector.Older asr systems often provided pronunciation feedback using speech waveforms that illustrate air movement of fricative consonants or aspiration of stops.Hincks (2002) reported that learners found these waveforms to be ineffective.This may explain why the results of this study suggest that the asr software and activities employed did not discernibly improve pronunciation.Newer cloud-based asr engines, such as Siri and gsr, which convert l2 utterances into text, can help improve learner feedback by returning a transcription of a spoken utterance to the user.Therefore, locating errors in pronunciation using a real-time transcription may be easier for the learner to interpret compared to waveforms and spectrograms which tend to be difficult for l2 learners to effectively utilize.
Feedback on pronunciation needs to be accurate to ensure that the correct pronunciation is not mistakenly modified and that poor pronunciation is not reinforced.Although the quality of pronunciation cannot be accurately analyzed, cloud-based speech recognition engines are far less complicated and less expensive to deploy compared to traditional asr engines which typically need to be installed and maintained on a local server.The ease of use makes cloud-based asr suitable for quick self-pronunciation practice.Additionally, since instructors often don't have enough time to constantly monitor and provide feedback to individual learners in a large class, cloud-based asr language tasks can be both effective and motivating.l2 learners who are afraid of making mistakes in public can comfortably practice speaking in a private setting.
In addition to feedback, a wider range of asr tasks need to be employed when designing asr systems for language learning.While most asr research focuses on pronunciation training, a few studies suggest other innovative uses of asr for language learning.Cai, et al. (2013) report on a study on how asr could be used to apply gamification theory to a word/picture matching task.The researchers claimed that by relaxing the constraints of asr or making it more lenient, users became more engaged in the activity.Because false negatives are common with non-native speakers using cloud-based asr systems designed for native speakers, learning engagement can be negatively affected.
asr has been shown to be effective as a language learning tool in language games and pronunciation practice.It is able to provide learners with greater opportunities to practice language.asr appears to offer numerous advantages for oral practice, and further research needs to be conducted on its effectiveness on improving the accuracy of the language through pronunciation training but also on improving oral fluency.asr can easily be implemented in tasks that encourage extensive speaking.Using asr, speech reports can be compiled for evaluative purposes which summarize, for example, word counts of spoken utterances, length of utterance, and lexical density of the language produced.As speech recognition applications are becoming more popular in call, educators often question the effectiveness of the speech technology, particularly as call developers continue to add additional features to their applications.To date there has not been a tremendous amount of studies conducted on the effectiveness of language learning activities that incorporate speech recognition technology.One can look at the motivational aspects of using speech recognition technology in efl settings where limited speaking opportunities exist.
Since popular speech recognition engines such as Siri and gsr are developed specifically for l1 speakers, it is important to verify if these tools can adequately transcribe l2 speech in order for the output to be meaningfully applied to language learning activities.Both Apple and Google's speech engines rely on the context of the utterance in order to 'guess' the meaning of the phrase when transcribing speech.Siri and gsr more accurately transcribe strings of speech that occur more frequently, such as "have you ever" or "went to the."Therefore, we can assume that if an l2 speaker leaves out an article or uses a preposition incorrectly, the software may run into difficulty with the transcription.This grouping of language may play an important role in both authentic listening activities as well as in speech recognition accuracy.With this in mind, it seems appropriate that before speech recognition activities could be adequately assessed as to how well they can aid in language instruction, the performance of the speech recognition engines need to be assessed to determine how well they deal with l2 speech.Since Siri and gsr are the most pervasive engines available on mobile devices, with ios devices using Siri and Android devices using gsr, the researchers set out to determine how accurate these two tools are at transcribing l2 speech.

Research questions
Which online speech recognition tool is more accurate at transcribing l2 spoken utterances?For l2 learners which tool could be used more effectively for designing online speaking activities for learners for English study?

Procedure
The participants consisted of 41 undergraduate students at two separate Japanese universities who were enrolled in general English language courses.The students' majors ranged from science to humanities, however none of the majors were related to English or language studies.Each participant was instructed to speak a total of 8 sentences into a microphone one sentence at a time.The transcription of each student recording was then entered into a spreadsheet and compared to the target sentence to determine the accuracy of the transcription.The vocabulary and grammar of the target sentences that were used in the task were at a similar level to the language being introduced in the English course in which they were enrolled.Each of the 8 sentences was spoken by the participants and transcribed a total of four times-two times using Siri and two times using gsr.To ensure a more objective evaluation of the two transcription engines, half of the students started the speaking task using gsr while the other half of the students started with Siri.This was an effort to ensure that the attempt at speaking the sentence and the order of the recognition engine being used were equal -neither Siri nor gsr had an advantage of transcribing speech that the participant had practiced more.

Data analysis
Table 1 provides a summary of the accuracy of the transcribed data by both the gsr and Siri transcription engines.The columns correspond to the target sentences, and the accuracy of the transcription of each speech recognition engine is listed in the corresponding column.The accuracy of the transcriptions was determined using a string comparison tool that calculates a similarity coefficient between two texts (Oliver, 1993).For example, if all of the words in the transcribed text matched the target sentence and were in the same order, a score of 100% was assigned.As seen in the above table, the data reveal that the average score of gsr's accuracy is considerably higher than that of Siri for seven sentences out of eight.The overall averages were 82.0% for gsr and 66.9% for Siri.When each transcription is analyzed, it becomes apparent that Siri sometimes missed words as if they were not pronounced at all.For instance, the first sentence "Where are you from" was transcribed as "Where are you."Siri may not recognize some sounds such as a weakly pronounced 'R'.For instance, gsr transcribed 'earth' correctly.On the other hand, Siri transcribed it as 'us.' gsr appears to make use of contextual clues to make corrections as a sentence is being transcribed.On the other hand, Siri did not appear to make corrections as intelligently as gsr while transcribing based on the context.Furthermore, after one word is transcribed incorrectly, the remaining words in the sentence were sometimes transcribed incorrectly.
Siri appears as though it was confused by a single word at which point it was not able to process the rest of sentence.It is observed that the average accuracy of the first sentence is very high because it is relatively short and easy to pronounce.The second sentence is longer and more difficult to pronounce.The Japanese language does not have the R sound which may be the reason many participants have a problem pronouncing it correctly.Having said that, not all words with the R have the same degree of difficulty.When the R is at the beginning of a word, it is typically easier for a Japanese speaker to pronounce.However, if it is in the middle or at the end of a word, it may be dropped or mispronounced.For example, 'born' is sometimes wrongly transcribed as 'bone.'It may have been also useful to look at phoneme matches since quite often Siri and gsr would transcribe the student's speech with part of a word matching, for example if the target word is 'there' and the student's speech is transcribed as 'they', partial credit should be given for the correctly matched 'th' or ð phoneme.

implications and asR activities
Not only does gsr appear to be more accurate at recognizing l2 speech than Siri, it is also relatively easy to integrate into web-based language-learning apps.Apple only allows developers to make use of Siri via a native app.gsr, on the other hand, offers a web-based api for voice technology, allowing developers to add voice recognition capabilities to ordinary html web pages as well as web-based apps.Because of the numerous advantages of gsr technology, the researchers decided to employ the gsr api with an automated speech assessment plugin for Moodle to allow teachers to administer a number of online speaking tasks which incorporate automated scoring and feedback.The following section provides a description of the types of speaking tasks that can be administered online.
Using the speech assessment activity, tasks can be administered online to capture audio, transcribe this captured audio, and perform basic text analysis of the transcription.Depending on the assessment algorithm, a speaking score can be automatically generated by comparing the transcribed text to the model answers.This automated assessment is typically beneficial with closed-ended questions that have a limited or restricted number of responses, for example, if a learner is asked to respond to a question while looking at an illustration which provides a clue to the correct answer.An example of a closed-ended question might be "What is the circumference of the circle?"Possible correct answers may include "The circumference of the circle is 10 centimeters" or "The circle has a circumference of 10 centimeters."Dictation tasks can also be set up to be closed-ended.As seen in Figure 1, the learner is able to listen and participate in conversational dialogues, which learners typically encounter in language textbooks.In this example, each active line of the dialogue is highlighted.The user selects the play icon to listen to that particular line.After listening to one line of the dialogue, the user can then select the record button and repeat that line of the dialogue.The learner is then presented with a score as well as the transcribed text, which appears to the right of the target phrase.The score is generated by comparing the target text to the transcribed text using a 'similar_text' PHP function [3].The score, the transcript and the captured audio are saved to the Moodle course for both the learner and instructor to access.Open-ended speaking tasks can also be administered online and, to some extent, automatically scored.One such example is a task where the learner listens to a short story and then attempts to retell the story.With this task an automated text comparison can be performed to match words or phrases from the target story with the student's transcribed text of the story.The transcription can also be automatically analyzed for word count, number of sentences, average words per sentence, and lexical density.In addition, the student audio is  From the analysis of this study, the researchers determined that gsr was both more accurate at transcribing l2 speech and easier to deploy than Siri.Therefore, gsr was chosen as the recognition engine for a series of speech assessment plugins that are currently being developed by the researchers for Moodle.Several types of online speaking tasks were illustrated which can be used to automatically score l2 speech.These speaking tasks will be employed in the next stage of this research project where the reliability of the scoring algorithm and student responses to the use of online speaking tasks will be evaluated.It will be important to determine the motivational aspects of online speaking activities as well as the importance of reaching out to different learner types.As students learn in different ways, online speaking activities should be administered as supplemental practice activities.Ideally, learners should be able to make choices as to how they practice speaking, with automated online speaking tasks as one of the options.In addition, gsr should not be used as an assessment tool as its assessment algorithm cannot be verified.Both Siri and gsr are closed source, and educators do not have access to the recognition algorithms that are employed.gsr, for example, has an option for different types of native English input, such as American, British, or Australian, but no options exist to instruct gsr that the language input is from a l2 speaker, which may offer non-native speakers inaccurate speech recognition results.Finally, educators and learners need to be aware of privacy concerns of these cloud-based services.The audio as well as the transcription are captured on Google's servers with little knowledge of how this user data will be used.

appendix 1
Materials provided to participants for the study At https://www.google.com/intl/en/chrome/demos/speech.html,please speak the text below: "Hello.Today I will practice speaking English using a computer.I am speaking into the microphone now.The words that I speak appear on the screen as text.It is difficult but the computer understands some of my words."III.Please speak each sentence a second time, and take a photo or screenshot of the results after each time.

Using Apple Siri on an iPad
I. Please speak the 8 sentences below clearly, one at a time.II.After you speak each sentence, please take a photo or screenshot of the results that appear on your screen after each time.
III. Please speak each sentence a second time, and take a photo or screenshot of the results after each time.
1.Where are you from? 2. I was born and grew up in a small town in western Japan.
3. How long does it take to go from your home to school? 4. It takes about thirty minutes to walk from my home to school. 5. How many people are living on our earth?6.There are over seven billion people living on our earth.
7. What is the diameter of the earth?8. Earth has a diameter of about twelve thousand seven hundred kilometers.

appendix 2
Notes for educators and developers interested in using speech recognition & audio capture.

Transcribe audio:
Using the html5 Speech Recognition api, JavaScript has access to a browser's audio stream which is converted to text using Google's speech recognition engine and returned to the browser as raw text.Tools: webkitSpeechRecognition api

Capture audio:
Recorder.jsJavascript library can be used to capture audio from any input device.The audio stream is saved as a .wavfile using getUserMedia.The .wav file can then be converted to an .mp3file in real-time within the browser using libmp3lame.js.Tools: getUserMedia api, record_wav.js and libmp3lame.jsJavaScript libraries

Capture & transcribe audio:
Audio capture is performed using Recorder.jsas outlined in the previous example.The audio is then transcribed using Google's webkitSpeechRecognition api.The trick is that a python proxy is required to convert the captured wav audio file to flac -mono 22Hz, which is the format that Google's speech recognition engine requires.The transcribed text reply from Google's server then needs to be parsed.Tools: speech_recognition module written in Python.

Figure 1 .
Figure 1.A conversational dialogue using speech recognition

Figure 2 :
Figure 2: A scrambled word task using speech recognition Figure 3 illustrates an online speaking task where the learner listens to an audio prompt, for example, "How often do you study in the library?" and is then shown three possible responses.The learner should then speak the best response from the following: [everyday] ----[for 3 hours] ----[in between classes].

Figure 3 :
Figure 3: Speaking the best response task using speech recognition captured at the same time for self, peer, or instructor assessment.A completely open-ended speaking task might be a simple prompt such as "Speak for 1 minute about your weekend."Like the open-ended story retelling task, the audio, transcript, and analyzed text data can be saved to the course.

Figure 4 :
Figure 4: Text analysis of transcribed speech

Figure 5 :
Figure 5: Speech transcription practice page

Table 1 .
Data analysis from the string comparison tool