How accurately can the Google Web Speech API recognize and transcribe Japanese L 2 English learners ’ oral production ?

Tokyo Denki University elamj@mail.dendai.ac.jp The ultimate aim of our research project was to use the Google Web Speech api to automate scoring of elicited imitation (ei) tests. However, in order to achieve this goal, we had to take a number of preparatory steps. We needed to assess how accurate this speech recognition tool is in recognizing native speakers’ production of the test items; we had to assess its accuracy with our Japanese efl learners; and, on the basis of these trials, we needed to evaluate the potential for using the api for our purposes. Through comparing our own assessments of the learners’ pronunciation with the system’s ability to transcribe utterances, we were able to ascertain that the learners’ pronunciation of certain sounds is probably the single biggest reason for a fall in recognition accuracy compared to native speaker input. However, we argue that pronunciation may not be an insurmountable barrier to using this speech recognition system for our efl purposes. By going through this double screening process, we feel we have arrived at a set of items which can be used to assess student’s grammatical ability in an ei test using a custom Google Web Speech system.


jesse R. Elam
Tokyo Denki University elamj@mail.dendai.ac.jp The ultimate aim of our research project was to use the Google Web Speech api to automate scoring of elicited imitation (ei) tests.However, in order to achieve this goal, we had to take a number of preparatory steps.We needed to assess how accurate this speech recognition tool is in recognizing native speakers' production of the test items; we had to assess its accuracy with our Japanese efl learners; and, on the basis of these trials, we needed to evaluate the potential for using the api for our purposes.Through comparing our own assessments of the learners' pronunciation with the system's ability to transcribe utterances, we were able to ascertain that the learners' pronunciation of certain sounds is probably the single biggest reason for a fall in recognition accuracy compared to native speaker input.However, we argue that pronunciation may not be an insurmountable barrier to using this speech recognition system for our efl purposes.By going through this double screening process, we feel we have arrived at a set of items which can be used to assess student's grammatical ability in an ei test using a custom Google Web Speech system.Keywords: Automated Speech Recognition (asr), Elicited Imitation (ei) tests, Google Web Speech api, pronunciation, Sphinx introduction Automated speech recognition (asr) is revolutionizing the way humans interact with computers, hence, second language (l2) learning researchers are understandably excited about its potential to help revolutionize how efl students acquire a second language.Unfortunately, even with the hugely powerful systems that are used widely nowadays on smartphones and pcs, recognition accuracy is not perfect.In this study, we were interested in exploring whether the Google Web Speech api could be used to help automatically score l2 learners' performance on elicited imitation (ei) tests designed to measure spoken grammatical ability, but in order to reach that goal a number of exploratory, preparatory steps needed to be taken to assess the viability of using the api to create a custom Google Web Speech system.Therefore, the aim of this study was to ascertain how accurately our custom Google Web Speech system could recognize and transcribe both native speakers' and Japanese l2 English learners' oral production.

a brief history of asR
Over the past two decades, the race for a reliable asr has attracted a number of major tech-based companies with large players such as Apple, Amazon, Microsoft, Nuance, and Google all vying for space in the field; however, asr has had a longer history than one might expect."Research on speech recognition dates back to the 1930s when at&t's Bell Labs began using [mechanical] computers to transcribe human speech" (Daniels, 2015, p. 177).This later inspired the development of a one-syllable ten-digit recognition system in the 1950s by Bell Labs, mit, and nec, which could identify numbers spoken by the users (Juang & Rabiner, 2005).Nevertheless, the early systems faced serious limitations, as they used transistors and frequency sensors to conduct speech recognition.Hence, they were only able to identify a restricted range of phonetic sounds (e.g., numbers 1-10), and they were typically only reliable with an acoustic model of the original speaker whose voice was recorded to establish the original waveform.As technology improved in the 1970s and 1980s, asr systems could handle up to 1000 different words, by utilizing language models and more sophisticated algorithms for analyzing data (2005).During this period, researchers and programmers began to develop pattern clustering methods for speaker-independent recognizers and introduced dynamic programming methods for improving connected word recognition (2005).This meant that the asr systems could distinguish between a wider range of speakers regardless of their inherently different pronunciations and could begin to recognize more than single words -as had been the case until that point.
Dragon was one of the first commercial companies to develop a product that brought voice recognition to personal computers using these new methods."In the late 1990s Dragon Naturally Speaking [was] released . . .[and] was later purchased by Nuance which offered speech recognition applications for Windows and for mobile devices" (Daniels, 2015, pp. 177-178).Eventually, Nuance and other commercially available asr systems were developed that used Hidden Markov Models (hmm) to analyze syntax and semantics, which in turn increased the recognition and accuracy of the programs.Even today the hmm is one of the most influential algorithms for automated speech recognition and, thanks to the tremendous advancements in the last decade, systems are now able to handle an almost unlimited vocabulary set integrated with text-to-speech processing (Juang & Rabiner, 2005).Some of these systems are even open-source and/or cloud-based, making it possible for consumers to have their first experience with the latest technology.
In 1986 Sphinx was launched and became one of the most successful open source asr systems developed for research purposes using hmm (Juang & Rabiner, 2005).This was a major advancement in asr because now anyone with programming skills could implement their own customized asr system.Since its initial release, however, Sphinx has gone through a number of different modifications.In the past, "the decoding strategy of these systems tended to be deeply entangled with the rest of the system.As a result of these constraints, the systems were difficult to modify for experiments in other areas" (Walker et al., 2004, p. 1).Nonetheless, the newest version, (Sphinx-4) which was released in 2010, "works with various kinds of language specifications such as grammars, statistical language models (slms), or blends of both" (Twiefel, Baumann, Heinrich, & Wermter, 2014, p. 1).This means that researchers are given more flexibility in the way they can incorporate acoustic models, which allows constraints to be imposed on the expected input (language models) from the user.
Thanks to the advancements in mobile technology, internet speed, and cloud-based computing, voice recognition systems like, Apple's Siri, Amazon's Alexa, Microsoft's Cortana, and Google's Assistant are becoming ubiquitous in everyday life.Furthermore, these systems are continually improving on their respective accuracy rates by constantly gathering acoustic information and utilizing machine learning.According to Twiefel et al. ( 2014), Google's acoustic models were originally based on data collected from a free telephone service and had over 5,000 hours of training when version 2 was released in 2010.Through this endeavour, it was believed that, "distributed speech recognition systems [could] offer better recognition accuracy than local customization systems" (Twiefel et al., 2014, p. 2).
That is, systems, like Google's asr, no longer needed to be reliant on the data stored locally on the computer, as it had the ability to transcribe speech-to-text in real-time, making the number of identifiable words seemingly limitless."While voice activity detection (vad) and feature extraction may be performed on the client . . . the computationally expensive decoding step of speech recognition is performed on Google's servers," (p. 1) ultimately allowing third-party developers to easily integrate the Google Speech api into their own custom asr systems.At the time of writing this research paper, Google Cloud Platform had just been released including Google Speech api alpha, which could potentially be used on virtually any platform including ios (Google, 2016).
One of the negative aspects of Google asr is that the expected input cannot be controlled; nonetheless, Twiefel et al. (2014) believe that integrating Google asr with Sphinx might alleviate this issue.This is because all of the computations are happening on Google's servers, and the output is a text string that best represents the audio input statistically, using machine learning.This means, Google asr is quite accurate with the acoustic models it uses to decipher speech input, but it is a black box when it comes to analyzing the phonemes produced by the users.Hence, researchers at Brigham Young University have realized that although Sphinx is limited in its acoustic models, its strengths lie in its ability to break down input at the syllable level, input which can then be transcribed into phonemes (2014).To this end, Twiefel et al. have suggested using the benefits of each system by making a hybrid asr which allows phonetic post-processing.This means the original input can be run through the Google asr with the resulting text string sent to Sphinx to be deconstructed into the phonetic form.As a result, investigators can have access to more useful data for conducting analysis.Moreover, Sphinx can allow researchers to constrain the exact expected input, making it a great candidate for identifying particular phonetic sounds for set phrases.Thus, the resulting output can then be compared to an expected input at the syllable level.Such a combination would increase reliability and accuracy of voice recognition dramatically, therefore creating more opportunities to employ voice recognition for educational purposes.

Pedagogical applications of asR
In the field of l2 learning, researchers have become interested in how asr systems can be used to increase students' confidence, pronunciation, and motivation (see Chiu, Liou, & Yeh, 2007;Elimat & Abuseileek, 2014;Golonka, Bowles, Frank, Richardson, & Freynik, 2014;Kim, 2006;McCrocklin, 2016;Wang & Young, 2014).After analyzing 350 research articles in a historical account of the pedagogical use of technology for foreign language learning, Golonka et al. (2014) concluded, that Computer-Assisted Language Learning (call) and asr systems have had a reasonable influence in increasing l2 students' motivation, been a useful aid in giving feedback, and helped learners develop metalinguistic skills.
Much of asr research has focused on pronunciation in the past (Chiu et al., 2007;Elimat & Abuseileek, 2014;Golonka et al., 2014;Kim, 2006;Luo, 2016;McCrocklin, 2016)According to McCrocklin (2016), "Research in pronunciation learning strategies has struggled to provide methods for autonomous pronunciation practice in which students can also get clear feedback to help them improve" (McCrocklin, 2016, p. 26).Consequently, many educators and researchers have been turning to asr systems -specifically Sphinx -to provide students with pronunciation feedback (Chiu et al., 2007;Golonka et al., 2014;Kim, 2006;McCrocklin, 2016).Recent studies have shown that pronunciation and intelligibility are connected to Japanese students' self-efficacy (Lear, 2013;Toyama, 2015) and these researchers have advocated pronunciation activities utilizing asr systems as the most beneficial for increasing pronunciation accuracy due to the inherent nature of the immediate corrective feedback (cf) they provide (Golonka et al., 2014).In this way, De Vries, Cucchiarini, Bodnar, Strik, and Van Hout (2014) maintain that the current technological limitations of asr systems suggest that they are more suitable for giving implicit forms of feedback in pronunciation, which require students to concentrate more on their errors, thus, promoting teachable moments.
Although the role of cf in language acquisition has a long-debated history (Russell, 2009), it is clear from many studies that some form of feedback is better than no feedback at all (Ellis, 2012;Lyster & Ranta, 1997).In one experimental study, for example, De Vries et al. ( 2014) used an asr system called greet (utilizing Sphinx) to determine the effectiveness of cf on grammar correction.When users in the experimental group made mistakes, the system would notify them of their errors in red, while the control group did not receive any feedback.After analyzing the pretest-posttest, student logs, and surveys, De Vries et al.
found that students who received cf enjoyed the system more.Furthermore, in a similar Taiwanese study, Wang and Young (2014) explored the idea of using a Sphinx-based asr system named icasl to measure multiple levels of cf in pronunciation practice for selfpaced learning.An experimental group received a three-step error correction that included implicit and explicit error correction while the control group only received implicit correction.After using t-tests to compare the two groups and analyzing the qualitative data, the researchers concluded that the experimental group largely improved; a combination of both implicit and explicit correction through the use of an asr system enabled them to gain better pronunciation skills over an 8-week period (Wang & Young, 2014).These studies demonstrate how asr systems can now be used to test cf in a more controlled environment.
asr systems have also been the focus of research into motivation and learner autonomy recently (Chiu et al., 2007;Golonka et al., 2014;Kim, 2006;McCrocklin, 2016).For example, McCrocklin (2016), examined students' beliefs about autonomy using the feedback of different asr systems.He found that some esl students did not even want to experiment with Dragon Dictation, and three students did not even attempt to use the system because they could not get the asr to recognize their voices properly (2016).Overall, the students felt that "Dragon Dictation was too [inaccurate] for the program to be useful for pronunciation practice" (p.31).In short, Dragon may be a suitable program for native English speakers to use, however, as McCrocklin points out, even when the Dragon's asr is trained to a single speaker's voice, it lacks the acoustic models that would make it beneficial for efl or esl students.

Elicited imitation tests
The present investigation was aimed at assessing whether it may be possible to use an asr system to score learner performance on an elicited imitation (ei) test.ei tests can be used to measure l2 learners' spoken grammatical ability (Purpura, 2004).The simplest format for an ei test is one in which the learner hears a sentence and then imitates the sentence, with the response recorded onto tape or computer (Bley-Vroman and Chaudron, 1994).
Although the learner may have little difficulty imitating simple sentences perfectly, when the length of test sentences is increased, the load on working memory increases and the learner may begin to have difficulty reproducing certain parts of a sentence.The ei test is taken to be a measure of how well the grammatical structures contained in the sentence have been automatized as part of the learner's interlanguage system.It is believed that the ability to chunk information (Abney, 1991) allows a limited capacity working memory to cope with the demands of reproducing sentences of greater length, and that chunking into larger units at the phrasal and clausal level is what enables more expert speakers to produce longer and more complex utterances even under the demands of on-going interaction (DeKeyser, 2001).When an l2 learner has difficulty reproducing a grammatical feature contained in a stimulus sentence this is believed to be due to the feature still not being fully automatized as part of the learner's interlanguage knowledge.The ei procedure allows particular grammatical features to be elicited and tested and also allows for productive grammatical ability, as opposed to receptive grammatical ability, to be assessed.
The tests are therefore an attractive way of measuring a learner's mastery over particular grammatical features.
It has been argued that by manipulating certain aspects of the test, it is possible to produce ei instruments that can measure underlying implicit as well as explicit knowledge of grammatical features (Erlam, 2006).For example, by introducing an intervening step between stimulus sentence and imitation it is possible to encourage the learner to focus on meaning rather than form.If the stimulus sentence is a question, the learner can be asked to provide a simple answer before imitating the question.Or if the stimulus sentence is a contentious statement, the learner can be asked to say "True" or "False" before imitating the sentence.Thus, the learner's attention can be diverted away from the accuracy of the form and performance can be claimed to be more likely based on implicit knowledge than explicit knowledge.This claim is further strengthened if a time limit is set for the response so that the learner is not given a chance to reflect on explicit knowledge before responding (Ellis, 2005).It is also possible to use pictures to make the context of the stimulus sentence clear and to link test items together thematically so as to induce a focus on meaning and away from form.asr to scoring ei tests.Firstly, at that time, asr was still an emerging technology and recognition accuracy -even of native speaker input -was variable.Secondly, using the Sphinx asr system to automate scoring required integrating complex systems that were hard for non-computer specialists to manipulate.And finally, the speakers who took their ei tests were non-native speakers sometimes with heavy accents which an asr system designed for use with native speakers might find difficult to recognize.However, Graham et al. (2008) also pointed out several reasons for optimism.One was that with an ei test the expected input is already known, so the asr task is far more constrained than for systems designed to deal with unpredictable input of any kind.Over several trials, Graham et al. were finally able to score non-native ei data using the Sphinx open source asr tool achieving good correlations with human scoring.
Nowadays with the availability of tools to develop asr applications such as the Google Web Speech api, the possibility of automatically scoring tests for efl has come a step closer as developers can access the open source code.Sphinx is a complex system to set up whereas the Google Web Speech api is already in use and is growing in power all the time.By using this api and its transcription function, and by matching this transcribed output string with the original string of the ei item stimulus, we believed that we could begin to develop a system that automatically scored ei tests.As an initial step, however, we needed to investigate how well the Google Web Speech api was at recognizing both native speaker and non-native speaker input.

Research questions
1. How accurate is the Google Web Speech api in recognizing and transcribing English native speakers' oral production?-Which words does the Google Web Speech api have difficulty recognizing?
2. How accurate is the Google Web Speech api in recognizing and transcribing Japanese l2 English learners' oral production?-Which words does the Google Web Speech api have difficulty recognizing?3. To what extent does the learners' pronunciation affect the system's ability to recognize and transcribe words?
The Ei test The 13 ei test items used in this study formed the first section of a 39-item ei test designed to elicit performance on 13 grammatical features (possessive -s, plural -s, 3rd person -s, articles, question tags, comparative adjectives, relative clauses, conditionals, modal verbs, relative adverbs, verb complements, since/for, and direct/indirect objects) and four tenses (simple present, simple past, present perfect, and present perfect continuous).Multiple instances of each feature appeared in the test.Items ranged in length between 4 and 16 syllables.The 13 items in each section were arranged in increasing length order so that the first items would be relatively easy to imitate and later items would become progressively more difficult to imitate.Each item was in the form of a question and was accompanied by a slide displayed on the computer to help provide context.It is hoped that this test can eventually be used to assess grammatical ability under varying conditions (timed/untimed; requiring an intervening answer or not), but in this study, we were testing to see how well our Google Web Speech api-based system could recognize native and non-native speaker input using just the items in the first section of the test.

The custom asR system for Ei
The conceptualization of the custom asr system used in this research project, originated in a prior study conducted by the content specialist (Author1), who aspired to digitize ei tests to help automate the process of assessing students' grammatical ability in future studies.Although Google has now discontinued the use of external queries, the initial program was designed by the technology specialist (Author2), who used Linux shell scripting to send .flacformatted audio files to Google's server which were transcribed and returned back as simple text strings.After analyzing multiple audio file transcriptions of sample ei inputs (varying in syllable length) it appeared that Google's asr system was accurate enough to warrant further development.Therefore, an initial plan for a customized Google Web Speech api program was drawn up (Figure 1) that would be capable of presenting ei test items and analyzing l2 learners' oral reproductions of ei items.
Although the Google Cloud Platform (included in Google Speech alpha which can run on any operating system) was released in March, 1st 2016 (Google, 2016), the design of the system used in Figure 1 was built on Google Web Speech api beta which required the use of using JavaScript and php (Figure 1).Nevertheless, access to the Google Web Speech api using html5 was discontinued at the time of creating, so the technology specialist used JavaScript, php and ajax to import and export external files instead.
Initially, each student heard a prompt (ei test item), clicked the record button, and then imitated it.After they spoke into the microphone, they would push the record button again, the Google asr would capture the students' input, send it to Google's sever for decoding, and finally returned a text string of the closest match to each student's utterances.The resulting string was analysed against the original prompt string imported from the .csvfile to see if the sentences perfectly matched.However, if the strings did not match, the system would report which words were regarded as missing, as well the student's accuracy in terms of the number of matching words/total words in the prompt.All of the data was processed using an algorithm in the results.php(Figure 1), which exported and amended the .csvfile into the data folder.

Pilot study
A pilot research project was conducted during the spring semester of 2015 at Meiji Gakuin University in a toefl ibt preparation course for determining the usability, functionality, and limitations of the custom asr system.In a controlled environment using microphones in a call laboratory located on the campus, seven participants were asked to take three different, five minute tests: single words, phrases and sentences.Additionally, the overall accuracy of the system was tested to see if the students' pronunciation affected the system's ability to transcribe their inputs.To achieve this, each input was compared word-by-word against students' actual audio recordings to determine the accuracy of the custom asr system; initial findings suggested an accuracy rate of nearly 70%.

Figure 3. Usability results from pilot research
After using the asr system in the pilot study, the students were asked to take two different surveys to identify any difficulties (Figure 3).Through both surveys, it was obvious that the system was not as easy to use as initially thought.Responses to the Usability Questionnaire (use) showed that the students recognized inconsistencies in the design, and they also noted that the system would not be useable without directions.As can be seen in the System Usability Scale (sus) in Figure 3, the system used in the pilot study was in the lower 30th percentile of acceptability.Furthermore, qualitative feedback from the users showed there were some other negative aspects of the system that made it difficult for them to utilize it properly.All of the feedback was used to redesign the system for a better user experience, ensuring that the custom asr system usability had no influence on the students' input (see Figure 4).

Procedure
Ultimately, we would like to use a custom asr system to score an ei test, but first we needed to check to see how accurate the Google Web Speech api-based system designed for this study was at recognizing, transcribing and scoring input.We reasoned that, if the custom asr system did not work well even with native speaker input, we were unlikely to have much success with non-native speaker input.However, if the system performed reasonably well with native speaker input, we could then go on to see how it performed with non-native speaker input.We assumed that getting one American English (ae) and one British English (gb) speaker to repeat 13 ei items 40 times each would give us a large enough sample to work with for native speaker input; we did not feel it necessary to find 40 ae and 40 gb native speakers for this purpose.However, we did choose to see how setting the system to expect four different varieties of English might affect recognition for native speaker input.Due to time constraints, we did not feel that we could impose on the students who participated in the study more than to have them record each item once in a language lab at the end of a regular lesson.As we were not intending to make a principled, detailed comparison between accuracy rates for native and non-native speakers, we did not feel the need to have identical conditions for the two groups.The next step in the process will be to compare automated and manual scoring accuracy of non-native speaker input.For now, we were concerned to see whether the system recognized native and non-native input accurately at all and to see what features of the input might affect recognition.

Results pertaining to RQ1: NS input
Each of the 13 ei items in Trial 1 was recorded 40 times by a British English (be) native speaker and 40 times by an American English (ae) native speaker.The custom asr system was set to expect American English (us) for the first 10 recordings, Australian English (au) for the next 10, British English (gb) for the next 10, and Canadian English (ca) for the final 10 for each speaker.The custom asr system scanned the transcribed output and gave an accuracy score for each utterance based on the number of words the system recognized divided by the total number of words in the item.Thus, for example with Item 1, "Who is taller", the system would give a score of 100% if all three words were fully transcribed, and a score of 66.7% if, for example, the word "taller" was not transcribed.Table 1 shows the mean accuracy scores for 20 recordings of each item (10 by the be speaker and 10 by the ae speaker) with the system set to expect input in the four varieties of English, and an overall accuracy score based on all 80 recordings for each item for the two speakers.The words in two of the items (4 and 6) were recognized and transcribed with an overall accuracy score of over 99% and five other items (1, 3, 7, 11 and 13) had mean accuracy scores of 90% or over.Item 12 was over 90% for the be speaker, but below 90% for the ae native speaker.The remaining 5 items (2, 5, 8, 9 and 10) were not recognized and transcribed so effectively for either speaker, with the scores for Items 5, 8 and 10 being particularly low for the be native speaker.
Looking at the 5 common problematic items in more detail, it was possible to identify particular words, collocations and word order issues that seem to have caused recognition difficulties.Item 2 contains four words and was recorded 80 times making a total of 320 words for the system to transcribe.Of these 320 words, 43 were not transcribed and were then regarded as being "missing" giving the overall mean accuracy score of 86.6%.Thirtyfive of the missing words were "Sue" suggesting that the system had particular difficulty recognizing this proper noun.Recognition of the name "Emma" in Item 9 was also particularly poor.It was subsequently found that by replacing these proper nouns with "Tom" or "Robert", recognition accuracy improved dramatically.For the be native speaker, recognition of "Sue" and "Emma" was particularly poor in the us English input mode, but recognition of "Sue" was also poor in the gb English mode, which was his own variety.
For Item 5, 86 of the 162 missing words were either "do" or "they", and "do they" was overwhelmingly the most commonly missed collocation.This suggests that the system is not good at recognizing question tags.Subsequently, replacing "Mike and Sue" with "Tom and Robert" did not have any impact on recognition of "do they".
For Item 8, 128 of the 138 missing words were in the second half of the sentence.Subsequently, by replacing "coat" with "time" and "bought" with "sang", recognition accuracy improved dramatically not only for these two words but also for the intervening words "she" and "ever".This suggests that the highly frequent collocation "first time" helps the system to recognize the following words more effectively than a highly unusual collocation like "first coat".
In Item 9, "would" accounted for 46 of the 106 missing words and "Emma" 39.Afterwards, by replacing "Emma" with "Robert", the recognition of "would" improved dramatically suggesting that the asr system is good at back-forming for initial auxiliary verbs when it recognizes the proper noun that appears in second place in the sentence.
Finally, for Item 10, "after", "did", "they" and "both" accounted for 115 of the 136 missing words and "did they both" was the only collocation found to be commonly missing.Subsequently, it was found that by moving "after dinner" to the end of the sentence, recognition of "after" and "did" improved noticeably.

Results pertaining to RQ2: NNS Input
Table 2 shows the mean accuracy scores for 44 university students recording the same 13 ei items as above.Two of the members from this student cohort were Chinese nationals.The other 42 were Japanese.Table 2 also shows the two words the sr system judged were missing most often.
Most obviously, and as might be expected, the mean accuracy scores for the nns input are generally much lower than those for the ns input (65.7% overall accuracy).Only on Item 11 is there close parity between the nns and ns mean scores.Naturally, one assumes that the discrepancy must be caused by pronunciation issues.It will be noted, however, that the two proper nouns, "Sue" and "Emma", which were problematic for the system with native speaker input were also among the words which were most problematic for the system with non-native speaker input.

Results pertaining to RQ3: NNS pronunciation
Table 3 shows which word was judged to be missing most often by the custom asr system, to what extent these words were judged to be mispronounced by a rater (one of the authors), and the word in each item which was judged to be mispronounced most often.For example, for Item 1, "taller" was the word that the system judged to be missing most often.It represented 72% of all missing words."Taller" also represented 82% of the words that the rater thought were mispronounced.Overall, it was the word judged to be most often mispronounced.In Item 2, "Sue" was the word the system failed to recognize most often and was regarded as missing.It accounted for 55% of all missing words.According to the rater, however, "Sue" only accounted for 19% of the mispronounced words.The word "swimming" was, according to the rater, the word most often mispronounced accounting for 51% of mispronounced words.A word was judged to be mispronounced if it was thought that someone not familiar with Japanese English would have difficulty catching the word, even in context.
It is clear that the word which was most often judged to be mispronounced in each item was not always the word most often regarded as missing by the system.In six cases the word most often mispronounced coincided with the word recorded as most often missing, but in seven cases there was no correspondence.

Discussion
In answer to the first research question, it is possible from the results presented in Table 1 to conclude that, although the overall recognition accuracy of the system was 89.4%, the recognition part of the system had difficulty with particular words, collocations and word orders even when the input was from what we take to be typical be and ae native speakers.
We are certain the system faithfully transcribed what was recognized and found no malfunction with the part of the custom asr system which matched the transcription with the original ei Test item letter string, so we are led to believe that it was particular features that caused the system difficulty.Certain proper nouns such as "Sue" and "Emma", certain collocations such as "first coat", certain structures such as question tags ("do they" at the end of the question), and certain word orders such as starting a question with an adverbial clause before the question word ("After dinner did they…") caused problems for the system even with native speaker input.Setting the input mode to one of four different varieties of native speaker English did not appear to strongly affect recognition accuracy except in the case of certain proper nouns.
In answer to the second research question, we know from the evidence presented in Table 2 that the Google Web Speech api incorporated into the custom asr system judged these Japanese l2 English learners' oral production to be 65.7% accurate overall.Although direct comparison is unwise due to the different conditions set for native and non-native speakers, this accuracy rate is much lower than the overall accuracy assigned to ns input.
It was noticeable, however, that some of the words, collocations, and word orders that seemed to cause recognition problems with ns input were also responsible for the low accuracy score given for nns input.Some of the same proper nouns ("Sue" and "Emma"), the same question tag ending ("do they"), and the same unusual word order putting an adverbial phrase in front of the question word ("After dinner did they both…") were partly responsible for poor recognition of nns input just as they were for ns input.It was also noticeable that for items which avoid these particular features and follow a more canonical word order, recognition accuracy can climb to be on a par with that for ns input.
By investigating the third research question, we found that while the learners' pronunciation obviously does affect the system's ability to recognize and transcribe words, there are often other factors affecting recognition that outweigh non-target-like pronunciation.In Table 3, it is clear that certain pronunciations that are typically difficult for Japanese speakers caused the recognition problem.The medial lateral in "taller" is difficult and means the word is often pronounced more like "tora".This is similar to a typical problem with the word "golfer" so that it is pronounced "gorufaa", a problem exacerbated by the existence of this as a loanword in Japanese.The initial "r" in "Robert" and the second vowel sound mean it is often pronounced "lobaat".The initial "th" (/ð/) in "they" and "there" is often pronounced as "z" (/z/).And the initial glide in "would" is often approximated with "oo", so that "would" becomes "ood".These undoubtedly caused many of the recognition difficulties for the system.However, it is also clear from the data in Table 3 that nns pronunciation was not the principle problem in around half the items.Words judged to be mispronounced most often were not always the ones the system failed to recognize most often.For example, in Item 6, the word "Tom" was the word the system did not recognize most often but it was judged to be pronounced perfectly well by all 44 nns.Yet the word "old", which was judged to be mispronounced most often, was recognized by the system on 41 out of 44 occasions.
The conclusion we draw from this is that because the system does not rely solely on acoustic information but is also drawing on predictive syntax algorithms (slm) and vast amounts of stored collocational data (hhm), it tends to overlook local pronunciation problems when the input conforms to canonical order and typical sentence patterns due to the machine learning algorithms in play.

Recommendations and next steps
One recommendation we would like to make to people considering using the Google Web Speech api for l2learning and testing purposes is that, even for pronunciation practice, it is advisable to check that the items learners are being asked to imitate are not difficult for the system to recognize.This may sound rather counter-intuitive.One might expect the system to be rather good at showing learners what they need to work on, especially in terms of pronunciation.To some extent this is true, but, as we have shown from the native speaker screening process, some items are not recognized well because they contain certain problematic words, collocations or word orders.It would therefore seem inadvisable to use these "faulty" items with l2 learners because they will be either disheartened or marked down when they get the item "wrong".What we are advocating is adapting to the system's strengths to some extent by constraining the input in some ways so that learners are judged fairly according to their true abilities.
The second recommendation is that it makes sense to tailor the system to the learners who will use it, keeping in mind the ultimate purpose.In our case, we are hoping to develop a system that will allow us to automatically score an ei test for Japanese learners of English to provide information on learners' productive grammatical ability.It therefore follows that we should screen out features in items which cause the system to fail to recognize input even though it may be grammatically correct.For people who want to use the Google Web Speech api for pronunciation practice and training, this may not apply, but for our purposes (ei tests) we need items that will allow learners to display their grammatical ability and allow the system to recognize what is being said.Thus, we intend to constrain the input even further by developing items that avoid the word, collocation, and word order issues and the typical pronunciation pitfalls for Japanese speakers identified above.Using these items, we intend to test whether the system can score target grammatical features as well as a human scorer.We are also hoping to add new features to the system to make it more flexible and adaptable for use by other teachers and researchers.

conclusion
This investigation has alerted us to some of the problems with using the Google Web Speech api to score ei test responses.By working with the strengths of the application we believe we will be able to develop a system that can reliably and efficiently score Japanese l2 learners' spoken production for grammatical accuracy.Because we are heavily constraining what the system needs to be able to recognize, the system does not need to work perfectly for any possible input.Through trial and error, we can find a set of items that are recognized with a high degree of accuracy with our learners.We can then test the system for how well it measures performance.At the same time, we can develop the asr system so that it has greater functionality making it more useful to other researchers who may wish to employ it for other purposes.

Figure 1 .
Figure 1.Custom ASR design using Google Web Speech API

Figure 4 .
Figure 4. Custom ASR user interface Graham, Lonsdale, Kennington, Johnson, and McGhee (2008)r, Van der Walt, & Niesler, 2011;Lonsdale & Christiansen, 2011)ject, this may not be such a problem, but when the results are to be used diagnostically in an on-going program, for example, or if tests are to be used in class as pedagogical tools in themselves, the scoring issue becomes a major hurdle to the usefulness of ei tests.If it were possible to replace manual scoring with automated scoring, ei tests could be used much more widely because immediate feedback would become a reality.Interest in automated scoring of ei tests has grown with the development of the Sphinx asr system (e.g.Christensen, Hendrickson, & Lonsdale, 2010;De Wet, Muller, Van der Walt, & Niesler, 2011;Lonsdale & Christiansen, 2011).Graham, Lonsdale, Kennington, Johnson, and McGhee (2008), for example, developed a computer-based ei test and subsequently attempted to score test performance by l2 learners from a variety of language backgrounds using Sphinx.They describe several reasons why they thought it might be difficult to apply One major problem with ei tests is that scoring is extremely labor intensive.It can take nearly as long to score one test manually as it does for one individual to take the test, making it impossible

Table 1 .
Mean accuracy scores (%) for NS input in four input modes and overall

Table 2 .
Mean accuracy scores for NNS input and most problematic words

Table 3 .
Words judged to be missing by the system and mispronounced by a rater