An Analysis of the Use of Criterion in a Writing Classroom in Japan

Criterion is a web-based writing assessment system developed by English Testing Service (ETS) which automatically provides learners with feedback as well as a score on essays that are written using the system. This descriptive study examined 28 Japanese adult students’ TOEFL writing essays to explore what Criterion can and cannot do with regard to providing feedback on the essays. Criterion’s critique function was compared with a human instructor’s error feedback focusing on five error categories: verbs, word choice, nouns, articles, and sentence structures. The results revealed that Criterion experienced difficulties in detecting errors in all of the five categories. Through the study, it is suggested that Criterion should be implemented into a classroom with a team of teachers rather than as a standing-alone evaluation tool.

the United �tates or Canada o�er the past fi�e years and the TWE has been mandatory since 2000.For those learners, impro�ing the score on the TWE is a major purpose for studying English writing.Currently, TOEFL has been shifted to computer-based formats in �apan and test-takers also need to take the TWE on the computer screen.In computerbased tests, test-takers need to complete an argumentati�e essay on a gi�en assigned topic in 30 minutes.After 30 minutes, the screen is automatically closed and no re�ision is allowed.Therefore, time is one of the most significant factors for test-takers.In TWE writing, test-takers need to be concerned with both writing fluency and accuracy in the assigned essay.
As a writing e�aluation tool, Criterion, a web-based writing assessment tool, has been de-�eloped by ET�.Criterion gi�es a holistic score and feedback to students about their essays using natural language processing.�tudents can instantly recei�e their score and also can recei�e feedback of their essay on the screen.Criterion comprises two applications: E-rater and Critique.E-rater pro�ides score extracting linguistically based features from an essay and uses a statistical model of how these features are related to o�erall writing quality to assign a holistic score to the essay.On the other hand, Critique is comprised of a suite of programs that e�aluate and pro�ide feedback on errors in grammar, usage, and mechanics, identify the essay's discourse structure, and recognize undesirable stylistic features (Burstein et al., 2004).
�urthermore, Criterion also has two applications for human instructors: Pop-ups and Instructor's Commentary.Using these two applications, human instructors can freely gi�e the learners feedback both in a holistic and an analytical way.The Pop-ups function is a highlighted box on the essay screen where instructors can usually write a brief correction concerning sentence le�el errors.�tudents can confirm this correction by hitting the "I" marks.In the Commentary, instructors can write a short paragraph under each student's essay, where instructors will often make a general comment on the o�erall quality of the essays.
As can be seen by the functions of Criterion abo�e, this writing assessment tool seems systematic, organized, and can be expected to ha�e a positi�e influence in a process-writing approach, pro�iding students with instant feedback.
Howe�er, although this tool has been utilized in writing instruction in �apan for the past three years, the actual use of Criterion has not recei�ed much attention in the literature.Despite its apparent effecti�eness, it must be noted that complicated error feedback issues cannot be resol�ed easily with a new technology.Therefore, Criterion's assessment should be carefully examined to see where it is has potential limitations in a language learning context.
This study deals with Criterion's ability to pro�ide feedback on sentence-le�el errors in essays written using the system.�pecifically, it seeks to determine which sentence errors Criterion can detect and which it has difficulties with, and how its error detection differs from that of human instructors.It will be �aluable to gi�e specific data of Criterion to impro�e its functionality and to show the best use of the application through micro-le�el research.As Burstein et al. (2003) argues, Criterion is intended to be an aide, not a replacement for teacher writing assessment/error feedback.Therefore, both researchers and practitioners should ha�e critical attitudes toward Criterion when they use it.In doing so, it will be clear for us on how to implement Criterion in second language writing pedagogy.

The Study
This study was a descripti�e study aiming at exploring how Criterion feedback differs from the human instructors' feedback.In order to obtain specific data, the following fi�e major error categories were examined considering error feedback issues: �erb errors, noun ending errors, article errors, wrong word, and sentence structures.These fi�e categories ha�e been considered as major errors in much of the second language writing literature and the description of each category below was based on Ferris' study (2001), as shown below.Errors in sentence/clause boundaries (run-ons, fragments, comma splices), word order, omitted words or phrases, unnecessary words or phrases, other unidiomatic sentence construction.
This study was conducted using 28 �apanese adult students' TWE essays.They were enrolled in a large pri�ate language institution in Tokyo between April in 2003 and March in 2004.Most of them were taking the TWE course to apply for graduate programs in American uni�ersities.The a�erage age of the subjects was 27 years old.
In TWE course at the institution, after the students attended classroom sessions, they took TWE in Criterion and recei�ed the feedback from the human instructors within a week.A nati�e-speaker English instructor and a �apanese instructor cooperated in gi�ing the error feedback.The feedback was mainly direct correction of sentence-le�el errors; howe�er, a coding system was also used for major fi�e error categories abo�e.The coding system was introduced to the students during the classroom sessions.
In terms of the error feedback procedure, the �apanese instructor ga�e error feedback using Pop-ups on students' TWE essay screens and the nati�e-speaker instructor checked the �apanese instructor's corrections.Also, the nati�e instructor wrote a general comment concerning the o�erall quality of the essays in the Commentary box.After ha�ing the feedback from the human instructors on the screen, the students took the TWE essay again with a different topic.
of error feedback between Criterion and human instructors, the Japanese instructor (the author) classified the detected errors into the fi�e error categories.The errors detected by Criterion were classified from the Critique section of grammar and mechanics.As for the errors detected by human instructors, classification was conducted by examining the Pop-ups.
In this study, error rate was measured.Error counts ha�e been normalized by di�iding the number of errors by the number of words and multiplying by a standard, which was set at 200 (see Biber, Conard, & Reppen, 1988) for both Criterion and the human instructors' feedback.

Results and Discussion
Table 2 shows the a�erage number of words, number of words per sentence, and score.As shown in the table, the a�erage number of words per text was 209, the a�erage number of words per sentence was 15.3, and the a�erage score was 3.28 (maximum of ��).  3 shows the error rate marked by Criterion and human instructors.As explained earlier, error counts ha�e been normalized by di�iding the number of errors by number of words and multiplying by the standard of 200.As shown in Table 3, there is a large difference between Criterion and the human instructors in detecting the major fi�e surface errors.In all fi�e categories, the human instructors detected more than Criterion did.Of particular interest is that Criterion did not detect noun errors at all in this study.In Table 4 through Table 8, each category was subdi�ided by classifying the error types.First, Table 4 shows the error types of �erbs detected by Criterion and the human instruc-tors.As seen in the table, while both Criterion and the human instructors detected subject-�erb agreement, Criterion did not detect tense errors at all.Because the tense form changes depending on the context (Celce-Murcia & Larsen-Freeman, 1999), it is anticipated that it is difficult for Criterion to detect contextual �erb tense errors.Table 5 shows the noun error detected by Criterion and the human instructors.No noun errors were detected by Criterion.On the other hand, the human instructors detected 24 cases which were all incorrect use of singular and plurals.Noun errors are considered surface errors; howe�er, Criterion did not detect this type of error in this study.Table 5 shows the article errors detected by Criterion and the human instructors.The two cases detected by Criterion were incorrect use of "these".On the other hand, the human instructors detected �arious types of article errors by �apanese writers.It is well known that the command of articles is �ery difficult for �apanese English writers (e.g.Leki, 1991).�ince the human instructors ha�e acknowledged this point, it is assumed that they carefully detected the article errors which were made by �apanese learners.
Also, it is noted that �apanese instructor sometimes failed to detect article errors in the error feedback process.The nati�e instructor could not help correcting both article errors by the students and the �apanese instructor.�ince articles are echoed with writing fluency of discourse (Celce-Murcia & Larsen-Freeman, 1999), this is one category where a nati�e speaker's assistance is needed in error feedback.In Table ��, wrong word errors detected by Criterion and human instructors are shown.The human instructors detected �arious types of errors in this category, too.In this category, Criterion detected some spelling, confused words, and word choice errors; howe�er, it did not detect preposition and word form errors at all.Although they are both considered surface errors, Criterion did not ha�e a function to detect them.On the other hand, the human instructors detected many word choice errors.In this study, the human instructors not only pointed out wrong words of the essay but also e�en rewrote more appropriate words considering the discourse context.The wrong word category of error is one of the major categories for which human instructors can contribute to error feedback.Table 8 shows the sentence structure errors detected by Criterion and the human instructors.The human instructors detected this error category most.In this category, it is noted that there were 75 cases which the human instructors considered as unclear types of errors.These cases in�ol�ed a mix of �ariety sentence errors and unidiomatic expressions of English.Both the �apanese and the nati�e English instructor tended to rewrite the sentences when they could guess the meaning from the context.In other words, human instructors were trying to understand what the writers wanted to say from the context.Therefore, human instructors detected 75 cases and rewrote the sentences as well as gi�ing error feedback.As shown in the tables abo�e, the human instructors detected errors much more than Criterion did.In particular, Criterion had difficulty in detecting errors with regard to nouns and articles.Although these errors are considered as surface errors, these errors are not rule-go�erned errors and non-nati�e English writers need more time to control those errors in impro�ing comprehensi�e English proficiency (Truscott, 1999;Reid, 1997).In the same way, the other three error categories also in�ol�e the issue of rule-go�erned and non-rule-go�erned errors (Celce-Murcia & Larsen-Freeman, 1999).
Additionally, Criterion in this study did not detect errors which relate to the context.Criterion was especially unable to detect errors in wrong word and sentence structure which on the other hand, the human instructors easily detected errors.These error categories are closely related to the discourse context.Criterion could not make best use of its Critique function in those categories, either.
From such findings in this study, it is difficult to say that Criterion can be used as a standalone error feedback tool.Based on the findings of this study, we need to consider how Criterion should be implemented into the practical settings.In this study, the subjects were ad�anced writers who aimed to apply for graduate programs in the U.� or Canada.It is assumed that they need to achie�e a certain score of the TWE.In such a situation, the learners might not be too concerned about Criterion's critique function; they will just check the total score.Therefore, Criterion can be used as an e�aluation tool.Howe�er, this aspect must be explored more by conducting learners' attitude sur�ey toward the use of Criterion.It might be a good idea to in�estigate which function of Criterion learners are concerned Otoshi: An Analysis of the Use of Criterion in a Writing Classroom in �apan about.In doing so, researchers can disco�er how learners actually make use of Criterion in order to impro�e their writing score.
In regard to the cooperation with the human instructors, human instructors are still required to check surface errors as well as global errors because of the abo�e mentioned complicated characteristics of errors.As many writing instructors ha�e experienced, they are sometimes pro�iding the students with right words/sentences hoping such feedback will e�entually help learners' writing growth in a long run.As shown in Table 8, both the �apanese and the nati�e English speaker instructor in this study rewrote se�eral sentences in the Pop-ups.E�en though such model sentences are not considered to be effecti�e unless learners can understand them, human instructors cannot help introducing those sentences.The human instructors hope that showing the better sentences will e�entually help the learners in the long run.
As Burstein et al. (2004) suggested, Criterion can ne�er become a replacement of human instructor's feedback; rather, it should be considered as a guide to help the human instructors pro�ide feedback based on the holistic score gi�en by Criterion.In many English classrooms in �apan, writing instruction has attracted the least attention among the four language skills because it in�ol�es time-consuming work and less producti�e results in a short time.Criterion has the potential to alle�iate some of the load on the teacher in this regard, as well as affording students opportunities for writing both inside and outside the classroom.In saying this, howe�er, while �arious technical support tools for language instruction such as Criterion is expected in this century, both researchers and instructors should ha�e a critical attitude towards using them through careful examination of the tools and their potential strengths and weaknesses.

Table 1 :
Description of Error Categories Used for Feedback and AnalysisPlural or possessi�e ending incorrect, omitted, or unnecessary; includes rele�ant and subject-�erb agreement errors.
Article errors Articles or other determiner incorrect, omitted, or unnecessary Wrong Word All specific lexical errors in word choice or word form, including preposition and pronoun errors.�pelling errors only included if the (apparent) misspelling results in an actual English word.Sentence Structure

Table 2 :
Average Number of Words, Numbers of Words per Sentences, and Score (Mean/SDs)*

Table 3 :
Errors Marked by Criterion and Human Instructors (Means/SDs)*

Table 4 :
Error Types of Verbs Detected by Criterion and Human Instructors

Table 5 :
Error Types of Nouns Detected by Criterion and Human Instructors

Table 6 :
Error Types of Articles Detected by Criterion and Human Instructors

Table 7 :
Error Types of Wrong Word Detected by Criterion and Human Instructors

Table 8 :
Error Types of Sentence Structure Detected by Criterion and Human Instructors This study was designed to explore what Criterion can do and what it cannot do when compared with human instructors' error feedback, focusing on major fi�e error categories.