Constructing a blog corpus for Japanese learners of English

Kwansei Gakuin University pfoss@kwansei.ac.jp Researchers have directed increased attention to the building and analysing of written learner corpora – databases of written language produced by language learners – to address issues such as the words that non-native learners of English use in their writing, and how their word use diff ers from that of native speakers. is paper off ers an initial look at a new written learner corpus, currently under construction, which is composed of lower/ intermediate-level learner blogs. Preliminary data from the corpus regarding high frequency vocabulary use is compared to frequency lists from the British National Corpus in order to illustrate basic usage diff erences.


Introduction
What words do non-native learners of English use in writing? How does their word use diff er from that of native speakers? Increasingly, researchers have been building and analysing written learner corpora, i.e. databases of written language produced by language learners, to answer questions like these. Concrete evidence concerning issues such as learner use of vocabulary can only lead to better-targeted materials and more effi cient language learning.
Probably the best known and most researched written learner corpus is the International Corpus of Learner English (ICLE ICLE), a . million word collection of essays written by advanced English level university students of various language backgrounds (Granger, ). Many other major corpora are also composed of essays written by advanced university learners. Examples include the Uppsala Student English Corpus (Axelsson, ; Axelsson & e corpus is written learner corpus -hereafter referred to as the Japanese Learner English Blog Corpus (JLEBC JLEBC) -is being constructed at a private university in Japan. First and second-year students in one of the English programs at this school write regularly on individually-created blogs as part of their English coursework. is is a required component of the program. Students are graded according to production; grammatical mistakes and spelling errors are ignored. Dictionary use is allowed but not encouraged. As there are seven instructors and approximately thirty classes involved, there are a number of variables concerning how these blogs are assigned and produced. ese include: . Number of blog entries required per semester. is typically ranges from -. . Number of words required per entry. is ranges from approximately -. . Use of class time. Some teachers use the fi rst -minutes of class time for blog-writing.
Other teachers assign blogs mostly for homework. . Choice of topic. Some blog topics are teacher-generated, others student-generated. . Availability of models. Some teachers write models on their own blogs for student reference. . Level of interaction. Most, but not all, class blogs are organized into 'blog circles,' and students are asked to read -other student blogs per week and contribute brief comments regarding them. . Type of blog. A variety of free blog providers are utilized.
Most students in the program have low/intermediate-level English skills as measured by institutional TOEFL TOEFL exams and other tests.
During the -academic year, students wrote a total of , , words (tokens) in , blog entries, which is where the JLEBC JLEBC stands as of this writing. e mean per student was , words spread over . blog entries, for an average of words per entry. For inclusion in the corpus, all entries have been anonymized, copied into computer text fi les, and organized by ) learner, ) semester, ) entry number, and ) topic choice: teacher-generated or student-generated. Of the , blog entries, ( % of the total) were written in response to diff erent teacher-generated topics. e remaining entries ( % of the total) were written in response to student-generated topics. No spelling or grammatical errors have been corrected.

Rationale for using blogs
Blogs have been defi ned by Ward ( ) as websites which are "updated regularly and organized chronologically according to date, and in reverse order from most recent entry backwards" (p. ). Carney ( ) has noted several key characteristics of blogs: their (potentially) broad audience; their ownership by individuals; their frequent updates; and their communicative features such as commenting and hyperlinks.
As far as the use of blogs in language or writing education is concerned, researchers have particularly noted the 'real world' nature of blogs and other forms of online writing and how learners respond to this authenticity in ways they may not respond to less authentic forms of writing such as academic essays (Blanton, ; Bloch, ; Lowe & Williams, ; Pinkman, ). Although using blogs in the classroom is not inherently authentic, the potential audience and interactive features of blogs "make them more likely, or at least more simply, used in authentic communicative ways" (Carney, , p. ). In a related vein, the personal nature of blogs can make them a more authentic method of self-expression than standard academic rhetorical forms (Farmer, ; orne & Payne, ). Blogs have also been noted for their immediacy, making them particularly suitable for peer review and collaboration (Lowe & Williams, ). e less formal, less structured nature of blog writing or online writing in general has furthermore been shown to benefi t lower-level or insecure learners without a solid background in academic writing (Blanton, ; Bloch, ; Lam, ). For all of these reasons, blogs seem to be an ideal medium for lower/intermediate level learners using general (i.e. non-academic) English. A corpus constructed from blogs produced by these learners would also seem to be an ideal starting point for research into lower/intermediate level vocabulary use.

Methodology
Learner corpora are valuable sources of information in and of themselves. ey are also commonly compared with other corpora, particularly native speaker corpora, in order to ascertain diff erences in language usage. For this exploratory study, statistical data on the overall size of the corpus and the most frequent words was obtained for the JLEBC JLEBC using Wordsmith Tools . (Scott, ). Using keyword analysis, this word frequency list was then compared to a similar list generated from the World Edition of the British National Corpus (BNC BNC) (Scott, ). ). Although the JLEBC JLEBC technically consists of writing in only one genre (blogs), it is designed to be representative of the general English language usage of a group of learners on a wide variety of general topics. Second, the BNC BNC is one of the largest native speaker corpora in existence; size is an important consideration where the authority of a particular corpus is concerned (Granger & Tribble, ). Finally, wordlists from the BNC BNC are freely available (Scott, ; see also Leech, Rayson, & Wilson, ). Of course, the BNC BNC is not a perfect model of English usage, but then no single corpus is. Even describing a native speaker corpus as a model of any sort for learners is troubling for some. Ringbom ( a) pointed out in a study concerning high frequency vocabulary in the ICLE ICLE that comparing a learner corpus with a native speaker corpus can be problematic, as frequently employed terms such as 'overuse' and 'underuse' "presuppose a norm" from which learners fall short (p. ). Cook ( ) asked the question plainly: "Why should the attested language use of a native speaker community be a model for learners of English as an international language?" (p. ). e chosen model for this study, the BNC BNC, is composed primarily of texts written in British English by professional adult writers; is it fair to make the English found in this type of writing the norm for young Japanese university learners writing on blogs? ough blogs are technically a written medium, their personal nature and less-structured form also gives them certain qualities found usually in speech. Would it be more appropriate to compare them to a primarily spoken corpus? Or is it, to take Ringbom's (or Cook's) point to one possible conclusion, actually inappropriate to 'presuppose a norm' and measure them against a native speaker corpus at all?
Perhaps -to all of these questions. However, it is natural for learners to look for norms, for standards by which to measure their own production. It is also understandable if the language usage of any large group of native speakers is considered standard (if not the standard). ough national, regional, and cultural diff erences must be taken into account, if the language usage of native speakers taken in the aggregate cannot be considered standard usage, then what can be? Signifi cant diff erences between learner usage and native speaker usage of vocabulary, for example, could therefore be an indication that this vocabulary has not been learned to the necessary extent (in the case of underuse) or is being used at the expense of normal linguistic variation (in the case of overuse). ese conclusions can be tested, and if found valid, materials can be developed and targeted instruction employed to help the learners move forward in an effi cient manner.
Particularly where high frequency vocabulary is concerned, diff erences in learner and native speaker usage should be closely examined. As they are largely function or basic-content words, high frequency words by their very nature are the words found in virtually all types of written and spoken discourse. ey are the glue that holds the language together; it is therefore necessary for learners to know and be able to use them appropriately. As Nation ( ) has written, "high frequency words are so important that anything that teachers and learners can do to make sure they are learned is worth doing" (p. ). Caveats aside, it seems 'worth doing,' then, to make comparisons between the learner writing in the JLEBC JLEBC and the native speaker language in the BNC BNC, despite its limitations, and see what sort of diff erences exist, especially concerning use of high frequency vocabulary. Given the above concerns, however, for the purposes of this study the descriptive phrases 'used more' and 'used less' will be employed rather than the possibly value-laden terms 'overuse' and 'underuse.' Before making these comparisons, a second issue -topic sensitivity -must also be briefl y addressed. As Ringbom ( b) has noted, topic sensitivity "will to some extent be present whenever word frequency patterns are established for texts with diff erent content" (p. ). Even a corpus as large and varied as the BNC BNC has a number of words at very high levels of frequency -London and British among them -which likely would not be present in a differently constructed corpus. A small corpus of texts created by Japanese learners of roughly the same age, studying in the same program and sharing many of the same experiences, is bound to have even more of these words. For this initial study, content words with obvious connections to assigned blog topics (e.g. school, English, university) have been removed from the frequency lists. However, it is likely that other words with less obvious connections are present and infl uencing the results.
For this reason, and because the JLEBC JLEBC is still incomplete, all results should be considered tentative. e planned inclusion of additional blog entries will aff ect frequency ratios, and additional statistical analysis will be necessary in the future to substantiate any claims regarding usage diff erences. ink of the words in the JLEBC JLEBC like runners in a marathon, this paper a report on the race in progress. Final standings are likely to change.

High frequency words used more in the JLEBC than in the BNC
Of the top most frequent words in the JLEBC JLEBC, which were used signifi cantly more by learners than by native speakers in the BNC BNC? Keyword analysis revealed several categories of words on this list that are not surprising -indeed might be expected from lower/ intermediate-level writing. One of the most obvious is fi rst-person pronouns. Table (and all subsequent tables in this section) lists these words followed by the approximate number of occurrences per , words in each corpora and the log-likelihood (LL LL) measure of the diff erence (the higher the number, the bigger the diff erence; see Dunning, for more on log-likelihood). Some of these diff erences are quite large. Not only learner language is likely having an eff ect here; the personal nature of blogs, noted previously, also lends itself to fi rst-person pronoun use. Second, this list is peppered with basic adjectives, the most notable of which are shown in Table .  Why the heavy use of high by the fi rst and second year university learners represented in this study? Using the concordance feature on Wordsmith Tools reveals that high school is still very much on their minds. ere are also numerous examples of high price, as in "…it is very high price and I cannot buy it," possibly indicating diffi culties with the adjective expensive.
Other categories of words used more include words which function as quantifi ers or intensifi ers (Table ), words typically labelled as 'vague' (Table ), and common ordinals and other words used for signalling purposes (Table ). e substantial presence of this last set of words is perhaps due to the infl uence of the learners' instructors; many of these words are explicitly taught in the writing program at this university. Overuse of the verb think is mentioned often in the literature in regards to learner writing (e.g. Aijmer, ; Ringbom, b); as Table shows, the learners in this study used this word much more than native speakers as well. A fi nal category of words used more by the Japanese learners in this study concerns the different forms of enjoy and play (Table ), which perhaps deserve special mention. In Japanese, the concepts of 'having fun' or 'having a good time' are almost invariably expressed by the verb tanoshimu or the adjective tanoshii, which are frequently translated as enjoy or enjoyable, respectively. As for play, the Japanese equivalent asobu is used in a much wider variety of situations than play is in English; the sentence I went out with my friends, for example, is commonly expressed in Japanese as Tomodachi to asobimashita (literally, 'I played with my friends'). Given the context, then (and, to some extent, the age of the subjects involved), it is perhaps not surprising that these two words are used so often by the Japanese learners represented here.

Words used less in the JLEBC than in the BNC
Of the top most frequent words in the BNC BNC, which were used signifi cantly less than might be expected by the Japanese learners in this study? Words used less are perhaps of even greater interest than words used more; as Granger and Tribble ( ) have written concerning the benefi ts of non-native learner corpus data, "perhaps the greatest gain comes from the way in which the NNS NNS corpus shows what is absent in learner writing" (p. ). ough the range of topics in the JLEBC JLEBC is naturally more limited than in the BNC BNC, it should be noted again that words at the highest levels of frequency are commonly found across topics. e absence or less frequent occurrence of these words in learner writing may indicate a lack of understanding or confi dence where production of these words is concerned and therefore a need for further instruction.
Keyword analysis of the two corpora revealed several categories of words used significantly less by the learners in the JLEBC JLEBC. ough one of these categories could be labelled 'adult' or 'professional' words (e.g., local, national, public, system) not likely to be used by university-age Japanese learners, the majority are simple function words, as should be expected given their high level of frequency. No one with teaching experience in Japan should be surprised to learn that articles and words commonly used as determiners headline this group. Table (and all subsequent tables in this section) lists these words followed by the approximate number of occurrences per , words in each corpora and the log-likelihood (LL LL) measure of the diff erence. e Japanese language, of course, does not have articles, which might partly explain why the learners in this study used them to a lesser degree than the native speakers in the BNC BNC. Also striking is the number of prepositions used less in the JLEBC JLEBC, of which  As with articles, Japanese also does not have prepositions in the English sense. Between, for example, can be translated as ~ (no) aida (ni), which is technically a noun construction (literally 'the middle position'). Further notable for their underrepresentation are common verb modals (Table ), other auxiliary verbs (Table ), and verbs for reported speech (Table ), suggesting possible diffi culties with these forms for this level of Japanese learner. Conclusion is preliminary study has sought to point out basic usage diff erences concerning highfrequency vocabulary between the writing on blogs of a particular group of lower/intermediate-level Japanese learners and the writing of native speakers collected in the British National Corpus by examining the top most frequent words in each corpus. at there are large numbers of basic words used more often by these learners than native speakers should not be surprising; learners of all types have fewer words to draw from to begin with and, in what Hasselgren ( ) famously termed 'the teddy-bear principle,' tend to use those words with which they feel most comfortable. However, discovering what these words are, specifi cally, can help local educators develop materials to wean learners away from these words in an effi cient manner. Even more useful from a pedagogical point of view is discovering the words used less often by learners. ough production involves choice (see Corson, , regarding motivation and language use) and any conclusions as to why learners avoid individual words must be considered educated guesses at best without specifi c information from the learners themselves, it seems reasonable to suggest that high frequency words which can be identifi ed as underutilized may be words not fully understood by the learners in question. ey therefore represent teaching opportunitiesopportunities that often go overlooked, as it can be diffi cult on the spot to notice what is absent in learner speech or writing.
Hopefully one day materials based on the data in the JLEBC JLEBC can be developed to help educators in Japan make vocabulary instruction better targeted and more useful for lowerintermediate level learners. In any case, as the JLEBC JLEBC continues to grow, there will be further opportunities to study vocabulary use by this group and possibly even other groups of learners. While this corpus, at . million words, is large enough as a whole for some types of frequency analysis, a still larger one would be far more reliable. Furthermore, at this point in time, the subcorpora for the JLEBC JLEBC are still too small to examine longitudinal issues eff ectively. How, for example, does vocabulary use by fi rst and second year university students change as they progress through an EFL EFL program? Questions like this await further data collection.