Corpus linguistics is a methodology in linguistics that involves computer-based empirical analyses (both quantitative and qualitative) of actual patterns of language use by employing electronically available, large collections of naturally occuring spoken and written texts, so-called corpora. Corpus-based and other types of empirical linguistic research have shown that speakers' intuitions oftentimes provide only limited access to the open-ended nature of language, which can cause problems when examining unexpected or infrequent linguistic structures, e.g. as regards lexical co-occurrence patterns, patterns of variation between grammatical constructions, word meaning, or idioms and metaphorical language.
The factors that condition the choice between competing grammatical variants is one topic that features prominently in research and students' projects at Mainz University. While grammar books make us believe that e.g. yet is a trigger of present perfect, we can observe U.S. election campaigns featuring the sentence "Did you vote yet?". While standard reference works used by school teachers advise pupils to use the synthetic comparative -er with monosyllabic adjectives, we observe native speakers to use more apt. more proud rather than prouder. apter in the majority of cases. While the 's -genitive is described as being used with persons while the of -genitive is allegedly to be used with things, linguists who do research on actual language use find a marked discrepancy between what is taught and what is done. Thus, the topic's relevance cannot be stigmatized as an exception or even be marked as incorrect. The issue of variation poses an intriguing challenge for English teachers and researchers. While to some the task of bringing schoolbook knowledge up to scratch with actual language use seems insurmountable, English Linguistics at Mainz University tries to offer ways out of the dilemma.
Most (advanced) English linguistics classes in Mainz involve at some point students' own collection, processing and analysis of empirical data, often by making use of electronic corpora. In advanced classes in particular, students will be asked to carry out corpus-based projects, sometimes involving replications and extensions of earlier case studies. The Department of English and Linguistics hence offers its students a wide range of computerized corpora comprising British and American English. MACOCO. the Mainz Corpus Collection, is a progressively enhanced source for student research on the correctness, use, historical development, etc. of certain language structures.Examples for research projects with electronic corpora as research tools
1.Project goal: Compilation of the prothom-alo corpus, converting the corpus to Unicode format as well as producing a document on the approach to compilation of a Bangla corpus. 2.Corpora and its importance in language research: A corpus, most simply can be defined as a collection of texts, which can be of a particular language or of more than one language. It is as important a resource as any other resources in linguistic research. Natural language processing has always been an interesting research area and computational linguistics is one important part of it. The key resource to any linguistic research is a trained, annotated corpus that can enhance the language processing capability such as automatic part-of-speech tagging, information extraction etc. As an example many lexicographers have found that they can more effectively create dictionaries by studying word usage in a very large linguistics corpora. Corpora have significantly affected research in linguistic discipline and have succeeded to open a new area of research. The corpus being such an important resource has made linguistic researchers produce corpora of their language. English language has many corpora varying in size, genres and purposes. The first English corpus is the Brown corpus which was created by W. Nelson Francis and Henry Kucera in the early 1960’s. There are many other English corpora available such as The British National Corpus, London Lund corpus, Penn TreeBank corpus, the International corpus of English and many more. Unfortunately in Bangla we do not have any corpus available. The goal of the project was to convert the available text collected from the on line version of Prothom-alo and convert it to Unicode format to make it usable for further research. 3. Compilation procedure. The corpus has been created in two phases. These phases are: 1) Collecting raw text from “Prothom-alo” website. 2) Converting to Unicode.
3.1 Collection of text: The raw text for the corpus was collected from the prothom-alo home page- WWW.Prothom-alo.net. This was done using a web crawler program that surfed through the website of prothom-alo and downloaded all the news available for the year of 2005- including magazines, periodicals published by them. The crawler crawled for one night to collect all the text, which were in html of course. After that using the Linux script all the files were converted to text files.
decemb er day1 day12 day 31
News categories 1,2. 15,38
Fig: Prothom-alo corpus 3.2 Unicode convertioin: The second part in creating the corpus was to convert all the files to unicode format. This was needed because Unicode support for Bangla is much more rich than any other format. Prothom-alo uses two types of fonts, namely “Bansi Alpona” and “Prothoma”. The previous was in use up to 2005 and currently they are using “Prothoma” for the on lone version of the newspaper. So a Java application was written which recursively searched the folders and sub folders and convert all the text files to Unicode.
4. Processing the text: Categorizing the news: After converting to Unicode the corpus is now ready for any further processing required. An important and useful processing could be categorizing the news. Prothom-alo presents news in 27 different categories. Each category has a category id; i.e. category 1 is for “prothom pata”, category 2 is for “sesh pata”, etc. So if all the news that belong to the same category can be merged together it will enable us to analyze and carry out some research on like text categorization etc. A Java application is used which surfs through the news of all the days and collects news of the same category in one file. The corpus is also available as a single text file. 5. Analysis. We now have a corpus which is: • •
318 Mb in size 12 million word/token count
which is a big one. Some basic statistical analysis is now on offer which.
Please sign up to read full document.
British National Corpus Table of Contents Introduction 2 Introduction to the British National Corpus 2 What is a corpus . 3 What is a corpus for? 5 Earlier interest in computerised corpora for linguistic research 6 Corpus Design and Construction 7 Written language, spoken language 7 A monolingual corpus 8 Conclusion 9 Introduction The paper aims to analyse the introduction to the British national Corpus program that is a lingual program of leaning more than 100 million expressions and words that a learner learn as a monolingual group. Moreover, it discusses the program from different angles to understand its concept. Lastly, it focuses on reasons and considerations that is taken under notice while selecting a non-UK based language course for studying. Introduction to the British National Corpus The British National Corpus is an accumulation of 100 million words of contemporary British English content held in PC intelligible structure. It is accessible as an examination device for those professionally keen on how the English language is being utilised as a part of the late twentieth century inside the United Kingdom. These incorporate especially word specialists and makers of English language reference lives up to expectations, additionally scholarly etymologists. The hunt/recovery programming supplied with the corpus will.
2341 Words | 10 Pages
synonyms mentioned above. The aim of the thesis is to identify the use of the discourse maker actually in the corpus of data according to the functions of discourse markers as suggested by Vivian De Klerk (2006). Due to its many senses and functions in spoken/ written texts it becomes a problem to translate this item into Lithuanian without changing the effect of the utterance that is meant in the source language. Therefore, another research aim is to analyze the translation variants of actually in Lithuanian in the parallel corpus of English-Lithuanian fiction. In their research most Lithuanian scholars focused on the distribution of English discourse markers through registers (Buitkienė 2005, Šiniajeva 2005), and only few attempts were to study Lithuanian translation equivalents of the markers (see Masaitienė 2003). As an English philology student, I got interested in the topic which can reveal much about the actual usage of discourse markers and the corresponding Lithuanian equivalents as used in different situations. To achieve the aim, the following tasks have been set: To provide a theoretical overview of the basic characteristics of actually; To analyze the corpus of data and to group the equivalents of actually according to their functions performed in different contexts. to find out what translation equivalents of actually are found in Lithuanian corpus texts A limitation of this analysis has to be.
19444 Words | 56 Pages
State Linguistic University after Valery BrusovPaper Corpus Linguistics, Lexicology and Translation Subject- Lexicology Faculty- IC Year - II Group - III Lecturer - K. SoghikyanStudent – Mane Nersisyan1586865360044Yerevan 2013 0Yerevan 2013 Introduction This paper includes information about corpus linguistics, its connection with lexicology and translation. The latter is the most important one and I am keen on finding and introducing something which is mainly connected with my future profession. Frankly speaking that was not an easy journey but I am hopeful it is destined to be successful. A corpus is an electronically stored collection of samples of naturally occurring language. Most modern corpora are at least 1 million words in size and consist either of complete texts or of large extracts from long texts. Usually the texts are selected to represent a type of communication or a variety of language; for example, a corpus may be compiled to represent the English used in history textbooks, or Canadian French, or Internet discussions of genetic modification. Corpora are investigated through the use of dedicated software. Corpus linguistics can be regarded as a sophisticated method of finding answers to the kinds of questions linguists have always asked. A large corpus can be a test bed for hypotheses and can be used to add a quantitative dimension to many linguistic studies.
3263 Words | 6 Pages
An Empirical Study on a Computer-Based Corpus Approach to English Vocabulary Teaching and Learning Song Yunxia Sun Zhuo Yang Min Foreign Languages Section of Agriculture Division of Jilin University, Changchun 130062 email@example.com Abstract This paper investigates the effectiveness of a computer-based corpus approach to English vocabulary teaching and learning. The empirical study in this paper illustrates that the use of corpus does benefit English vocabulary teaching and learning but there are some problems in its application. Pedagogical implications are then provided to enhance the effective integration of corpus and English Vocabulary Teaching and Learning: 1) teachers and learners need to negotiate frequently during the whole process of corpus -based DDL; 2) corpus use in vocabulary teaching and learning must adapt to the students’ needs and teaching environments; 3) the alternative use of corpus and dictionaries will enhance the efficiency of corpus -based vocabulary learning of the students; 4) Integrating cooperative learning and DDL will promote students’ vocabulary learning greatly. Keywords: Corpus ; English Vocabulary Teaching and Learning; information technology 1.Introduction 1.1.Corpus Linguistics With the conspicuous development of information technology and computer science.
3074 Words | 11 Pages
Translation and Interpreting Studies I A Critical Assessment of Universals of Translation: In the light of corpus -based approach Introduction The 1990s witnessed the rapid development of the corpus -based approach to translation studies. According to Laviosa (2010: 80), “a corpus is a collection of authentic texts held in electronic forms assembled according to specific design criteria”. It is of great interest to apply corpora to all sorts of disciplines due to the fact that corpora can be used to access vast quantities of data and prepared for computer processing. Baker (1993: 243) first put forward the corpus -based methodology to translation studies and since then, numerous translation scholars have attempted to identify the nature of translated texts by carefully examining the product and process of translation. Thus, various features of translated texts are observed and summarized, forming the basis for universals of translation which is defined by Baker (1993: 243) as “features typically occur in translated texts rather than original utterances and which is not the result of interference from specific language systems”. Frawley (1984) regards these features as the ‘third code’ of translation, which differs from both the source language and the target language. This paper first takes a brief look at typologies of corpus frequently used in translation studies and then moves on to examine the four.
3386 Words | 10 Pages
to understand this we need to start off from the beginning by learning what Habeas Corpus is, where it comes from and how America follows its traditions. The best place to start off is what Habeas Corpus means, it comes from a Latin term which means “you have the body”. It means to bring a person that is under arrest to court or before a judge. The reason for Habeas Corpus is so that a prisoner can be released unlawfully if there is lack of evidence or cause. Habeas Corpus originated from the English legal system and is now used in many countries around the world, it is also known as “Great Writ”. Habeas Corpus “Great Writ” was first used in England not for helping detainees, but for helping government officials in the judicial process and remuneration. The origin explains why there has been much controversy because Habeas Corpus wasn’t made to help as Edward Jenks famously said “Originally not intended to get people out of prison, but to put them in it”(Gregory, 2011). Habeas Corpus is talked about in the Constitution in Article I, Section 9. There it is stated that “The Privilege of the writ of habeas corpus shall not be suspended, unless when in cases of rebellion or invasion the public safety may require it”. During are United States history there has been many times that Habeas Corpus was suspended. During the beginning of the Civil War in.
2294 Words | 6 Pages
RUNNING AHEAD: THE RIGHT OF HABEAS CORPUS ON THE WAR OF TERROR. POLITIC 201 Monday, April 29, 2013 RUNNING AHEAD: THE RIGHT OF HABEAS CORPUS ON WAR TERROR. Habeas Corpus . The meaning of Habeas Corpus comes from a Latin base meaning “you have the body” (National Archives). It refers to the right of a person to question his/her incarceration before a judge, intriguingly; the violation of the right of habeas corpus has not been the most severe of civil liberties granted not to Americans only, but many other countries. The right of Habeas Corpus protects a prisoner. It allows a prisoner to point that his or her integrally guaranteed rights to fair treatment in a trial have been broken upon. The most recent controversy regarding habeas corpus was during the Bush administration when hundreds of suspected Afghan and Iraqi terrorists were imprisoned. (http://www.enotes.com). While telling about the Habeas Corpus and the war on terror, my main focus in writing this essay will be on the general meaning of Habeas Corpus . its relationship with civil liberties, its American and English history, its evolution in U.S history of suspension. I will show the relevance of Habeas Corpus to the contempory U.S situation during the war on terror. I will also talk to about its interpretation by the U.S Supreme Court with the respect to “enemy.
2126 Words | 5 Pages
Civil Liberties, Habeas Corpus . and the War on Terror POL201: American National Government Jamie Way September Barron May 5, 2013 The history of the Right of Habeas and the war on terror, it stated in the article The Tissue of Structure by Anthony Gregory “It has been celebrated for centuries in the Anglo-American tradition as a means of questioning government power. It is probably the most revered of all of the checks and balances in our legal history—as William Blackstone commented,” “the most celebrated writ in English law” (Gregory, A. 2011, 2nd par.). The Habeas corpus is to protect the individual from being imprisoned wrongly and due to a fair trial. Although, questions arise regarding whether proper use of habeas corpus been brought into focus over the last ten years. In this essay I will explore the history of Habeas Corpus and how it has evolved over the many years. I will try to briefly explain how the habeas corpus originated and the role the U. S. has and the current actions being taken with it. I will look into the Bush administration and the way the way they dealt with habeas corpus during his administration. Let’s look at the history of habeas corpus it stated in an article entitled Habeas Corpus The most extraordinary writ that the history of “Habeas Corpus is ancient”. “Although the precise origin of Habeas Corpus .
2172 Words | 6 Pages
A corpus (plural corpora. German “das Korpus”, not “der”) is a collection of texts used for linguistic analyses, usually stored in an electronic database so that the data can be accessed easily by means of a computer. Corpus texts usually consist of thousands or millions of words and are not made up of the linguist’s or a native speaker’s invented examples but on authentic (naturally occurring) spoken and written language.
The majority of present-day corpora are “balanced” or “systematic”. This means that the texts are collected (“compiled”) according to specific principles, such as different genres, registers or styles of English (e.g. written or spoken English, newspaper editorials or technical writing); these sampling principles do not follow language-internal but language-external criteria. For example, the texts for a corpus are not selected because of their high number of relative clauses but because they are instances of a predefined text type, say broadcast English in a hypothetical corpus of Australian British English. Examples of balanced corpora are the International Corpus of English (ICE). the British National Corpus (BNC). or the Brown and Lancaster-Oslo/Bergen (LOB) corpora and their Freiburg updates (Frown and F-LOB ).What is corpus linguistics and why is it useful?
Based on the above definition of a corpus, corpus linguistics is the study of language by means of naturally occurring language samples; analyses are usually carried out with specialised software programmes on a computer. Corpus linguistics is thus a method to obtain and analyse data quantitatively and qualitatively rather than a theory of language or even a separate branch of linguistics on a par with e.g. sociolinguistics or applied linguistics. The corpus-linguistic approach can be used to describe language features and to test hypotheses formulated in various linguistic frameworks. To name but a few examples, corpora recording different stages of learner language (beginners, intermediate, and advanced learners) can provide information for foreign language acquisition research; by means of historical corpora it is possible to track the development of specific features in the history of English like the emergence of the modal verbs gonna and wanna ; or sociolinguistic markers of specific age groups such as the use of like as a discourse marker can be investigated for purposes of sociolinguistic or discourse-analytical research.
The great advantage of the corpus-linguistic method is that language researchers do not have to rely on their own or other native speakers’ intuition or even on made-up examples. Rather, they can draw on a large amount of authentic, naturally occurring language data produced by a variety of speakers or writers in order to confirm or refute their own hypotheses about specific language features on the basis of an empirical foundation.What types of corpora are there?
In the following, a list of some of the most common types of corpora is provided.
There is also a large variety of specialized corpora, e.g. Michigan Corpus of Academic Spoken English (MICASE ), useful for various types of research (cf. e.g. http://www.helsinki.fi/varieng/CoRD/corpora/index.html ).
It should be pointed out that the above listed types of corpora are not necessarily mutually exclusive – F-LOB and Frown. for example, are both synchronic and regional corpora, and even “become” historical when paired with their 1960s counterparts LOB and Brown .2. List of corpora available in Freiburg
You can download a list of corpora here. Please note that with new corpora constantly being compiled this list is not exhaustive but constitutes a selection of well known and widely used corpora of the English language.
In order to analyse a corpus and search for certain words or phrases (strings), you need special software. Some software packages are designed for a specific corpus, for example Sara for the BNC or ICECUP for the ICE Great Britain. ‘Concordancers’, on the other hand, can be used for the analysis of almost any corpus.
One of the most frequently used concordancers is Wordsmith Tools. Its two most important tools, Concord and WordList. will be explained in more detail below.
As an alternative to Wordsmith, you can also use a concordancer called AntConc which can be downloaded for free. At www.antlab.sci.waseda.ac.jp/antconc_index.html you will find both links for the download and an online help system explaining its basic functions. The most useful functions of AntConc are explained below.3.1 WordSmith Concord
Click on the Wordsmith icon on the desktop to open the program. Select concord in order to search a corpus for a certain word or phrase. You can now choose a corpus and select those files of the corpus you want to analyse.
As a case study, let us analyse the use of English prepositional phrases by German and Italian learners of English in the International Corpus of Learner English (ICLE). The underlying assumption is that German learners frequently use ‘possiblity to do something’ (native language interference from German ‘Möglichkeit, etwas zu tun’) while Italian learners prefer an of -construction as the direct translation of possibilità di fare qc (possibility of doing sth.).
In the ‘choose text ’ option, mark all texts written by Italian learners (those beginning with ‘it’) and put them into the directory by drag and drop. Click ok.
Then go to concord – settings – search word. e.g. ‘possibility’. and click go now (see below for further options to type in a search word or phrase). This will get all occurrences of ‘possibility’ for you to analyse. For a better overview, you can sort them e.g. according to the first word to the right.
The number on the left indicates the number of occurrences; on the right, further information, such as the source files, is provided. In the toolbar, you find a number of functions which are useful to work with the data.
In order to view the samples with larger or smaller context, click on view - grow or view - shrink .
You can also resort (edit - resort) data according to words on the first to fifth item to the left or the right of ‘possibility’. To compare the use of prepositions occurring with ‘possibility’, for example, sort according to first word right (R1).
Additionally, you can delete irrelevant examples by pressing delete on your keyboard and then choosing the zap -option ( edit - zap ) from the toolbar. In our example, all hits where ‘possibility’ is not used with a verb phrase (e.g. ‘the possibility for women’, ‘possibility of an adoption’) are deleted, leaving only the occurrences of ‘possibility’ + prepositional verb phrase.
Furthermore, you can view the most frequent collocations or the most frequent clusters by clicking on the respective tabs below. This will show you that Italian learners use ‘possibility to’ (39 occurrences) slightly less often than ‘ possibility of’ (43 occurrences), thus neither firmly corroborating nor contradicting the above hypothesis:
Now you can start a new concordance, e.g. with the German ICLE subcorpus and compare the results.
To continue working with this data at a later point in time, you can save it either as a concord file ( file - save as ). If you do not have Wordsmith on your computer at home, it is better to save the data as a .txt file or as an excel spreadsheet as you will not be able to open a Concord file without the Wordsmith software. Saving it as an excel table allows you to work with the data later on, e.g. to copy and paste some of the examples, or to count the occurrences and compare frequencies of different corpora/ points in time and finally draw graphics.Some further options for entering a search word or phrase:
By using the asterisk *, you can widen the scope of your search. For example, entering going as a search word will provide you only with all instances of going ; entering going to with all instances of going to. If you type in go*. on the other hand, you will get all words beginning with go -, e.g. going, goes, gold. Searching for *ing. you will get all words ending in –ing. e.g. swimming, dancing, sing .
You can also type in several words as your search words, e.g. go / going / goes.
Additionally, it is possible to analyse the co-occurrence of two words within a certain distance. In order to do this, you need to type in one word as ‘Search Word or Phrase’ and the other as ‘Context Words’ in the tab ‘advanced’. Additionally, indicate the ‘search horizons’, i.e. how many words to the left or right the second word might occur with respect to the first. For example, if you want to analyse a collocation such as have a look in a span of five words, enter have/ has/ had having as search word and alook as context word; click on ‘0L’ and ‘5R’ to find all instances where look is found within five words right of book.WordSmith WordList
The tool WordList generates word lists of the selected text files and enables you to compare the length of text files or corpora. Moreover, you can use WordList to compare the frequency of a word in different text files or across genres and to identify common clusters.
Choose the text files for your analysis as described in the section above and use WordList now instead of Concord.
In the tabs below you can select between three different types of word lists being listed by their frequency. occurring in alphabetical order or containing statistical information.
‘Wordlist statistics’ compares the frequencies of words in each category of the respective corpus (e.g. in each text of the German and Italian sub-corpora of the ICLE), providing information on the number of words, the average length of words and sentences, etc.
For further information and explanations of the different tools, you can always resort to the Wordsmith Tools Help window.3.2 AntConc Concordance tool
This tool shows the words or word strings you want to analyse in their textual context.
2. Select the files you want to analyse: File > Open file(s)
3. Choose the tab "Concordance"
4. Type in a search word (“Search Term”, bottom left-hand corner)
Example: how to find all occurrences of make :
- only one word form: type in make, makes, made, making separately
- several word forms: use of wildcards
i. ma* gives you all of the above word forms, but also all other words beginning in ma-. e.g. man, mankind, marry. etc. * stands for 0 or more characters
ii. ma?e gives you make and made. but also maze, male and mate. stands for any 1 character
i. @ stands for 0 or 1 word
ii. # stands for any one word
iii. | stands for OR
2) Determine how large the context of the concordance line is supposed to be: Default setting of “Search Window Size” is 50 characters, but generally you need more context à 200 or 250 characters
3) Click “Start”
4) “Concordance Hits” shows you the overall amount of occurrences (remember that not all occurrences need to be relevant for your analysis!)
5) If you want to see the whole text of one concordance line, move the mouse over the highlighted search term in the concordance line and click.
6) Deleting unwanted concordance lines: on your keyboard press “control” + click on the line you want to delete, then press “delete” on your keyboard. Click on “Sort” (under the “Search Term” box to reorder the remaining concordance lines so that you are left with consistent numbering.
7) Save your results: File > Save output to text fileHow to refine your search:
- Click on “Advanced” next to the “Search Term” box
- Type in make in the “Search Term” box
- Activate the box “Contexts Words and Horizons”
- Type in “up” in the box under “Context Words”, then click on “Add”
- Define the search horizon (e.g. 0 words to the right and 5 words to the left of make )
- Click on “Apply”
- Click on “Start”
Example 2: Finding words clustering around take
- Select the tab “Clusters”
- Type in take as search term
- “Search Term Position”: Decide if you want to find the words preceding (activate “on right”, i.e. take is on the right) or following take (activate “on left”, i.e. take is on the left)
- Using “Cluster Size”, define how long you want your cluster to be (e.g. at least 3 words including take )
- Click on “Start”
Example 3: Finding collocates of take
- Select the tab “Collocates”
- Type in take as search term
- Define the span of words to the left and right of take. “Window span” from e.g. 0L to 5R
- Click on “Start”WATCH OUT:
- When you have done a search with context words via the “Advanced” search function, and then want to do a search without context words, make sure to clear the context words you used for your previous search.
- When you are using a part-of-speech-tagged corpus like F-LOB or Frown, and you do not want the tags to show up, go to “Global Settings” > “Tag Settings” > “Hide Tags”4. Exercises
1. The BNC online (www.natcorp.ox.ac.uk ) offers a free search facility for simple searches. For example, you can check whether a certain collocation is used by British native speakers. The BNC online counts all instances of your search items but displays at most 50 random examples.
a) Here you can find some translations of German collocations. Do they exist in English and are they used by native British speakers?
b) The following words have a different meaning in German and English. Check their use in the BNC and verify their meaning in the OED (www.oed.com ; can only be accessed in the campus network):
2. If you want to analyse different varieties of English, you can use the ICE (International Corpus of English) corpora. For this exercise, refer to the ICE New Zealand.
a) ‘Wahine’ is a word from Maori meaning woman/female/wife. How many occurrences can you find both in the written and in the spoken part of ICE NZ?
b) The word ‘panache’ occurs once in ICE NZ - in which file?
c) Find the word nice in ICE NZ. Which adverb does it most frequently collocate with?
d) Which is the most efficient search strategy for finding all instances of to shake x’s head (including its inflected forms)?