To contents page
To Psychiatric Activism Page
Objective tools for analysing linguistic structures
Is a new beginning, a raid on the inarticulate
With shabby equipment always deteriorating
In the general mess of imprecision of feeling.
- T.S. Eliot, East Coker
Despite all that has been said in the previous chapter about language fabricating reality, it is perhaps nevertheless most easily imagined as some kind of message delivery apparatus: Words and sentences trundling back and forth like cocopans on the overhead rails of an automated office mail system, transporting the ore of meaning from 'sender' to 'receiver'. Whether language really functions as a message carrier, or has a more sinister purpose, it should in principle be possible to draw up a blueprint showing the exact arrangement of gears, pulleys, springs, counterweights, and so on which keeps the system moving.
Unfortunately, or perhaps fortunately, language does not come with a user's manual and more indirect methods have had to be resorted to to expose its inner workings. The problem with many of these methods (some of which were reviewed in the previous chapter) is that they are themselves constituted in language. The reflexive absurdities which result are aptly described by Richards (1989): "trying to see what 'see' means, trying to hold onto the meaning of 'hold', looking for the meaning of 'look', following the meaning of 'follow' round in circles" (p. 61). While it would be naive to imagine that some kind of artificial language could be invented to analyse 'natural' language (perhaps using physical tokens as in Swift's satire), it is not unreasonable to assume that there may be some utility in surveying language by fitting an intentionally synthetic grid over it. This is indeed precisely what structuralist and post-structuralist authors do, for example Derrida's sous rature, Lacan's bogus algebraic formulas involving combinations of signifier and signified, or Deleuze and Guittarri's proliferating neologisms.
In this chapter I evaluate a number of more traditional quantitative approaches to language analysis in linguistics and psychology with particular emphasis on the automation of procedures, and conclude with an assessment of their implications for a proposed quantitatively informed discourse analysis. Despite the technical and hyper-quantitative nature of the material in his chapter, the intention is not to propose quantitative language analysis as a substitute for qualitative analysis, but to ask how quantification may be used in conjunction with what must necessarily remain an essentially qualitative enterprise.
Perhaps the most obvious empirical approach to language studies is to examine, quantitatively, the frequencies and patterns of occurrence of various linguistic features in large samples (or corpora) of speech or writing. The purpose of this, to revert to structuralist terminology, is quite simply to study langue (the structure of language) through redundant patterns in parole (actual utterances) (Engwall, 1994). Zipf (1935)(1), one of the earliest proponents of what later came to be called corpus linguistics, described the impulse behind corpus work as follows:
It occurred to me that it might be fruitful to investigate speech as a natural phenomenon, much as a physiologist may study the beating of the heart, or an entomologist the tropisms of an insect, or an ornithologist the nesting-habits of a bird. That is, speech was to be regarded as a peculiar form of behavior of a very unusual extant species; it was to be investigated, in the manner of the exact sciences, by the direct application of statistical principles to the objective speech-phenomena (p. xi).
At the most superficial level this kind of approach may do little more than confirm already known facts, such as that 'e' is the most frequent letter in the English language or that the word 'shall' is now virtually extinct in Australian English (Collins, 1991). However, by calculating exhaustive statistics not only on the frequency of various linguistics categories, but also on their patterns of co-occurrence, corpus linguists hope that a more profound understanding of how language works may emerge.
In Johansson's (1994) wide definition, a corpus is "a body of texts put together in a principled way(2), often for the purposes of linguistic research" (p. 3), and can refer to virtually any collection of writing (or transcribed speech), such as the psychiatric textbooks or set of interview transcripts used in this dissertation. The term now most often refers to a large collection of texts and transcripts captured into a computer database. The earliest and best known computer corpora are the Brown corpus (Francis & Kucera, 1964) which contains just over a million words of American English extracted from sources such as newspapers, magazines and books, and the London-Lund Corpus of Spoken English (Svartvic & Quirk, 1980), which consists of approximately half a million words transcribed from radio broadcasts, surreptitiously and openly recorded telephone conversations, and the like. However, there has since been a proliferation of corpora, with Taylor, Leech & Fligelstone (1991) listing no fewer than 36 English corpora, ranging from the relatively tiny Corpus for Dialectometry (38 000 words) to the over 5 million words of the American Heritage Intermediate Corpus. The total number of words in the corpora listed by Taylor et al. is of the order of 45 million, and a single corpus currently under development by the Longman group is set to double this figure. Edwards (1993) lists projects intent on establishing corpora of 100 million words each.
Computer corpora vary considerably in the purposes for which they were originally intended and the format in which they have been captured. Many make use of standard English orthography augmented by minimal text markers indicating the source of each text fragment, its date of recording and so on. Others have been extensively tagged or annotated to identify features such as voice pitch (Whichmann, 1991), grammatical categories and the like.
The number of ways in which corpora can be annotated(3) is endless. The three basic components of language identified by Longacre (1976) - lexis, grammar, and phonology - are each imperfectly represented in standard orthography and special markers have to be added to the text to signal the occurrence of particular lexical, grammatical or phonological events. One may for instance wish to tag different 'case roles' (agent, goal, instrument, location, patient; Starosta, 1978) or lexical categories such as Human Noun, Concrete Noun, Motion Verb, Physical Verb and Locative Verb (Longacre, 1983) or stative, action and process verbs (Gervasio, Taylor & Hirschfield, 1992). Another approach is participant indexing (Grimes, 1975) which is used to identify the characters in a story or the participants in a conversation. Phonetic markers of varying complexity can also be added to texts.
Whatever theoretical perspective a linguist or social scientist may adopt towards language, there is usually an abundance of classification schemes which can be used for coding purposes. An example is Austin (1975) and Searle's (1969) work on 'speech acts', which helped popularise the idea that language is not only descriptive but also a form of action. Apart from Austin and Searle's own taxonomies of speech acts (Searle, 1969, 1976), rival taxonomies have been suggested by at least five other theorists (reviewed in Hancher, 1979; Stiles, 1981), and there is thus no shortage of coding schemes which may be used by an analyst wishing to annotate texts in terms of speech act theory.
According to Longacre (1976), language has deep structure and surface structure in its lexical, grammatical and phonological components(4) (see Figure 6.1), and it is probably accurate to say that text annotation is in essence an attempt to move beyond the surface phenomena to the deep structure. In the lexical field, for instance, one finds the phenomenon that an elaborated vocabulary including words such as 'saunter', 'amble' and 'trot' all belong to the more basic meaning category of walk/run. Much also depends on the size of the textual chunks which one uses as unit of analysis - vastly different deep and surface structures exist at different levels such as phoneme, sentence, plot or dialogue (what Longacre calls 'repartee').
The types and complexity of deep and surface structures which can be identified are virtually limitless, depending as much on the kinds of phenomena of interest to different researchers as on objective qualities of language. As Simons and Versaw (1992) point out, ordinary English orthography tends to mislead one into thinking that text is a one-dimensional string of characters, while in fact it could more accurately be viewed as a string of (more-or-less) 'ordinary' orthography, plus "a multidimensional set of annotations provided by the analyst" (p. 1.3), the 'analyst' being not necessarily a linguist, but also, for example, an ordinary person engaged in conversation.
Figure 6.1 Surface structure and deep structure for three components of language (from Longacre, 1976)
However, hand-annotated texts are not cheaply or easily produced. A grammar tagging project initiated by Leech & Garside (1991) illustrates the difficulties encountered when extensive annotation of a large corpus is attempted. Although the authors based their work on a simplified phrase structure grammar, automated several aspects of the annotation process, and set up a "grammar factory" of more than 15 highly trained individuals to parse sections of text, the project repeatedly floundered under the sheer volume of work. So time-consuming is the task of even relatively simple annotation, that projects such as these often seem to lose sight of the original purpose for which the annotations were required (such as to generate a probabilistic grammar of spoken English), and are presented in academic forums as if the act of annotation were in itself a sufficient achievement.
The effort involved in coding is one reason why the proportion of corpora containing text annotations is likely to remain small. Another is that the production, distribution and consumption of large bodies of computer text is increasingly in the hands of individuals other than professional linguists. With the advent of CD-ROM drives for microcomputers, CDS containing several million words of copyright-free corpus material are now available at very low cost to researchers (Atkins, Levin & Zampoli, 1994). There is also a growing library of encyclopaedias, technical reference books and even magazines available in machine-readable format to the 'ordinary' user. Market forces have ensured that the volume of texts available from these sources has become far larger than the corpora painstakingly assembled by linguists over several decades. In addition to commercially available sources of machine-readable texts, the now almost complete computerisation of office work means that large corpora can be collected with little effort from newspaper offices, schools, hospitals and other large and small bureaucracies. One of the text samples used in this dissertation comes from such a source.
Finally, the advent of the internet has now made literally billions of pages of machine-readable text available to virtually anybody.
Carefully constructed and thoroughly annotated corpora will no doubt remain much sought-after commodities (Ide & Veronis, 1998), as will corpora containing transcribed speech. However, social scientists interested in language are increasingly finding themselves awash in machine-readable texts with no hope of even the tiniest proportion of these ever being hand-annotated. If, as post-structuralists claim, "the reality that any individual inhabits is a vast inverted pyramid of discourse poised on a tiny apex of experience" (Tallis, 1989, p. 13), then the availability of machine-readable texts in large quantities opens up hitherto unheard of possibilities for exploring that reality. These possibilities can however only be realised if ways are found to extract anything but the most trivial information from 'plain-English' texts. Although techniques from corpus linguistics have not progressed far beyond the trivial (Church, Gale, Hanks, Hindle and Moon, 1994, compare a lexicographer to "a person standing underneath Niagara Falls holding a rainwater gauge, while the evidence sweeps by in immeasurable torrents", p. 153), their combination with discourse analytic methods could enrich both approaches.
Three broad classes of techniques used in automated analysis of unannotated texts can be identified. These are word frequencies and type-token ratios; concordances and collocations; and automated tagging.
Word frequencies and type-token ratios
The most obvious way of processing machine-readable text is simply to count words. Thus it may be of some academic interest that the total number of words in an earlier version of the previous two pages was 677, that 343 unique words were used, and that the most frequently used word (n=37) was the, as indeed it is in the Brown corpus where it occurs more than 68 000 times (Hofland, 1991). A sorted list of words from the two pages in question (Table 6.1) may seem to provide extremely trivial information compared to an actual reading of the pages. However, faced with the task of reading the 3 500 or so pages of the Brown corpus one may well be grateful for such scraps of information as can be revealed by frequency counts.
Table 6.1 Sample word frequency list from two pages of text
N % Word N % Word
0037 5.26 THE 0034 4.84 OF
0019 2.70 TO 0016 2.28 AND
0015 2.13 IN 0013 1.85 IS
0012 1.71 AS 0011 1.56 A
0007 1.00 TEXT 0006 0.85 BE
0006 0.85 SUCH 0006 0.85 THAT
0005 0.71 CORPORA 0005 0.71 OR
0005 0.71 WHICH 0005 0.71 ARE
0005 0.71 ANNOTATION 0005 0.71 ONE
0004 0.57 LEXICAL 0004 0.57 DEEP
0004 0.57 TEXTS 0004 0.57 FOR
0004 0.57 DIFFERENT 0004 0.57 STRUCTURE
0004 0.57 LARGE 0004 0.57 ON
0004 0.57 SURFACE 0004 0.57 WORK
0003 0.43 SOURCES 0003 0.43 FROM
0003 0.43 THESE 0003 0.43 EVEN
Note. The table has been truncated
Comparative counts may be particularly informative. Thus the fact that shall occurs 0.9 times for every 10 000 words in an Australian corpus compared to 4.2 in the Lancaster-Oslo/Bergen corpus of British English (Collins, 1991) is more interesting than the Australian data alone. Similarly, the fact that the word corpora occurs five times in the two sample pages from this dissertation, but not once in the Brown corpus, does give some idea of the nature of the discourse produced on these pages. Words such as text, annotation and lexical also appear much more frequently in this chapter than one would expect from their prevalence in other general English texts.
Methods of comparing texts in terms of word frequency vary in sophistication. A typical example is the work of Kukulska-Hulme (1992) whose comparison between frequency word lists (function words removed) from a data security handbook and a user's manual for a particular computer system reveal a low degree of overlap or 'hit rate' (20%), from which he concludes that the handbook would prove confusing to many users. However useful word frequencies may prove for particular purposes, it is hard to escape the impression that for the most part the inferences which can be drawn will remain inconsequential. One attempt to use word frequencies in a more sophisticated manner is the so-called type/token ratio.
'Tokens' refer to the total number of words in a section of text, while 'types' are unique words. The type/token ratio is therefore quite simply "the number of different words as a ratio of the total number of running words" (Butler, 1985, p. 14). What the type/token ratio reveals is 'vocabulary richness' and it is for instance used to compare different authors' writing styles. Together with other stylistic 'fingerprints' such as word and sentence length, vocabulary richness can, amongst other things, help settle questions of disputed authorship.
A major drawback of the type/token ratio is that it depends not only on the author's style, but also on text length: The longer the text, the smaller the ratio. Thus the very high type/token ratio of 0.51 (343/677) for the two sample pages taken from this dissertation says as much about the shortness of the sample as about the richness of the vocabulary. If the size of the sample is doubled, the ratio drops to 0.44, while in a 30-page sample it is 0.32. If the type-token ratio at different points in a discourse is graphed, a parabolic curve such as that in Figure 6.2 is produced as the writer or speaker gradually 'uses up' the vocabulary available to him or her in the particular context. Despite this drawback, type/token ratios may have some utility, provided that care is taken to keep text length constant when different corpora are compared.
Figure 6.2 Sample type/token graph
Another disadvantage of type/token ratios is that a single index can hardly be expected to provide an adequate summary of an entire body of text. This problem is addressed to some extent by an extension to the type/token ratio, the Vocabulary-Management Profile (VMP), introduced by Youmans (1991). The VMP involves plotting a curve which shows the number of new types over a moving interval thirty-five tokens long. According to Youmans, peaks and valleys in VMP curves are closely related to constituent boundaries (such as breaks between paragraphs and chapters) and 'information flow' in works of fiction. One would expect similar peaks and valleys in accounts by psychiatric hospital patients, corresponding, for example, to the initial admission, an induction phase, discharge and so on.
Concordances and collocations
While counting word frequencies and computing type-token ratios could be compared to trawling for fish and using the catch to estimate the variety and number of species present in a particular area, concordance building is an attempt to describe the kinds of ecological interdependencies which exist among the different species.
A concordance is essentially an indexed word list indicating the location of words in a text, in terms firstly of formal positional markers (such as Act and Scene in a play) and secondly in terms of the surrounding text (Klein, 1991). An excerpt from parts of a concordance (Wright, 1893) for the King James Bible dealing with the words word and fish is reproduced in Table 6.2. The table has been re-arranged to be similar to the popular KWIC (Key Words In Context) format for concordances which prints key words in a column down the centre of the page with sections of surrounding text on either side.
Constructing a concordance by hand, as Wright (1893) and others have done for the bible, requires years of painstaking work, and it is therefore not surprising that concordance-building was one of the first literary and linguistics tasks to which computers were put. Computer-generated concordances are less prone to clerical errors than their manually produced counterparts, but usually require some manual editing to make them suitable for publication. The earliest concordance programs were the COCOA (word COunt and COncordance on Atlas) package and the Oxford Concordance Program (Hockey & Marriott, 1979), and most programs for analysing text corpora now include concordance building as one of their standard features. The indexing facilities available in high-end word processing and desk top publishing programs can also produce output similar to standard concordances.
Table 6.2 Part of a concordance for the King James Bible(5)
Deut 4.18 likeness of any fish in waters
Eccl. 9.12 fish taken in an evil net
Hab 1.14 makest men as fish of the sea
Matt 7.10 if he ask a fish?
Matt 17.27 take up the fish that first cometh
Lk. 24.42 piece of broiled fish
John 21.9 they saw fish laid
1 Cor 15.39 shall fish them
Matt. 4.4 every word of God
Rom. 10.8 the word is nigh
Job 38.3 darkeneth council by word
Ps 19.14 let the word of my mouth be acceptable
Prov. 15.11 a word fitly spoken
Ps. 68.11 the Lord gave the word
Is. 29.21 an offender for a word
Dan 7.25 speak great word against the Most High
Jer. 18.18 nor shall the word perish
Where type/token ratios represent the quantitative pole of automated textual analysis, concordances represent the qualitative pole. Wright's (1893) biblical concordance entry for fish is certainly useful in forming an impression of the kind of submarine ecology within which the word thrives, but this information remains essentially qualitative. Unlike the type/token ratio, which purports to summarise a central feature of a text in a single number (or a series of numbers in the case of the VMP), a full concordance is more bulky than the text itself, leading Brodda (1991) to suggest that one should limit a concordance to between 500 and 1000 'relevant' words. Deciding relevancy is of course not an easy matter. One strategy, that of including the most frequent words, certainly will not work, as high frequency words, such as the, of and in are often semantically the least interesting.
A derivative of the concordance idea but which is somewhat more quantitative in flavour is the tabulation of word collocations, i.e. words that habitually occur together. The term 'collocation' was popularised in 1951 by Firth (quoted in Ide & Veronis, 1998), who explained it as follows: "One of the meanings of ass is its habitual collocation with an immediately preceding you silly ..." (p. 19). Similarly, the collocations of sin in the Bible are all those words which occur within a span of a certain number of words of sin. In effect collocations are multiple word frequency lists, with a separate list computed for each key word. The list of collocates for sin may for instance include fathers, deadly, and repent as relatively high frequency items, while ass is most likely a low frequency collocate, or not a collocate at all.
The assumption behind the idea of collocation is that words are not evenly distributed through semantic space, but clump together in more or less distinct constellations separated by lesser or greater tracts of meaninglessness (or unsaid meaning), and furthermore that the force with which words attract and repel is reflected in their relative distance in spoken or written language. Unlike the stars, however, which (as the King James tells us) have been 'set in the firmament' (Gen. 1.17), the ways in which words are constellated may vary from one discourse to another.
That words do tend to fall into each other's gravitational fields is supported by the fact that "roughly 70% of the running words in the London-Lund Corpus form part of recurrent word combinations of some kind" (Johansson & Stenström, 1991, p. 5). However, the forces which impel words to come together or propel them away from one another are not all of a purely semantic nature. As a weakly inflected language English, as an example, relies heavily on word order to differentiate between the functional elements in a sentence, and the grammatical constraints thus placed on the co-occurrence of words are of a different order from the more abstract semantic similarities and differences which cause words to repel and attract. In general it is likely that the sorts of results obtained from a collocational study will, amongst other things, depend on the distance allowed between collocates (e.g. adjacent words only or all words within a span of say 15 words) as well as on the boundaries which are set (e.g., all collocates, or only those occurring in the same sentence or paragraph). Unfortunately there is currently no consensus on the optimal size of the span of words to use for different purposes, and the value fluctuates among different studies "more or less arbitrarily" (Ide & Veronis, 1998, p. 19).
Automated and semi-automated tagging
Accepting that annotated text is in principle more informative than 'plain English' as it allows glimpses into lexical, grammatical and phonological 'deep structure', one may ask if the task of text annotation itself cannot somehow be automated. To an extent this is indeed possible, and in many cases a thorough automated analysis of unannotated text can be thought of as proceeding in two passes: First, automated tagging of particular features of the text; second, automated analysis based on the higher-level units thus identified.
Before discussing the successes so far achieved in the automated tagging of texts, it is important to acknowledge that automation is not necessarily an all or nothing affair. Although more reliable than hand-annotation, automated annotation is, due to the deterministic nature of computer algorithms, incapable of assigning correct codes to text sequences not explicitly provided for in advance. Thus the output from automated tagging programs often has to be manually checked and adjusted (e.g. Leech & Garside, 1991) before being submitted to further analysis. Another strategy is to start by annotating the text manually, but with gradually increasing automated assistance. Simons & Versaw (1992) have developed a method where the analyst assigns codes to text strings (e.g. synonyms in a dictionary project), which are then suggested as possible codes by the computer program when similar instances occur. As the analyst works her way through a text she will therefore start by having to type in most of the codes herself, but eventually reach a point where, for the most part, she simply has to indicate acceptance of the codes suggested by the program.
Moving from computer assisted to more fully automated text annotation, progress has been made in phonological, grammatical and lexical tagging, of which only the latter two will be discussed here(6). In the field of grammatical tagging, Gervasio, Taylor and Hirschfield (1992) describe a system which assigns a grammatical class to each word in a sentence with an accuracy of more than 80%. This is done without recourse to a large dictionary, but rather by deducing each word's class from its position relative to a small number of 'function' words (articles, auxiliary verbs and prepositions). Words are also automatically grouped into phrases and clauses with a high degree of accuracy. The frequency of different verb types identified by the system has been used to track changes in the language used by patients and therapists in the course of psychotherapy, and to study the impact of assertiveness training.
DeRose (1991) describes a stochastic tagging system which achieves an even higher accuracy level (96%) in assigning grammatical classes to words. DeRose's system is based on conditional probabilities derived from the Brown corpus. Thus in the sentence To everything there is a season, and a time to every purpose under the heaven (Ecclesiastes 3.1), time would be judged a noun both because articles are far more likely to precede nouns than verbs (collocational probability) and because time occurs 1000 times more frequently in the Brown corpus as a noun than as a verb (absolute probability). The system determines word class by assigning a 30-35% weight to absolute probabilities and a 65-70% weight to collocational probabilities.
Automatic tagging of grammatical categories is valuable (amongst other reasons) because of the increased fidelity of word frequency, collocational and other analyses preformed on tagged texts. Roughly 11% of word types and 48% of word tokens in the Brown corpus belong to more than one grammatical category, depending on context (DeRose, 1991). Even at the most simple level, frequency or collocational information on a word such as swallow in any particular text (such as a therapy transcript) would therefore be more accurate if reported separately for its different grammatical senses - according to the dictionary, a verb meaning "to make or let pass down one's throat" or a noun meaning "a kind of fork-tailed swift insectivorous bird".
Unfortunately grammatical category is not the only source of lexical ambiguity, and a word may not only belong to several grammatical categories, but also have several different possible meanings within a particular category. The word parade, taken as a noun, is for instance given six different meanings in the Little Oxford Dictionary (Ostler, 1969), ranging from "muster of troops for inspection" and "ground used for this" to "public promenade". Fortunately the various meanings of a word within a particular grammatical category are often related - even if only metaphorically - so that misclassification only becomes an issue in circumstances, such as automated translation, where finer distinctions in meaning are important. This is not however by any means always the case. For example: Like swallow, the word hawk may refer either to a kind of bird or to an action performed by the human throat (depending on whether it is used as a noun or a verb), but as a verb it also has another quite unrelated meaning, namely "to carry about for sale".
The phenomenon of one word form having several different meanings may be due to a variety of factors, including homophony (similar-sounding but different words), homography and homonymy (words of the same form but different meaning), and polysemy(7) (identical words with related but different meanings, e.g. head of a body and head of an organisation). The resulting lexical ambiguity is, according to Brekke (1991), "an all-pervasive phenomenon in Modern English" (p. 83). Brekke points out that "the common core vocabulary of English contains hundreds of high frequency items like board, stamp and wall, which in isolation carry no clue as to which of their specific meanings is intended" (p. 83). In ordinary language use humans perform very rapid disambiguation of such words by referring to the context, but (as Brekke demonstrates for the word wall) this is not an easy process to simulate by means of computer algorithms.
As if the problem of single word forms having multiple meanings were not enough, natural language is also plagued by the opposite phenomenon of synonymy - i.e. apparently different words having closely related or identical meanings. In highly inflected languages this is a particularly common problem. The German verb aufnehmen for instance appears in over 30 different forms, while the Finnish verb ottaa appears in about 60 forms (Butler, 1985). Often the underlying lexical unit, called a lexeme or lemma, is of more interest than its various forms. Thus one may wish to know the frequency and collocations of the lexeme LOVE, rather than of love, loves, loved and loving separately, or of BE rather than of be, is, am, are, was, were, been and being.
The process of grouping together the forms of a lexeme, called lemmatisation, is according to Butler (1985) difficult to automate, because "the rules for recognising a form as an instance of a particular lemma are complex and have not been specified in a completely explicit way for any language" (p. 14). An additional difficulty is that the nature and degree of lemmatisation required may differ from application to application. In certain circumstances it may be important to consider word forms entirely separately, while in others very loose semantic groupings (such as all colour terms) may be treated as if they formed a lemma. The principles governing depth of lemmatisation, i.e. the choice of which thesaurus to use to sort words into semantically related groupings, have not been explicated.
Automated tagging can be conceptualised as the implementation of rewrite rules (Brodda, 1991), i.e. applying a set of rules to a text so as to produce a systematically transformed version of the text as output. A rewrite rule concerned with lemmatisation of BE will thus produce output in which all instances of be, is, am and so forth have either been changed to be or marked as belonging to the same lemma. The rules which have been discussed thus far apply to the traditional linguistic categories of grammar and lexis, but rewrite rules are of arbitrary complexity, and may be set up to identify and transform any kind of language unit, including those used in content analysis.
Where the kinds of automated and semi-automated linguistic analyses discussed thus far originate from linguistic and literary studies, content analysis historically very much belongs to the social sciences, especially psychology. Now often referred to as a qualitative technique, content analysis in fact straddles the divide between numerical and scholarly approaches, and thus may provide some clues for developing a quantitatively informed discourse analysis
Berelson (1952), one of the early developers of the technique, defined content analysis as the "objective, systematic and quantitative description of the manifest content of communication" (p. 18) - hardly the kind of language that one would normally associate with a qualitative procedure. A decade or so later Stone, Dunphy, Smith & Ogilvie (1966) had dropped quantitative from their definition, but stayed with systematic and objective: "Content analysis is any research technique for making inferences by systematically and objectively identifying specified characteristics within the text" (p. 5). A more recent review of content analysis studies (Viney, 1983) seems to veer back towards the quantitative, focusing mainly on content analysis scales, and discussing these in terms of psychometric properties such as correction factors, reliability and validity.
However, despite its claim to core scientific characteristics such as being objective, systematic, and even quantitative, content analysis also has inescapable qualitative elements. In part this is perhaps simply because it deals with qualitative data (relatively unstructured samples of speech or writing), rather than with discrete physiological, psychometric or behavioural indices. Equally importantly, however, content analysis is partially qualitative because inferring content categories from a sample of text and subsequently identifying category instances in other samples is a subjective rather than an objective enterprise. The use of multiple raters and reporting of inter-rater reliability coefficients serve to underscore rather than diminish this point.
Attempts have been made to apply computer techniques to improve content analysis as a tool both for discovering appropriate classification schemes to describe texts and for applying these schemes to other texts. Brown, Taylor, Baldy, Edwards and Oppenheimer (1990) describe a system designed to assist in the qualitative exploration of textual data, similar to a manual system which involves the sorting and grouping of index cards. The advantage of the Brown et al. method is that a particular section of text can conveniently be related to more than one classification scheme at the same time, instances of any category can be automatically retrieved, while hierarchical and other relations among categories are easily represented in the system. Many of the popular computer-aided qualitative analysis techniques such as Atlas and Nudist (reviewed in Kelle, 1995) work along the same principles.
Rather than assisting in the induction of classification schemes, Gottschalk and Bechtel's (1982) system is aimed at automating the parsing of texts in terms of predetermined classification schemes. Many such schemes (also known as content analysis 'scales') have been developed. Viney (1983) mentions scales for anxiety, hostility, sociability, locus of control (the origin and pawn scales), hope and positive affect; Schnurr, Rosenberg and Oxman (1992) refer to scales measuring pessimism, optimism, rumination and helplessness; and Peterson, Bettes and Seligman (1985) describe a scale for measuring causal attributions to negative events. Gottschalk and Bechtel's (1982) program is based on the well-known Gottschalk-Gleser anxiety scale, which produces scores for death anxiety, mutilation anxiety, separation anxiety, guilt, shame and diffuse anxiety. The Gottschalk-Gleser method of analysing transcribed speech involves three steps, of which only the second and third have been computerised by Gottschalk and Bechtel. They are: Dividing the text into grammatical clauses (defined as language structures which contain an active verb); scoring each clause for the presence (and in some cases intensity) of a particular construct; and applying correction factors to the summed scores to adjust for text length.
Identification of scoreable instances is achieved by checking each phrase against dictionaries containing key words and word combinations thought to be indicative of the different kinds of anxiety. Content scoring of this sort is therefore an exact analogy of lemmatisation, although a lemma such as DEATH ANXIETY would no doubt appear rather strange to classical linguists. Gottschalk and Bechtel (1982) do not report any attempt to resolve the lexical ambiguity problem which inevitably limits the success of any lemmatisation attempt. Nevertheless, correlations between hand scoring and machine scoring range from .58 to .92 with a mean of .85 (although machine scores are consistently lower).
Towards a quantitatively informed discourse analysis
It would be difficult to imagine a body of work more clearly different in temperament from discourse analysis than that reviewed in this chapter, yet it seems possible that discourse analysis could benefit from borrowing some of the principles and techniques of corpus-based linguistics and content analysis. The two major shortcomings of discourse analysis as viewed from a traditional empirical perspective (insufficient concern for sampling and subjective analysis) are both to a greater or lesser extent addressed in corpus studies.
Corpus studies could represent something of a model of how discourse analysts could go about ensuring more adequate sampling. Potter and Wetherell's (1987) bald statement that "for discourse analysts the success of a study is not in the least dependent on sample size" (p. 161) was probably provoked by a desire to distance themselves from the kind of social science which takes the individual as its basic unit and considers studies employing more of these units as (potentially) superior. However, as has been shown in the previous chapter, there is nevertheless an effort to sample texts for diversity of origin. Likewise, the constructors of linguistic corpora, while usually quite unconcerned with sample size and composition in the social scientific sense, also make use of such (at least minimally) stratified random sampling techniques to select textual fragments for inclusion. There appears to be a tacit consensus that a 'good' corpus (from the point of view of sampling) is one that a) includes texts from a variety of different sources (oral and written, published and unpublished, dialogues and monologues, spontaneous and prepared, formal and informal); b) clearly identifies the nature and source of each fragment as well as the overall proportions of different kinds of fragment in the corpus; and c) is large. The immense diversity in language use, even in terms of narrowly defined syntactic features, makes it imperative that large, well-structured samples of text should be studied. The question is not if the sample is representative of a population of individuals, but if it adequately represents a certain kind of language situation (e.g. British written English in general or British tabloid newspaper reportage on the royal family during the 1990s)(8).
As discussed in the previous chapter, an issue which is related to that of the initial sample selected is the extent to which it is exhaustively surveyed in the course of the analysis. Unlike some discourse analytic studies which quote selected illustrative examples in corroboration of whatever inferences are made, corpus studies and content analyses are explicitly concerned with systematically parsing the entire text, and presenting results in the context of their frequency in the text as a whole. While type/token ratios, collocational frequencies or content analysis scale scores may not contain quite the sorts of information useful from a discourse analytic perspective, similar kinds of indicators can, as will be shown below, be developed.
In terms of subjectivity, discourse analysis could also benefit from the relative success of attempts at automated analysis reviewed in this chapter. While one has to bear in mind warnings by Potter and Wetherell (1987) and other discourse analysts (reviewed in the previous chapter) against an overly mechanical application of analytic techniques, it is difficult to see why at least certain aspects of the process could not be executed by means of objective algorithms. The product of a discourse analytic study of the Potter and Wetherell variety is an 'interpretative repertoire' which is "basically a lexicon of terms and metaphors drawn upon to characterize and evaluate actions and events" (p. 138), and "often a repertoire will be organized around specific metaphors and figures of speech" (p. 149). Potter and Reicher (1987) return to the same theme in their definition of a discourse as "terms which are used with stylistic and grammatical regularities, often combined with certain metaphors" (p. 27)(9).
To identify a discourse, or a category in a discursive repertoire, one therefore has to identify certain lexical terms, metaphors, figures of speech, and stylistic and grammatical regularities. At least at face value this seems rather similar to the application of rewrite rules in content analysis or lemmatisation studies, and may be equally susceptible to automation. One could, for example, imagine discourse studies that not only identify discourses in texts, but that report on the frequency and distribution of previously identified discursive elements in different textual situations, in the same way as different sorts of content analytic studies are concerned either with developing novel content categories or with applying previously developed content analytic scales.
This kind of scenario of course immediately raises the spectre of reification, precisely the issue which is the main bone of contention in the debate between Potter, Wetherell, Gill & Edwards (1990) and Parker (1990a, 1990b) reviewed in the previous chapter. The moment we give discourses and technologies for discovering their presence in texts explicit definition, we may have created a pseudo-scientific regime every bit as totalitarian as the one currently set up to detect and describe individual subjectivity. A partial rebuttal to this argument is contained in the observation that reification is inevitable, even desirable, and that the question is rather one of degree: Not if discourse analysis should create reified objects, but for how long it should leave them standing. This issue is returned to more fully below.
Apart from the obvious danger of reification, there is also another reason why discourse analysts have avoided adopting more explicitly structured techniques. As discussed in the previous chapter, this has to do with discourse analysis' identity as a 'bottom-up' qualitative technique. The objects discovered by discourse analytic research differ from those of, for example, quantitatively oriented content analysis not only in their theoretical underpinning, but also in how they are constructed. While the latter are often deduced from psychological constructs (such as depression or anxiety) and then rediscovered in individual texts, the former are supposed to emerge directly from the texts, remaining as far as possible at the level of describing how texts are organised.
Rather than the dictionary approach used in lemmatisation studies or automated content analysis (i.e. matching textual fragments to predetermined categories), a quantitatively informed discourse analysis should therefore ideally approach texts without any preconceived ideas as to the entities which will be found there. This almost a-theoretical posture is not unlike that adopted in collocational studies, which seek to describe the internal ecology of texts from the point of view of nothing more complex than the co-occurrence of lexical items. Potter and Reicher's (1987) definition concerning "stylistic and grammatical regularities, often combined with certain metaphors" (p. 27) as well as similar definitions quoted earlier suggest that the idea of collocation may be the minimum ingredient for a quantitatively informed version of discourse analysis.
One way of using techniques from corpus linguistics in discourse analysis while maintaining the latter's identity as a 'bottom-up' and qualitative technique, is to start the analysis using linguistic techniques for purposes of gaining an overview of the text and then to use this information to guide more in-depth qualitative analysis. This progression is the reverse of that commonly found in discourse analytic studies, in which the units of interest are first determined qualitatively, followed, in some cases (e.g.Gilbert & Mulkey, 1984; Levett, 1988; Van Dijk, 1987a) by tabulation of the frequency with which the different units are found in the text. Content analytic studies similarly follow a progression from qualitatively (or 'rationally') derived content categories to quantitative frequency counts and content scales (e.g. O'Dell & Weideman, 1993; Laffal, 1990; Schnurr, Rosenberg & Oxman, 1992). More generally, the idea that qualitative analysis precedes and prepares the ground for quantitative analysis, that it is good at identifying which questions to ask but less so at providing definitive answers to these questions, is intuitively appealing and often stated (e.g. Kirk & Miller, 1986; Miles & Huberman, 1984).
The inversion of the qualitative-quantitative sequence, although a significant departure from the norm, has a similar purpose as most attempts to combine quantitative and qualitative methods, namely to render the analysis more rigorous while retaining flexibility and richness of detail. The reason why such an inversion is worth exploring relates to the issue of representativeness, which is often identified as the achilles heel of qualitative research(10).
Qualitative research not infrequently makes strong claims with regard to representativeness, such as the discourse analytic assertion that the phenomena it identifies somehow emerge spontaneously from the text rather than being imposed on it, but as often has difficulty in demonstrating such representativeness when reporting on the results of an analysis. An apparent lack of representativeness may manifest on at least three levels, each of which is sometimes addressed by recourse to quantitative data, although in each case at the risk of reification.
The first level at which qualitative research may appear unrepresentative, which is the level Miles and Huberman (1984) appear to be alluding to, is when there is the possibility that instances are incorrectly or arbitrarily assigned to categories, i.e., when there is the suspicion of inadequate inter-rater reliability. As has been shown in the proliferation of computer-based content analytic scales, the appearance of unrepresentativeness is perhaps most easily overcome at this level. However, although it is always possible to devise a perfectly consistent classification algorithm, the categories used may themselves lack adequate justification, and it can be argued that content analytic researchers (like psychometrists) have been quick to find solutions for what are essentially trivial measurement problems, while ignoring more fundamental theoretical questions with regard to what they are measuring. As has often been stated, objects such as 'authoritarianism', 'depression' or 'anxiety' easily acquire a spurious substantiality by virtue of the reliability with which they can be identified.
The second level at which qualitative research may appear to lack representativeness is where there is no indication of the frequency with which particular phenomena occur in a text. An example is Potter and Reicher's (1987) 'discursive repertoire' which details the kinds of metaphors used to construct the idea of community in various accounts of a 'riot', but does not reveal the relative frequency of the different elements of the repertoire. Again, this problem is relatively easily overcome. Some discourse analytic studies, such as those of Levett (1988), Van Dijk (1987a) and Gilbert and Mulkey (1984), take care to report the frequencies of the different discursive phenomena they identify so that it is possible to gauge the relative importance of each. Thus the racist themes 'They have a different mentality' (N=20) and 'They do not respect women' (N=15) can be seen to be much more frequent in the discourses about 'our neighbourhood' analysed by Van Dijk (1987a) than, for instance, the theme 'They steal, are dishonest' (N=7).
The third level at which qualitative research may appear to lack representativeness concerns the ubiquitousness of the identified categories in the text as a whole. Not only would it be interesting to know how frequent the metaphors in Potter and Reicher's (1987) community repertoire are relative to each other, but also how frequent they are relative to the total volume of talk. Did Potter and Reicher have to sift through piles and piles of transcripts before coming up with the occasional nugget, or were community metaphors relatively common in talk about the riot? Did Van Dijk's (1987a) field workers have to plough through hours and hours of conversation before racial issues were introduced by their respondents, or was this one of the main themes when people were asked to talk about the neighbourhood? Even if Van Dijk's textual universe is limited to those extracts which deal with racial issues, there is the suspicion (particularly in the absence of a catch-all 'other' category) that the racial sub-themes he identifies may cover only part of a larger, less easily organised, domain of talk about race.
The danger of reification at this level of reporting compounds that found at the previous two: Not only is the implication that the identified phenomena stand out as definite topographical features above the general textual landscape, but that the text studied is itself a representative sample of some naturally demarcated region of discourse. Content analytic scales again provide the most obvious example. These usually report on the frequency of particular content categories as a ratio of the total text (measured either in words or phrases) produced by an individual - the unspoken assumption being that individuals are the source of meaning and that the natural fault lines in discourse run between individuals, rather than, for instance, between discourse situations, between different strata of social power or between different discourse communities.
What such quantitative adjuncts to qualitative research appear to have in common is that they tend to be introduced after the fact, as a means of minimally demonstrating the extent to which the illustrative examples provided represent a larger collection of similar instances. The proposed use of corpus linguistic techniques in discourse analysis is also concerned with representativeness, particularly at the third level discussed above (i.e., the prominence of the identified phenomena in the text as a whole), but rather than as a post-hoc check it is intended as a tool to help ensure from the outset that the overall features of the text are used as the ground against which more specific elements are selected for discussion. Thus quantitative information about the text is used to guide qualitative analysis, not to summarise qualitative information.
Corpus linguistic techniques adapted for discourse analysis
In this section I discuss specific ways in which corpus linguistic techniques can be applied as a precursor to qualitative discourse analysis. Seven kinds of techniques are discussed here: lemmatisation, manual mark-up, frequency counts, target-word collocations, collocation counts, contextual markup and lexical nets. Of these, the first four are essentially the same as those commonly used in corpus linguistics, as reviewed above, while the last three constitute elaborations on corpus linguistic techniques. The use of these techniques in actual analysis is demonstrated in the next two chapters.
The corpus linguistic technique of manually annotating texts using the COCOA (word COunt and COncordance on Atlas) markup scheme, the most commonly used standard (Butler, 1985), can be useful in preparng texts for further analysis. Although complex markup of syntax is probably of little use in discourse analysis, more basic annotations labelling particular sections of text is helpful in later isolating these sections for further analysis. Markup is done by enclosing in angle brackets an identifier or category, followed by a space and then the actual identification. Some typical examples are: <ACT III>; <AUTHOR SHAKESPEARE>; <SOURCE BROADCAST>.
'Root form' and 'Parts of speech' lemmatisation
As discussed above, in principle any text consists not only of a one-dimensional string of orthographic markers, but also of a multi-dimensional set of implicit 'annotations'. One way in which textual analysis can be facilitated is by making such annotations explicit, thus identifying diverse linguistic forms as belonging to a smaller set of lemmas. While the overly psychologistic types of assumption involved in content analytic lemmatisation may be unacceptable from a discourse analytic perspective, more 'neutral' linguistic lemmatisation could be useful.
Although numerous exceptions and special cases had to be provided for, I found it possible to write a relatively simple computer program to strip away several word suffixes without altering the root form. The program does this with a greater than 90% degree of success. This was the case both for suffixes with a largely grammatical function - the final 's' from most words (which has the effect of converting nouns from plural to singular and verbs to their infinitive form) and the present participle (-ing) and past participle (-ed) from verbs (which also converts to the infinitive) - as well as suffixes which form nouns (-ness), adjectives (-able) and adverbs (-ly). Other suffixes which can in certain cases be stripped away are -al, -able, -ance, -ment, -ive, and -ion.
The effect of stripping away suffixes is to greatly reduce the degree of (possibly spurious) variation in a text at the cost of giving up some finer distinctions between word forms. Thus act, active, actively, activist and activity would all be lemmatised as ACT; emotion, emotional and emotionally would become EMOTION; and sociable, sociably, social, socialise and socially would become SOCIAL. Apart from the information which is lost, this form of lemmatisation is of course also rather inconsistent. The past tense form of a regular verb, e.g. passed, would for instance be changed to the infinitive, while an irregular past tense form such as could remains as a separate type.
Root-form lemmatisation of this sort was, after much experimentation, not used in the studies reported on in the next two chapters on the grounds that it tended to obliterate too much of the rhetorical and stylistic texture of the texts. Once tokens were reduced to their root forms, there appeared to be little that could be done with the text other than counting of content categories as is done in computerised content analysis. There may however be types of text other than verbal transcripts for which this may be the most sensible option, and the relative ease with which the kind of automated root-form lemmatisation outlined above was possible, suggests that it may yet prove a useful aid in certain analyses.
Another form of lemmatisation, based on parts of speech, also proved practicable and was eventually used. The procedure is primarily aimed at differentiating grammatical from lexical words(11). Grammatical words identified by the program are articles (a, an and the), auxiliaries (can, could, have, has, had, may, might, must, should, would, ought and in some cases be, being, been, is, am, are, was, were, shall, will and do, did, doing, done, does), conjunctions (after, and, because, but, for, however, since, until, till, yet, while, although, as, moreover, either, so, only and also), prepositions (about, above, across, after, against, along, amid, around, at, before, behind, below, beneath, beside, between, beyond, by, down, except, for, from, in, inside, like, near, of, off, over, since, through, till, to, toward, under, until, up, upon and with) and pronouns (I, me, my, mine, myself, you, your, yours, yourself, yourselves, he, him, his, hers, himself, she, her, herself, it, its, itself, we, us, our, ours, ourselves, they, them, their, theirs, themselves, who, this, that, these, those, such, what, whose and which). In addition, a list of colloquial and contracted forms provided by Butler (1985) which may be considered as grammatical words are coded as such, supplemented with additional South African colloquial forms (Table 6.3). Ambiguous words are coded as grammatical rather than lexical.
Lexical words are semantically richer than grammatical words and a frequency list of lexical words provides a good initial overview of the basic content of a text.
Table 6.3 Grammatical and ambiguous words (adapted from Butler, 1985)
didn't i've not there why no
or very all that's some uhm
any going yes one more when
other can't whether then want nothing
go there's outside anything couldn't everybody
less wasn't OK haven't ag uh
something weren't if anyway where you're
they've shouldn't within whatever i'd i'll
both everyone isn't that'll what's aren't
mmm oneself many mustn't put than
they're how oh few everything gonna
doesn't having should've never i'm don't
here it's ja
Frequency counts constitute the most obvious means of summarising a text and can be used in discourse analysis to provide an initial overview of the material prior to more detailed analysis. Counts can be given of all types, and of lexical types only. The former tend to provide information on stylistic and pragmatic features of the text, while the latter give an indication of the substantive topics addressed.
These are lists of words which co-occur in close proximity to selected word types, and can be used to describe the general lexical environment in which high-frequency word types occur.
These are sorted lists of highly collocated types and provide a somewhat more comprehensive idea of the patterns of redundancy which characterise a text. While target-word collocations require only that the collocational probability of a selected sample of types be computed, collocation counts are derived from a matrix of all possible collocations among word types in the text. This matrix is systematically examined, and the most highly related word pairs (calculated as described below) extracted. Although this would seem a natural extension to the idea of target-word collocations, I have not seen this method used in corpus studies. It is however ideal for use as a precursor to qualitative discourse analysis as it provides an overview of the entire text while imposing minimal assumptions about the types of themes or categories to be looked for.
This technique is also derived from the idea of collocation, although again I have not seen it used in corpus linguistics. It involves the modification of transcripts based on the collocational redundancy of words as revealed by a matrix of all possible collocations. The markup reveals the extent to which the lexical environment of each word token in a transcript is typical of the word type's positioning in the text as a whole. The procedure is to systematically determine the positional 'typicalness' of each word token by computing the average strength of association (described below) between the corresponding word type and the word types in its immediate vicinity. Thus if the types unable and struggle often occur in the vicinity of cope, then cope will be marked as being in a highly typical situation in a text fragment such as: "I am unable to cope - I struggle to survive." Each of the other words in the fragment will be similarly marked, depending on the extent to which they find themselves in a lexical neighbourhood which is typical of their occurrence in the transcript as a whole.
Different forms of contextual markup have been developed for the studies presented in the next two chapters, the most sophisticated of which translates strength of association into font size. The program developed to perform this markup determines the font size of each token in a transcript and then re-encodes the token in Rich Text Format, a protocol which is recognised by most word processing and desktop publishing programs. However, the resultant text tends to be quite cumbersome with some words printed in very large font while others appear in a very small font, and this form of markup is therefore probably best used sparingly for illustrative purposes. A markup method which proved more practicable was simply to capitalise all words for which the strength of association (described below) with any word within a certain span of surrounding words exceeds a certain minimum level.
Finally, lexical nets can be drawn using the information contained in the table of collocations. This is a newly developed method in which the collocational pattern in a text is graphically represented by manually drawing lines between pairs of words which are statistically related (as described below). Various forms of lexical net were experimented with, including ones in which the thickness or length of the lines between pairs of words reflected the strength of the relationship. However, this again proved impracticable and cumbersome, and in their final form lines of uniform thickness were drawn between all words where the strength of the relationship exceeded a certain minimum. The way in which reasonable values for this and other parameters were established is described in greater detail in Appendix 2. The length of the line between words is arbitrary and depends on the way in which words are arranged for each net so as to produce a coherent picture with as few lines crossing each other as possible.
Target-word collocations, collocation counts, contextual markup and lexical nets are all based on some indicator of the strength of association between words. The statistical index used in this study was the z-score (Miall, 1992; Bradley, 1990). The z-score was chosen as this is the most frequently used index of collocation in corpus linguistics. Other measures, such as Yule's Y or mutual information have however also been proposed (Church, Gale, Hanks, Hindle & Moon, 1994; Church & Hanks, 1990). Given the total number of words in a text, the 'span' of words considered to be a target word's typical context (e.g. 5 words on either side), the frequency with which the target word occurs in the text, and the frequency with which another word (or 'collocate') occurs within the target word's context, the z-score returns a coefficient of collocation which is significant at the 1% probability level when it reaches 2.57 or above.
The formula used in the program used for the analyses in the next two chapters was taken from Bradley (1990), and is computed as follows:
C - (P x L)
z = _________________
sqrt((P x L x (1-P))
where C = the frequency with which a collocate occurs in the same context as a target word type;
L = the total number of word tokens in the same context as the target word (at most the span x the total frequency of the target word); and
P = the frequency with which the collocate occurs in the text as a whole divided by the total number of tokens in the text as a whole.
It is worth noting that the z-score is not symmetrical, so that the probability of word a occurring as a collocate of word b is not the same as the probability of word b occurring as a collocate of word a. This may be illustrated as follows: If a occurs 500 times in the text, b occurs 3 times and the two words occur in the same context 3 times, then a is a strong collocate of b (since it is present whenever b is present), but b is unlikely to be a strong collocate of a (since it is only present in 3 of 500 cases when a is present). In practice, however, the discrepancy between the two z-scores is usually small and where the direction of the association is not of importance the average is taken.
In order to generate a comprehensive overview of the collocational patterns in a text, a matrix of z-scores for each possible pair of the 1000 most common word types is computed, consisting of (1000 x 500) - 500 = 499500 (i.e. just less than half a million) individual z-scores. This matrix can be computed separately for each section of text analysed, and using various different lengths for the 'span' of words constituting a context. This is described in greater detail in Appendix 2. While it may have been desirable to include all word types in the matrix, available computer resources did not allow for this. However, the 1000 most common types account for the majority of tokens in most texts.
In this chapter an overview was provided of various techniques used in corpus linguistics and related fields and ways suggested in which these could be adapted for use with discourse analysis. The use suggested for corpus techniques in discourse analysis goes beyond an adaptation of techniques however, but also encompasses a reversal of the usual quantitative-qualitative sequence, with quantitative techniques employed as a precursor to further qualitative analysis. This is demonstrated in the next two chapters.
1. Illustrating the differences in mentality between discourse analysis and corpus linguistics, Miller (1965) says of Zipf that "he was the kind of man who would take roses apart to count their petals" (p. v).
2. 1 Engwall (1994) details criteria which should be used in collecting corpus material, placing particular emphasis on careful sampling from text category (e.g. literary, scholarly, newspapers, conversations), genre (e.g. imaginative prose, drama, scientific texts, dialogue), and period (e.g. diachronic or synchronic).
3. 2 I use 'annotated' here interchangeably with 'coded', or 'tagged' to refer to any system for marking up text so as to identify features not visible in the surface orthography. As is typical of linguistic research in general, there is a proliferation of mark-up conventions, some of which are reviewed in Edwards and Lampert (1993). One of the oldest and most common formats is the COCOA format, which is similar to Standard Generalised Markup Language (SGML), the most widely accepted current convention and the one on which Hypertext Markup Language (HTML), the de-facto standard for internet documents, is based (Johansson, 1994).
4. 3 In this Longacre draws, of course, on Chomsky (1957). However, Chomsky's work on formal linguistic transformation rules, which relied heavily on contrived examples and artificially limited domains, contributed to the waning of interest in corpus linguistics during the 1960s and 1970s. This interest has only fully revived since the 1980s (Ide & Veronis, 1998).
5. 4 This extract from Wright (1893) is based on what remains the standard concordance compiled by Alexander Cruden (1700-71). Cruden's other notable publication, in 1739, is The London-Citizen Exceedingly Injured or A British Inquisition Display'd in an Account of the Unparallel'd Case of a Citizen of London, Bookseller to the late Queen, who was in a most unjust and arbitrary Manner sent on the 23rd March last by one Robert Wightman, a mere Stranger, to a Private Madhouse. Cruden was repeatedly confined in Bethlem and elsewhere for disruptive behaviour consequent upon religious fanaticism (Porter, 1987a).
6. 5 Details of the tagging process are also omitted. Thus the first step in an automated analysis is usually text normalisation (Brodda, 1991), i.e., removal of 'noise' such as control characters and (in some cases) punctuation, the expansion of abbreviations and (in some cases) converting the text to upper case.
7. 6 Lacan of course claimed that such quirks of language constituted not a problem of disambiguation, but a model for the workings of the unconscious.
8. 7 However, in the conclusion to this dissertation I reconsider the implications of specifying the boundaries of textual data sets.
9. 8 The reader may recognise that this and the next paragraph themselves contain phrases recycled from previous chapters of the dissertation, thus demonstrating the emergence of a discursive repertoire in the current text.
10. 9 An example is Miles and Huberman's (1984) warning: "Avoid the 'sprinking' of vivid or interesting examples to spice up the narrative. Rather, look for genuinely representative exemplars of the conclusions you are presenting" (p. 213).
11. 10 Grammatical words are words such as articles, auxiliaries, prepositions and exclamations which seem to carry less semantic weight than, for instance, main verbs and nouns, which are lexical words (Butler, 1985). The proportion of lexical words in a text is termed the 'lexical density'.
To contents page
To Psychiatric Activism Page