Collections and Corpora

Gathering language materials and linguistic data through work with speech communities

Introduction

Corpora are collections of written text and transcribed speech compiled during language documentation projects and research. Proper organisation and archiving of corpora improve data collection, academic integrity, preservation efforts and — importantly — accessibility, so that materials produced by CoEDL members are available to the speakers and communities, including those that assisted this research.

Corpus collection and management were critical research priorities for CoEDL and central to the Centre’s Archiving program. Coordinated by CI Nick Thieberger, Data Manager Julia Miller and Corpus Manager Wolfgang Barth, the program provided Centre members with training and guidance to ensure that material created in the course of Centre work was managed and archived effectively and responsibly [1].

CoEDL saw this work as part of its responsibility to communities, an acknowledgement of the colonial extraction of information that characterised much fieldwork-based research in the past, and a way to influence and promote research methods that are both more collaborative and more rigorous. There is also an important human element in corpora formation and circulation, as many in the community may be able to locate recordings of grandparents and other family members.

The three corpora introduced below demonstrate the time and effort required to collect and properly document a corpus as well as the impact the collection can have for both academic work and members of the communities CoEDL collaborated with.

Gurindji Kriol

The Gurindji Kriol corpus, which CI Felicity Meakins deposited into PARADISEC in 2021 with the help of Data Manager Julia Miller, is the largest digital corpus of an Australian First Nations language in the CoEDL Corpus Collection. Gurindji Kriol is a mixed language spoken in northern Australia; it combines the Kriol verb phrase with the Gurindji noun phrase and a relatively even mix of vocabulary. For younger generations of Gurindji people who speak Gurindji Kriol, this language represents the continuity of their local identity as well as their changing world.

The corpus consists of 165 hours of fully transcribed, translated and annotated recordings of 157 Gurindji adults and children. The corpus was jointly created with Cassandra Algy (Gurindji community linguist) and Sasha Wilmoth (corpus linguist); many UQ Summer Research Program students also contributed to its development through transcription and coding.

Already this corpus has made important contributions to different fields of linguistics. Previously linguists had believed that new languages such as Gurindji Kriol, born from contact between two languages, required special mechanisms to develop. However, Felicity and CoEDL Affiliate Patrick McConvell have shown that Gurindji Kriol developed from unremarkable contact processes like code-switching, which occurs when a multilingual speaker alternates between two or more languages in the course of one conversation.

Similarly, it has long been held that grammar simplifies when a language comes into contact with other language. Using innovative population genetics methods in joint work with Affiliates Lindell Bromham and Xia Hua, the team showed that this is not a generalisation which applies to all language contact situations. Felicity and her collaborators have also used data from the Gurindji Kriol corpus to challenge neo-Whorfian underpinnings that link language and cognition.

Gurindji Kriol is now a case study language in many linguistics textbooks, for example Linguistics: An Introduction (Bloomsbury, McGregor, 2015), For the Love of Language: An Introduction to Linguistics (Cambridge, Burridge & Stebbins, 2015) and Linguistic Fieldwork: A Practical Guide (Palgrave, Bowern 2015).

Ku Waru

In 2021, the English-translated Ku Waru child language corpus, compiled by a team led by CI Alan Rumsey and prepared for archiving by Corpus Manager Wolfgang Barth, became a significant addition to the CoEDL corpus collection. Containing over 1.3 million transcribed words, it is the Centre’s largest English-translated Indigenous language corpus, and one of the largest for any Pacific language.

Ku Waru is a language of the Western Highlands Province of Papua New Guinea; it is actively spoken by about 10,000 people and still learned as a first language by children. Younger generations of Ku Waru speakers also speak Tok Pisin, a largely English-based creole and one of Papua New Guinea’s national languages.

While Alan has worked with the Ku Waru community since 2004, all the CoEDL-archived corpus was gathered for the Ku Waru Child Language Socialisation Study (KWCLSS). This study ran from 2013 to 2016, supported by ARC Discovery Project funding. Associate Investigator Francesca Merlan and several CoEDL PhD students — as well as field assistants John Onga and Andrew Noma, members of the Ku Waru community — worked with Alan on KWCLSS, which explored questions about how children learn language and whether and how they are socialised to particular behaviours and ways of life as they acquire language.

To answer these questions, the study took a longitudinal approach. It followed five children, ranging in age from 20 months to five years old, as they learned Ku Waru and Tok Pisin. John and Andrew filmed each child for one hour per month while they interacted with parents and other people. The field assistants then transcribed the recordings and translated the Ku Waru transcriptions into English. This work was recorded in hundreds of notebooks.Partner institution Appen scanned and created typed transcripts of the notebooks before the material passed to the research team for analysis.

The full corpus contains over 2.5 million words in Ku Waru and Tok Pisin, spanning 364 sessions. Now that the corpus is available online, words and phrases can be searched and word frequency can be compared for each speaker, age group or gender.

The selection of transcripts archived in the CoEDL Corpus Collection is available here, while there are also files archived with PARADISEC and with the Language, Acquisition, Diversity Lab (ACQDIV) at the University of Zurich.

Bislama

Efforts under the CoEDL Corpus Project resulted in what is one of the largest collections of text for any Pacific language when the Bislama corpus exceeded two million words in 2020.

Work to assemble a corpus for Bislama accelerated in 2018 when CoEDL received funding from the Defence Science and Technology Group to build a corpus for the Vanuatu language. The project subsequently hired Ricky Taleo as a collection assistant based in Port Vila, Vanuatu. Ricky helped to transcribe and organise over 30 hours of spoken Bislama, turning up some surprisingly significant files in the process. One recording featured a two-hour long interview with Jimmy Stevens, the leader of a rebellion movement against Vanuatu independence in the 1980s.

Gems like this continue to surface as PARADISEC gains access to linguistic and musicological records. In this respect, notes CoEDL CI Nick Thieberger, the Centre has been instrumental not only in training new researchers to properly format primary material and add these materials to the PARADISEC archive, but also in providing resources to acquire and digitise vulnerable collections. Materials like cassette tapes and reel-to-reel recordings are subject to severe deterioration. With CoEDL’s help, PARADISEC has been able to digitise such recordings before it is too late, collaborating with museums and collections from around the world.

Further information

In addition to supporting the compilation of corpora and collections, the CoEDL Archiving Thread did important work in repatriation and preservation; read more here.

To learn more about these and other CoEDL members, explore the Languages subset of Connections data in map or list form.

Captions

Hero image: Language documentation notebooks. Image: CoEDL.

Image 1: Cassandra Algy Nimarra and Felicity Meakins record director-matcher tasks with Jamieisha Barry Nangala, Regina Crowson Nangari and Quitayah Frith Namija (Image: Jennifer Green, 2017).

Image 2: Alan Rumsey with the Ku Waru community. Image: Alan Rumsey.

IImage 3: Nick Thieberger in a meeting about the Bislama corpus. Image: Nick Thieberger/Robert Early.

References

[1] CoEDL Data Manager Julia Miller produced several guides on the principles and good practices of recording, managing and archiving data. These are available here.

Alan Rumsey (collector), 1983. Western Highlands of PNG recordings. Collection AR1 at catalog.paradisec.org.au [Closed Access]. https://dx.doi.org/10.4225/72/56E823B109D31

Hua, Xia, Meakins, Felicity, Algy, Cassandra, & Bromham, Lindell. (2022). Language change in multidimensional space: New methods for modelling linguistic coherence. Language Dynamics and Change, 12, 78-123.

McConvell, Patrick, & Meakins, Felicity. (2005). Gurindji Kriol: A mixed language emerges from code-switching. Australian Journal of Linguistics, 25(1), 9-30.

Meakins, Felicity. (2016). No fixed address: The grammaticalisation of the Gurindji locative as a progressive suffix. In F. Meakins & C. O'Shannessy (Eds.), Loss and Renewal: Australian Languages Since Colonisation (pp. 367-396). Berlin: Mouton de Gruyter.

Meakins, Felicity, & Algy, Cassandra. (2016). Deadly reckoning: Changes in Gurindji children's knowledge of cardinals. Australian Journal of Linguistics, 36(4), 479-501.

Meakins, Felicity, Hua, Xia, Algy, Cassandra, & Bromham, Lindell. (2019). The birth of a new language does not favour simplification. Language, 95(2), 294-332.

Meakins, Felicity, Jones, Caroline, & Algy, Cassandra. (2016). Bilingualism, language shift and the corresponding expansion of spatial cognitive systems. Language Sciences, 54, 1-13.

Meakins, Felicity, & Wilmoth, Sasha. (2020). Overabundance resulting from language contact: Complex cell-mates in Gurindji Kriol. In P. Arkadiev & F. Gardani (Eds.), The complexities of morphology (pp. 81-104). Oxford: Oxford University Press.

Thieberger, Nick. Bislama Corpus v03. In: Barth, Wolfgang (ed.). CoEDL Corpus Collection, https://go.coedl.net/bislama_corpus, accessed 10.08.2023.

CoEDL Alumni

Discover the unique paths beyond CoEDL that our alumni have taken

PARADISEC

The Pacific and Regional Archive for Digital Sources in Endangered Cultures

Back to all Stories

TRANSFORMING THE SCIENCE OF LANGUAGE

We acknowledge all Aboriginal and Torres Strait Islander Traditional Custodians of Country and recognise their continuing connection to land, sea, culture and community. We pay our respects to Elders past and present. Aboriginal and Torres Strait Islander people should be aware that this website may contain images, voices and names of deceased persons.

The ARC Centre of Excellence for the Dynamics of Language was funded by the Australian Research Council (CE140100041), The Australian National University, The University of Melbourne, The University of Queensland and Western Sydney University.