Digital corpus of santali language

Author	M.A.K. Akhtar G. Sahoo M. Kumar
Keywords	CSE
Abstract	In corpus preparation we do part-of-speech (POS) tagging where we add POS information into the corpus in the form of tags. The POS information contains a number of tags such as noun, pronoun, verb, adjective, adverbs, preposition, conjunction etc. Literature shows the lack of corpora for Santali language. In this paper we have created and described a Santali language corpus using Sketch Engine corpus query tool that is very much useful for linguistic research. This paper shows statistical information about the Santali corpus such as number of words, tokens and sentences. We have shown lexicons such as number of words, tags, lemmas and lempos. This paper also shows parts of speech tags and lemmatized corpus in terms of noun, numeral, preposition, pronoun, verb, adjective, adverb, conjunction. We have added 590,314 tokens, 425,238 words and 63,199 sentences in our Santali corpus. © 2017 IEEE.
Year of Conference	2017
Conference Name	2017 International Conference on Advances in Computing, Communications and Informatics, ICACCI 2017
Volume	2017-January
Number of Pages	934-938,
Publisher	Institute of Electrical and Electronics Engineers Inc.
ISBN Number	978-150906367-3 (ISBN)
DOI	10.1109/ICACCI.2017.8125961
	Conference Proceedings
Download citation	DOI Google Scholar
Cits	5