Digital corpus of santali language

Author
Keywords
Abstract

In corpus preparation we do part-of-speech (POS) tagging where we add POS information into the corpus in the form of tags. The POS information contains a number of tags such as noun, pronoun, verb, adjective, adverbs, preposition, conjunction etc. Literature shows the lack of corpora for Santali language. In this paper we have created and described a Santali language corpus using Sketch Engine corpus query tool that is very much useful for linguistic research. This paper shows statistical information about the Santali corpus such as number of words, tokens and sentences. We have shown lexicons such as number of words, tags, lemmas and lempos. This paper also shows parts of speech tags and lemmatized corpus in terms of noun, numeral, preposition, pronoun, verb, adjective, adverb, conjunction. We have added 590,314 tokens, 425,238 words and 63,199 sentences in our Santali corpus. © 2017 IEEE.

Year of Conference
2017
Conference Name
2017 International Conference on Advances in Computing, Communications and Informatics, ICACCI 2017
Volume
2017-January
Number of Pages
934-938,
Publisher
Institute of Electrical and Electronics Engineers Inc.
ISBN Number
978-150906367-3 (ISBN)
DOI
10.1109/ICACCI.2017.8125961
Conference Proceedings
Download citation
Cits
5
CIT

For admissions and all other information, please visit the official website of

Cambridge Institute of Technology

Cambridge Group of Institutions

Contact

Web portal developed and administered by Dr. Subrahmanya S. Katte, Dean - Academics.

Contact the Site Admin.