Digital corpus of santali language
Author | |
---|---|
Keywords | |
Abstract |
In corpus preparation we do part-of-speech (POS) tagging where we add POS information into the corpus in the form of tags. The POS information contains a number of tags such as noun, pronoun, verb, adjective, adverbs, preposition, conjunction etc. Literature shows the lack of corpora for Santali language. In this paper we have created and described a Santali language corpus using Sketch Engine corpus query tool that is very much useful for linguistic research. This paper shows statistical information about the Santali corpus such as number of words, tokens and sentences. We have shown lexicons such as number of words, tags, lemmas and lempos. This paper also shows parts of speech tags and lemmatized corpus in terms of noun, numeral, preposition, pronoun, verb, adjective, adverb, conjunction. We have added 590,314 tokens, 425,238 words and 63,199 sentences in our Santali corpus. © 2017 IEEE. |
Year of Conference |
2017
|
Conference Name |
2017 International Conference on Advances in Computing, Communications and Informatics, ICACCI 2017
|
Volume |
2017-January
|
Number of Pages |
934-938,
|
Publisher |
Institute of Electrical and Electronics Engineers Inc.
|
ISBN Number |
978-150906367-3 (ISBN)
|
DOI |
10.1109/ICACCI.2017.8125961
|
Conference Proceedings
|
|
Download citation | |
Cits |
5
|