SMART TAMIL: A DIALECT-AWARE SMALL LANGUAGE MODEL FOR TAMIL NLP
Department of CS & IT
Smart Tamil, designed to be a dialect-sensitive language system, aims to embrace the diversity and richness of Tamil as spoken in the dialects of Tamil Nadu. Most language systems do not take dialect differences explicitly during large-scale deployments and the output language is grammatically correct but evocative of a non- idiomatic language. To solve this, the Small Language Model (SLM) of Smart Tamil will be trained on small- corpus spoken data, heterogeneous written data, and video data to capture the language and dialect variations and spoken styles of the five major dialect zones of Tamil Nadu including Kongu Tamil (Coimbatore/Erode), Nellai Tamil (Tirunelveli/Thoothukudi), Kanyakumari Tamil, the Central Trichy/Thanjavur, and Urban Tamil of Chennai. The Smart Tamil System has been built as a full stack React + Flask application with the inbuilt ability for speech synthesis, and speech recognition through the Web Speech API.
K, G., R, K. K. & R, T. (2026). Smart Tamil: A Dialect-Aware Small Language Model for Tamil NLP. International Journal of Science, Strategic Management and Technology, 02(05). https://doi.org/10.55041/ijsmt.v2i5.046
K, Gokul, et al.. "Smart Tamil: A Dialect-Aware Small Language Model for Tamil NLP." International Journal of Science, Strategic Management and Technology, vol. 02, no. 05, 2026, pp. . doi:https://doi.org/10.55041/ijsmt.v2i5.046.
K, Gokul,Kishore R, and Tholkappiyan R. "Smart Tamil: A Dialect-Aware Small Language Model for Tamil NLP." International Journal of Science, Strategic Management and Technology 02, no. 05 (2026). https://doi.org/https://doi.org/10.55041/ijsmt.v2i5.046.
2.Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of deep bidirectional transformers for language understanding," in Proc. NAACL-HLT, pp. 4171–4186, 2019.
3.Brown et al., "Language models are few-shot learners," in Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901, 2020.
4.Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
5.Joshi et al., "The state and fate of linguistic diversity and inclusion in the NLP world," in Proc. ACL 2020, pp. 6282– 6293.
6.Murugesan et al., "Tamil NLP: Challenges, datasets, and deep learning approaches," Journal of Intelligent Systems, vol. 29, no. 1, pp. 498–509, 2020.
7.Krishnamurthy, "A study of dialects in Tamil Nadu: Sociolinguistic perspectives," Indian Linguistics, vol. 70, no. 1–4,1–25, 2009.
8.Raj and S. Thomas, "Low-resource dialect adaptation using transfer learning for Dravidian languages," in Proc. ACL 2021.
9.Soundararajan and B. Raju, "Regional dialect identification in Tamil using phonological features," ACM Trans. Asian Low-Resource Language Inf. Process., vol. 21, no. 4, pp. 1–20, 2022.
10.Ghosh and R. Bhatt, "Code-mixing in South Asian languages," in Proc. ACL Workshop on Code-Switching, 2021.