CFP last date
20 May 2024
Reseach Article

Authorship Attribution based on Data Compression for Telugu Text

by S.nagaprasad, P.vijayapal Reddy, A.vinaya Babu
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 110 - Number 1
Year of Publication: 2015
Authors: S.nagaprasad, P.vijayapal Reddy, A.vinaya Babu
10.5120/19277-0686

S.nagaprasad, P.vijayapal Reddy, A.vinaya Babu . Authorship Attribution based on Data Compression for Telugu Text. International Journal of Computer Applications. 110, 1 ( January 2015), 1-5. DOI=10.5120/19277-0686

@article{ 10.5120/19277-0686,
author = { S.nagaprasad, P.vijayapal Reddy, A.vinaya Babu },
title = { Authorship Attribution based on Data Compression for Telugu Text },
journal = { International Journal of Computer Applications },
issue_date = { January 2015 },
volume = { 110 },
number = { 1 },
month = { January },
year = { 2015 },
issn = { 0975-8887 },
pages = { 1-5 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume110/number1/19277-0686/ },
doi = { 10.5120/19277-0686 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T22:45:12.725372+05:30
%A S.nagaprasad
%A P.vijayapal Reddy
%A A.vinaya Babu
%T Authorship Attribution based on Data Compression for Telugu Text
%J International Journal of Computer Applications
%@ 0975-8887
%V 110
%N 1
%P 1-5
%D 2015
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Authorship attribution (AA) can be defined as the task of inferring characteristics of a document's author from the textual characteristics of the document itself. In this paper we evaluated the compression model for AA on Telugu text. We considered six different compressors namely Zip, BZip, GZip, LZW, PPM and PPMd in combination with three different compression distance measures such as Normalized Compressor Distance (NCD), Compression Dissimilarity Measure (CDM) and Conditional Complexity of Compression (CCC). The result shows that the compression models are good alternatives for Authorship attribution instead of classification model with various features.

References
  1. E. Keogh, S. Lonardi, and C. A. Ratanamahatana, "Towards parameter-free data mining," in Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD '04. New York, NY, USA: ACM, 2004, pp. 206–215.
  2. R. Cilibrasi and P. M. B. Vit, "Clustering by Compression," IEEE Transactions on Information Theory, vol. 51, no. 4, pp. 1523–1545, 2005.
  3. C. J. V. Rijsbergen, Information Retrieval, 2nd ed. Butterworth-Heinemann, 1979.
  4. Y. Yang, "An Evaluation of Statistical Approaches to Text Categorization," Information Retrieval, vol. 1, no. 1-2, pp. 69–90, 1999.
  5. D. Benedetto, E. Caglioti, and V. Loreto, "Language Trees and Zipping," Physical Review Letters, vol. 88, p. 048702, 2002.
  6. D. V. Khmelev and W. J. Teahan, "A repetition based measure for verification of text collections and for text categorization," in Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval, ser. SIGIR '03. New York, NY, USA: ACM, 2003, pp. 104–110.
  7. M. Lambers and C. Veenman, "Forensic Authorship Attribution Using Compression Distances to Prototypes," in Proceedings of the Third International Workshop on Computational Forensics. Berlin, Heidelberg: Springer-Verlag, 2009, pp. 13–24.
  8. Y. Marton, N. Wu, and L. Hellerstein, "On compression-based text classification," In Proceedings of the European Conference on Information Retrieval, pp. 300–314, 2005.
  9. D. Sculley and C. E. Brodley, "Compression and Machine Learning: A New Perspective on Feature Space Vectors," in Proceedings of the Data Compression Conference, ser. DCC '06. Washington, DC, USA: IEEE Computer Society, 2006, p. 332.
  10. V. Bobicev, "Text Classification Using Word-Based PPM Models," The Computer Science Journal of Moldova, vol. 14, no. 2, pp. 183–201, 2006. 11. E. Frank, C. Chui, and I. H. Witten, "Text Categorization Using Compression Models," in Proceedings of the Conference on Data Compression, ser. DCC '00. Washington, DC, USA: IEEE Computer Society, 2000, p. 555.
  11. F. Peng, D. Schuurmans, and S. Wang, "Language and task independent text categorization with simple language models," in Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, ser. NAACL '03. Stroudsburg, PA, USA: Association for Computational Linguistics, 2003, pp. 110–117.
  12. C. Shannon, "A Mathematical Theory of Communication," The Bell System Technical Journal, vol. 27, pp. 379–423 & 623–656, 1948.
  13. A. Lempel and J. Ziv, "On the Complexity of Finite Sequences," IEEE Transactions on Information Theory, vol. 22, no. 1, pp. 75–81, 1976.
  14. J. Cleary and I. Witten, "Data compression using adaptive coding and partial string matching," IEEE Transactions on Communications, vol. 32, no. 4, pp. 396–402, 1984.
  15. "PPM: One Step to Practicality," in Proceedings of the Data Compression Conference, ser. DCC '02. Washington, DC, USA: IEEE Computer Society, 2002, p. 202.
  16. P. G. Howard, "The Design and Analysis of Efficient Lossless Data Compression Systems," Providence, RI, USA, Tech. Rep. , 1993.
  17. Bratko, A. , Filipic, B. : Spam Filtering Using Compression Models. Department of Intelligent Systems, Jozef Stefan Institute, Ljubljana, Slovenia, IJS-DP-9227 (2005)
  18. Cerra, D. , Datcu, M. , 2012. A fast compression-based similarity measure with applications to content-based image retrieval. Journal of Visual Communication and Image Representation 23 (2), 293 – 302.
  19. Watanabe, T. , Sugawara, K. , Sugihara, H. , 2002. A new pattern representation scheme using data compression. IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (5), 579–590.
  20. Diederich, J. , J. Kindermann, E. Leopold & G. Paass (2000) Authorship Attribution with Support Vector Machines. Applied Intelligence, 19(1-2), pp. 109-123.
  21. Forensic Linguistics Institute (FLI): http://www. thetext. co. uk/info. html.
  22. Frank, E. , C. Chui & I. Witten (2000) Text Categorization Using Compression Models. Proceedings of the Data Compression Conference.
  23. Khmelev, D. & F. Tweedie (2001) Using Markov Chains for Identification of Writers. Literary and Linguistic Computing, 16(4), pp. 299-307.
  24. 25. S. Argamon, M. Sari ?, and S. S. Stein. Style mining of electronic messages for multiple authorship discrimination: First results. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 475–480, Washington, D. C. , USA, 2003. ACM Press.
  25. 26. M. Burrows, D. J. Wheeler, A block-sorting lossless data compression algorithm. Technical Report 124, Digital SRC Research, 1994.
  26. 27. M. Malyutov, Authorship attribution of texts: a review, Electron. Notes Discrete Math. 21 (August) (2005) 353–357.
  27. 28. B. VishnuVardhan,P. VijaypalReddy, A. Govardhan"Corpus based Extractive summarization for Indic script", International Conference on Asian Language Processing (IALP) IEEE Computer Society (IALP 2011) pp 154-157
  28. 29. M. Li, X. Chen, X. Li, B. Ma, P. Vitanyi, The similarity metric, IEEE Trans. Inf. Theory 50 (December (12)) (2004) 3250–3264.
Index Terms

Computer Science
Information Sciences

Keywords

Authorship attribution Compressors Compression distance measures Macro-average Micro-average Accuracy Telugu data set.