Call for Paper - January 2023 Edition
IJCA solicits original research papers for the January 2023 Edition. Last date of manuscript submission is December 20, 2022. Read More

Authorship Attribution using Rough Sets based Feature Selection Techniques

Print
PDF
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Year of Publication: 2016
Authors:
Ignatius Ikechukwu Ayogu, Victor Akinbola Olutayo
10.5120/ijca2016911889

Ignatius Ikechukwu Ayogu and Victor Akinbola Olutayo. Authorship Attribution using Rough Sets based Feature Selection Techniques. International Journal of Computer Applications 152(6):38-46, October 2016. BibTeX

@article{10.5120/ijca2016911889,
	author = {Ignatius Ikechukwu Ayogu and Victor Akinbola Olutayo},
	title = {Authorship Attribution using Rough Sets based Feature Selection Techniques},
	journal = {International Journal of Computer Applications},
	issue_date = {October 2016},
	volume = {152},
	number = {6},
	month = {Oct},
	year = {2016},
	issn = {0975-8887},
	pages = {38-46},
	numpages = {9},
	url = {http://www.ijcaonline.org/archives/volume152/number6/26327-2016911889},
	doi = {10.5120/ijca2016911889},
	publisher = {Foundation of Computer Science (FCS), NY, USA},
	address = {New York, USA}
}

Abstract

This presents an investigation into the usefulness of rough set theory in the context of authorship attribution using writing style. The problem was setup as a standard supervised machine learning problem. The rough set based feature subset computation techniques reduced the dimensionality of the feature space from 346 conditional attributes to an average of 8 features. Experiments were performed experiment using five different subsets of the original attributes computed using rough sets techniques with the results showing that the rough set based techniques improved the performances of neural network (NN) and Support Vector Machines (SVM) models. The overall classification accuracy increased from 8.712 % for on the baseline data to 50.505 % for the NN and from 7.197 % to 28.662 % for the SVM model. The improvements in performance compared to the baseline model are evidenced across all other performance metrics used. However, the NN model performed generally better than the SVM model.

References

  1. Abbasi, A. and Chen, H., 2008: Writeprints: A Stylometric Approach to Identity-Level Identification and Similarity Detection in Cyberspace. ACM Transactions on Information Systems. 26(2):1-29
  2. Argamon, S., & Levitan, S., 2005:. Measuring the usefulness of function words for authorship attribution. In Proceedings of the Joint Conference of the Association for Computers and the Humanities and the Association for Literary and Linguistic Computing, Association for Computing and the Humanities, Victoria, BC.
  3. Argamon, S., Whitelaw, C., Chase, P., Hota, S.R., Garg, N., & Levitan, S., 2007: Stylistic text classification using functional lexical features. Journal of the American Society for Information Science and Technology, 58(6), 802–822.
  4. Baayen, R., van Halteren, H., & Tweedie, F., 1996: Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing, 11(3), 121–131.
  5. Bishop, C. M. (1995) Neural Networks for Pattern Recognition. Oxford University Press.
  6. Buoanani, S. M. & Kassou, I., 2014: Authorship Analysis Studies: A Survey. International Journal of Computer Applications 86 (12): 22-29.
  7. Burrows, J.F.,1992: Not unless you ask nicely: The interpretative nexus between analysis and information. Literary and Linguistic Computing, 7(2), 91–109.
  8. Chen, H. Li, J. and Zheng, R., 2006: From Fingerprint to Writeprint. Communication of the ACM, 49(4)
  9. Cortez, P., 2010: Data Mining with Neural Networks and Support Vector Machines using the R/rminer Tool. In P. Perner (Ed.), Advances in Data Mining - Applications and Theoretical Aspects, 10th Industrial Conference on Data Mining, LNAI 6171, Springer, pp. 572-583, Berlin, Germany, July, 2010.
  10. Devel, O., 2000: Mining e-mail Authorship. In Proceedings of the Workshop on the Text Mining in ACM International Conference on Knowledge Discovery and Data Mining.
  11. Duda, R. O. and Hart, R. E., 1973: Pattern Recognition and Scene Analysis. Wiley, New York.
  12. Evans J, Stanovich K. E., 2013:. Dual-process theories of higher cognition advancing the debate. Perspect. Psychol. Sci. 8:223–41
  13. Fukunaga, K. (1990) Introduction to Statistical Pattern Recognition. Academic Press New York
  14. Forsyth, R., and Holmes, D., 1996: Feature-finding for text classification. Literary and Linguistic Computing, 11(4), 163–174
  15. Gardner, B., Lally, P. and Wardle, J., 2012: Making health habitual: the psychology of habit-formation and general practice. Br J Gen Pract 62:664–666.
  16. Hall, M. A., 1999: Correlation-Based Feature Selection for Machine Learning. PhD Thesis, Waikato University, New Zealand.
  17. Holmes, D.I., 1994: Authorship attribution. Computers and the Humanities, 28, 87–106.
  18. Holmes, D.I., 1998: The evolution of stylometry in humanities scholarship. Literary and Linguistic Computing, 13(3), 111–117.
  19. Houvards, J., & Stamatatos, E., 2006: N-gram feature selection for authorship identification. In Proceedings of the 12th International Conference on Artificial Intelligence: Methodology, Systems, Applications (pp. 77–86). Berlin, Germany: Springer.
  20. John, G. H., Kohavi, R. and Pfleger, K., 1994: Irrelevant Features and the Subset Selection Problem. Proceedings of the 11th International Conference in Machine Learning. 121-129.
  21. Juola, P., 2007: Future trends in authorship attribution. In P. Craiger & S. Shenoi (Eds.), Advances in digital forensics III (pp. 119–132). Boston: Springer.
  22. Kestemont, M., 2014: Function Words in Authorship Attribution From Black Magic to Theory? Proceedings of the 3rd Workshop on Computational Linguistics for Literature (CLfL) at EACL 2014, pages 59–66, Gothenburg, Sweden, April 27, 2014. Association for Computational Linguistics
  23. Kukushkina, O.V., Polikarpov, A.A., and Khmelev, D.V., 2001: Using literal and grammatical statistics for authorship attribution. Problems of Information Transmission, 37(2), 172–184.
  24. Komorowski J., Pawlak Z., Polkowski L. and Skowron A., 1999: Rough sets: A tutorial, In: Rough Fuzzy Hybridization: A New Trend in Decision Making (S.K. Pal and A. Skowron, Eds.). — Singapore: Springer, pp.3–98.
  25. Luyckx, K., and Daelemans, W., 2005: Shallow text analysis and machine learning for authorship attribution. In Proceedings of the 15th meeting of Computational Linguistics in the Netherlands (pp. 149–160). Utrecht, Netherlands.
  26. Mahor, U. and Das, S., 2015: Performance Evaluation of Various Feature Extraction and Classification Techniques for Authorship Attribution. International Journal of Innovation and Scientific Research. 1(16):252-259.
  27. Mitchell, T., 1997: Machine learning. New York: McGraw-Hill.
  28. Pawlak, Z., 1982: Rough Sets. International Journal of Information and Computer Science 2:341-356.
  29. Rissino, S. and Lambert-Torres, G., 2009: Rough Set Theory – Fundamental Concepts, Principals, Data Extraction, and Applications, Data Mining and Knowledge Discovery in Real Life Applications. 35-60.
  30. Stamatatos, E. Fakotakis, N. and Kokkinakis, G., 2000: Computer-Based Authorship Attribution without Lexical Measures. Computer and Humanities. pp 193-214.
  31. Stamatatos, E. Fakotakis, N. and Kokkinakis, G., 2001: Automatic Text Categorization in Terms of Genre and Author. Computational Linguistics 26(4).
  32. Stamatatos, E., 2006: Ensemble-based author identification using character n-grams. In Proceedings of the 3rd International Workshop onText-Based Information Retrieval (pp. 41–46).
  33. Stamatatos, E., 2007: Author identification using imbalanced and limited training texts. In Proceedings of the 4th International Workshop on Text-Based Information Retrieval (pp. 237–241).
  34. Stamatatos, E., 2008: Author identification: Using text sampling to handle the class imbalance problem. Information Processing and Management, 44(2), 790–799.
  35. Stamatatos, E., 2009: A Survey of Modern Authorship Attribution Methods, JASIST
  36. Swiniarski, R. W. and Skowron, A., 2003: Rough Set Methods in Feature Selection and Recognition. Pattern Recognition Letters, Elsevier. 24:833-849
  37. Tamboli, M. S. & Prasad, R. S., 2013: Authorship Analysis and Identification Techniques: A Review. International Journal of Computer Applications 77(16): 11-15.
  38. Tweedie, F., Singh, S., and Holmes, D., 1996: Neural network applications in stylometry: The Federalist Papers. Computers and the Humanities, 30(1), 1–10.
  39. Walczak, B. and Massart, D. L., 1999: Rough Sets Theory. Tutorial. Chemometrics and Intelligence Laboratory Systems. 47. 1-6
  40. Wood, W., Rünger, D., 2016: Psychology of habit. Annual Review of Psycholology http://dx.doi.org/10.1146/annurev-psych-122414-033417.
  41. Zhang, M., Yao, J., 2004: A rough sets based approach to feature selection. In: Proc. 23rd Internat. Conf. of NAFIPS, pp. 434–439
  42. Zhao, Y., and Zobel, J., 2007: Searching with style: Authorship attribution in classic literature. In Proceedings of the 30th Australasian Computer Science Conference (pp. 59–68). New York: ACM Press.
  43. Zheng, R., Li, J., Chen, H., and Huang, Z., 2006: A framework for authorship identification of online messages: Writing style features and classification techniques. Journal of the American Society of Information Science and Technology, 57(3), 378–393.
  44. Stolerman, A., 2012: Authorship Attribution Using Writeprints. Machine Learning Final Project. Drexel University. http://www.stolerman.net/studies/cs613/cs613_Writeprints_Ariel_Stolerman_paper.pdf
  45. Can, M, Jamak, A, Savatic, A., 2012: Teaching Neural Networks to Detect the Authors of Texts Using Lexical Descriptors, Southeast Europe Journal of Soft Computing, 1, (1), pp. 57-72.

Keywords

Stylometry, Feature Selection, Neural Networks, Support Vector Machines, Supervised Machine Learning.