Call for Paper - January 2023 Edition
IJCA solicits original research papers for the January 2023 Edition. Last date of manuscript submission is December 20, 2022. Read More

A Detailed Survey on Topic Modeling for Document and Short Text Data

Print
PDF
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Year of Publication: 2019
Authors:
S. Likhitha, B. S. Harish, H. M. Keerthi Kumar
10.5120/ijca2019919265

S Likhitha, B S Harish and Keerthi H M Kumar. A Detailed Survey on Topic Modeling for Document and Short Text Data. International Journal of Computer Applications 178(39):1-9, August 2019. BibTeX

@article{10.5120/ijca2019919265,
	author = {S. Likhitha and B. S. Harish and H. M. Keerthi Kumar},
	title = {A Detailed Survey on Topic Modeling for Document and Short Text Data},
	journal = {International Journal of Computer Applications},
	issue_date = {August 2019},
	volume = {178},
	number = {39},
	month = {Aug},
	year = {2019},
	issn = {0975-8887},
	pages = {1-9},
	numpages = {9},
	url = {http://www.ijcaonline.org/archives/volume178/number39/30790-2019919265},
	doi = {10.5120/ijca2019919265},
	publisher = {Foundation of Computer Science (FCS), NY, USA},
	address = {New York, USA}
}

Abstract

Text mining is one of the most significant field in the digital era due to the rapid growth of textual information. Topic models are gaining popularity in the last few years. A topic comprises of a group of words that are often take place together. Topic models are better performing techniques to extract semantic knowledge presented in the data. The various methods used for topic models are, LSA (Latent Semantic Analysis), PLSA (Probabilistic Latent Semantic Analysis), LDA (Latent Dirichlet Allocation). These methods gained popularity in extracting hidden themes from the document (corpus). Various topic modeling algorithms are developed to inquiry, summarize and extract hidden semantic structures of large corpus. In this paper, we present a detailed survey covering the various topic modeling techniques proposed in last decade. Additionally, we focus on different strategies of extracting the topics in social media text, where the goal is to find and aggregate the topic within short texts. Further, we summarize the various applications and quantitative evaluation of the various methods, with statistical and mathematical knowledge to predict the convergence of results.

References

  1. Ghanshyambhai, C.U., and Shah, A., 2018. Optimizing topic coherence in the Gujarati text topic modeling: a relevant words-based approach. Ph.D. thesis.
  2. Blei, D.M., 2012. Probabilistic topic models. Communications of the ACM, 55(4), pp.77-84.
  3. Das, R., Zaheer, M. and Dyer, C., 2015. Gaussian lda for topic models with word embeddings. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Vol. 1, pp. 795-804.
  4. Revanasiddappa, M.B., Harish, B.S. and Kumar, S.A., 2018. Meta-cognitive Neural Network based Sequential Learning Framework for Text Categorization. Procedia computer science, 132, pp.1503-1511.
  5. Revanasiddappa, M. B., and Harish, B. S. 2019. A Novel Text Representation Model to Categorize Text Documents using Convolution Neural Network. International Journal of Intelligent Systems and Applications, 5, 36-45.
  6. Gupta,V. and Lehal, G.S., 2009. A survey of text mining techniques and applications. Journal of emerging technologies in web intelligence, 1(1), pp.60-76.
  7. Gangemi, A., Presutti, V. and Recupero, D.R., 2014. Frame-based detection of opinion holders and topics: a model and a tool. IEEE Computational Intelligence Magazine, 9(1), pp.20-30.
  8. Phan, X.H., Nguyen, C.T., Le, D.T., Nguyen, L.M., Horiguchi, S. and Ha, Q.T., 2011. A hidden topic-based framework toward building applications with short web documents. IEEE Transactions on Knowledge and Data Engineering, 23(7), pp.961-976.
  9. Papadimitriou, C.H., Raghavan, P., Tamaki, H. and Vempala, S., 2000. Latent semantic indexing: A probabilistic analysis. Journal of Computer and System Sciences, 61(2), pp.217-235.
  10. Landauer, T.K., Foltz, P.W. and Laham, D., 1998. An introduction to latent semantic analysis. Discourse processes, 25(2-3), pp.259-284.
  11. Hofmann, T., 1999, July. Probabilistic latent semantic analysis. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence (pp. 289-296). Morgan Kaufmann Publishers Inc.
  12. Bassiou, N.K. and Kotropoulos, C.L., 2014. Online PLSA: Batch updating techniques including out-of-vocabulary words. IEEE transactions on neural networks and learning systems, 25(11), pp.1953-1966.
  13. Liu, S., Xia, C. and Jiang, X., 2010, December. Efficient probabilistic latent semantic analysis with sparsity control. In 2010 IEEE International Conference on Data Mining (pp. 905-910). IEEE.
  14. Blei, D.M., Ng, A.Y. and Jordan, M.I., 2003. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), pp.993-1022.
  15. Cheng, X., Yan, X., Lan, Y. and Guo, J., 2014. Btm: Topic modeling over short texts. IEEE Transactions on Knowledge and Data Engineering, 26(12), pp.2928-2941.
  16. AlSumait, L., Barbará, D. and Domeniconi, C., 2008, December. On-line lda: Adaptive topic models for mining text streams with applications to topic detection and tracking. In Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on (pp. 3-12). IEEE.
  17. Mazarura, J. and de Waal, A., 2016. A comparison of the performance of latent Dirichlet allocation and the Dirichlet multinomial mixture model on short text. In Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference (PRASA-RobMech), 2016 (pp. 1-6). IEEE.
  18. Yi, X. and Allan, J., 2008. Evaluating topic models for information retrieval. In Proceedings of the 17th ACM conference on Information and knowledge management (pp. 1431-1432). ACM.
  19. Yao, L., Mimno, D. and McCallum, A., 2009. Efficient methods for topic model inference on streaming document collections. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 937-946). ACM.
  20. Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y. and Zhao, L., 2017. Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey. Multimedia Tools and Applications, pp.1-43.
  21. Alghamdi, R. and Alfalqi, K., 2015. A survey of topic modeling in text mining. Int. J. Adv. Comput. Sci. Appl.(IJACSA), 6(1).
  22. Qiang, J., Chen, P., Wang, T. and Wu, X., 2017. Topic modeling over short texts by incorporating word embeddings. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 363-374). Springer.
  23. Hong, L. and Davison, B.D., 2010. Empirical study of topic modeling in twitter. In Proceedings of the first workshop on social media analytics (pp. 80-88). ACM.
  24. Schneider, J. and Vlachos, M., 2018. Topic Modeling based on Keywords and Context. In Proceedings of the 2018 SIAM International Conference on Data Mining (pp. 369-377).
  25. Revanasiddappa, M. B., & Harish, B. S. (2018). A New Feature Selection Method based on Intuitionistic Fuzzy Entropy to Categorize Text Documents. International Journal of Interactive Multimedia & Artificial Intelligence, 5(3).
  26. Li, L., Sun, Y., Han, X. and Wang, C., 2018, June. Research on Improve Topic Representation over Short Text. In 2018 IEEE Third International Conference on Data Science in Cyberspace (DSC) (pp. 848-853). IEEE.
  27. Chen, G.B. and Kao, H.Y., 2017. Word co-occurrence augmented topic model in short text. Intelligent Data Analysis, 21(S1), pp.S55-S70.
  28. Xun, G., Gopalakrishnan, V., Ma, F., Li, Y., Gao, J. and Zhang, A., 2016. Topic discovery for short texts using word embeddings. In Data Mining (ICDM), 2016 IEEE 16th International Conference on (pp. 1299-1304). IEEE.
  29. Chen, Y., Zhang, H., Liu, R., Ye, Z. and Lin, J., 2019. Experimental explorations on short text topic mining between LDA and NMF based Schemes. Knowledge-Based Systems, 163, pp.1-13.
  30. Harish, B.S. and Revanasiddappa, M.B., 2017. A comprehensive survey on various feature selection methods to categorize text documents. International Journal of Computer Applications, 164(8), pp.1-7.
  31. Lu, H.Y., Ge, G.J., Li, Y., Wang, C.J. and Xie, J.Y., 2018, November. Exploiting Global Semantic Similarity Biterms for Short-Text Topic Discovery. In 2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI) (pp. 975-982). IEEE.
  32. Chen, G.B. and Kao, H.Y., 2015. Word co-occurrence augmented topic model in short text. International Journal of Computational Linguistics & Chinese Language Processing, 20(2).
  33. Salerno, M.D., Tataru, C.A. and Mallory, M.R., 2015. Word Community Allocation: Discovering Latent Topics via Word Co-Occurrence Network Structure.
  34. Chen, B., 2009. Latent topic modelling of word co-occurence information for spoken document retrieval.
  35. Yan, X., Guo, J., Lan, Y. and Cheng, X., 2013. A biterm topic model for short texts. In Proceedings of the 22nd international conference on World Wide Web (pp. 1445-1456). ACM.
  36. Pedrosa, G., Pita, M., Bicalho, P., Lacerda, A. and Pappa, G.L., 2016. Topic modeling for short texts with co-occurrence frequency-based expansion. In 2016 5th Brazilian Conference on Intelligent Systems (BRACIS) (pp. 277-282).
  37. Zuo, Y., Wu, J., Zhang, H., Lin, H., Wang, F., Xu, K. and Xiong, H., 2016. Topic modeling of short texts: A pseudo-document view. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 2105-2114). ACM.
  38. Quan, X., Kit, C., Ge, Y. and Pan, S.J., 2015. Short and sparse text topic modeling via self-aggregation. In 24th International Joint Conference on Artificial Intelligence.
  39. Jiang, L., Lu, H., Xu, M. and Wang, C., 2016. Biterm Pseudo Document Topic Model for Short Text. In Tools with Artificial Intelligence (ICTAI), 2016 IEEE 28th International Conference on (pp. 865-872). IEEE.
  40. Lee, S., Kim, J. and Myaeng, S.H., 2015. An extension of topic models for text classification: A term weighting approach. In Big Data and Smart Computing (BigComp), 2015 International Conference on (pp. 217-224). IEEE.
  41. Kai, Y., Yi, C., Zhenhong, C., Ho-fung, L. and Raymond, L.A.U., 2016. Exploring topic discriminating power of words in latent dirichlet allocation. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers (pp. 2238-2247).
  42. Wilson, A.T. and Chew, P.A., 2010. Term weighting schemes for latent dirichlet allocation. In human language technologies: The 2010 annual conference of the North American Chapter of the Association for Computational Linguistics (pp. 465-473).
  43. Wang, Q., Zhang, D. and Si, L., 2013. Semantic hashing using tags and topic modeling. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval (pp. 213-222).
  44. Li, X., Zhang, A., Li, C., Ouyang, J. and Cai, Y., 2018. Exploring coherent topics by topic modeling with term weighting. Information Processing & Management.
  45. Liang, W., Feng, R., Liu, X., Li, Y. and Zhang, X., 2018. GLTM: A Global and Local Word Embedding-Based Topic Model for Short Texts. IEEE, 6, pp.43612-43621.
  46. Rajani, N.F.N., McArdle, K. and Baldridge, J., 2014. Extracting topics based on authors, recipients and content in microblogs. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval (pp. 1171-1174). ACM.
  47. Lu, H.Y., Ge, G.J., Li, Y., Wang, C.J. and Xie, J.Y., 2018. Exploiting Global Semantic Similarity Biterms for Short-Text Topic Discovery. 30th International Conference on Tools with Artificial Intelligence (ICTAI)  (pp. 975-982).
  48. Reisinger, J., Waters, A., Silverthorn, B. and Mooney, R.J., 2010. Spherical topic models. In Proceedings of the 27th international conference on machine learning (ICML-10) (pp. 903-910).
  49. Li, C., Wang, H., Zhang, Z., Sun, A. and Ma, Z., 2016, July. Topic modeling for short texts with auxiliary word embeddings. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval (pp. 165-174). ACM.
  50. Mimno, D., Wallach, H.M., Talley, E., Leenders, M. and McCallum, A., 2011. Optimizing semantic coherence in topic models. In Proceedings of the conference on empirical methods in natural language processing (pp. 262-272). Association for Computational Linguistics.
  51. Zhao, H., Du, L. and Buntine, W., 2017, November. A word embeddings informed focused topic model. In Asian Conference on Machine Learning (pp. 423-438).
  52. Li, C., Duan, Y., Wang, H., Zhang, Z., Sun, A. and Ma, Z., 2017. Enhancing Topic Modeling for Short Texts with Auxiliary Word Embeddings. ACM Transactions on Information Systems (TOIS), 36(2), p.11.
  53. Nigam, K., McCallum, A.K., Thrun, S. and Mitchell, T., 2000. Text classification from labeled and unlabeled documents using EM. Machine learning, 39(2-3), pp.103-134.
  54. Liu, Y., Liu, Z., Chua, T.S. and Sun, M., 2015, January. Topical Word Embeddings. In AAAI (pp. 2418-2424).
  55. Nguyen, D.Q., Billingsley, R., Du, L. and Johnson, M., 2015. Improving topic models with latent feature word representations. Transactions of the Association for Computational Linguistics, 3, pp.299-313.
  56. Geeganage, K. and Tharanga, D., 2018. Concept Embedded Topic Modeling Technique. International World Wide Web Conferences Steering Committee, (pp. 831-835).
  57. Jiang, H., Zhou, R., Zhang, L., Wang, H. and Zhang, Y., 2018. Sentence level topic models for associated topics extraction. World Wide Web, pp.1-16.
  58. Tsai, F.S., 2011. A tag-topic model for blog mining. Expert Systems with Applications, 38(5), pp.5330-5335.
  59. Wang, Y., Liu, J., Huang, Y. and Feng, X., 2016. Using hashtag graph-based topic model to connect semantically-related words without co-occurrence in microblogs. IEEE Transactions on Knowledge and Data Engineering, 28(7), pp.1919-1933.
  60. Kawamae, N., 2010, July. Author interest topic model. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval (pp. 887-888). ACM.
  61. Clinchant, S. and Perronnin, F., 2013. Aggregating continuous word embeddings for information retrieval. In Proceedings of the workshop on continuous vector space models and their compositionality (pp. 100-109).
  62. Kim, H., Sun, Y., Hockenmaier, J. and Han, J., 2012. Etm: Entity topic models for mining documents associated with entities. In Data Mining (ICDM), 2012 IEEE 12th International Conference on (pp. 349-358).
  63. Jadhav, B.S., Bhosale, D.S. and Jadhav, D.S., 2016, August. Pattern based topic model for data mining. In Inventive Computation Technologies (ICICT), International Conference on (Vol. 2, pp. 1-6). IEEE.
  64. Li, W. and McCallum, A., 2006, June. Pachinko allocation: DAG-structured mixture models of topic correlations. In Proceedings of the 23rd international conference on Machine learning (pp. 577-584). ACM.
  65. V. Lavrenko and W. B. Croft. Relevance-based language models. In Proc. of ACM SIGIR, pp. 120-127, 2001.
  66. Qiang, J., Li, Y., Yuan, Y. and Wu, X., 2018. Short text clustering based on Pitman-Yor process mixture model. Applied Intelligence, 48(7), pp.1802-1812.
  67. Qiu, Z. and Shen, H., 2017. User clustering in a dynamic social network topic model for short text streams. Information Sciences, 414, pp.102-116.
  68. Hu, X., Wang, H. and Li, P., 2018. Online Biterm Topic Model based short text stream classification using short text expansion and concept drifting detection. Pattern Recognition Letters, 116, pp.187-194.
  69. Demšar, J. and Bosnić, Z., 2018. Detecting concept drift in data streams using model explanation. Expert Systems with Applications, 92, pp.546-559.
  70. Wandabwa, H., Naeem, M.A., Pears, R. and Mirza, F., 2018. A Metamodel Enabled Approach for Discovery of Coherent Topics in Short Text Microblogs. IEEE Access, 6, pp.65582-65593.
  71. Wang, T., Cai, Y., Leung, H.F., Cai, Z. and Min, H., 2015. Entropy-based term weighting schemes for text categorization in VSM. In Tools with Artificial Intelligence (ICTAI), 2015 IEEE 27th International Conference on (pp. 325-332). IEEE.
  72. Shi, L.L., Liu, L., Wu, Y., Jiang, L. and Hardy, J., 2017. Event detection and user interest discovering in social media data streams. IEEE Access, 5, pp.20953-20964.
  73. Sapul, M.S.C., Aung, T.H. and Jiamthapthaksin, R., 2017. Trending topic discovery of Twitter Tweets using clustering and topic modeling algorithms. In 2017 14th International Joint Conference on Computer Science and Software Engineering (JCSSE) (pp. 1-6). IEEE.
  74. Zheng, C.T., Liu, C. and San Wong, H., 2018. Corpus-based topic diffusion for short text clustering. Neurocomputing, 275, pp.2444-2458.
  75. Li, X., Wang, Y., Zhang, A., Li, C., Chi, J. and Ouyang, J., 2018. Filtering out the noise in short text topic modeling. Information Sciences, 456, pp.83-96.
  76. MacMillan, K. and Wilson, J.D., 2017. Topic supervised non-negative matrix factorization. arXiv:1706.05084
  77. Kandemir, M., Kekeç, T. and Yeniterzi, R., 2018. Supervising topic models with Gaussian processes. Pattern Recognition, 77, pp.226-236.
  78. Phan, X.H., Nguyen, L.M. and Horiguchi, S., 2008. Learning to classify short and sparse text & web with hidden topics from large-scale data collections. 17th international conference on WWW (pp. 91-100).
  79. Qiang, J., Li, Y., Yuan, Y., Liu, W. and Wu, X., 2018. STTM: A Tool for Short Text Topic Modeling. arXiv preprint arXiv:1808.02215.
  80. Yin, J. and Wang, J., 2014. A dirichlet multinomial mixture model-based approach for short text clustering. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 233-242). ACM.
  81. Yan, X., Guo, J., Liu, S., Cheng, X. and Wang, Y., 2013. Learning topics in short texts by non-negative matrix factorization on term correlation matrix. In Proceedings of the 2013 SIAM International Conference on Data Mining (pp. 749-757).
  82. He, X., Xu, H., Li, J., He, L. and Yu, L., 2017. FastBTM: Reducing the sampling time for biterm topic model. Knowledge-Based Systems, 132, pp.11-20.
  83. Mehrotra, R., Sanner, S., Buntine, W. and Xie, L., 2013. Improving lda topic models for microblogs via tweet pooling and automatic labeling. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval (pp. 889-892).
  84. Mikolov, T., Chen, K., Corrado, G. and Dean, J., 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

Keywords

Text ining, Topic Modeling, Short Text, Semantic Analysis.