CFP last date
20 June 2024
Reseach Article

Automating Corpora Generation with Semantic Cleaning and Tagging of Tweets for Multi-dimensional Social Media Analytics

by Nazura Javed, Muralidhara B.L.
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 127 - Number 12
Year of Publication: 2015
Authors: Nazura Javed, Muralidhara B.L.

Nazura Javed, Muralidhara B.L. . Automating Corpora Generation with Semantic Cleaning and Tagging of Tweets for Multi-dimensional Social Media Analytics. International Journal of Computer Applications. 127, 12 ( October 2015), 11-16. DOI=10.5120/ijca2015906548

@article{ 10.5120/ijca2015906548,
author = { Nazura Javed, Muralidhara B.L. },
title = { Automating Corpora Generation with Semantic Cleaning and Tagging of Tweets for Multi-dimensional Social Media Analytics },
journal = { International Journal of Computer Applications },
issue_date = { October 2015 },
volume = { 127 },
number = { 12 },
month = { October },
year = { 2015 },
issn = { 0975-8887 },
pages = { 11-16 },
numpages = {9},
url = { },
doi = { 10.5120/ijca2015906548 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
%0 Journal Article
%1 2024-02-06T23:19:42.513050+05:30
%A Nazura Javed
%A Muralidhara B.L.
%T Automating Corpora Generation with Semantic Cleaning and Tagging of Tweets for Multi-dimensional Social Media Analytics
%J International Journal of Computer Applications
%@ 0975-8887
%V 127
%N 12
%P 11-16
%D 2015
%I Foundation of Computer Science (FCS), NY, USA

Developing corpora from social media content involves convoluted cleaning. In this paper we propose and implement the automation of corpora building for facilitating social media mining and analytics. This automation process incorporates: a) metadata extraction and structuring b) semantic cleaning with tagging and c) learning domain terms/entities. The implementation performs comprehensive cleaning including abbreviation and slang correction, phonetic matching using metaphone algorithm, splitting joined words and identifying/learning entities. It identifies the entities, tags them and creates/updates a knowledgebase (KB) comprising of domain terms. The corpus thus constructed, facilitates multidimensional analysis and summarization. This proposed technique was tested with an experiment in which real world streaming tweets pertaining to Indian politics were collected, structured, cleaned and tagged. The results of the automation experiment can be stated as follows: a) the tweets although primarily in English, contained at times words from the regional languages. The algorithm does not recognize these words and they are construed as domain terms. An accuracy of 85.55% was achieved in identifying the correct domain terms and entities. b) The automation required human feedback and intervention which progressively reduced and reached a figure of 18% with the update and enhancement of the KB. This paper assumes relevance because the implementation automates the entire process of collecting and cleaning the tweets and yields a corpus suitable for multi-faceted analysis.

  1. Fan, W., & Gordon, M. D. (2014). The power of social media analytics.Communications of the ACM, 57(6), 74-81.
  2. Zeng, D., Chen, H., Lusch, R., & Li, S. H. (2010). Social media analytics and intelligence. Intelligent Systems, IEEE, 25(6), 13-16.
  3. Li, C., Sun, A., Weng, J., & He, Q. (2015). Tweet Segmentation and its Application to Named Entity Recognition. Knowledge and Data Engineering, IEEE Transactions on, 27(2), 558-570.
  4. Han, B., & Baldwin, T. (2011, June). Lexical normalisation of short text messages: Makn sens a# twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 (pp. 368-378). Association for Computational Linguistics.
  5. Chen, B., Chen, X., & Xing, W. (2015, March). Twitter Archeology of learning analytics and knowledge conferences. In Proceedings of the Fifth International Conference on Learning Analytics And Knowledge (pp. 340-349). ACM.
  6. Xiang, G., Fan, B., Wang, L., Hong, J., & Rose, C. (2012, October). Detecting offensive tweets via topical feature discovery over a large scale twitter corpus. In Proceedings of the 21st ACM international conference on Information and knowledge management (pp. 1980-1984). ACM.
  7. Medhat, W., Yousef, A. H., & Korashy, H. (2014, November). A Framework of preparing corpora from Social Network sites for Sentiment Analysis. InInformation Society (i-Society), 2014 International Conference on (pp. 32-39). IEEE.
  8. Bosco, C., Patti, V., & Bolioli, A. (2013). Developing corpora for sentiment analysis: The case of irony and senti-tut. IEEE Intelligent Systems, (2), 55-63.
  9. Abel, F., Celik, I., Houben, G. J., & Siehndel, P. (2011). Leveraging the semantics of tweets for adaptive faceted search on twitter. In The Semantic Web–ISWC 2011 (pp. 1-17). Springer Berlin Heidelberg.
  10. Klein, B., Laiseca, X., Casado-Mansilla, D., López-de-Ipiña, D., & Nespral, A. P. (2012). Detection and extracting of emergency knowledge from twitter streams. In Ubiquitous Computing and Ambient Intelligence (pp. 462-469). Springer Berlin Heidelberg.
  11. O'Donovan, J., Kang, B., Meyer, G., Hollerer, T., & Adalii, S. (2012, September). Credibility in context: An analysis of feature distributions in twitter. In Privacy, Security, Risk and Trust (PASSAT), 2012 International Conference on and 2012 International Confernece on Social Computing (SocialCom) (pp. 293-301). IEEE.
  12. Zappavigna, M. (2012). Discourse of Twitter and social media: How we use language to create affiliation on the web. A&C Black.
  13. Russell, M. A. (2013). Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn, Google+, GitHub, and More. " O 'Reilly Media, Inc.".
Index Terms

Computer Science
Information Sciences


Corpora Tweets Social Media Mining Analytics Knowledgebase