Classifying Short Text in Social Media: Twitter as Case Study

Faris Kateb; Jugal Kalita

Call for Paper

August Edition

IJCA solicits high quality original research papers for the upcoming August edition of the journal. The last date of research paper submission is 21 July 2025

Submit your paper

Know more

The week's pick

FORENSIC ANALYSIS FRAMEWORKS FOR ENCRYPTED CLOUD STORAGE INVESTIGATIONS

Joy Awoleye Sarah Mavire Allan Munyira Kelvin Magora

Random Articles

Wirelessly Transmitting a Grayscale Image using Visible Light

November

2012

Development and Performance Evaluation of Mismatched Filter using Differential Evolution

May

2012

A Novel Prioritised Concealment and Flexible Macroblock Ordering Scheme for Video Transmission

Sep

2016

An Optimizing Technique based on Genetic Algorithm for Power Management in Heterogeneous Multi-Tier Web Clusters

April

2015

Reseach Article

Classifying Short Text in Social Media: Twitter as Case Study

by Faris Kateb, Jugal Kalita

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 111 - Number 9

Year of Publication: 2015

Authors: Faris Kateb, Jugal Kalita

10.5120/19563-1321

Faris Kateb, Jugal Kalita . Classifying Short Text in Social Media: Twitter as Case Study. International Journal of Computer Applications. 111, 9 ( February 2015), 1-12. DOI=10.5120/19563-1321

@article{ 10.5120/19563-1321,

author = { Faris Kateb, Jugal Kalita },

title = { Classifying Short Text in Social Media: Twitter as Case Study },

journal = { International Journal of Computer Applications },

issue_date = { February 2015 },

volume = { 111 },

number = { 9 },

month = { February },

year = { 2015 },

issn = { 0975-8887 },

pages = { 1-12 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume111/number9/19563-1321/ },

doi = { 10.5120/19563-1321 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T22:47:23.615222+05:30

%A Faris Kateb

%A Jugal Kalita

%T Classifying Short Text in Social Media: Twitter as Case Study

%J International Journal of Computer Applications

%@ 0975-8887

%V 111

%N 9

%P 1-12

%D 2015

%I Foundation of Computer Science (FCS), NY, USA

Abstract

With the huge growth of social media, especially with 500 million Twitter messages being posted per day, analyzing these messages has caught intense interest of researchers. Topics of interest include micro-blog summarization, breaking news detection, opinion mining and discovering trending topics. In information extraction, researchers face challenges in applying data mining techniques due to the short length of tweets as opposed to normal text with longer length documents. Short messages lead to less accurate results. This has motivated investigation of efficient algorithms to overcome problems that arise due to the short and often informal text of tweets. Another challenge that researchers face is stream data, which refers to the huge and dynamic flow of text generated continuously from social media. In this paper, we discuss the possibility of implementing successful solutions that can be used to overcome the inconclusiveness of short texts. In addition, we discuss methods that overcome stream data problems.

References

Fabian Abel, Qi Gao, Geert-Jan Houben, and Ke Tao. Semantic enrichment of twitter posts for user profile construction on the social web. In Grigoris Antoniou, Marko Grobelnik, Elena Simperl, Bijan Parsia, Dimitris Plexousakis, Pieter Leenheer, and Jeff Pan, editors, The Semanic Web: Research and Applications, volume 6644 of Lecture Notes in Computer Science, pages 375–389. Springer Berlin Heidelberg, 2011.
C. Albrecht Buehler, B. Watson, and D. A. Shamma. Visualizing live text streams using motion and temporal pooling. Computer Graphics and Applications, IEEE, 25(3):52– 59, May 2005.
Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. Models and issues in data stream systems. In Proceedings of the Twenty-first ACM SIGMODSIGACT- SIGART Symposium on Principles of Database Systems, PODS '02, pages 1–16, New York, NY, USA, 2002. ACM.
James Benhardus and Jugal Kalita. Streaming trend detection in twitter. Int. J. Web Based Communities, 9(1):122–139, January 2013.
Adam Bermingham and Alan F. Smeaton. Classifying sentiment in microblogs: Is brevity an advantage? In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM '10, pages 1833–1836, New York, NY, USA, 2010. ACM.
Johan Bollen, Huina Mao, and Xiaojun Zeng. Twitter mood predicts the stock market. Journal of Computational Science, 2(1):1 – 8, 2011.
S. Le Cessie and J. C. Van Houwelingen. Ridge estimators in logistic regression. Journal of the Royal Statistical Society. Series C (Applied Statistics), 41(1):pp. 191–201, 1992.
Meeyoung Cha, Hamed Haddadi, Fabricio Benevenuto, and Krishna P Gummadi. Measuring user influence in twitter: The million follower fallacy. In 4th International AAAI Conference on Weblogs and Social Media (ICWSM), volume 14, pages 10–17, 2010.
Adrian Chen. Can an algorithm solve twitter's credibility problem?, 2014. [Online; posted 5-May-2014].
Thomas Cover and Peter Hart. Nearest neighbor pattern classification. Information Theory, IEEE Transactions on, 13(1):21–27, 1967.
Nello Cristianini and John Shawe-Taylor. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, 2000.
Aron Culotta. Towards detecting influenza epidemics by analyzing twitter messages. In Proceedings of the First Workshop on Social Media Analytics, SOMA '10, pages 115–122, New York,NY,USA, 2010. ACM.
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal Of The Royal Statistical Society, Series B, 39(1):1–38, 1977.
Pedro Domingos and Geoff Hulten. Mining high-speed data streams. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '00, pages 71–80, New York, NY, USA, 2000. ACM.
M. Dork, D. Gruen, C. Williamson, and S. Carpendale. A visual backchannel for large-scale events. Visualization and Computer Graphics, IEEE Transactions on, 16(6):1129– 1138, Nov 2010.
Micah Dubinko, Ravi Kumar, Joseph Magnani, Jasmine Novak, Prabhakar Raghavan, and Andrew Tomkins. Visualizing tags over time. ACM Trans. Web, 1(2), August 2007.
Susan Dumais, John Platt, David Heckerman, and Mehran Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of the Seventh International Conference on Information and Knowledge Management, CIKM '98, pages 148–155, New York, NY, USA, 1998. ACM.
D. M. Endres and J. E. Schindelin. A new metric for probability distributions. Information Theory, IEEE Transactions on, 49(7):1858–1860, July 2003.
Mica R. Endsley. Toward a theory of situation awareness in dynamic systems. Human Factors: The Journal of the Human Factors and Ergonomics Society, 37(1):32–64, 1995.
Zhou Faguo, Zhang Fan, Yang Bingru, and Yu Xingang. Research on short text classification algorithm based on statistics and rules. In Electronic Commerce and Security (ISECS), 2010 Third International Symposium on, pages 3–7, 2010.
Dehong Gao, Wenjie Li, Xiaoyan Cai, Renxian Zhang, and You Ouyang. Sequential summarization: A full view of twitter trending topics. IEEE/ACM Trans. Audio, Speech and Lang. Proc. , 22(2):293–302, February 2014.
Daniel Gayo-Avello. A meta-analysis of state-of-the-art electoral prediction from twitter data. Social Science Computer Review, 31(6):649–679, 2013.
Daniel Gayo-Avello, Panagiotis Metaxas, and Eni Mustafaraj. Limits of electoral predictions using twitter. CWSM, 2011.
Alec Go, Richa Bhayani, and Lei Huang. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, pages 1–12, 2009.
A. Gupta, P. Kumaraguru, C. Castillo, and P. Meier. Tweet- Cred: A Real-time Web-based System for Assessing Credibility of Content on Twitter. arXiv preprint arXiv:1405. 5490, May 2014.
John Hannon, Mike Bennett, and Barry Smyth. Recommending twitter users to follow using content and collaborative filtering approaches. In Proceedings of the Fourth ACM Conference on Recommender Systems, RecSys '10, pages 199–206, New York, NY, USA, 2010. ACM.
Peter Harrington. Machine Learning in Action. Manning Publications Co. , Greenwich, CT, USA, 2012.
Liangjie Hong and Brian D. Davison. Empirical study of topic modeling in twitter. In Proceedings of the First Workshop on Social Media Analytics, SOMA '10, pages 80–88, New York, NY, USA, 2010. ACM.
Andreas Hotho, Andreas Nrnberger, and Gerhard Paa. A brief survey of text mining. LDV Forum - GLDV Journal for Computational Linguistics and Language Technology, 20(1):19– 62, May 2005.
X. Hu and H. Liu. Text Analytics in Social Media. Springer, 2012.
Xia Hu, Nan Sun, Chao Zhang, and Tat-Seng Chua. Exploiting internal and external semantics for the clustering of short texts using world knowledge. In Proceedings of the 18thACM Conference on Information and Knowledge Management, CIKM '09, pages 919–928, New York, NY, USA, 2009. ACM.
Thorsten Joachims. Optimizing search engines using clickthrough data. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '02, pages 133–142, New York, NY, USA, 2002. ACM.
Kathy Lee, Diana Palsetia, Ramanathan Narayanan, Md. Mostofa Ali Patwary, Ankit Agrawal, and Alok Choudhary. Twitter trending topic classification. In Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops, ICDMW '11, pages 251–258, Washington, DC, USA, 2011. IEEE Computer Society.
Lila MacLellan. Tweets per minute social media, 2012. [Online; posted 6-Sep-2012].
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Sch¨utze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008.
Adam Marcus, Michael S. Bernstein, Osama Badar, David R. Karger, Samuel Madden, and Robert C. Miller. Twitinfo: Aggregating and visualizing microblogs for event exploration. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI '11, pages 227–236, New York, NY, USA, 2011. ACM.
M. M. Masud, Qing Chen, L. Khan, C. C. Aggarwal, Jing Gao, Jiawei Han, A. Srivastava, and N. C. Oza. Classification and adaptive novel class detection of feature-evolving data streams. Knowledge and Data Engineering, IEEE Transactions on, 25(7):1484–1497, July 2013.
M. M. Masud, Jing Gao, L. Khan, Jiawei Han, and Bhavani Thuraisingham. Classification and novel class detection in concept-drifting data streams under time constraints. Knowledge and Data Engineering, IEEE Transactions on, 23(6):859–874, June 2011.
Diana Maynard, Kalina Bontcheva, and Dominic Rout. Challenges in developing opinion mining tools for social media. In Workshop Programme, pages 15–, 2011.
Andrew McCallum, Kamal Nigam, et al. A comparison of event models for na¨?ve bayes text classification. pages 41–48. Citeseer.
M. McCord and M. Chuah. Spam detection on twitter using traditional classifiers. In Autonomic and Trusted Computing, volume 6906 of Lecture Notes in Computer Science, pages 175–186. Springer Berlin Heidelberg, 2011.
Douglas McNair, Maurice Lorr, and Leo Droppleman. Profile of mood states (poms). Profile of Mood States, 1989.
Kamal Nigam, AndrewKachites Mccallum, Sebastian Thrun, and Tom Mitchell. Text classification from labeled and unlabeled documents using em. Machine Learning, 39(2-3):103– 134, 2000.
Brendan O'Connor, Ramnath Balasubramanyan, Bryan R Routledge, and Noah A Smith. From tweets to polls: Linking text sentiment to public opinion time series. The International AAAI Conference on Weblogs and Social Media (ICWSM), 11:122–129, 2010.
Alexander Pak and Patrick Paroubek. Twitter as a corpus for sentiment analysis and opinion mining. In The International Conference on Language Resources and Evaluation (LREC), 2010.
Sa?sa Petrovi´c, Miles Osborne, and Victor Lavrenko. Streaming first story detection with application to twitter. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT '10, pages 181–189, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics.
J. R. Quinlan. Induction of decision trees. Machine Learning, 1(1):81–106, 1986.
Christian Rohrdantz, Daniela Oelke, Milo?s Krstajic, and Fabian Fischer. Real-time visualization of streaming text data: Tasks and challenges. In Workshop on Interactive Visual Text Analytics for Decision-Making at the IEEE VisWeek, volume 201, 2011.
Francisco P Romero, Pascual Juli´an-Iranzo, Andr´es Soto, Mateus Ferreira-Satler, and Juan Gallardo-Casero. Classifying unlabeled short texts using a fuzzy declarative approach. Language Resources and Evaluation, 47(1):151–178, 2013.
Mehran Sahami and Timothy D. Heilman. A web-based kernel function for measuring the similarity of short text snippets. In Proceedings of the 15th International Conference on World Wide Web,WWW'06, pages 377–386, New York, NY, USA, 2006. ACM.
Gerard Salton, Edward A. Fox, and Harry Wu. Extended boolean information retrieval. ACM Communication, 26(11):1022–1036, November 1983.
A. L. Samuel. Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3(3):210–229, July 1959.
Andrew I. Schein, Alexandrin Popescul, Lyle H. Ungar, and David M. Pennock. Methods and metrics for cold-start recommendations. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '02, pages 253–260, New York, NY, USA, 2002. ACM.
Harald Schoen, Daniel Gayo-Avello, Panagiotis Takis Metaxas, Eni Mustafaraj, Markus Strohmaier, and Peter Gloor. The Power of Prediction with Social Media. Internet Research, 23(5):528–543, 2013.
Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Compututer Survey, 34(1):1–47, March 2002.
U. S Securities and Exchange. U. s securities and exchange commission report for 2013, 2013.
Beaux Sharifi, Mark-Anthony Hutton, and Jugal Kalita. Automatic summarization of twitter topics. In National Workshop on Design and Analysis of Algorithms, Tezpur, India, pages 121–128, 2010.
C. Shekar, S. Wakade, K. J. Liszka, and Chien-Chung Chan. Mining pharmaceutical spam from twitter. In Intelligent Systems Design and Applications (ISDA), 2010 10th International Conference, pages 813–817, Nov 2010.
Ian Soboroff, Dean McCullough, Jimmy Lin, Craig Macdonald, Iadh Ounis, and Richard McCreadie. Evaluating real-time search over tweets. International Conference on Weblogs and Social Media ICWSM, pages 943–961, 2012.
Bharath Sriram, Dave Fuhry, Engin Demir, Hakan Ferhatosmanoglu, and Murat Demirbas. Short text classification in twitter to improve information filtering. In Proceedings of the 33rd International ACM SIGIR Conference on Researchand Development in Information Retrieval, SIGIR '10, pages 841–842, New York, NY, USA, 2010. ACM.
Takeshi Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. Earthquake shakes twitter users: Real-time event detection by social sensors. In Proceedings of the 19th International Conference on World Wide Web, WWW '10, pages 851–860, New York, NY, USA, 2010. ACM.
Oren Tsur and Ari Rappoport. What's in a hashtag?: Content based prediction of the spread of ideas in microblogging communities. In Proceedings of the fifth ACM International Conference on Web sSearch and Data Mining, pages 643–652. ACM, 2012.
Y. Tyshchuk, C. Hui, M. Grabowski, and W. A. Wallace. Social media and warning response impacts in extreme events: Results from a naturally occurring experiment. In System Science (HICSS), 2012 45th Hawaii International Conference on, pages 818–827, Jan 2012.
Jianshu Weng and Bu-Sung Lee. Event detection in twitter. The International AAAI Conference on Weblogs and Social Media (ICWSM), 11:401–408, 2011.
Theresa Wilson, Janyce Wiebe, and Paul Hoffmann. Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, HLT '05, pages 347–354, Stroudsburg, PA, USA, 2005. Association for Computational Linguistics.
Benjamin P. Wing and Jason Baldridge. Simple supervised document geolocation with geodesic grids. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT '11, pages 955–964, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics.
Ian H. Witten, Eibe Frank, and Mark A. Hall. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Amsterdam, 3 edition, 2011.
Felix Ming Fai Wong, Soumya Sen, and Mung Chiang. Why watching movie tweets won't tell the whole story? In Proceedings of the 2012 ACM Workshop on Workshop on Online Social Networks, WOSN '12, pages 61–66, New York, NY, USA, 2012. ACM.
Pak Chung Wong, H. Foote, D. Adams, W. Cowley, and J. Thomas. Dynamic visualization of transient data streams. In Information Visualization, 2003. INFOVIS 2003. IEEE Symposium, pages 97–104, Oct 2003.
Panpan Xu, Yingcai Wu, Enxun Wei, Tai-Quan Peng, Shixia Liu, J. J. H. Zhu, and Huamin Qu. Visual analysis of topic competition on social media. Visualization and Computer Graphics, IEEE Transactions on, 19(12):2012–2021, Dec 2013.
Lei Yang, Tao Sun, Ming Zhang, and Qiaozhu Mei. We know what @you #tag: Does the dual role affect hashtag adoption? In Proceedings of the 21st International Conference on World Wide Web, WWW '12, pages 261–270, New York, NY, USA, 2012. ACM.
Xue Zhang, Hauke Fuehres, and Peter A. Gloor. Predicting stock market indicators through twitter "i hope it is not as bad as i fear?"". Procedia - Social and Behavioral Sciences, 26(0):55 – 62, 2011. The 2nd Collaborative Innovation Networks Conference - fCOINs2010g.
Arkaitz Zubiaga, Damiano Spina, V´?ctor Fresno, and Raquel Mart´?nez. Classifying trending topics: A typology of conversation triggers on twitter. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM '11, pages 2461–2464, New York, NY, USA, 2011. ACM.

Index Terms

Computer Science

Information Sciences

Keywords

Social Media Mining Short Text Classification Stream Data