Call for Paper - January 2023 Edition
IJCA solicits original research papers for the January 2023 Edition. Last date of manuscript submission is December 20, 2022. Read More

A Novel SSPS Framework for String Similarity Join

Print
PDF
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Year of Publication: 2017
Authors:
P. Selvaramalakshmi, S. Hari Ganesh, Florence Tushabe
10.5120/ijca2017912955

P Selvaramalakshmi, Hari S Ganesh and Florence Tushabe. A Novel SSPS Framework for String Similarity Join. International Journal of Computer Applications 160(1):32-38, February 2017. BibTeX

@article{10.5120/ijca2017912955,
	author = {P. Selvaramalakshmi and S. Hari Ganesh and Florence Tushabe},
	title = {A Novel SSPS Framework for String Similarity Join},
	journal = {International Journal of Computer Applications},
	issue_date = {February 2017},
	volume = {160},
	number = {1},
	month = {Feb},
	year = {2017},
	issn = {0975-8887},
	pages = {32-38},
	numpages = {7},
	url = {http://www.ijcaonline.org/archives/volume160/number1/27040-2017912955},
	doi = {10.5120/ijca2017912955},
	publisher = {Foundation of Computer Science (FCS), NY, USA},
	address = {New York, USA}
}

Abstract

As the enormous growth of information challenges the existing string analysis techniques for processing huge volume of data, there always seem to be a hope for newer inventions. Moreover, the problems encountered with the traditional methods such as low pruning power, increased false positives and poor scalability should be addressed with the appropriate solutions that cater to the need for improvement. Hence, this paper aims at proposing an improved similarity joins using SSPS MapReduce Framework that consists of a novel PSS Stemming algorithm along with three newly proposed filtering techniques such as SSize, SPositional and UI(Union –Intersection) that could effectively process large scale data by concerning the limitations of the traditional filtering methods. The experimentation shows that the framework reduces the false positives and run time cost with increased scalability than the existing frameworks.

References

  1. Fetterly D, Manasse M, Najork M (2003) On the evolution of clusters of near-duplicate web pages. J Web Eng 2(4):228–246
  2. Henzinger M (2006) Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, New York, pp 284–291
  3. Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, pp 269–278
  4. Xiao C,WangW, Lin X, Yu JX,Wang G (2011) Efficient similarity joins for near-duplicate detection. ACM Trans Database Syst (TODS) 36(3):15
  5. Baraglia R, De Francisci Morales G, LuccheseC(2010) Document similarity self-joinwith mapreduce. In: 2010 IEEE 10th International Conference on Data Mining (ICDM), IEEE, pp 731–736
  6. Elsayed T, Lin J, Oard DW (2008) Pairwise document similarity in large collections with mapreduce. In: Proceedings of the 46th annual meeting of the association for computational linguistics on human language technologies: short papers. association for, computational linguistics, pp 265–268
  7. Hoad TC, Zobel J (2003) Methods for identifying versioned and plagiarized documents. J Am Soc Inf Sci Technol 54(3):203–215
  8. Winkler WE (1999) The state of record linkage and current research problems. In: Statistical Research Division, US Census Bureau, Citeseer
  9. Hadjieleftheriou M, Chandel A, Koudas N, Srivastava D (2008) Fast indexes and algorithms for set similarity selection queries. In: IEEE 24th International Conference on Data Engineering, 2008. ICDE 2008. IEEE, New York pp 267–276
  10. Henzinger M (2006) Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, New York, pp 284–291
  11. Jiang, Y., Li, G., Feng, J. and Li, W.S., 2014. String similarity joins: An experimental evaluation. Proceedings of the VLDB Endowment, 7(8), pp.625-636.
  12. Deng, D., Li, G., Hao, S., Wang, J. and Feng, J., 2014, March. Massjoin: A mapreduce-based method for scalable string similarity joins. In 2014 IEEE 30th International Conference on Data Engineering (pp. 340-351). IEEE.
  13. Li, C., Wang, B. and Yang, X., 2007, September. VGRAM: Improving performance of approximate queries on string collections using variable-length grams. In Proceedings of the 33rd international conference on Very large data bases (pp. 303-314). VLDB Endowment.
  14. Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S. and Srivastava, D., 2001, September. Approximate string joins in a database (almost) for free. In VLDB (Vol. 1, pp. 491-500).
  15. Wang, J., Li, G. and Feng, J., 2012, May. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (pp. 85-96). ACM.

Keywords

similarity joins, Hadoop, MapReduce, filtering and Verification