A Novel SSPS Framework for String Similarity Join

P. Selvaramalakshmi; S. Hari Ganesh; Florence Tushabe

Call for Paper

March Edition

IJCA solicits high quality original research papers for the upcoming March edition of the journal. The last date of research paper submission is 20 February 2026

Submit your paper

Know more

The week's pick

A Knowledge-Graph–Driven Multimodal Large Model for Semantic Understanding and Controllable Generation of Intangible Cultural Heritage

Jundi Yang Heng Yao

Random Articles

Reseach Article

A Novel SSPS Framework for String Similarity Join

by P. Selvaramalakshmi, S. Hari Ganesh, Florence Tushabe

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 160 - Number 1

Year of Publication: 2017

Authors: P. Selvaramalakshmi, S. Hari Ganesh, Florence Tushabe

10.5120/ijca2017912955

P. Selvaramalakshmi, S. Hari Ganesh, Florence Tushabe . A Novel SSPS Framework for String Similarity Join. International Journal of Computer Applications. 160, 1 ( Feb 2017), 32-38. DOI=10.5120/ijca2017912955

@article{ 10.5120/ijca2017912955,

author = { P. Selvaramalakshmi, S. Hari Ganesh, Florence Tushabe },

title = { A Novel SSPS Framework for String Similarity Join },

journal = { International Journal of Computer Applications },

issue_date = { Feb 2017 },

volume = { 160 },

number = { 1 },

month = { Feb },

year = { 2017 },

issn = { 0975-8887 },

pages = { 32-38 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume160/number1/27040-2017912955/ },

doi = { 10.5120/ijca2017912955 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-07T00:05:30.007723+05:30

%A P. Selvaramalakshmi

%A S. Hari Ganesh

%A Florence Tushabe

%T A Novel SSPS Framework for String Similarity Join

%J International Journal of Computer Applications

%@ 0975-8887

%V 160

%N 1

%P 32-38

%D 2017

%I Foundation of Computer Science (FCS), NY, USA

Abstract

As the enormous growth of information challenges the existing string analysis techniques for processing huge volume of data, there always seem to be a hope for newer inventions. Moreover, the problems encountered with the traditional methods such as low pruning power, increased false positives and poor scalability should be addressed with the appropriate solutions that cater to the need for improvement. Hence, this paper aims at proposing an improved similarity joins using SSPS MapReduce Framework that consists of a novel PSS Stemming algorithm along with three newly proposed filtering techniques such as SSize, SPositional and UI(Union –Intersection) that could effectively process large scale data by concerning the limitations of the traditional filtering methods. The experimentation shows that the framework reduces the false positives and run time cost with increased scalability than the existing frameworks.

References

Fetterly D, Manasse M, Najork M (2003) On the evolution of clusters of near-duplicate web pages. J Web Eng 2(4):228–246
Henzinger M (2006) Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, New York, pp 284–291
Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, pp 269–278
Xiao C,WangW, Lin X, Yu JX,Wang G (2011) Efficient similarity joins for near-duplicate detection. ACM Trans Database Syst (TODS) 36(3):15
Baraglia R, De Francisci Morales G, LuccheseC(2010) Document similarity self-joinwith mapreduce. In: 2010 IEEE 10th International Conference on Data Mining (ICDM), IEEE, pp 731–736
Elsayed T, Lin J, Oard DW (2008) Pairwise document similarity in large collections with mapreduce. In: Proceedings of the 46th annual meeting of the association for computational linguistics on human language technologies: short papers. association for, computational linguistics, pp 265–268
Hoad TC, Zobel J (2003) Methods for identifying versioned and plagiarized documents. J Am Soc Inf Sci Technol 54(3):203–215
Winkler WE (1999) The state of record linkage and current research problems. In: Statistical Research Division, US Census Bureau, Citeseer
Hadjieleftheriou M, Chandel A, Koudas N, Srivastava D (2008) Fast indexes and algorithms for set similarity selection queries. In: IEEE 24th International Conference on Data Engineering, 2008. ICDE 2008. IEEE, New York pp 267–276
Henzinger M (2006) Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, New York, pp 284–291
Jiang, Y., Li, G., Feng, J. and Li, W.S., 2014. String similarity joins: An experimental evaluation. Proceedings of the VLDB Endowment, 7(8), pp.625-636.
Deng, D., Li, G., Hao, S., Wang, J. and Feng, J., 2014, March. Massjoin: A mapreduce-based method for scalable string similarity joins. In 2014 IEEE 30th International Conference on Data Engineering (pp. 340-351). IEEE.
Li, C., Wang, B. and Yang, X., 2007, September. VGRAM: Improving performance of approximate queries on string collections using variable-length grams. In Proceedings of the 33rd international conference on Very large data bases (pp. 303-314). VLDB Endowment.
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S. and Srivastava, D., 2001, September. Approximate string joins in a database (almost) for free. In VLDB (Vol. 1, pp. 491-500).
Wang, J., Li, G. and Feng, J., 2012, May. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (pp. 85-96). ACM.

Index Terms

Computer Science

Information Sciences

Keywords

similarity joins Hadoop MapReduce filtering and Verification