De-duplicating a large crowd-sourced catalogue of bibliographic records

Subasić, Ilija; Gvozdenović, Nebojša; Jack, Kris

Please use this identifier to cite or link to this item: https://open.uns.ac.rs/handle/123456789/4757

DC Field	Value	Language
dc.contributor.author	Subasić, Ilija	en_US
dc.contributor.author	Gvozdenović, Nebojša	en_US
dc.contributor.author	Jack, Kris	en_US
dc.date.accessioned	2019-09-30T08:41:25Z	-
dc.date.available	2019-09-30T08:41:25Z	-
dc.date.issued	2016-04-04	-
dc.identifier.issn	00330337	en_US
dc.identifier.uri	https://open.uns.ac.rs/handle/123456789/4757	-
dc.description.abstract	© 2016, © Emerald Group Publishing Limited. Purpose – The purpose of this paper is to describe a large-scale algorithm for generating a catalogue of scientific publication records (citations) from a crowd-sourced data, demonstrate how to learn an optimal combination of distance metrics for duplicate detection and introduce a parallel duplicate clustering algorithm. Design/methodology/approach – The authors developed the algorithm and compared it with state-of-the art systems tackling the same problem. The authors used benchmark data sets (3k data points) to test the effectiveness of our algorithm and a real-life data (>90 million) to test the efficiency and scalability of our algorithm. Findings – The authors show that duplicate detection can be improved by an additional step we call duplicate clustering. The authors also show how to improve the efficiency of map/reduce similarity calculation algorithm by introducing a sampling step. Finally, the authors find that the system is comparable to the state-of-the art systems for duplicate detection, and that it can scale to deal with hundreds of million data points. Research limitations/implications – Academic researchers can use this paper to understand some of the issues of transitivity in duplicate detection, and its effects on digital catalogue generations. Practical implications – Industry practitioners can use this paper as a use case study for generating a large-scale real-life catalogue generation system that deals with millions of records in a scalable and efficient way. Originality/value – In contrast to other similarity calculation algorithms developed for m/r frameworks the authors present a specific variant of similarity calculation that is optimized for duplicate detection of bibliographic records by extending previously proposed e-algorithm based on inverted index creation. In addition, the authors are concerned with more than duplicate detection, and investigate how to group detected duplicates. The authors develop distinct algorithms for duplicate detection and duplicate clustering and use the canopy clustering idea for multi-pass clustering. The work extends the current state-of-the-art by including the duplicate clustering step and demonstrate new strategies for speeding up m/r similarity calculations.	en
dc.relation.ispartof	Program	en
dc.title	De-duplicating a large crowd-sourced catalogue of bibliographic records	en_US
dc.type	Journal/Magazine Article	en_US
dc.identifier.doi	10.1108/PROG-02-2015-0021	-
dc.identifier.scopus	2-s2.0-84961635256	-
dc.identifier.url	https://api.elsevier.com/content/abstract/scopus_id/84961635256	-
dc.description.version	Unknown	en_US
dc.relation.lastpage	156	en
dc.relation.firstpage	138	en
dc.relation.issue	2	en
dc.relation.volume	50	en
item.grantfulltext	none	-
item.fulltext	No Fulltext	-
crisitem.author.dept	Ekonomski fakultet, Departman za poslovnu informatiku i kvantitativne metode	-
crisitem.author.orcid	0000-0002-9230-9528	-
crisitem.author.parentorg	Ekonomski fakultet	-
Appears in Collections:	EF Publikacije/Publications

Show simple item record

SCOPUS^TM
Citations

1

checked on Aug 12, 2023

Page view(s)

12

Last Week
5

Last month
0

checked on May 3, 2024

Google Scholar^TM

Check

SCOPUSTM Citations

Page view(s)

Google ScholarTM

Altmetric

SCOPUS^TM
Citations

Google Scholar^TM