Duplicate and near-duplicate documents in the web: detection by means of fuzzy-hash techniques
DOI:
https://doi.org/10.54886/scire.v17i1.3895Keywords:
World Wide Web, Duplicate detection, Fuzzy hashingAbstract
The detection of duplicates in the web is important because it allows to lighten databases and improve the efficiency of information retrieval engines and the precision of cybermetric analysis, web mining studies, etc. Standard hash techniques used to detect these duplicates only detect exact ones, at the bit level. However, many of the duplicates found in the real world are not exactly alike and have the same content, but different formats, headers, meta tags or style sheets. The obvious solution is to compare plain text conversions of all these formats, but these conversions are never identical, because of the different treatments that the converters give to the various formatting elements (treatment of textual characters, diacritics, spacing, paragraphs…). In this article, we introduce the possibility of using fuzzy-hashing to produce fingerprints of files (or documents, etc..) that can be compared to estimate the closeness or distance between two files, documents, etc. Based on the concept of “rolling hash”, the fuzzy hashing has been used successfully in computer security tasks, such as identifying malware, spam, virus scanning, etc. We have added capabilities of fuzzy hashing to a slight crawler and have made several tests in a heterogeneous network domain, consisting of multiple servers with different software, static and dynamic pages, etc. These tests allowed us to measure similarity thresholds and to obtain useful data about the quantity and distribution of duplicate documents on web servers.
Downloads
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2011 Authors retain their copyright, but transfer the exploitation rights (reproduction, distribution, public communication and transformation) to the journal in a non-exclusive way and guarantee the right to the first publication of their work to the journal, which will be simultaneously subjected to the license CC BY-NC-ND. Authors take whole personal responsibility on fulfilling all the appropiate ethical codes and laws, and obtaining all the necessary copyright permissions regarding their articles. Institutional and self- archiving is allowed and encouraged.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
© 1996- . Authors retain their copyright, but transfer the exploitation rights (reproduction, distribution, public communication and transformation) to the journal in a non-exclusive way and guarantee the right to the first publication of their work to the journal, which will be simultaneously subjected to the license CC BY-NC-ND. Authors take whole personal responsibility on fulfilling all the appropiate ethical codes and laws, and obtaining all the necessary copyright permissions regarding their articles. Institutional and self- archiving is allowed and encouraged.