Duplicate and near-duplicate documents in the web: detection by means of fuzzy-hash techniques

Authors

  • Carlos G. Figuerola Departamento de Informática y Automática, Facultad de Traducción y Documentación, Universidad de Salamanca, España
  • Raquel Gómez Díaz
  • José Luis Alonso Berrocal
  • Angel Zazo Rodríguez

DOI:

https://doi.org/10.54886/scire.v17i1.3895

Keywords:

World Wide Web, Duplicate detection, Fuzzy hashing

Abstract

The detection of duplicates in the web is important because it allows to lighten databases and improve the efficiency of information retrieval engines and the precision of cybermetric analysis, web mining studies, etc. Standard hash techniques used to detect these duplicates only detect exact ones, at the bit level. However, many of the duplicates found in the real world are not exactly alike and have the same content, but different formats, headers, meta tags or style sheets. The obvious solution is to compare plain text conversions of all these formats, but these conversions are never identical, because of the different treatments that the converters give to the various formatting elements (treatment of textual characters, diacritics, spacing, paragraphs…). In this article, we introduce the possibility of using fuzzy-hashing to produce fingerprints of files (or documents, etc..) that can be compared to estimate the closeness or distance between two files, documents, etc. Based on the concept of “rolling hash”, the fuzzy hashing has been used successfully in computer security tasks, such as identifying malware, spam, virus scanning, etc. We have added capabilities of fuzzy hashing to a slight crawler and have made several tests in a heterogeneous network domain, consisting of multiple servers with different software, static and dynamic pages, etc. These tests allowed us to measure similarity thresholds and to obtain useful data about the quantity and distribution of duplicate documents on web servers.

Downloads

Download data is not yet available.

Author Biography

Carlos G. Figuerola, Departamento de Informática y Automática, Facultad de Traducción y Documentación, Universidad de Salamanca, España

Es profesor del Grado en Documentación y del Master en Sisyemas de Información Digital de la Universidad de Salamanca. Su docencia se centra en las técnicas informáticas base de las Ciencias de la Información y Documentación. Además es miembro del grupo de investigación E-lectra, grupo reconocido de la Universidad de Salamanca, cuyos temas de interés son la recuperación de información y la cibermetría: implementación de conocimiento lingüístico en sistemas de recuperación, procesamiento de lenguaje natural, recuperación de información multilingüe, clasificación automática, recuperación robusta, recuperación interactiva, recuperación de información en el web, cibermetría, etc.

Published

2011-12-30

How to Cite

G. Figuerola, C., Gómez Díaz, R., Alonso Berrocal, J. L., & Zazo Rodríguez, A. (2011). Duplicate and near-duplicate documents in the web: detection by means of fuzzy-hash techniques. Scire: Knowledge Representation and Organization (ISSNe 2340-7042; ISSN 1135-3716), 17(1), 49–54. https://doi.org/10.54886/scire.v17i1.3895

Issue

Section

Articles