Documentos duplicados y casi duplicados en el Web: detección con técnicas de hashing borroso

Carlos G. Figuerola; Raquel Gómez Díaz; José Luis Alonso Berrocal; Angel Zazo Rodríguez

doi:10.54886/scire.v17i1.3895

Duplicate and near-duplicate documents in the web: detection by means of fuzzy-hash techniques

Authors

Carlos G. Figuerola Departamento de Informática y Automática, Facultad de Traducción y Documentación, Universidad de Salamanca, España
Raquel Gómez Díaz
José Luis Alonso Berrocal
Angel Zazo Rodríguez

DOI:

https://doi.org/10.54886/scire.v17i1.3895

Keywords:

World Wide Web, Duplicate detection, Fuzzy hashing

Abstract

The detection of duplicates in the web is important because it allows to lighten databases and improve the efficiency of information retrieval engines and the precision of cybermetric analysis, web mining studies, etc. Standard hash techniques used to detect these duplicates only detect exact ones, at the bit level. However, many of the duplicates found in the real world are not exactly alike and have the same content, but different formats, headers, meta tags or style sheets. The obvious solution is to compare plain text conversions of all these formats, but these conversions are never identical, because of the different treatments that the converters give to the various formatting elements (treatment of textual characters, diacritics, spacing, paragraphs…). In this article, we introduce the possibility of using fuzzy-hashing to produce fingerprints of files (or documents, etc..) that can be compared to estimate the closeness or distance between two files, documents, etc. Based on the concept of “rolling hash”, the fuzzy hashing has been used successfully in computer security tasks, such as identifying malware, spam, virus scanning, etc. We have added capabilities of fuzzy hashing to a slight crawler and have made several tests in a heterogeneous network domain, consisting of multiple servers with different software, static and dynamic pages, etc. These tests allowed us to measure similarity thresholds and to obtain useful data about the quantity and distribution of duplicate documents on web servers.

Downloads

Download data is not yet available.

Author Biography

Carlos G. Figuerola, Departamento de Informática y Automática, Facultad de Traducción y Documentación, Universidad de Salamanca, España

Es profesor del Grado en Documentación y del Master en Sisyemas de Información Digital de la Universidad de Salamanca. Su docencia se centra en las técnicas informáticas base de las Ciencias de la Información y Documentación. Además es miembro del grupo de investigación E-lectra, grupo reconocido de la Universidad de Salamanca, cuyos temas de interés son la recuperación de información y la cibermetría: implementación de conocimiento lingüístico en sistemas de recuperación, procesamiento de lenguaje natural, recuperación de información multilingüe, clasificación automática, recuperación robusta, recuperación interactiva, recuperación de información en el web, cibermetría, etc.

Downloads

PDF (Español (España))

Published

2011-12-30

How to Cite

G. Figuerola, C., Gómez Díaz, R., Alonso Berrocal, J. L., & Zazo Rodríguez, A. (2011). Duplicate and near-duplicate documents in the web: detection by means of fuzzy-hash techniques. Scire: Knowledge Representation and Organization (ISSNe 2340-7042; ISSN 1135-3716), 17(1), 49–54. https://doi.org/10.54886/scire.v17i1.3895

Download Citation

Issue

Vol.17, N.1 (2011)

Section

Articles

License

Copyright (c) 2011 Authors retain their copyright, but transfer the exploitation rights (reproduction, distribution, public communication and transformation) to the journal in a non-exclusive way and guarantee the right to the first publication of their work to the journal, which will be simultaneously subjected to the license CC BY-NC-ND. Authors take whole personal responsibility on fulfilling all the appropiate ethical codes and laws, and obtaining all the necessary copyright permissions regarding their articles. Institutional and self- archiving is allowed and encouraged.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

© 1996- . Authors retain their copyright, but transfer the exploitation rights (reproduction, distribution, public communication and transformation) to the journal in a non-exclusive way and guarantee the right to the first publication of their work to the journal, which will be simultaneously subjected to the license CC BY-NC-ND. Authors take whole personal responsibility on fulfilling all the appropiate ethical codes and laws, and obtaining all the necessary copyright permissions regarding their articles. Institutional and self- archiving is allowed and encouraged.

Duplicate and near-duplicate documents in the web: detection by means of fuzzy-hash techniques

Authors

DOI:

Keywords:

Abstract

Downloads

Author Biography

Carlos G. Figuerola, Departamento de Informática y Automática, Facultad de Traducción y Documentación, Universidad de Salamanca, España

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Current Issue

Information

Language

Browse