Document (#30060)

Conrad, J.G.
Schriber, C.P.
Managing déjà vu : collection building for the identification of nonidentical duplicate documents
Journal of the American Society for Information Science and Technology. 57(2006) no.7, S.921-932
As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. Few users wish to retrieve search results consisting of sets of duplicate documents, whether identical duplicates or close variants. The goal of this work is to facilitate (a) investigations into the phenomenon of near duplicates and (b) algorithmic approaches to minimizing its deleterious effect on search results. Harnessing the expertise of both client-users and professional searchers, we establish principled methods to generate a test collection for identifying and handling nonidentical duplicate documents. We subsequently examine a flexible method of characterizing and comparing documents to permit the identification of near duplicates. This method has produced promising results following an extensive evaluation using a production-based test collection created by domain experts.
Beitrag zur Problematik der automatischen Erkennung von Dubletten

Similar documents (content)

  1. Zhang, J.; Mostafa, J.; Tripathy, H.: Information retrieval by semantic analysis and visualization of the concept space of D-Lib® magazine (2002) 0.14
    0.14003526 = sum of:
      0.14003526 = product of:
        0.31826195 = sum of:
          0.02114336 = weight(abstract_txt:users in 1211) [ClassicSimilarity], result of:
            0.02114336 = score(doc=1211,freq=11.0), product of:
              0.057146076 = queryWeight, product of:
                1.0164431 = boost
                3.569778 = idf(docFreq=3384, maxDocs=44218)
                0.015749332 = queryNorm
              0.36998796 = fieldWeight in 1211, product of:
                3.3166249 = tf(freq=11.0), with freq of:
                  11.0 = termFreq=11.0
                3.569778 = idf(docFreq=3384, maxDocs=44218)
                0.03125 = fieldNorm(doc=1211)
          0.026366623 = weight(abstract_txt:subsequently in 1211) [ClassicSimilarity], result of:
            0.026366623 = score(doc=1211,freq=1.0), product of:
              0.116867654 = queryWeight, product of:
                1.0278318 = boost
                7.2195506 = idf(docFreq=87, maxDocs=44218)
                0.015749332 = queryNorm
              0.22561096 = fieldWeight in 1211, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.2195506 = idf(docFreq=87, maxDocs=44218)
                0.03125 = fieldNorm(doc=1211)
          0.016802715 = weight(abstract_txt:search in 1211) [ClassicSimilarity], result of:
            0.016802715 = score(doc=1211,freq=6.0), product of:
              0.06000727 = queryWeight, product of:
                1.041578 = boost
                3.6580524 = idf(docFreq=3098, maxDocs=44218)
                0.015749332 = queryNorm
              0.28001133 = fieldWeight in 1211, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                3.6580524 = idf(docFreq=3098, maxDocs=44218)
                0.03125 = fieldNorm(doc=1211)
          0.029645834 = weight(abstract_txt:variants in 1211) [ClassicSimilarity], result of:
            0.029645834 = score(doc=1211,freq=1.0), product of:
              0.12636703 = queryWeight, product of:
                1.0687885 = boost
                7.5072327 = idf(docFreq=65, maxDocs=44218)
                0.015749332 = queryNorm
              0.23460102 = fieldWeight in 1211, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5072327 = idf(docFreq=65, maxDocs=44218)
                0.03125 = fieldNorm(doc=1211)
          0.043832153 = weight(abstract_txt:algorithmic in 1211) [ClassicSimilarity], result of:
            0.043832153 = score(doc=1211,freq=2.0), product of:
              0.13016969 = queryWeight, product of:
                1.0847504 = boost
                7.61935 = idf(docFreq=58, maxDocs=44218)
                0.015749332 = queryNorm
              0.33673087 = fieldWeight in 1211, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.61935 = idf(docFreq=58, maxDocs=44218)
                0.03125 = fieldNorm(doc=1211)
          0.0134407105 = weight(abstract_txt:both in 1211) [ClassicSimilarity], result of:
            0.0134407105 = score(doc=1211,freq=3.0), product of:
              0.065149195 = queryWeight, product of:
                1.0852865 = boost
                3.811558 = idf(docFreq=2657, maxDocs=44218)
                0.015749332 = queryNorm
              0.20630662 = fieldWeight in 1211, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.811558 = idf(docFreq=2657, maxDocs=44218)
                0.03125 = fieldNorm(doc=1211)
          0.032321285 = weight(abstract_txt:permit in 1211) [ClassicSimilarity], result of:
            0.032321285 = score(doc=1211,freq=1.0), product of:
              0.13385987 = queryWeight, product of:
                1.1000187 = boost
                7.7265954 = idf(docFreq=52, maxDocs=44218)
                0.015749332 = queryNorm
              0.2414561 = fieldWeight in 1211, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.7265954 = idf(docFreq=52, maxDocs=44218)
                0.03125 = fieldNorm(doc=1211)
          0.04615882 = weight(abstract_txt:minimizing in 1211) [ClassicSimilarity], result of:
            0.04615882 = score(doc=1211,freq=1.0), product of:
              0.16975704 = queryWeight, product of:
                1.2387645 = boost
                8.701155 = idf(docFreq=19, maxDocs=44218)
                0.015749332 = queryNorm
              0.27191108 = fieldWeight in 1211, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.701155 = idf(docFreq=19, maxDocs=44218)
                0.03125 = fieldNorm(doc=1211)
          0.031299848 = weight(abstract_txt:method in 1211) [ClassicSimilarity], result of:
            0.031299848 = score(doc=1211,freq=6.0), product of:
              0.09084737 = queryWeight, product of:
                1.281581 = boost
                4.50095 = idf(docFreq=1333, maxDocs=44218)
                0.015749332 = queryNorm
              0.34453222 = fieldWeight in 1211, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                4.50095 = idf(docFreq=1333, maxDocs=44218)
                0.03125 = fieldNorm(doc=1211)
          0.018011618 = weight(abstract_txt:test in 1211) [ClassicSimilarity], result of:
            0.018011618 = score(doc=1211,freq=1.0), product of:
              0.11420974 = queryWeight, product of:
                1.4369494 = boost
                5.046608 = idf(docFreq=772, maxDocs=44218)
                0.015749332 = queryNorm
              0.1577065 = fieldWeight in 1211, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.046608 = idf(docFreq=772, maxDocs=44218)
                0.03125 = fieldNorm(doc=1211)
          0.039238963 = weight(abstract_txt:documents in 1211) [ClassicSimilarity], result of:
            0.039238963 = score(doc=1211,freq=4.0), product of:
              0.15233617 = queryWeight, product of:
                2.346964 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.015749332 = queryNorm
              0.2575814 = fieldWeight in 1211, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.03125 = fieldNorm(doc=1211)
        0.44 = coord(11/25)
  2. Desrichard, Y.: ¬Le dedoublonage des banques de donnees bibliographiques : un etat de l'art (1997) 0.14
    0.1392734 = sum of:
      0.1392734 = product of:
        1.1606117 = sum of:
          0.02549985 = weight(abstract_txt:users in 669) [ClassicSimilarity], result of:
            0.02549985 = score(doc=669,freq=1.0), product of:
              0.057146076 = queryWeight, product of:
                1.0164431 = boost
                3.569778 = idf(docFreq=3384, maxDocs=44218)
                0.015749332 = queryNorm
              0.44622225 = fieldWeight in 669, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.569778 = idf(docFreq=3384, maxDocs=44218)
                0.125 = fieldNorm(doc=669)
          0.5539059 = weight(abstract_txt:duplicates in 669) [ClassicSimilarity], result of:
            0.5539059 = score(doc=669,freq=1.0), product of:
              0.50927114 = queryWeight, product of:
                3.7162938 = boost
                8.701155 = idf(docFreq=19, maxDocs=44218)
                0.015749332 = queryNorm
              1.0876443 = fieldWeight in 669, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.701155 = idf(docFreq=19, maxDocs=44218)
                0.125 = fieldNorm(doc=669)
          0.58120596 = weight(abstract_txt:duplicate in 669) [ClassicSimilarity], result of:
            0.58120596 = score(doc=669,freq=1.0), product of:
              0.5787949 = queryWeight, product of:
                4.5747485 = boost
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.015749332 = queryNorm
              1.0041656 = fieldWeight in 669, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.125 = fieldNorm(doc=669)
        0.12 = coord(3/25)
  3. Lawrence, S.; Giles, C.L.: Inquirus, the NECI meta search engine (1998) 0.13
    0.12621681 = sum of:
      0.12621681 = product of:
        0.6310841 = sum of:
          0.029103154 = weight(abstract_txt:search in 3604) [ClassicSimilarity], result of:
            0.029103154 = score(doc=3604,freq=2.0), product of:
              0.06000727 = queryWeight, product of:
                1.041578 = boost
                3.6580524 = idf(docFreq=3098, maxDocs=44218)
                0.015749332 = queryNorm
              0.48499382 = fieldWeight in 3604, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.6580524 = idf(docFreq=3098, maxDocs=44218)
                0.09375 = fieldNorm(doc=3604)
          0.023279995 = weight(abstract_txt:both in 3604) [ClassicSimilarity], result of:
            0.023279995 = score(doc=3604,freq=1.0), product of:
              0.065149195 = queryWeight, product of:
                1.0852865 = boost
                3.811558 = idf(docFreq=2657, maxDocs=44218)
                0.015749332 = queryNorm
              0.35733357 = fieldWeight in 3604, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.811558 = idf(docFreq=2657, maxDocs=44218)
                0.09375 = fieldNorm(doc=3604)
          0.08393798 = weight(abstract_txt:identification in 3604) [ClassicSimilarity], result of:
            0.08393798 = score(doc=3604,freq=1.0), product of:
              0.1531885 = queryWeight, product of:
                1.6641902 = boost
                5.8446846 = idf(docFreq=347, maxDocs=44218)
                0.015749332 = queryNorm
              0.5479392 = fieldWeight in 3604, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8446846 = idf(docFreq=347, maxDocs=44218)
                0.09375 = fieldNorm(doc=3604)
          0.058858447 = weight(abstract_txt:documents in 3604) [ClassicSimilarity], result of:
            0.058858447 = score(doc=3604,freq=1.0), product of:
              0.15233617 = queryWeight, product of:
                2.346964 = boost
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.015749332 = queryNorm
              0.38637212 = fieldWeight in 3604, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1213026 = idf(docFreq=1949, maxDocs=44218)
                0.09375 = fieldNorm(doc=3604)
          0.43590447 = weight(abstract_txt:duplicate in 3604) [ClassicSimilarity], result of:
            0.43590447 = score(doc=3604,freq=1.0), product of:
              0.5787949 = queryWeight, product of:
                4.5747485 = boost
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.015749332 = queryNorm
              0.75312424 = fieldWeight in 3604, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.09375 = fieldNorm(doc=3604)
        0.2 = coord(5/25)
  4. Yu, L.-C.; Wu, C.-H.; Chang, R.-Y.; Liu, C.-H.; Hovy, E.H.: Annotation and verification of sense pools in OntoNotes (2010) 0.12
    0.11507737 = sum of:
      0.11507737 = product of:
        0.5753868 = sum of:
          0.051112432 = weight(abstract_txt:method in 4236) [ClassicSimilarity], result of:
            0.051112432 = score(doc=4236,freq=4.0), product of:
              0.09084737 = queryWeight, product of:
                1.281581 = boost
                4.50095 = idf(docFreq=1333, maxDocs=44218)
                0.015749332 = queryNorm
              0.56261873 = fieldWeight in 4236, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.50095 = idf(docFreq=1333, maxDocs=44218)
                0.0625 = fieldNorm(doc=4236)
          0.05094455 = weight(abstract_txt:test in 4236) [ClassicSimilarity], result of:
            0.05094455 = score(doc=4236,freq=2.0), product of:
              0.11420974 = queryWeight, product of:
                1.4369494 = boost
                5.046608 = idf(docFreq=772, maxDocs=44218)
                0.015749332 = queryNorm
              0.44606134 = fieldWeight in 4236, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.046608 = idf(docFreq=772, maxDocs=44218)
                0.0625 = fieldNorm(doc=4236)
          0.017754953 = weight(abstract_txt:results in 4236) [ClassicSimilarity], result of:
            0.017754953 = score(doc=4236,freq=1.0), product of:
              0.08157519 = queryWeight, product of:
                1.4873548 = boost
                3.482422 = idf(docFreq=3693, maxDocs=44218)
                0.015749332 = queryNorm
              0.21765138 = fieldWeight in 4236, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.482422 = idf(docFreq=3693, maxDocs=44218)
                0.0625 = fieldNorm(doc=4236)
          0.16497192 = weight(abstract_txt:near in 4236) [ClassicSimilarity], result of:
            0.16497192 = score(doc=4236,freq=3.0), product of:
              0.2183807 = queryWeight, product of:
                1.986996 = boost
                6.9783883 = idf(docFreq=111, maxDocs=44218)
                0.015749332 = queryNorm
              0.75543267 = fieldWeight in 4236, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.9783883 = idf(docFreq=111, maxDocs=44218)
                0.0625 = fieldNorm(doc=4236)
          0.29060298 = weight(abstract_txt:duplicate in 4236) [ClassicSimilarity], result of:
            0.29060298 = score(doc=4236,freq=1.0), product of:
              0.5787949 = queryWeight, product of:
                4.5747485 = boost
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.015749332 = queryNorm
              0.5020828 = fieldWeight in 4236, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.0625 = fieldNorm(doc=4236)
        0.2 = coord(5/25)
  5. Diodato, V.: User preferences for features in back of the book indexes (1994) 0.11
    0.11155219 = sum of:
      0.11155219 = product of:
        0.6972012 = sum of:
          0.019124888 = weight(abstract_txt:users in 7762) [ClassicSimilarity], result of:
            0.019124888 = score(doc=7762,freq=1.0), product of:
              0.057146076 = queryWeight, product of:
                1.0164431 = boost
                3.569778 = idf(docFreq=3384, maxDocs=44218)
                0.015749332 = queryNorm
              0.33466667 = fieldWeight in 7762, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.569778 = idf(docFreq=3384, maxDocs=44218)
                0.09375 = fieldNorm(doc=7762)
          0.023279995 = weight(abstract_txt:both in 7762) [ClassicSimilarity], result of:
            0.023279995 = score(doc=7762,freq=1.0), product of:
              0.065149195 = queryWeight, product of:
                1.0852865 = boost
                3.811558 = idf(docFreq=2657, maxDocs=44218)
                0.015749332 = queryNorm
              0.35733357 = fieldWeight in 7762, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.811558 = idf(docFreq=2657, maxDocs=44218)
                0.09375 = fieldNorm(doc=7762)
          0.038334325 = weight(abstract_txt:method in 7762) [ClassicSimilarity], result of:
            0.038334325 = score(doc=7762,freq=1.0), product of:
              0.09084737 = queryWeight, product of:
                1.281581 = boost
                4.50095 = idf(docFreq=1333, maxDocs=44218)
                0.015749332 = queryNorm
              0.42196405 = fieldWeight in 7762, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.50095 = idf(docFreq=1333, maxDocs=44218)
                0.09375 = fieldNorm(doc=7762)
          0.616462 = weight(abstract_txt:duplicate in 7762) [ClassicSimilarity], result of:
            0.616462 = score(doc=7762,freq=2.0), product of:
              0.5787949 = queryWeight, product of:
                4.5747485 = boost
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.015749332 = queryNorm
              1.0650785 = fieldWeight in 7762, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.033325 = idf(docFreq=38, maxDocs=44218)
                0.09375 = fieldNorm(doc=7762)
        0.16 = coord(4/25)