Document (#34563)

Jele, H.
Erkennung bibliographischer Dubletten mittels Trigrammen : Messungen zur Performanz
Source 12(2009) H.3,
Die Bildung von Trigrammen wird in der automatisierten Dublettenerkennung häufig in Situationen angewandt, in denen "sehr ähnliche" aber nicht idente Datensätze als Duplikate identifiziert werden sollen. In dieser Arbeit werden drei auf Trigrammen beruhende Erkennungsverfahren (das Jaccard-Maß, der euklidische Abstand sowie der Ähnlichkeitswert des KOBV) praktisch angewandt, sämtliche dabei notwendigen Schritte umgesetzt und schließlich der Verbrauch an Zeit und Ressourcen (=die "Performanz") gemessen. Die hier zur Anwendung gelangte Datenmenge umfasst 392.616 bibliographische Titeldatensätze, die im Österreichischen Bibliothekenverbund erbracht wurden.

Similar documents (content)

  1. Schneider, W.: ¬Ein verteiltes Bibliotheks-Informationssystem auf Basis des Z39.50 Protokolls (1999) 0.05
    0.048503194 = sum of:
      0.048503194 = product of:
        0.40419328 = sum of:
          0.09686548 = weight(abstract_txt:bibliographische in 4773) [ClassicSimilarity], result of:
            0.09686548 = score(doc=4773,freq=1.0), product of:
              0.1349953 = queryWeight, product of:
                1.0275823 = boost
                7.653836 = idf(docFreq=56, maxDocs=44218)
                0.017164174 = queryNorm
              0.7175471 = fieldWeight in 4773, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.653836 = idf(docFreq=56, maxDocs=44218)
                0.09375 = fieldNorm(doc=4773)
          0.10702998 = weight(abstract_txt:datensätze in 4773) [ClassicSimilarity], result of:
            0.10702998 = score(doc=4773,freq=1.0), product of:
              0.14428115 = queryWeight, product of:
                1.0623364 = boost
                7.912698 = idf(docFreq=43, maxDocs=44218)
                0.017164174 = queryNorm
              0.74181545 = fieldWeight in 4773, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.912698 = idf(docFreq=43, maxDocs=44218)
                0.09375 = fieldNorm(doc=4773)
          0.2002978 = weight(abstract_txt:dubletten in 4773) [ClassicSimilarity], result of:
            0.2002978 = score(doc=4773,freq=1.0), product of:
              0.21910724 = queryWeight, product of:
                1.3091387 = boost
                9.7509775 = idf(docFreq=6, maxDocs=44218)
                0.017164174 = queryNorm
              0.9141542 = fieldWeight in 4773, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.7509775 = idf(docFreq=6, maxDocs=44218)
                0.09375 = fieldNorm(doc=4773)
        0.12 = coord(3/25)
  2. Fürste, F.M.: Linked Open Library Data : Bibliographische Daten und ihre Zugänglichkeit im Web der Daten (2009) 0.04
    0.03813433 = sum of:
      0.03813433 = product of:
        0.3177861 = sum of:
          0.08927282 = weight(abstract_txt:notwendigen in 2900) [ClassicSimilarity], result of:
            0.08927282 = score(doc=2900,freq=1.0), product of:
              0.1278455 = queryWeight, product of:
                7.448392 = idf(docFreq=69, maxDocs=44218)
                0.017164174 = queryNorm
              0.6982868 = fieldWeight in 2900, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.448392 = idf(docFreq=69, maxDocs=44218)
                0.09375 = fieldNorm(doc=2900)
          0.09686548 = weight(abstract_txt:bibliographische in 2900) [ClassicSimilarity], result of:
            0.09686548 = score(doc=2900,freq=1.0), product of:
              0.1349953 = queryWeight, product of:
                1.0275823 = boost
                7.653836 = idf(docFreq=56, maxDocs=44218)
                0.017164174 = queryNorm
              0.7175471 = fieldWeight in 2900, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.653836 = idf(docFreq=56, maxDocs=44218)
                0.09375 = fieldNorm(doc=2900)
          0.13164781 = weight(abstract_txt:bibliographischer in 2900) [ClassicSimilarity], result of:
            0.13164781 = score(doc=2900,freq=1.0), product of:
              0.16563357 = queryWeight, product of:
                1.1382338 = boost
                8.478011 = idf(docFreq=24, maxDocs=44218)
                0.017164174 = queryNorm
              0.7948135 = fieldWeight in 2900, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.478011 = idf(docFreq=24, maxDocs=44218)
                0.09375 = fieldNorm(doc=2900)
        0.12 = coord(3/25)
  3. Schaffner, V.: FRBR in MAB2 und Primo - ein kafkaesker Prozess? : Möglichkeiten der FRBRisierung von MAB2-Datensätzen in Primo exemplarisch dargestellt an Datensätzen zu Franz Kafkas "Der Process" (2011) 0.04
    0.036524184 = sum of:
      0.036524184 = product of:
        0.3043682 = sum of:
          0.09786929 = weight(abstract_txt:bibliographische in 907) [ClassicSimilarity], result of:
            0.09786929 = score(doc=907,freq=3.0), product of:
              0.1349953 = queryWeight, product of:
                1.0275823 = boost
                7.653836 = idf(docFreq=56, maxDocs=44218)
                0.017164174 = queryNorm
              0.724983 = fieldWeight in 907, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.653836 = idf(docFreq=56, maxDocs=44218)
                0.0546875 = fieldNorm(doc=907)
          0.12486831 = weight(abstract_txt:datensätze in 907) [ClassicSimilarity], result of:
            0.12486831 = score(doc=907,freq=4.0), product of:
              0.14428115 = queryWeight, product of:
                1.0623364 = boost
                7.912698 = idf(docFreq=43, maxDocs=44218)
                0.017164174 = queryNorm
              0.86545134 = fieldWeight in 907, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                7.912698 = idf(docFreq=43, maxDocs=44218)
                0.0546875 = fieldNorm(doc=907)
          0.08163058 = weight(abstract_txt:bibliothekenverbund in 907) [ClassicSimilarity], result of:
            0.08163058 = score(doc=907,freq=1.0), product of:
              0.17251626 = queryWeight, product of:
                1.161642 = boost
                8.652365 = idf(docFreq=20, maxDocs=44218)
                0.017164174 = queryNorm
              0.47317618 = fieldWeight in 907, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.652365 = idf(docFreq=20, maxDocs=44218)
                0.0546875 = fieldNorm(doc=907)
        0.12 = coord(3/25)
  4. Bürger, T.: ¬Die Digitalisierung der kulturellen und wissenschaftlichen Überlieferung : Versuch einer Zwischenbilanz (2011) 0.03
    0.03265789 = sum of:
      0.03265789 = product of:
        0.40822366 = sum of:
          0.07439401 = weight(abstract_txt:notwendigen in 4717) [ClassicSimilarity], result of:
            0.07439401 = score(doc=4717,freq=1.0), product of:
              0.1278455 = queryWeight, product of:
                7.448392 = idf(docFreq=69, maxDocs=44218)
                0.017164174 = queryNorm
              0.5819056 = fieldWeight in 4717, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.448392 = idf(docFreq=69, maxDocs=44218)
                0.078125 = fieldNorm(doc=4717)
          0.33382964 = weight(abstract_txt:performanz in 4717) [ClassicSimilarity], result of:
            0.33382964 = score(doc=4717,freq=1.0), product of:
              0.43821448 = queryWeight, product of:
                2.6182773 = boost
                9.7509775 = idf(docFreq=6, maxDocs=44218)
                0.017164174 = queryNorm
              0.7617951 = fieldWeight in 4717, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.7509775 = idf(docFreq=6, maxDocs=44218)
                0.078125 = fieldNorm(doc=4717)
        0.08 = coord(2/25)
  5. Figge, U.L.: Technische Anleitungen und der Erwerb kohärenten Wissens (2004) 0.03
    0.03183245 = sum of:
      0.03183245 = product of:
        0.39790562 = sum of:
          0.14198655 = weight(abstract_txt:situationen in 3144) [ClassicSimilarity], result of:
            0.14198655 = score(doc=3144,freq=2.0), product of:
              0.15612829 = queryWeight, product of:
                1.1050911 = boost
                8.231152 = idf(docFreq=31, maxDocs=44218)
                0.017164174 = queryNorm
              0.90942234 = fieldWeight in 3144, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.231152 = idf(docFreq=31, maxDocs=44218)
                0.078125 = fieldNorm(doc=3144)
          0.25591907 = weight(abstract_txt:angewandt in 3144) [ClassicSimilarity], result of:
            0.25591907 = score(doc=3144,freq=1.0), product of:
              0.36706126 = queryWeight, product of:
                2.396302 = boost
                8.924298 = idf(docFreq=15, maxDocs=44218)
                0.017164174 = queryNorm
              0.6972108 = fieldWeight in 3144, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.924298 = idf(docFreq=15, maxDocs=44218)
                0.078125 = fieldNorm(doc=3144)
        0.08 = coord(2/25)