Document (#26610)

Author
Bookstein, A.
Kulyukin, V.
Raita, T.
Nicholson, J.
Title
Adapting measures of clumping strength to assess term-term similarity
Source
Journal of the American Society for Information Science and technology. 54(2003) no.7, S.611-620
Year
2003
Abstract
Automated information retrieval relies heavily an statistical regularities that emerge as terms are deposited to produce text. This paper examines statistical patterns expected of a pair of terms that are semantically related to each other. Guided by a conceptualization of the text generation process, we derive measures of how tightly two terms are semantically associated. Our main objective is to probe whether such measures yield reasonable results. Specifically, we examine how the tendency of a content bearing term to clump, as quantified by previously developed measures of term clumping, is influenced by the presence of other terms. This approach allows us to present a toolkit from which a range of measures can be constructed. As an illustration, one of several suggested measures is evaluated an a large text corpus built from an on-line encyclopedia.
Theme
Computerlinguistik

Similar documents (author)

  1. Bookstein, A.: Probability and Fuzzy-set applications to information retrieval (1985) 1.89
    1.8918184 = sum of:
      1.8918184 = product of:
        3.7836368 = sum of:
          3.7836368 = weight(author_txt:bookstein in 781) [ClassicSimilarity], result of:
            3.7836368 = score(doc=781,freq=1.0), product of:
              0.70710677 = queryWeight, product of:
                8.561393 = idf(docFreq=22, maxDocs=44218)
                0.082592495 = queryNorm
              5.3508706 = fieldWeight in 781, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.561393 = idf(docFreq=22, maxDocs=44218)
                0.625 = fieldNorm(doc=781)
        0.5 = coord(1/2)
    
  2. Bookstein, A.: Relevance (1979) 1.89
    1.8918184 = sum of:
      1.8918184 = product of:
        3.7836368 = sum of:
          3.7836368 = weight(author_txt:bookstein in 839) [ClassicSimilarity], result of:
            3.7836368 = score(doc=839,freq=1.0), product of:
              0.70710677 = queryWeight, product of:
                8.561393 = idf(docFreq=22, maxDocs=44218)
                0.082592495 = queryNorm
              5.3508706 = fieldWeight in 839, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.561393 = idf(docFreq=22, maxDocs=44218)
                0.625 = fieldNorm(doc=839)
        0.5 = coord(1/2)
    
  3. Nicholson, D.: Subject-based interoperability : issues from the High Level Thesaurus (HILT) Project (2002) 1.89
    1.8918184 = sum of:
      1.8918184 = product of:
        3.7836368 = sum of:
          3.7836368 = weight(author_txt:nicholson in 2917) [ClassicSimilarity], result of:
            3.7836368 = score(doc=2917,freq=1.0), product of:
              0.70710677 = queryWeight, product of:
                8.561393 = idf(docFreq=22, maxDocs=44218)
                0.082592495 = queryNorm
              5.3508706 = fieldWeight in 2917, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.561393 = idf(docFreq=22, maxDocs=44218)
                0.625 = fieldNorm(doc=2917)
        0.5 = coord(1/2)
    
  4. Bookstein, A.: Fuzzy requests : an approach to weighted Boolean searches (1979) 1.89
    1.8918184 = sum of:
      1.8918184 = product of:
        3.7836368 = sum of:
          3.7836368 = weight(author_txt:bookstein in 5504) [ClassicSimilarity], result of:
            3.7836368 = score(doc=5504,freq=1.0), product of:
              0.70710677 = queryWeight, product of:
                8.561393 = idf(docFreq=22, maxDocs=44218)
                0.082592495 = queryNorm
              5.3508706 = fieldWeight in 5504, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.561393 = idf(docFreq=22, maxDocs=44218)
                0.625 = fieldNorm(doc=5504)
        0.5 = coord(1/2)
    
  5. Bookstein, A.: Informetric distributions : I. Unified overview (1990) 1.89
    1.8918184 = sum of:
      1.8918184 = product of:
        3.7836368 = sum of:
          3.7836368 = weight(author_txt:bookstein in 6902) [ClassicSimilarity], result of:
            3.7836368 = score(doc=6902,freq=1.0), product of:
              0.70710677 = queryWeight, product of:
                8.561393 = idf(docFreq=22, maxDocs=44218)
                0.082592495 = queryNorm
              5.3508706 = fieldWeight in 6902, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.561393 = idf(docFreq=22, maxDocs=44218)
                0.625 = fieldNorm(doc=6902)
        0.5 = coord(1/2)
    

Similar documents (content)

  1. Kim, W.; Wilbur, W.J.: Corpus-based statistical screening for content-bearing terms (2001) 0.20
    0.20056558 = sum of:
      0.20056558 = product of:
        0.62676746 = sum of:
          0.03998324 = weight(abstract_txt:strength in 5188) [ClassicSimilarity], result of:
            0.03998324 = score(doc=5188,freq=1.0), product of:
              0.12061455 = queryWeight, product of:
                7.071914 = idf(docFreq=101, maxDocs=44218)
                0.017055431 = queryNorm
              0.33149597 = fieldWeight in 5188, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.071914 = idf(docFreq=101, maxDocs=44218)
                0.046875 = fieldNorm(doc=5188)
          0.043582436 = weight(abstract_txt:yield in 5188) [ClassicSimilarity], result of:
            0.043582436 = score(doc=5188,freq=1.0), product of:
              0.12774837 = queryWeight, product of:
                1.029148 = boost
                7.2780466 = idf(docFreq=82, maxDocs=44218)
                0.017055431 = queryNorm
              0.34115845 = fieldWeight in 5188, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.2780466 = idf(docFreq=82, maxDocs=44218)
                0.046875 = fieldNorm(doc=5188)
          0.05104038 = weight(abstract_txt:pair in 5188) [ClassicSimilarity], result of:
            0.05104038 = score(doc=5188,freq=1.0), product of:
              0.14193527 = queryWeight, product of:
                1.0847892 = boost
                7.6715355 = idf(docFreq=55, maxDocs=44218)
                0.017055431 = queryNorm
              0.35960323 = fieldWeight in 5188, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.6715355 = idf(docFreq=55, maxDocs=44218)
                0.046875 = fieldNorm(doc=5188)
          0.1035199 = weight(abstract_txt:bearing in 5188) [ClassicSimilarity], result of:
            0.1035199 = score(doc=5188,freq=3.0), product of:
              0.15768482 = queryWeight, product of:
                1.1433918 = boost
                8.085969 = idf(docFreq=36, maxDocs=44218)
                0.017055431 = queryNorm
              0.65649885 = fieldWeight in 5188, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                8.085969 = idf(docFreq=36, maxDocs=44218)
                0.046875 = fieldNorm(doc=5188)
          0.022427388 = weight(abstract_txt:text in 5188) [ClassicSimilarity], result of:
            0.022427388 = score(doc=5188,freq=1.0), product of:
              0.11831521 = queryWeight, product of:
                1.7154619 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.017055431 = queryNorm
              0.18955624 = fieldWeight in 5188, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.046875 = fieldNorm(doc=5188)
          0.05980637 = weight(abstract_txt:terms in 5188) [ClassicSimilarity], result of:
            0.05980637 = score(doc=5188,freq=4.0), product of:
              0.15775363 = queryWeight, product of:
                2.2872827 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.017055431 = queryNorm
              0.37911248 = fieldWeight in 5188, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.046875 = fieldNorm(doc=5188)
          0.08668388 = weight(abstract_txt:term in 5188) [ClassicSimilarity], result of:
            0.08668388 = score(doc=5188,freq=3.0), product of:
              0.22237511 = queryWeight, product of:
                2.7156465 = boost
                4.8012047 = idf(docFreq=987, maxDocs=44218)
                0.017055431 = queryNorm
              0.38980925 = fieldWeight in 5188, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.8012047 = idf(docFreq=987, maxDocs=44218)
                0.046875 = fieldNorm(doc=5188)
          0.21972387 = weight(abstract_txt:clumping in 5188) [ClassicSimilarity], result of:
            0.21972387 = score(doc=5188,freq=1.0), product of:
              0.47323394 = queryWeight, product of:
                2.8012578 = boost
                9.905128 = idf(docFreq=5, maxDocs=44218)
                0.017055431 = queryNorm
              0.46430284 = fieldWeight in 5188, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.905128 = idf(docFreq=5, maxDocs=44218)
                0.046875 = fieldNorm(doc=5188)
        0.32 = coord(8/25)
    
  2. Bookstein, A.; Raita, T.: Discovering term occurence structure in text (2001) 0.19
    0.19252409 = sum of:
      0.19252409 = product of:
        1.2032756 = sum of:
          0.08631216 = weight(abstract_txt:tendency in 5751) [ClassicSimilarity], result of:
            0.08631216 = score(doc=5751,freq=1.0), product of:
              0.12691386 = queryWeight, product of:
                1.025781 = boost
                7.2542357 = idf(docFreq=84, maxDocs=44218)
                0.017055431 = queryNorm
              0.6800846 = fieldWeight in 5751, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.2542357 = idf(docFreq=84, maxDocs=44218)
                0.09375 = fieldNorm(doc=5751)
          0.05980637 = weight(abstract_txt:terms in 5751) [ClassicSimilarity], result of:
            0.05980637 = score(doc=5751,freq=1.0), product of:
              0.15775363 = queryWeight, product of:
                2.2872827 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.017055431 = queryNorm
              0.37911248 = fieldWeight in 5751, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.09375 = fieldNorm(doc=5751)
          0.6214729 = weight(abstract_txt:clumping in 5751) [ClassicSimilarity], result of:
            0.6214729 = score(doc=5751,freq=2.0), product of:
              0.47323394 = queryWeight, product of:
                2.8012578 = boost
                9.905128 = idf(docFreq=5, maxDocs=44218)
                0.017055431 = queryNorm
              1.3132467 = fieldWeight in 5751, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.905128 = idf(docFreq=5, maxDocs=44218)
                0.09375 = fieldNorm(doc=5751)
          0.4356841 = weight(abstract_txt:measures in 5751) [ClassicSimilarity], result of:
            0.4356841 = score(doc=5751,freq=4.0), product of:
              0.42750314 = queryWeight, product of:
                4.611534 = boost
                5.4353957 = idf(docFreq=523, maxDocs=44218)
                0.017055431 = queryNorm
              1.0191367 = fieldWeight in 5751, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.4353957 = idf(docFreq=523, maxDocs=44218)
                0.09375 = fieldNorm(doc=5751)
        0.16 = coord(4/25)
    
  3. Ruge, G.: Experiments on linguistically-based term associations (1992) 0.17
    0.16518039 = sum of:
      0.16518039 = product of:
        0.6882516 = sum of:
          0.07714957 = weight(abstract_txt:statistical in 1810) [ClassicSimilarity], result of:
            0.07714957 = score(doc=1810,freq=1.0), product of:
              0.14837477 = queryWeight, product of:
                1.5685384 = boost
                5.5462847 = idf(docFreq=468, maxDocs=44218)
                0.017055431 = queryNorm
              0.5199642 = fieldWeight in 1810, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5462847 = idf(docFreq=468, maxDocs=44218)
                0.09375 = fieldNorm(doc=1810)
          0.044854775 = weight(abstract_txt:text in 1810) [ClassicSimilarity], result of:
            0.044854775 = score(doc=1810,freq=1.0), product of:
              0.11831521 = queryWeight, product of:
                1.7154619 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.017055431 = queryNorm
              0.37911248 = fieldWeight in 1810, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.09375 = fieldNorm(doc=1810)
          0.14704469 = weight(abstract_txt:semantically in 1810) [ClassicSimilarity], result of:
            0.14704469 = score(doc=1810,freq=1.0), product of:
              0.2280888 = queryWeight, product of:
                1.944765 = boost
                6.8766055 = idf(docFreq=123, maxDocs=44218)
                0.017055431 = queryNorm
              0.64468175 = fieldWeight in 1810, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.8766055 = idf(docFreq=123, maxDocs=44218)
                0.09375 = fieldNorm(doc=1810)
          0.05980637 = weight(abstract_txt:terms in 1810) [ClassicSimilarity], result of:
            0.05980637 = score(doc=1810,freq=1.0), product of:
              0.15775363 = queryWeight, product of:
                2.2872827 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.017055431 = queryNorm
              0.37911248 = fieldWeight in 1810, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.09375 = fieldNorm(doc=1810)
          0.14155416 = weight(abstract_txt:term in 1810) [ClassicSimilarity], result of:
            0.14155416 = score(doc=1810,freq=2.0), product of:
              0.22237511 = queryWeight, product of:
                2.7156465 = boost
                4.8012047 = idf(docFreq=987, maxDocs=44218)
                0.017055431 = queryNorm
              0.6365558 = fieldWeight in 1810, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.8012047 = idf(docFreq=987, maxDocs=44218)
                0.09375 = fieldNorm(doc=1810)
          0.21784206 = weight(abstract_txt:measures in 1810) [ClassicSimilarity], result of:
            0.21784206 = score(doc=1810,freq=1.0), product of:
              0.42750314 = queryWeight, product of:
                4.611534 = boost
                5.4353957 = idf(docFreq=523, maxDocs=44218)
                0.017055431 = queryNorm
              0.50956833 = fieldWeight in 1810, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.4353957 = idf(docFreq=523, maxDocs=44218)
                0.09375 = fieldNorm(doc=1810)
        0.24 = coord(6/25)
    
  4. Seo, H.-C.; Kim, S.-B.; Rim, H.-C.; Myaeng, S.-H.: lmproving query translation in English-Korean Cross-language information retrieval (2005) 0.12
    0.11736188 = sum of:
      0.11736188 = product of:
        0.5868094 = sum of:
          0.0850673 = weight(abstract_txt:pair in 1023) [ClassicSimilarity], result of:
            0.0850673 = score(doc=1023,freq=1.0), product of:
              0.14193527 = queryWeight, product of:
                1.0847892 = boost
                7.6715355 = idf(docFreq=55, maxDocs=44218)
                0.017055431 = queryNorm
              0.5993387 = fieldWeight in 1023, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.6715355 = idf(docFreq=55, maxDocs=44218)
                0.078125 = fieldNorm(doc=1023)
          0.0642913 = weight(abstract_txt:statistical in 1023) [ClassicSimilarity], result of:
            0.0642913 = score(doc=1023,freq=1.0), product of:
              0.14837477 = queryWeight, product of:
                1.5685384 = boost
                5.5462847 = idf(docFreq=468, maxDocs=44218)
                0.017055431 = queryNorm
              0.43330348 = fieldWeight in 1023, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5462847 = idf(docFreq=468, maxDocs=44218)
                0.078125 = fieldNorm(doc=1023)
          0.1114426 = weight(abstract_txt:terms in 1023) [ClassicSimilarity], result of:
            0.1114426 = score(doc=1023,freq=5.0), product of:
              0.15775363 = queryWeight, product of:
                2.2872827 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.017055431 = queryNorm
              0.7064345 = fieldWeight in 1023, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.078125 = fieldNorm(doc=1023)
          0.14447312 = weight(abstract_txt:term in 1023) [ClassicSimilarity], result of:
            0.14447312 = score(doc=1023,freq=3.0), product of:
              0.22237511 = queryWeight, product of:
                2.7156465 = boost
                4.8012047 = idf(docFreq=987, maxDocs=44218)
                0.017055431 = queryNorm
              0.64968204 = fieldWeight in 1023, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.8012047 = idf(docFreq=987, maxDocs=44218)
                0.078125 = fieldNorm(doc=1023)
          0.18153507 = weight(abstract_txt:measures in 1023) [ClassicSimilarity], result of:
            0.18153507 = score(doc=1023,freq=1.0), product of:
              0.42750314 = queryWeight, product of:
                4.611534 = boost
                5.4353957 = idf(docFreq=523, maxDocs=44218)
                0.017055431 = queryNorm
              0.4246403 = fieldWeight in 1023, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.4353957 = idf(docFreq=523, maxDocs=44218)
                0.078125 = fieldNorm(doc=1023)
        0.2 = coord(5/25)
    
  5. Efron, M.: Linear time series models for term weighting in information retrieval (2010) 0.10
    0.10330326 = sum of:
      0.10330326 = product of:
        0.6456454 = sum of:
          0.10272478 = weight(abstract_txt:yield in 3688) [ClassicSimilarity], result of:
            0.10272478 = score(doc=3688,freq=2.0), product of:
              0.12774837 = queryWeight, product of:
                1.029148 = boost
                7.2780466 = idf(docFreq=82, maxDocs=44218)
                0.017055431 = queryNorm
              0.80411816 = fieldWeight in 3688, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.2780466 = idf(docFreq=82, maxDocs=44218)
                0.078125 = fieldNorm(doc=3688)
          0.099677294 = weight(abstract_txt:terms in 3688) [ClassicSimilarity], result of:
            0.099677294 = score(doc=3688,freq=4.0), product of:
              0.15775363 = queryWeight, product of:
                2.2872827 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.017055431 = queryNorm
              0.6318542 = fieldWeight in 3688, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.078125 = fieldNorm(doc=3688)
          0.18651399 = weight(abstract_txt:term in 3688) [ClassicSimilarity], result of:
            0.18651399 = score(doc=3688,freq=5.0), product of:
              0.22237511 = queryWeight, product of:
                2.7156465 = boost
                4.8012047 = idf(docFreq=987, maxDocs=44218)
                0.017055431 = queryNorm
              0.83873594 = fieldWeight in 3688, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.8012047 = idf(docFreq=987, maxDocs=44218)
                0.078125 = fieldNorm(doc=3688)
          0.25672933 = weight(abstract_txt:measures in 3688) [ClassicSimilarity], result of:
            0.25672933 = score(doc=3688,freq=2.0), product of:
              0.42750314 = queryWeight, product of:
                4.611534 = boost
                5.4353957 = idf(docFreq=523, maxDocs=44218)
                0.017055431 = queryNorm
              0.60053205 = fieldWeight in 3688, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.4353957 = idf(docFreq=523, maxDocs=44218)
                0.078125 = fieldNorm(doc=3688)
        0.16 = coord(4/25)