Document (#30809)

Author
Baumgartner, R.
Title
Methoden und Werkzeuge zur Webdatenextraktion
Source
Semantic Web: Wege zur vernetzten Wissensgesellschaft. Hrsg.: T. Pellegrini, u. A. Blumauer
Imprint
Berlin : Springer
Year
2006
Pages
S.419-435
Series
X.media.press
Abstract
Das World Wide Web kann als die größte uns bekannte "Datenbank" angesehen werden. Leider ist das heutige Web großteils auf die Präsentation für menschliche Benutzerinnen ausgelegt und besteht aus sehr heterogenen Datenbeständen. Überdies fehlen im Web die Möglichkeiten Informationen strukturiert und aus verschiedenen Quellen aggregiert abzufragen. Das heutige Web ist daher für die automatische maschinelle Verarbeitung nicht geeignet. Um Webdaten dennoch effektiv zu nutzen, wurden Sprachen, Methoden und Werkzeuge zur Extraktion und Aggregation dieser Daten entwickelt. Dieser Artikel gibt einen Überblick und eine Kategorisierung von verschiedenen Ansätzen zur Datenextraktion aus dem Web. Einige Beispielszenarien im B2B Datenaustausch, im Business Intelligence Bereich und insbesondere die Generierung von Daten für Semantic Web Ontologien illustrieren die effektive Nutzung dieser Technologien.
Theme
Data Mining

Similar documents (content)

  1. Frohner, H.: Social Tagging : Grundlagen, Anwendungen, Auswirkungen auf Wissensorganisation und soziale Strukturen der User (2010) 0.13
    0.12789486 = sum of:
      0.12789486 = product of:
        0.53289527 = sum of:
          0.07302384 = weight(abstract_txt:ansätzen in 4723) [ClassicSimilarity], result of:
            0.07302384 = score(doc=4723,freq=1.0), product of:
              0.15334398 = queryWeight, product of:
                7.61935 = idf(docFreq=58, maxDocs=44218)
                0.020125598 = queryNorm
              0.47620937 = fieldWeight in 4723, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.61935 = idf(docFreq=58, maxDocs=44218)
                0.0625 = fieldNorm(doc=4723)
          0.07302384 = weight(abstract_txt:heterogenen in 4723) [ClassicSimilarity], result of:
            0.07302384 = score(doc=4723,freq=1.0), product of:
              0.15334398 = queryWeight, product of:
                7.61935 = idf(docFreq=58, maxDocs=44218)
                0.020125598 = queryNorm
              0.47620937 = fieldWeight in 4723, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.61935 = idf(docFreq=58, maxDocs=44218)
                0.0625 = fieldNorm(doc=4723)
          0.0839964 = weight(abstract_txt:effektiv in 4723) [ClassicSimilarity], result of:
            0.0839964 = score(doc=4723,freq=1.0), product of:
              0.1683439 = queryWeight, product of:
                1.0477685 = boost
                7.983315 = idf(docFreq=40, maxDocs=44218)
                0.020125598 = queryNorm
              0.4989572 = fieldWeight in 4723, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.983315 = idf(docFreq=40, maxDocs=44218)
                0.0625 = fieldNorm(doc=4723)
          0.19356172 = weight(abstract_txt:kategorisierung in 4723) [ClassicSimilarity], result of:
            0.19356172 = score(doc=4723,freq=2.0), product of:
              0.23310949 = queryWeight, product of:
                1.2329533 = boost
                9.394302 = idf(docFreq=9, maxDocs=44218)
                0.020125598 = queryNorm
              0.8303468 = fieldWeight in 4723, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.394302 = idf(docFreq=9, maxDocs=44218)
                0.0625 = fieldNorm(doc=4723)
          0.067796506 = weight(abstract_txt:daten in 4723) [ClassicSimilarity], result of:
            0.067796506 = score(doc=4723,freq=2.0), product of:
              0.14593579 = queryWeight, product of:
                1.3796297 = boost
                5.255941 = idf(docFreq=626, maxDocs=44218)
                0.020125598 = queryNorm
              0.46456394 = fieldWeight in 4723, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.255941 = idf(docFreq=626, maxDocs=44218)
                0.0625 = fieldNorm(doc=4723)
          0.041492954 = weight(abstract_txt:dieser in 4723) [ClassicSimilarity], result of:
            0.041492954 = score(doc=4723,freq=1.0), product of:
              0.15172143 = queryWeight, product of:
                1.722863 = boost
                4.3756986 = idf(docFreq=1511, maxDocs=44218)
                0.020125598 = queryNorm
              0.27348116 = fieldWeight in 4723, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.3756986 = idf(docFreq=1511, maxDocs=44218)
                0.0625 = fieldNorm(doc=4723)
        0.24 = coord(6/25)
    
  2. Röhle, T.: ¬Die Demontage der Gatekeeper : relationale Perspektiven zur Macht der Suchmaschinen (2009) 0.08
    0.081011645 = sum of:
      0.081011645 = product of:
        0.40505823 = sum of:
          0.07302384 = weight(abstract_txt:ansätzen in 23) [ClassicSimilarity], result of:
            0.07302384 = score(doc=23,freq=1.0), product of:
              0.15334398 = queryWeight, product of:
                7.61935 = idf(docFreq=58, maxDocs=44218)
                0.020125598 = queryNorm
              0.47620937 = fieldWeight in 23, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.61935 = idf(docFreq=58, maxDocs=44218)
                0.0625 = fieldNorm(doc=23)
          0.080416396 = weight(abstract_txt:strukturiert in 23) [ClassicSimilarity], result of:
            0.080416396 = score(doc=23,freq=1.0), product of:
              0.16352595 = queryWeight, product of:
                1.0326662 = boost
                7.8682456 = idf(docFreq=45, maxDocs=44218)
                0.020125598 = queryNorm
              0.49176535 = fieldWeight in 23, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.8682456 = idf(docFreq=45, maxDocs=44218)
                0.0625 = fieldNorm(doc=23)
          0.056856822 = weight(abstract_txt:verschiedenen in 23) [ClassicSimilarity], result of:
            0.056856822 = score(doc=23,freq=1.0), product of:
              0.16351415 = queryWeight, product of:
                1.4603579 = boost
                5.563489 = idf(docFreq=460, maxDocs=44218)
                0.020125598 = queryNorm
              0.34771806 = fieldWeight in 23, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.563489 = idf(docFreq=460, maxDocs=44218)
                0.0625 = fieldNorm(doc=23)
          0.08298591 = weight(abstract_txt:dieser in 23) [ClassicSimilarity], result of:
            0.08298591 = score(doc=23,freq=4.0), product of:
              0.15172143 = queryWeight, product of:
                1.722863 = boost
                4.3756986 = idf(docFreq=1511, maxDocs=44218)
                0.020125598 = queryNorm
              0.5469623 = fieldWeight in 23, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.3756986 = idf(docFreq=1511, maxDocs=44218)
                0.0625 = fieldNorm(doc=23)
          0.11177526 = weight(abstract_txt:werkzeuge in 23) [ClassicSimilarity], result of:
            0.11177526 = score(doc=23,freq=1.0), product of:
              0.25660437 = queryWeight, product of:
                1.829421 = boost
                6.9694996 = idf(docFreq=112, maxDocs=44218)
                0.020125598 = queryNorm
              0.43559372 = fieldWeight in 23, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.9694996 = idf(docFreq=112, maxDocs=44218)
                0.0625 = fieldNorm(doc=23)
        0.2 = coord(5/25)
    
  3. Weigel, U.: Internet - (k)ein Netz mit doppeltem Boden? : T.1: Eine erste Annäherung; T.2: Dienste; T.3: World-Wide Web (1994) 0.08
    0.0809434 = sum of:
      0.0809434 = product of:
        1.0117925 = sum of:
          0.34114096 = weight(abstract_txt:verschiedenen in 58) [ClassicSimilarity], result of:
            0.34114096 = score(doc=58,freq=1.0), product of:
              0.16351415 = queryWeight, product of:
                1.4603579 = boost
                5.563489 = idf(docFreq=460, maxDocs=44218)
                0.020125598 = queryNorm
              2.0863085 = fieldWeight in 58, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.563489 = idf(docFreq=460, maxDocs=44218)
                0.375 = fieldNorm(doc=58)
          0.67065156 = weight(abstract_txt:werkzeuge in 58) [ClassicSimilarity], result of:
            0.67065156 = score(doc=58,freq=1.0), product of:
              0.25660437 = queryWeight, product of:
                1.829421 = boost
                6.9694996 = idf(docFreq=112, maxDocs=44218)
                0.020125598 = queryNorm
              2.6135623 = fieldWeight in 58, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.9694996 = idf(docFreq=112, maxDocs=44218)
                0.375 = fieldNorm(doc=58)
        0.08 = coord(2/25)
    
  4. Krüger, S.: Wissen ist Macht : Portale weisen den Weg und öffnen Türen (2001) 0.07
    0.070661396 = sum of:
      0.070661396 = product of:
        0.29442248 = sum of:
          0.0456399 = weight(abstract_txt:ansätzen in 5737) [ClassicSimilarity], result of:
            0.0456399 = score(doc=5737,freq=1.0), product of:
              0.15334398 = queryWeight, product of:
                7.61935 = idf(docFreq=58, maxDocs=44218)
                0.020125598 = queryNorm
              0.29763085 = fieldWeight in 5737, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.61935 = idf(docFreq=58, maxDocs=44218)
                0.0390625 = fieldNorm(doc=5737)
          0.05026025 = weight(abstract_txt:strukturiert in 5737) [ClassicSimilarity], result of:
            0.05026025 = score(doc=5737,freq=1.0), product of:
              0.16352595 = queryWeight, product of:
                1.0326662 = boost
                7.8682456 = idf(docFreq=45, maxDocs=44218)
                0.020125598 = queryNorm
              0.30735335 = fieldWeight in 5737, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.8682456 = idf(docFreq=45, maxDocs=44218)
                0.0390625 = fieldNorm(doc=5737)
          0.035535514 = weight(abstract_txt:verschiedenen in 5737) [ClassicSimilarity], result of:
            0.035535514 = score(doc=5737,freq=1.0), product of:
              0.16351415 = queryWeight, product of:
                1.4603579 = boost
                5.563489 = idf(docFreq=460, maxDocs=44218)
                0.020125598 = queryNorm
              0.21732378 = fieldWeight in 5737, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.563489 = idf(docFreq=460, maxDocs=44218)
                0.0390625 = fieldNorm(doc=5737)
          0.05645235 = weight(abstract_txt:methoden in 5737) [ClassicSimilarity], result of:
            0.05645235 = score(doc=5737,freq=2.0), product of:
              0.17669527 = queryWeight, product of:
                1.5180781 = boost
                5.7833843 = idf(docFreq=369, maxDocs=44218)
                0.020125598 = queryNorm
              0.31948987 = fieldWeight in 5737, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.7833843 = idf(docFreq=369, maxDocs=44218)
                0.0390625 = fieldNorm(doc=5737)
          0.036674935 = weight(abstract_txt:dieser in 5737) [ClassicSimilarity], result of:
            0.036674935 = score(doc=5737,freq=2.0), product of:
              0.15172143 = queryWeight, product of:
                1.722863 = boost
                4.3756986 = idf(docFreq=1511, maxDocs=44218)
                0.020125598 = queryNorm
              0.24172547 = fieldWeight in 5737, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.3756986 = idf(docFreq=1511, maxDocs=44218)
                0.0390625 = fieldNorm(doc=5737)
          0.06985953 = weight(abstract_txt:werkzeuge in 5737) [ClassicSimilarity], result of:
            0.06985953 = score(doc=5737,freq=1.0), product of:
              0.25660437 = queryWeight, product of:
                1.829421 = boost
                6.9694996 = idf(docFreq=112, maxDocs=44218)
                0.020125598 = queryNorm
              0.27224606 = fieldWeight in 5737, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.9694996 = idf(docFreq=112, maxDocs=44218)
                0.0390625 = fieldNorm(doc=5737)
        0.24 = coord(6/25)
    
  5. Cejpek, J.: Wie die neuen Medien bewerten : die Informationswissenschaft als Wissenschaft mit Gewissen (1996) 0.07
    0.06848179 = sum of:
      0.06848179 = product of:
        0.5706816 = sum of:
          0.07261267 = weight(abstract_txt:dieser in 6276) [ClassicSimilarity], result of:
            0.07261267 = score(doc=6276,freq=1.0), product of:
              0.15172143 = queryWeight, product of:
                1.722863 = boost
                4.3756986 = idf(docFreq=1511, maxDocs=44218)
                0.020125598 = queryNorm
              0.47859204 = fieldWeight in 6276, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.3756986 = idf(docFreq=1511, maxDocs=44218)
                0.109375 = fieldNorm(doc=6276)
          0.19560671 = weight(abstract_txt:werkzeuge in 6276) [ClassicSimilarity], result of:
            0.19560671 = score(doc=6276,freq=1.0), product of:
              0.25660437 = queryWeight, product of:
                1.829421 = boost
                6.9694996 = idf(docFreq=112, maxDocs=44218)
                0.020125598 = queryNorm
              0.76228905 = fieldWeight in 6276, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.9694996 = idf(docFreq=112, maxDocs=44218)
                0.109375 = fieldNorm(doc=6276)
          0.3024622 = weight(abstract_txt:heutige in 6276) [ClassicSimilarity], result of:
            0.3024622 = score(doc=6276,freq=1.0), product of:
              0.34312758 = queryWeight, product of:
                2.1154826 = boost
                8.059301 = idf(docFreq=37, maxDocs=44218)
                0.020125598 = queryNorm
              0.88148606 = fieldWeight in 6276, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.059301 = idf(docFreq=37, maxDocs=44218)
                0.109375 = fieldNorm(doc=6276)
        0.12 = coord(3/25)