r52545 MediaWiki - Code Review archive

Repository:MediaWiki
Revision:r52544‎ | r52545 | r52546 >
Date:15:04, 29 June 2009
Author:daniel
Status:deferred
Tags:
Comment:
reorganize
Modified paths:
  • /trunk/WikiWord/WikiWordIntegrator/src/docbook/Architecture.xml (deleted) (history)
  • /trunk/WikiWord/WikiWordIntegrator/src/docbook/BeanShell.xml (deleted) (history)
  • /trunk/WikiWord/WikiWordIntegrator/src/docbook/Database.xml (deleted) (history)
  • /trunk/WikiWord/WikiWordIntegrator/src/docbook/Manual.xml (added) (history)
  • /trunk/WikiWord/WikiWordIntegrator/src/docbook/Outline.xml (deleted) (history)
  • /trunk/WikiWord/WikiWordIntegrator/src/docbook/Process.xml (deleted) (history)
  • /trunk/WikiWord/WikiWordIntegrator/src/docbook/SourceDescriptor.xml (deleted) (history)

Diff [purge]

Index: trunk/WikiWord/WikiWordIntegrator/src/docbook/BeanShell.xml
@@ -1,5 +0,0 @@
2 -<?xml version="1.0" encoding="UTF-8"?>
3 -<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd">
4 -<article lang="en-US">
5 -<para/>
6 -</article>
Index: trunk/WikiWord/WikiWordIntegrator/src/docbook/Process.xml
@@ -1,30 +0,0 @@
2 -<?xml version="1.0" encoding="UTF-8"?>
3 -<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd">
4 -<article lang="en-US">
5 -<para>This document outlines the process of semantic vocabulary integration using WikiWord.</para>
6 -<para>Building the WikiWord Thesaurus</para>
7 -<para>The WikiWord thesaurus is extracted from Wikipedia's XML dumps.</para>
8 -<para>Building the WikiWord thesaurus is a long-running task, best performed centrally on appropriately powerful hardware. It is assumed that this will be performed as a service by Wikimedia.</para>
9 -<para>Thesaurus Extraction</para>
10 -<para>For each Wikipedia dump (I.e. for each Language), a thesaurus is extracted,</para>
11 -<para>Thesaurus Merging</para>
12 -<para>The thesauri are then merged into a multi-lingual thesaurus.</para>
13 -<para>Building Concept Properties</para>
14 -<para>Semantic integration means mapping concepts from one vocabulary to another. In the context of this document, concepts from a foreign authority are mapped to WikiWord concepts.</para>
15 -<para>Concept properties are the basis for mapping concepts. Properties include information supplied from the foreign authority, properties extracted from Wikipedia articles as well as the information in the thesaurus proper, that is, the terms/labels used to refer to a given concept.</para>
16 -<para>Property Extraction</para>
17 -<para>WikiWord allows for additional properties to be extracted from Wikipedia pages to be attached to an existing thesaurus. This way, a generic thesaurus can be used for a variety of domains, while the properties extracted can be tailored to the desired domain. Concept properties are most often extracted from so called “infobox” templates contained in Wikipedia articles, but also from other special purpose templates and categorization tags.</para>
18 -<para>Property Import</para>
19 -<para>Concepts defined by a third party authority are imported into the WikiWord database as sets of properties attached to a concept id. They are represented in the database as generic triples of concept id, property name and property value.</para>
20 -<para>External concepts can be imported directly from CSV/TSV files or from the result of an SQL query.</para>
21 -<para>Mapping Concepts</para>
22 -<para>Building Concept Associations</para>
23 -<para>Concept <emphasis>associations</emphasis> represent individual links between foreign authority concepts and WikiWord concepts. There may be several such associations between the same pair of concepts. Associations may be annotated with a variety of information about how the association was derived and how it is weighted.</para>
24 -<para>Building Concept Mappings</para>
25 -<para>Concept <emphasis>mappings</emphasis> represent aggregated links between foreign authority concepts and WikiWord concepts. There may be only one mapping between a given pair of concepts. However, the same foreign concept may be mapped to several WikiWord concepts, and a single WikiWord concept may be mapped to several different foreign concepts.</para>
26 -<para>Mappings are generally derived from associations by a kind of “grouping” operation: all associations between a given pair of concepts are grouped into a single mapping entry. Annotation of the mappings are reduced to figures aggregated from the associations that defined the respective mappings.</para>
27 -<para>Filtering Concept Mappings</para>
28 -<para>To get the mappings that are desired for a given purpose,  different kind of filters can be applied. One very common filter uses a threshold: all mappings that are below a given value according to some measure (often, the value of a specific “weight” annotation) are ignored.</para>
29 -<para>Often, it is desired to get <emphasis>unique </emphasis>mappings, that is, to have a given foreign concept map to only one WikiWord concept. There are two default ways to achieve this: either by using the <emphasis>best</emphasis> available mapping for a given foreign concept, according to some measure. Or my simply ignoring all ambiguous mappings; This of course reduces the amount of mappings, but it also improved the level of confidence in the mappings.</para>
30 -<para>Sometimes, it is desired to only get <emphasis>exact, exclusive</emphasis> matches – that is, not only to exclude any foreign concept for which there exists more than one mapping to  WikiWord, but to also to exclude all WikiWord concepts mapped to more than one foreign concept. This yields a strict 1:1 relationship and avoids any mismatches in scope or granularity. This is particularly useful when transferring definitions from one authority to another.</para>
31 -</article>
Index: trunk/WikiWord/WikiWordIntegrator/src/docbook/Outline.xml
@@ -1,5 +0,0 @@
2 -<?xml version="1.0" encoding="UTF-8"?>
3 -<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd">
4 -<article lang="en-US">
5 -<para/>
6 -</article>
Index: trunk/WikiWord/WikiWordIntegrator/src/docbook/SourceDescriptor.xml
@@ -1,5 +0,0 @@
2 -<?xml version="1.0" encoding="UTF-8"?>
3 -<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd">
4 -<article lang="en-US">
5 -<para/>
6 -</article>
Index: trunk/WikiWord/WikiWordIntegrator/src/docbook/Architecture.xml
@@ -1,30 +0,0 @@
2 -<?xml version="1.0" encoding="UTF-8"?>
3 -<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd">
4 -<article lang="en-US">
5 -<para>Overview</para>
6 -<orderedlist>
7 -<listitem>
8 -<para>The entry point the the Application</para>
9 -</listitem>
10 -<listitem>
11 -<para>The Application sets up the Store (target) and DataCursor (source)</para>
12 -</listitem>
13 -<listitem>
14 -<para>The Application then creates a Processor around the Store and calls it on the DataCursor.</para>
15 -</listitem>
16 -<listitem>
17 -<para>The Processor fetches on entry after another from the DataCursor and passes it to the StoreBuilder. Note that any logic for filtering, grouping and converting of entries is usually implemented in the DataCursor, not in the Processor.</para>
18 -</listitem>
19 -</orderedlist>
20 -<para>Application</para>
21 -<para>DB Configuration, Tweaks, SourceDescriptor</para>
22 -<para>Store, StoreBuilder</para>
23 -<para>FeatureSet</para>
24 -<para>DataCursor</para>
25 -<para>FeatureSetSourceDescriptor</para>
26 -<para>Processor</para>
27 -<para>Associations</para>
28 -<para>MappingCandidates</para>
29 -<para>Filter, Selector</para>
30 -<para>Scorer</para>
31 -</article>
Index: trunk/WikiWord/WikiWordIntegrator/src/docbook/Database.xml
@@ -1,5 +0,0 @@
2 -<?xml version="1.0" encoding="UTF-8"?>
3 -<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd">
4 -<article lang="en-US">
5 -<para/>
6 -</article>
Index: trunk/WikiWord/WikiWordIntegrator/src/docbook/Manual.xml
@@ -0,0 +1,144 @@
 2+<?xml version="1.0" encoding="UTF-8"?>
 3+<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd">
 4+<article lang="en-US">
 5+<title>WikiWord: Integrator</title>
 6+
 7+ <para>WikiWord is a system for extracting a theraurus from Wikipedia.
 8+ The Integrator module is desigend top use this data as a glue between
 9+ different data sets, that is, to map between different vocabularies,
 10+ standardized or natural.</param>
 11+
 12+<sect1>
 13+ <title>Process</title>
 14+ <para>This section outlines the process of semantic vocabulary integration using WikiWord.</para>
 15+<sect2>
 16+ <title>Building the WikiWord Thesaurus</title>
 17+ <para>The WikiWord thesaurus is extracted from Wikipedia's XML dumps.</para>
 18+ <para>Building the WikiWord thesaurus is a long-running task, best performed centrally on appropriately powerful hardware. It is assumed that this will be performed as a service by Wikimedia.</para>
 19+<sect3>
 20+ <title>Thesaurus Extraction</title>
 21+ <para>For each Wikipedia dump (I.e. for each Language), a thesaurus is extracted,</para>
 22+</sect3>
 23+<sect3>
 24+ <title>Thesaurus Merging</title>
 25+ <para>The thesauri are then merged into a multi-lingual thesaurus.</para>
 26+</sect3>
 27+<sect3>
 28+ <title>Building Concept Properties</title>
 29+ <para>Semantic integration means mapping concepts from one vocabulary to another. In the context of this document, concepts from a foreign authority are mapped to WikiWord concepts.</para>
 30+ <para>Concept properties are the basis for mapping concepts. Properties include information supplied from the foreign authority, properties extracted from Wikipedia articles as well as the information in the thesaurus proper, that is, the terms/labels used to refer to a given concept.</para>
 31+</sect3>
 32+<sect3>
 33+ <title>Property Extraction</title>
 34+ <para>WikiWord allows for additional properties to be extracted from Wikipedia pages to be attached to an existing thesaurus. This way, a generic thesaurus can be used for a variety of domains, while the properties extracted can be tailored to the desired domain. Concept properties are most often extracted from so called &quot;infobox&quot; templates contained in Wikipedia articles, but also from other special purpose templates and categorization tags.</para>
 35+</sect3>
 36+<sect3>
 37+ <title>Property Import</title>
 38+ <para>Concepts defined by a third party authority are imported into the WikiWord database as sets of properties attached to a concept id. They are represented in the database as generic triples of concept id, property name and property value.</para>
 39+ <para>External concepts can be imported directly from CSV/TSV files or from the result of an SQL query.</para>
 40+</sect3>
 41+</sect2>
 42+
 43+<sect2>
 44+ <title>Mapping Concepts</title>
 45+<sect3>
 46+<title>Building Concept Associations</title>
 47+<para>Concept <emphasis>associations</emphasis> represent individual links between foreign authority concepts and WikiWord concepts. There may be several such associations between the same pair of concepts. Associations may be annotated with a variety of information about how the association was derived and how it is weighted.</para>
 48+</sect3>
 49+<sect3>
 50+<title>Building Concept Mappings</title>
 51+<para>Concept <emphasis>mappings</emphasis> represent aggregated links between foreign authority concepts and WikiWord concepts. There may be only one mapping between a given pair of concepts. However, the same foreign concept may be mapped to several WikiWord concepts, and a single WikiWord concept may be mapped to several different foreign concepts.</para>
 52+<para>Mappings are generally derived from associations by a kind of &quot;grouping&quot; operation: all associations between a given pair of concepts are grouped into a single mapping entry. Annotation of the mappings are reduced to figures aggregated from the associations that defined the respective mappings.</para>
 53+</sect3>
 54+<sect3>
 55+<title>Filtering Concept Mappings</title>
 56+<para>To get the mappings that are desired for a given purpose,  different kind of filters can be applied. One very common filter uses a threshold: all mappings that are below a given value according to some measure (often, the value of a specific &quot;weight&quot; annotation) are ignored.</para>
 57+<para>Often, it is desired to get <emphasis>unique </emphasis>mappings, that is, to have a given foreign concept map to only one WikiWord concept. There are two default ways to achieve this: either by using the <emphasis>best</emphasis> available mapping for a given foreign concept, according to some measure. Or my simply ignoring all ambiguous mappings; This of course reduces the amount of mappings, but it also improved the level of confidence in the mappings.</para>
 58+<para>Sometimes, it is desired to only get <emphasis>exact, exclusive</emphasis> matches &mdash; that is, not only to exclude any foreign concept for which there exists more than one mapping to  WikiWord, but to also to exclude all WikiWord concepts mapped to more than one foreign concept. This yields a strict 1:1 relationship and avoids any mismatches in scope or granularity. This is particularly useful when transferring definitions from one authority to another.</para>
 59+</sect3>
 60+</sect2>
 61+
 62+<sect1>
 63+<title>Architecture</title>
 64+
 65+<sect2>
 66+<title>Classes</title>
 67+<orderedlist>
 68+<listitem>
 69+<para>The entry point the the Application</para>
 70+</listitem>
 71+<listitem>
 72+<para>The Application sets up the Store (target) and DataCursor (source)</para>
 73+</listitem>
 74+<listitem>
 75+<para>The Application then creates a Processor around the Store and calls it on the DataCursor.</para>
 76+</listitem>
 77+<listitem>
 78+<para>The Processor fetches on entry after another from the DataCursor and passes it to the StoreBuilder. Note that any logic for filtering, grouping and converting of entries is usually implemented in the DataCursor, not in the Processor.</para>
 79+</listitem>
 80+</orderedlist>
 81+<para>Application</para>
 82+<para>DB Configuration, Tweaks, SourceDescriptor</para>
 83+<para>Store, StoreBuilder</para>
 84+<para>FeatureSet</para>
 85+<para>DataCursor</para>
 86+<para>FeatureSetSourceDescriptor</para>
 87+<para>Processor</para>
 88+<para>Associations</para>
 89+<para>MappingCandidates</para>
 90+<para>Filter, Selector</para>
 91+<para>Scorer</para>
 92+<para>Aggregator, Accessor</para>
 93+</sect2>
 94+
 95+
 96+<sect2>
 97+<title>Database</title>
 98+</sect2>
 99+
 100+</sect1>
 101+
 102+<sect1>
 103+<title>Environment</title>
 104+
 105+<sect2>
 106+<title>Configuration files</title>
 107+</sect2>
 108+
 109+<sect2>
 110+<title>Command Line</title>
 111+</sect2>
 112+
 113+<sect2>
 114+<title>BeanShell Commands</title>
 115+</sect2>
 116+
 117+<sect2>
 118+<title>Parameters</title>
 119+
 120+<sect3>
 121+<title>Database</title>
 122+</sect3>
 123+
 124+<sect3>
 125+<title>Tweaks</title>
 126+</sect3>
 127+
 128+<sect3>
 129+<title>Source Descriptor</title>
 130+</sect3>
 131+
 132+<sect3>
 133+<title>Source Descriptor Defaults</title>
 134+</sect3>
 135+
 136+<sect3>
 137+<title>Built-In Scripts</title>
 138+</sect3>
 139+
 140+</sect2>
 141+
 142+
 143+</sect1>
 144+
 145+</article>
Property changes on: trunk/WikiWord/WikiWordIntegrator/src/docbook/Manual.xml
___________________________________________________________________
Added: svn:mergeinfo

Status & tagging log