r59927 MediaWiki - Code Review archive

Repository:MediaWiki
Revision:r59926‎ | r59927 | r59928 >
Date:12:23, 10 December 2009
Author:daniel
Status:deferred
Tags:
Comment:
document and tweak proximity measures
Modified paths:
  • /trunk/WikiWord/WikiWord/src/main/java/de/brightbyte/wikiword/schema/ProximityStoreSchema.java (modified) (history)
  • /trunk/WikiWord/WikiWordBuilder/src/main/java/de/brightbyte/wikiword/builder/BuildProximity.java (modified) (history)
  • /trunk/WikiWord/WikiWordBuilder/src/main/java/de/brightbyte/wikiword/store/builder/DatabaseProximityStoreBuilder.java (modified) (history)
  • /trunk/WikiWord/WikiWordBuilder/src/main/java/de/brightbyte/wikiword/store/builder/ProximityStoreBuilder.java (modified) (history)

Diff [purge]

Index: trunk/WikiWord/WikiWord/src/main/java/de/brightbyte/wikiword/schema/ProximityStoreSchema.java
@@ -43,18 +43,21 @@
4444 * concepts the relation applies to. That is, A's in_degree is the size of the set of all Bs for which in(A, B) applies.
4545 * This bias is combined with the bias coefficient for a given relation to form the effective bias for that relation, e.g.:
4646 * <tt>in_effective_bias(B) = 1 - ( ( 1 - in_bias(B) ) * in_bias_coef )</tt> which amounts to <tt>1 - ( ( log(in_degree(B)) / log(number_of_concepts) ) * in_bias_coef )</tt>.
47 - * For each relation, there's also weight factor provided, which is applied to complement's bias. So if in(A, B) applies, in_w(A, B) is given by:
48 - * <tt>in_weight_factor * out_effective_bias(B) </tt>; the feature vector for A is then calculated for each feature B as follows:
49 - * <tt>A[B] = w(A,B) = in_w(A, B) + out_w(A, B) + up_w(A, B) + down_w(A, B)</tt>. Note that A[A] = c, where c is the "self-weight", which usually equals 1.
 47+ * For each relation, there's also weight factor provided, which is applied to complement's bias.
 48+ * To calculate the effective weight an association, the effective bias on both "sides" of the relation is combined with the weight factor for that relation.
 49+ * So if in(A, B) applies, in_w(A, B) is given by: <tt>in_weight_factor * in_effective_bias(A) * out_effective_bias(B) </tt>.
 50+ * The feature vector for A is then calculated for each feature B as follows:
 51+ * <tt>A[B] = w(A,B) = in_w(A, B) + out_w(A, B) + up_w(A, B) + down_w(A, B)</tt>. Note that A[A] = c, where c is the "self-weight", which usually equals 1.
 52+ * Depending on the weight-factors used, the weight function may or may not be symmetric: in_w(A, B) may always be different from in_w(B, A), however,
 53+ * if in_weight_factor = out_weight_factor, then in_w(A, B) = out_w(B, A), and up_w(A, B) = down_w(B, A) if up_weight_factor = down_weight_factor.
 54+ * Thus, w(A,B) = w(B, A) and A[B] = B[A] if in_weight_factor = out_weight_factor and up_weight_factor = down_weight_factor.
5055 * </p>
5156 *
52 - * <p>The self, weight, the four bias-coeficients and the four weight-factors are the parameters for the feature vector calculation.
 57+ * <p>The self-weight, the four bias-coeficients and the four weight-factors are the parameters for the feature vector calculation.
5358 * They can be tweaked to adjust the relative weight given to the different types of relations in the thesaurus with respect to determining the semantic proximity,
5459 * that is, the thematic similarity, of concepts. E.g. having similar incoming links (i.e. frequent co-occurrance of references) is a stringer indicator
5560 * of similarity than common outgoing links.
5661 * </p>
57 - *
58 - * <p>Note that as a result of the rules above, the weight of the association is not symmetrical: w(A, B) may be different from w(B,A)</p>
5962 *
6063 * <h4>Table <tt>proximity</tt></h4>
6164 * <p>Holds statistical figures relating to the entire thesaurus.</p>
@@ -63,8 +66,10 @@
6467 * <dt>concept2</dt><dd>The second concept. Comprises a unique key together with concept1.</dd>
6568 * <dt>proximity</dt><dd>The semantic proximity of concept1 and concept2. This is given by the scalar products
6669 * of the normalized feature vectors of concept1 and concept2, as stored in the feature table.
 70+ * This can be interpreted as the cosin of the angle between the concepts' feature vectors.
6771 * Entries with a low proximity value may be omitted (subject to tweak value <tt>proximity.threshold</tt>).</dd>
6872 * </dl>
 73+ * <p>Note that the proximity relation is symmetrical, i.e. prox(A, B) = prox(B, A), regardless if the weight factors used.</p>
6974 *
7075 * @author daniel
7176 */
Index: trunk/WikiWord/WikiWordBuilder/src/main/java/de/brightbyte/wikiword/builder/BuildProximity.java
@@ -44,7 +44,7 @@
4545 this.proximityStore.buildProximity();
4646
4747 section("-- statistics --------------------------------------------------");
48 - conceptStore.getConceptStore().getStatisticsStore().dumpStatistics(getLogOutput());
 48+ conceptStore.getProximityStoreBuilder().dumpTableStats(out);
4949 }
5050
5151 public static void main(String[] argv) throws Exception {
Index: trunk/WikiWord/WikiWordBuilder/src/main/java/de/brightbyte/wikiword/store/builder/DatabaseProximityStoreBuilder.java
@@ -12,6 +12,7 @@
1313 import de.brightbyte.db.RelationTable;
1414 import de.brightbyte.util.PersistenceException;
1515 import de.brightbyte.wikiword.TweakSet;
 16+import de.brightbyte.wikiword.processor.ImportProgressTracker;
1617 import de.brightbyte.wikiword.schema.ProximityStoreSchema;
1718 import de.brightbyte.wikiword.schema.WikiWordConceptStoreSchema;
1819
@@ -47,29 +48,29 @@
4849 proximityThreshold = tweaks.getTweak("proximity.threshold", 0.15);
4950 }
5051
 52+ private static String getBiasFormula(String biasField, double biasCoef) {
 53+ if ( biasField == null || biasCoef <= 0) return "1";
 54+ else if (biasCoef==1) return biasField;
 55+ else if (biasCoef>1) throw new IllegalArgumentException("biasCoef must not be greater than 1");
 56+ else return "( 1 - ( ( 1 - "+biasField+" ) * "+biasCoef+" ) ) ";
 57+ }
 58+
5159 /**
5260 * Builds feature vectors. For a specification, refer to ProximityStoreSchema
5361 */
54 - protected int buildFeatures(DatabaseTable t, String conceptField, String featureField, String suffix, double w, String biasField, double biasCoef) throws PersistenceException {
 62+ protected int buildFeatures(DatabaseTable t, String conceptField, String featureField, String suffix, double w, String baseBiasField, double baseBiasCoef, String targetBiasField, double targetBiasCoef) throws PersistenceException {
5563 if (!conceptStore.areStatsComplete()) throw new IllegalStateException("statistics need to be built before concept infos!");
5664
5765 String v = ""+w;
 66+ if (baseBiasField!=null && baseBiasCoef>0) v = getBiasFormula("B."+baseBiasField, baseBiasCoef) + " * " + v;
 67+ if (targetBiasField!=null && targetBiasCoef>0) v = getBiasFormula("D."+targetBiasField, targetBiasCoef) + " * " + v;
5868
59 - //NOTE: conider bias of reference target
60 - //FIXME: also consider local (outgoing) bias? feature vectors will be normalized, so that's not so relevant maybe?
61 - //NOTE: since there are usually more link than categories, there's a bias in favor of categories!
62 - // number of links grows with article length, number of categories does not!
63 - if (biasField!=null && biasCoef>0) {
64 - if (biasCoef==1) v = "D."+biasField+" * "+w;
65 - else if (biasCoef>1) throw new IllegalArgumentException("biasCoef must not be greater than 1");
66 - else v = "( 1 - ( ( 1 - D."+biasField+" ) * "+biasCoef+" ) ) * "+w;
67 - }
68 -
6969 DatabaseTable degreeTable = conceptStore.getStatisticsStoreBuilder().getDatabaseAccess().getTable("degree");
7070
7171 String sql = "INSERT INTO "+featureTable.getSQLName()+" (concept, feature, total_weight) ";
7272 sql += " SELECT T."+conceptField+", T."+featureField+", "+v+" FROM "+t.getSQLName()+" as T ";
73 - if (biasField!=null && biasCoef!=0) sql += " JOIN "+degreeTable.getSQLName()+" as D ON T."+featureField+" = D.concept ";
 73+ if (baseBiasField!=null && baseBiasCoef!=0) sql += " JOIN "+degreeTable.getSQLName()+" as B ON T."+conceptField+" = B.concept ";
 74+ if (targetBiasField!=null && targetBiasCoef!=0) sql += " JOIN "+degreeTable.getSQLName()+" as D ON T."+featureField+" = D.concept ";
7475
7576 if (suffix!=null) sql += " "+suffix+" ";
7677
@@ -97,22 +98,22 @@
9899 }
99100
100101 if (beginTask("buildFeatures", "feature#down")) {
101 - int n = buildFeatures(broaderTable, "broad", "narrow", null, featureVectorFactors.downWeight, "up_bias", featureVectorFactors.downBiasCoef);
 102+ int n = buildFeatures(broaderTable, "broad", "narrow", null, featureVectorFactors.downWeight, "down_bias", featureVectorFactors.downBiasCoef, "up_bias", featureVectorFactors.upBiasCoef);
102103 endTask("buildFeatures", "feature#down", n+" entries");
103104 }
104105
105106 if (beginTask("buildFeatures", "feature#up")) {
106 - int n = buildFeatures(broaderTable, "narrow", "broad", null, featureVectorFactors.upWeight, "down_bias", featureVectorFactors.upBiasCoef);
 107+ int n = buildFeatures(broaderTable, "narrow", "broad", null, featureVectorFactors.upWeight, "up_bias", featureVectorFactors.upBiasCoef, "down_bias", featureVectorFactors.downBiasCoef);
107108 endTask("buildFeatures", "feature#up", n+" entries");
108109 }
109110
110111 if (beginTask("buildFeatures", "feature#out")) {
111 - int n = buildFeatures(linkTable, "anchor", "target", null, featureVectorFactors.outWeight, "in_bias", featureVectorFactors.outBiasCoef);
 112+ int n = buildFeatures(linkTable, "anchor", "target", null, featureVectorFactors.outWeight, "out_bias", featureVectorFactors.outBiasCoef, "in_bias", featureVectorFactors.inBiasCoef);
112113 endTask("buildFeatures", "feature#out", n+" entries");
113114 }
114115
115116 if (beginTask("buildFeatures", "feature#in")) {
116 - int n = buildFeatures(linkTable, "target", "anchor", null, featureVectorFactors.inWeight, "out_bias", featureVectorFactors.inBiasCoef);
 117+ int n = buildFeatures(linkTable, "target", "anchor", null, featureVectorFactors.inWeight, "in_bias", featureVectorFactors.inBiasCoef, "out_bias", featureVectorFactors.outBiasCoef);
117118 endTask("buildFeatures", "feature#in", n+" entries");
118119 }
119120
@@ -183,12 +184,17 @@
184185 protected String name;
185186 protected DatabaseTable conceptTable;
186187 protected int lastId ;
 188+
 189+ protected ImportProgressTracker conceptTracker;
 190+ protected ImportProgressTracker featureTracker;
187191
188192 public CollectProximityQuery(String context, String name) {
189193 super();
190194 this.context = context;
191195 this.name = name;
192196 this.conceptTable = conceptStore.getDatabaseAccess().getTable("concept");
 197+ this.conceptTracker = new ImportProgressTracker("concepts");
 198+ this.featureTracker = new ImportProgressTracker("features");
193199 }
194200
195201 public String getChunkField() {
@@ -228,12 +234,25 @@
229235 sql += " ORDER BY id ASC";
230236
231237 int n = 0;
 238+ int i = 0;
232239 try {
233240 ResultSet res = DatabaseProximityStoreBuilder.this.executeQuery(context+"::"+name+"#chunk"+chunk, sql);
234241 while (res.next()) {
235242 lastId = res.getInt(1);
236243
237 - n+= insertProximity(lastId); //TODO: progress tracker!
 244+ int c = insertProximity(lastId); //TODO: progress tracker!
 245+ n*= c;
 246+ i+= 1;
 247+
 248+ conceptTracker.step();
 249+ featureTracker.step(c);
 250+
 251+ if ( (i % 1000) == 0 ) {
 252+ conceptTracker.chunk();
 253+ featureTracker.chunk();
 254+ log("- "+conceptTracker);
 255+ log("- "+featureTracker);
 256+ }
238257 }
239258
240259 res.close();
@@ -241,6 +260,11 @@
242261 throw new PersistenceException(e);
243262 }
244263
 264+ conceptTracker.chunk();
 265+ featureTracker.chunk();
 266+ log("- "+conceptTracker);
 267+ log("- "+featureTracker);
 268+
245269 flush();
246270 return n;
247271 }
Index: trunk/WikiWord/WikiWordBuilder/src/main/java/de/brightbyte/wikiword/store/builder/ProximityStoreBuilder.java
@@ -10,20 +10,20 @@
1111 //NOTE: since there are usually more link than categories, there's a bias in favor of categories!
1212 // number of links grows with article length, number of categories does not!
1313
14 - public final double selfWeight = 1;
 14+ public final double selfWeight = 4;
1515 //public final double weightOffset = 1;
1616
17 - public final double downWeight = 0.5; //having common children is not very relevant
18 - public final double downBiasCoef = 0; //if a child has many parents doesn't matter
 17+ public final double downWeight = 0.2; //having common children is not very relevant; also, categorization is favored by systemic bias, so tone it down.
 18+ public final double downBiasCoef = 1; //if the parent has many children should be considered
1919
20 - public final double upWeight = 1.2; //having common parents is interesting
21 - public final double upBiasCoef = 1; //if the parent has many children should be considered
 20+ public final double upWeight = 1.2; //having common parents is interesting; note: categorization is favored by systemic bias, but the bias is tuned out here anyway.
 21+ public final double upBiasCoef = 0.1; //if a child has many parents doesn't matter
2222
23 - public final double inWeight = 1.2; //bein referenced from the same place is a string factor
24 - public final double inBiasCoef = 0.2; //if the link's origin contains a lot of links is not so important; note: dampen link bias
 23+ public final double inWeight = 1.5; //bein referenced from the same place is a strong factor
 24+ public final double inBiasCoef = 1; //if the concept is referenced a lot, co-reference becvomes less relevant
2525
26 - public final double outWeight = 0.5; //referencing the same thing isn't so very important
27 - public final double outBiasCoef = 0.5; //if the link's target is used a lot is a major factor; note: dampen link bias
 26+ public final double outWeight = 1.0; //referencing the same thing is a good indicator
 27+ public final double outBiasCoef = 0.2; //if the concept has many outgoing links doesn't matter much
2828 }
2929
3030 public void buildFeatures() throws PersistenceException;

Status & tagging log