r77729 MediaWiki - Code Review archive

Repository:MediaWiki
Revision:r77728‎ | r77729 | r77730 >
Date:17:12, 4 December 2010
Author:reedy
Status:deferred
Tags:
Comment:
Fixup svn props
Modified paths:
  • /trunk/tools/editor_trends (modified) (history)
  • /trunk/tools/editor_trends/.svn_ignore (deleted) (history)
  • /trunk/tools/editor_trends/README.1ST (modified) (history)
  • /trunk/tools/editor_trends/algorithms (modified) (history)
  • /trunk/tools/editor_trends/analyses (modified) (history)
  • /trunk/tools/editor_trends/analyses/cohort_confidence_intervals.py (modified) (history)
  • /trunk/tools/editor_trends/bots (modified) (history)
  • /trunk/tools/editor_trends/data (modified) (history)
  • /trunk/tools/editor_trends/data/csv (modified) (history)
  • /trunk/tools/editor_trends/data/database (modified) (history)
  • /trunk/tools/editor_trends/data/objects (modified) (history)
  • /trunk/tools/editor_trends/database (modified) (history)
  • /trunk/tools/editor_trends/database/sqlite_logic.py (modified) (history)
  • /trunk/tools/editor_trends/datasets (modified) (history)
  • /trunk/tools/editor_trends/documentation (modified) (history)
  • /trunk/tools/editor_trends/etl (modified) (history)
  • /trunk/tools/editor_trends/experience (modified) (history)
  • /trunk/tools/editor_trends/logs (modified) (history)
  • /trunk/tools/editor_trends/namespaces (modified) (history)
  • /trunk/tools/editor_trends/run.bat (modified) (history)
  • /trunk/tools/editor_trends/scripts (modified) (history)
  • /trunk/tools/editor_trends/statistics (modified) (history)
  • /trunk/tools/editor_trends/statistics/r (modified) (history)
  • /trunk/tools/editor_trends/statistics/stata (modified) (history)
  • /trunk/tools/editor_trends/statistics/stata/cohort_charts.do (modified) (history)
  • /trunk/tools/editor_trends/tests (modified) (history)
  • /trunk/tools/editor_trends/tests/mongodb (modified) (history)
  • /trunk/tools/editor_trends/utils (modified) (history)
  • /trunk/tools/editor_trends/utils/namespace_downloader.py (modified) (history)
  • /trunk/tools/editor_trends/wikitree (modified) (history)

Diff [purge]

Index: trunk/tools/editor_trends/.svn_ignore
@@ -1,12 +0,0 @@
2 -*.pyc
3 -*.xml
4 -*.db
5 -*.bin
6 -*.zip
7 -*.csv
8 -.*
9 -zips/
10 -wikistats/
11 -datasets/
12 -data/
13 -notes.txt
\ No newline at end of file
Property changes on: trunk/tools/editor_trends/namespaces
___________________________________________________________________
Added: svn:ignore
141 + wikistats
zips
notes.txt
*.pyc
*.xml
*.db
*.bin
*.zip
*.csv
datasets
errors
.settings
.project
.pydevproject
wiki.cfg
fabric.py
fabfile.py
deployment
data
Index: trunk/tools/editor_trends/analyses/cohort_confidence_intervals.py
@@ -1,49 +1,49 @@
2 -__author__ = '''\n'''.join(['Diederik van Liere (dvanliere@gmail.com)', ])
3 -__author__email = 'dvanliere at gmail dot com'
4 -__date__ = '2010-11-24'
5 -__version__ = '0.1'
6 -
7 -import sys
8 -sys.path.append('..')
9 -
10 -import configuration
11 -settings = configuration.Settings()
12 -from utils import utils
13 -from database import db
14 -
15 -
16 -#def dataset_edits_by_month(dbname, **kwargs):
17 -# dbname = kwargs.pop('dbname')
18 -# mongo = db.init_mongo_db(dbname)
19 -# editors = mongo['dataset']
20 -# name = dbname + '_edits_by_month.csv'
21 -# fh = utils.create_txt_filehandle(settings.dataset_location, name, 'w', settings.encoding)
22 -# x = 0
23 -# vars_to_expand = ['monthly_edits']
24 -# while True:
25 -# try:
26 -# id = input_queue.get(block=False)
27 -# print input_queue.qsize()
28 -# obs = editors.find_one({'editor': id})
29 -# obs = expand_observations(obs, vars_to_expand)
30 -# if x == 0:
31 -# headers = obs.keys()
32 -# headers.sort()
33 -# headers = expand_headers(headers, vars_to_expand, obs)
34 -# utils.write_list_to_csv(headers, fh)
35 -# data = []
36 -# keys = obs.keys()
37 -# keys.sort()
38 -# for key in keys:
39 -# data.append(obs[key])
40 -# utils.write_list_to_csv(data, fh)
41 -#
42 -# x += 1
43 -# except Empty:
44 -# break
45 -# fh.close()
46 -
47 -
48 -if __name__ == '__main__':
49 -
 2+__author__ = '''\n'''.join(['Diederik van Liere (dvanliere@gmail.com)', ])
 3+__author__email = 'dvanliere at gmail dot com'
 4+__date__ = '2010-11-24'
 5+__version__ = '0.1'
 6+
 7+import sys
 8+sys.path.append('..')
 9+
 10+import configuration
 11+settings = configuration.Settings()
 12+from utils import utils
 13+from database import db
 14+
 15+
 16+#def dataset_edits_by_month(dbname, **kwargs):
 17+# dbname = kwargs.pop('dbname')
 18+# mongo = db.init_mongo_db(dbname)
 19+# editors = mongo['dataset']
 20+# name = dbname + '_edits_by_month.csv'
 21+# fh = utils.create_txt_filehandle(settings.dataset_location, name, 'w', settings.encoding)
 22+# x = 0
 23+# vars_to_expand = ['monthly_edits']
 24+# while True:
 25+# try:
 26+# id = input_queue.get(block=False)
 27+# print input_queue.qsize()
 28+# obs = editors.find_one({'editor': id})
 29+# obs = expand_observations(obs, vars_to_expand)
 30+# if x == 0:
 31+# headers = obs.keys()
 32+# headers.sort()
 33+# headers = expand_headers(headers, vars_to_expand, obs)
 34+# utils.write_list_to_csv(headers, fh)
 35+# data = []
 36+# keys = obs.keys()
 37+# keys.sort()
 38+# for key in keys:
 39+# data.append(obs[key])
 40+# utils.write_list_to_csv(data, fh)
 41+#
 42+# x += 1
 43+# except Empty:
 44+# break
 45+# fh.close()
 46+
 47+
 48+if __name__ == '__main__':
 49+
5050
\ No newline at end of file
Property changes on: trunk/tools/editor_trends/analyses/cohort_confidence_intervals.py
___________________________________________________________________
Added: svn:eol-style
5151 + native
Property changes on: trunk/tools/editor_trends/analyses
___________________________________________________________________
Added: svn:ignore
5252 + wikistats
zips
notes.txt
*.pyc
*.xml
*.db
*.bin
*.zip
*.csv
datasets
errors
.settings
.project
.pydevproject
wiki.cfg
fabric.py
fabfile.py
deployment
data
Property changes on: trunk/tools/editor_trends/wikitree
___________________________________________________________________
Added: svn:ignore
5353 + wikistats
zips
notes.txt
*.pyc
*.xml
*.db
*.bin
*.zip
*.csv
datasets
errors
.settings
.project
.pydevproject
wiki.cfg
fabric.py
fabfile.py
deployment
data
Property changes on: trunk/tools/editor_trends/etl
___________________________________________________________________
Added: svn:ignore
5454 + wikistats
zips
notes.txt
*.pyc
*.xml
*.db
*.bin
*.zip
*.csv
datasets
errors
.settings
.project
.pydevproject
wiki.cfg
fabric.py
fabfile.py
deployment
data
Property changes on: trunk/tools/editor_trends/scripts
___________________________________________________________________
Added: svn:ignore
5555 + wikistats
zips
notes.txt
*.pyc
*.xml
*.db
*.bin
*.zip
*.csv
datasets
errors
.settings
.project
.pydevproject
wiki.cfg
fabric.py
fabfile.py
deployment
data
Property changes on: trunk/tools/editor_trends/experience
___________________________________________________________________
Added: svn:ignore
5656 + wikistats
zips
notes.txt
*.pyc
*.xml
*.db
*.bin
*.zip
*.csv
datasets
errors
.settings
.project
.pydevproject
wiki.cfg
fabric.py
fabfile.py
deployment
data
Property changes on: trunk/tools/editor_trends/statistics/r
___________________________________________________________________
Added: svn:ignore
5757 + wikistats
zips
notes.txt
*.pyc
*.xml
*.db
*.bin
*.zip
*.csv
datasets
errors
.settings
.project
.pydevproject
wiki.cfg
fabric.py
fabfile.py
deployment
data
Index: trunk/tools/editor_trends/statistics/stata/cohort_charts.do
@@ -1,25 +1,25 @@
2 -label var months_3 "3 Months"
3 -label var months_6 "6 Months"
4 -label var months_9 "9 Months"
5 -label var months_12 "1 Year"
6 -label var months_24 "2 Years"
7 -label var months_36 "3 Years"
8 -label var months_48 "4 Years"
9 -label var months_60 "5 Years"
10 -label var months_72 "6 Years"
11 -label var months_84 "7 Years"
12 -label var months_96 "8 Years"
13 -label var months_108 "9 Years"
14 -generate one_year_exp = months_3+ months_6+ months_9+ months_12
15 -
16 -generate fewer_one_year_abs = (one_year_exp/100) * n
17 -generate more_one_year_abs = n - fewer_one_year_abs
18 -label var fewer_one_year_abs "Editors with less than one year experience"
19 -label var more_one_year_abs "Editors with more than one year experience"
20 -
21 -graph bar (asis) months_3 months_6 months_9 months_12 months_24 months_36 months_48 months_60 months_72 months_84 months_96 months_108, over(year, label(labsize(small))) stack ylabel(, labsize(vsmall) format(%9.0g)) title(Wikipedia Age Composition by Year) subtitle(Editors are getting older and influx of new editors has stagnated) note("Based on English Wikipedia, 345.000 editors." "An editor is a person who has made at least 10 edits in the main namespace.", size(tiny)) legend(nocolfirst rowgap(tiny) colgap(tiny) size(vsmall))
22 -
23 -twoway (line one_year_exp year), ytitle(%) ytitle(, size(vsmall)) xtitle() xlabel(2001(1)2010, labsize(vsmall)) title(Percentage of Wikipedia editors with 1 year experience) note("Based on the English Wikipedia, dataset 345.000 editors.", size(vsmall))
24 -
25 -
26 -graph bar (asis) fewer_one_year_abs more_one_year_abs, over(year, label(labsize(vsmall))) stack blabel(bar, size(tiny) position(inside) format(%9.0f)) ylabel(, labsize(vsmall) format(%9.0g)) title(Editors with one year vs multiple years of experience) legend(colfirst cols(1))
 2+label var months_3 "3 Months"
 3+label var months_6 "6 Months"
 4+label var months_9 "9 Months"
 5+label var months_12 "1 Year"
 6+label var months_24 "2 Years"
 7+label var months_36 "3 Years"
 8+label var months_48 "4 Years"
 9+label var months_60 "5 Years"
 10+label var months_72 "6 Years"
 11+label var months_84 "7 Years"
 12+label var months_96 "8 Years"
 13+label var months_108 "9 Years"
 14+generate one_year_exp = months_3+ months_6+ months_9+ months_12
 15+
 16+generate fewer_one_year_abs = (one_year_exp/100) * n
 17+generate more_one_year_abs = n - fewer_one_year_abs
 18+label var fewer_one_year_abs "Editors with less than one year experience"
 19+label var more_one_year_abs "Editors with more than one year experience"
 20+
 21+graph bar (asis) months_3 months_6 months_9 months_12 months_24 months_36 months_48 months_60 months_72 months_84 months_96 months_108, over(year, label(labsize(small))) stack ylabel(, labsize(vsmall) format(%9.0g)) title(Wikipedia Age Composition by Year) subtitle(Editors are getting older and influx of new editors has stagnated) note("Based on English Wikipedia, 345.000 editors." "An editor is a person who has made at least 10 edits in the main namespace.", size(tiny)) legend(nocolfirst rowgap(tiny) colgap(tiny) size(vsmall))
 22+
 23+twoway (line one_year_exp year), ytitle(%) ytitle(, size(vsmall)) xtitle() xlabel(2001(1)2010, labsize(vsmall)) title(Percentage of Wikipedia editors with 1 year experience) note("Based on the English Wikipedia, dataset 345.000 editors.", size(vsmall))
 24+
 25+
 26+graph bar (asis) fewer_one_year_abs more_one_year_abs, over(year, label(labsize(vsmall))) stack blabel(bar, size(tiny) position(inside) format(%9.0f)) ylabel(, labsize(vsmall) format(%9.0g)) title(Editors with one year vs multiple years of experience) legend(colfirst cols(1))
Property changes on: trunk/tools/editor_trends/statistics/stata/cohort_charts.do
___________________________________________________________________
Added: svn:eol-style
2727 + native
Property changes on: trunk/tools/editor_trends/statistics/stata
___________________________________________________________________
Added: svn:ignore
2828 + wikistats
zips
notes.txt
*.pyc
*.xml
*.db
*.bin
*.zip
*.csv
datasets
errors
.settings
.project
.pydevproject
wiki.cfg
fabric.py
fabfile.py
deployment
data
Property changes on: trunk/tools/editor_trends/statistics
___________________________________________________________________
Added: svn:ignore
2929 + wikistats
zips
notes.txt
*.pyc
*.xml
*.db
*.bin
*.zip
*.csv
datasets
errors
.settings
.project
.pydevproject
wiki.cfg
fabric.py
fabfile.py
deployment
data
Index: trunk/tools/editor_trends/utils/namespace_downloader.py
@@ -1,44 +1,44 @@
2 -
3 -
4 -'''
5 -Copyright (C) 2010 by Diederik van Liere (dvanliere@gmail.com)
6 -This program is free software; you can redistribute it and/or
7 -modify it under the terms of the GNU General Public License version 2
8 -as published by the Free Software Foundation.
9 -This program is distributed in the hope that it will be useful,
10 -but WITHOUT ANY WARRANTY; without even the implied warranty of
11 -MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
12 -See the GNU General Public License for more details, at
13 -http://www.fsf.org/licenses/gpl.html
14 -'''
15 -
16 -__author__ = '''\n'''.join(['Diederik van Liere (dvanliere@gmail.com)', ])
17 -__author__email = 'dvanliere at gmail dot com'
18 -__date__ = 'Oct 27, 2010'
19 -__version__ = '0.1'
20 -
21 -import languages
22 -import dump_downloader as dd
23 -import configuration
24 -settings = configuration.Settings()
25 -
26 -
27 -
28 -def retrieve_json_namespace():
29 - path = '/w/api.php?action=query&meta=siteinfo&siprop=namespaces|namespacealiases&format=json'
30 - visited = set()
31 - for language in languages.MAPPING:
32 - language = languages.MAPPING[language]
33 - filename = '%s_ns.json' % language
34 - if language not in visited:
35 - domain = 'http://%s.wikipedia.org' % language
36 - dd.download_wiki_file(domain, path, filename, settings.namespace_location, 'w', True)
37 - visited.add(language)
38 -
39 -
40 -def launch_downloader():
41 - retrieve_json_namespace()
42 -
43 -
44 -if __name__ == '__main__':
 2+
 3+
 4+'''
 5+Copyright (C) 2010 by Diederik van Liere (dvanliere@gmail.com)
 6+This program is free software; you can redistribute it and/or
 7+modify it under the terms of the GNU General Public License version 2
 8+as published by the Free Software Foundation.
 9+This program is distributed in the hope that it will be useful,
 10+but WITHOUT ANY WARRANTY; without even the implied warranty of
 11+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 12+See the GNU General Public License for more details, at
 13+http://www.fsf.org/licenses/gpl.html
 14+'''
 15+
 16+__author__ = '''\n'''.join(['Diederik van Liere (dvanliere@gmail.com)', ])
 17+__author__email = 'dvanliere at gmail dot com'
 18+__date__ = 'Oct 27, 2010'
 19+__version__ = '0.1'
 20+
 21+import languages
 22+import dump_downloader as dd
 23+import configuration
 24+settings = configuration.Settings()
 25+
 26+
 27+
 28+def retrieve_json_namespace():
 29+ path = '/w/api.php?action=query&meta=siteinfo&siprop=namespaces|namespacealiases&format=json'
 30+ visited = set()
 31+ for language in languages.MAPPING:
 32+ language = languages.MAPPING[language]
 33+ filename = '%s_ns.json' % language
 34+ if language not in visited:
 35+ domain = 'http://%s.wikipedia.org' % language
 36+ dd.download_wiki_file(domain, path, filename, settings.namespace_location, 'w', True)
 37+ visited.add(language)
 38+
 39+
 40+def launch_downloader():
 41+ retrieve_json_namespace()
 42+
 43+
 44+if __name__ == '__main__':
4545 launch_downloader()
\ No newline at end of file
Property changes on: trunk/tools/editor_trends/utils/namespace_downloader.py
___________________________________________________________________
Added: svn:eol-style
4646 + native
Property changes on: trunk/tools/editor_trends/utils
___________________________________________________________________
Added: svn:ignore
4747 + wikistats
zips
notes.txt
*.pyc
*.xml
*.db
*.bin
*.zip
*.csv
datasets
errors
.settings
.project
.pydevproject
wiki.cfg
fabric.py
fabfile.py
deployment
data
Property changes on: trunk/tools/editor_trends/tests/mongodb
___________________________________________________________________
Added: svn:ignore
4848 + wikistats
zips
notes.txt
*.pyc
*.xml
*.db
*.bin
*.zip
*.csv
datasets
errors
.settings
.project
.pydevproject
wiki.cfg
fabric.py
fabfile.py
deployment
data
Property changes on: trunk/tools/editor_trends/tests
___________________________________________________________________
Added: svn:ignore
4949 + wikistats
zips
notes.txt
*.pyc
*.xml
*.db
*.bin
*.zip
*.csv
datasets
errors
.settings
.project
.pydevproject
wiki.cfg
fabric.py
fabfile.py
deployment
data
Index: trunk/tools/editor_trends/README.1ST
@@ -1,89 +1,89 @@
2 -===============================================================================
3 -
4 - Wikipedia Editor Trends Analytics
5 -
6 -===============================================================================
7 -
8 -BACKGROUND:
9 -This package offers a set of tools used to create datasets to analyze Editor
10 -Trends. By Editor Trends we refer to the overall pattern of entering and leaving
11 -a Wikipedia site. The main information source for this package is:
12 - http://strategy.wikimedia.org/wiki/Editor_Trends_Study
13 -
14 -REQUIREMENTS:
15 -
16 -* Python 2.6 or higher (this code has not been tested with Python 3.x)
17 -
18 -OPTIONAL
19 -* MongoDB
20 -
21 -If you don't want to install / use MongDB then the package will use the built-in
22 -Sqlite library. However, this not optimized for speed and may take a serious
23 -amount of time. If possible, install MongoDB.
24 -
25 -INSTALLING USING VIRTUALENV
26 -It's recommended to use Python virtualenv. If you are not familiar with
27 -virtualenv then have a look over here:
28 - http://groups.google.com/group/python-virtualenv/browse_thread/thread/f2f19d2cc93a844e
29 -
30 -To install Editor Trends Analytics:
31 -
32 - virtualenv --no-site-packages --distribute editor_trends
33 - pip install -E editor_trends -r /editor_trends/requirements.txt
34 -
35 -
36 -The first command creates a new virtualenv called editor_trends and the second
37 -command installs the dependencies. Currently the dependencies are:
38 -* PyMongo
39 -* Progressbar
40 -
41 -INSTALLING WITHOUT VIRTUALENV
42 -If you don't like virtualenv then do the following:
43 -
44 - easy_install pymongo
45 - easy_install progressbar
46 -
47 -IMPORTANT MONGODB NOTES
48 -If you decide to use MongDB to store the results then you have to install the
49 -64-bit version. 32-bit versions of MongoDB are limited to 2GB of data and the
50 -databases created by this package will definitely be larger than that. For more
51 -background information on this limitation, please read:
52 - http://blog.mongodb.org/post/137788967/32-bit-limitations
53 -
54 -
55 -CONFIGURATION:
56 -If you would like to create a dataset for your own analyses then you should
57 -first make the appropriate changes to settings.py. Settings.py contains
58 -configuration variables such as the location of input and output files. Most
59 -settings are self-explanatory but in cases of any questions please drop me a
60 -line.
61 -
62 -PROCESSING TIMES:
63 -
64 -CONFIG NAMESPACE FILENAME CHUNKING STORING INDEXING RETRIEVING TOTAL
65 -1 0 stub-meta-history 7 3 1 ? 11
66 -
67 -
68 -*CHUNKING == splitting XML file in smaller pieces
69 -*STORING == parsing xml files and storing it in MongoDB
70 -*INDEXING == creating an index in MongoDB
71 -*RETRIEVING == generating a dataset
72 -*TOTAL == sum of all parts
73 -
74 -MACHINE CONFIGURATIONS
75 -
76 -ID OS VERSION MEMORY PROCESSOR SPEED
77 -1 Windows 7 64-bit 4GB Duo Core 2.8MHZ
78 -Please add your processing times plus configuration to help improve performance.
79 -
80 -HARDDISK REQUIREMENTS
81 -You will need at least 3x the size of xml dump file in free space on your hard
82 -disk if you want to create the databases and datasets to run your own analyses.
83 -The English stub-meta-history.xml is about 15Gb so you need about 45Gb of free
84 -diskspace.
85 -
86 -CODE:
87 -The Python code adheres to PEP8. Function names are deliberately expressive to
88 -ease understanding what's going. If you find a bug please email me at dvanliere
89 -at gmail dot com or leave a message on my Talk page.
90 -
 2+===============================================================================
 3+
 4+ Wikipedia Editor Trends Analytics
 5+
 6+===============================================================================
 7+
 8+BACKGROUND:
 9+This package offers a set of tools used to create datasets to analyze Editor
 10+Trends. By Editor Trends we refer to the overall pattern of entering and leaving
 11+a Wikipedia site. The main information source for this package is:
 12+ http://strategy.wikimedia.org/wiki/Editor_Trends_Study
 13+
 14+REQUIREMENTS:
 15+
 16+* Python 2.6 or higher (this code has not been tested with Python 3.x)
 17+
 18+OPTIONAL
 19+* MongoDB
 20+
 21+If you don't want to install / use MongDB then the package will use the built-in
 22+Sqlite library. However, this not optimized for speed and may take a serious
 23+amount of time. If possible, install MongoDB.
 24+
 25+INSTALLING USING VIRTUALENV
 26+It's recommended to use Python virtualenv. If you are not familiar with
 27+virtualenv then have a look over here:
 28+ http://groups.google.com/group/python-virtualenv/browse_thread/thread/f2f19d2cc93a844e
 29+
 30+To install Editor Trends Analytics:
 31+
 32+ virtualenv --no-site-packages --distribute editor_trends
 33+ pip install -E editor_trends -r /editor_trends/requirements.txt
 34+
 35+
 36+The first command creates a new virtualenv called editor_trends and the second
 37+command installs the dependencies. Currently the dependencies are:
 38+* PyMongo
 39+* Progressbar
 40+
 41+INSTALLING WITHOUT VIRTUALENV
 42+If you don't like virtualenv then do the following:
 43+
 44+ easy_install pymongo
 45+ easy_install progressbar
 46+
 47+IMPORTANT MONGODB NOTES
 48+If you decide to use MongDB to store the results then you have to install the
 49+64-bit version. 32-bit versions of MongoDB are limited to 2GB of data and the
 50+databases created by this package will definitely be larger than that. For more
 51+background information on this limitation, please read:
 52+ http://blog.mongodb.org/post/137788967/32-bit-limitations
 53+
 54+
 55+CONFIGURATION:
 56+If you would like to create a dataset for your own analyses then you should
 57+first make the appropriate changes to settings.py. Settings.py contains
 58+configuration variables such as the location of input and output files. Most
 59+settings are self-explanatory but in cases of any questions please drop me a
 60+line.
 61+
 62+PROCESSING TIMES:
 63+
 64+CONFIG NAMESPACE FILENAME CHUNKING STORING INDEXING RETRIEVING TOTAL
 65+1 0 stub-meta-history 7 3 1 ? 11
 66+
 67+
 68+*CHUNKING == splitting XML file in smaller pieces
 69+*STORING == parsing xml files and storing it in MongoDB
 70+*INDEXING == creating an index in MongoDB
 71+*RETRIEVING == generating a dataset
 72+*TOTAL == sum of all parts
 73+
 74+MACHINE CONFIGURATIONS
 75+
 76+ID OS VERSION MEMORY PROCESSOR SPEED
 77+1 Windows 7 64-bit 4GB Duo Core 2.8MHZ
 78+Please add your processing times plus configuration to help improve performance.
 79+
 80+HARDDISK REQUIREMENTS
 81+You will need at least 3x the size of xml dump file in free space on your hard
 82+disk if you want to create the databases and datasets to run your own analyses.
 83+The English stub-meta-history.xml is about 15Gb so you need about 45Gb of free
 84+diskspace.
 85+
 86+CODE:
 87+The Python code adheres to PEP8. Function names are deliberately expressive to
 88+ease understanding what's going. If you find a bug please email me at dvanliere
 89+at gmail dot com or leave a message on my Talk page.
 90+
Property changes on: trunk/tools/editor_trends/README.1ST
___________________________________________________________________
Added: svn:eol-style
9191 + native
Property changes on: trunk/tools/editor_trends/logs
___________________________________________________________________
Modified: svn:ignore
9292 - *.bin
split_xml
9393 + wikistats
zips
notes.txt
*.pyc
*.xml
*.db
*.bin
*.zip
*.csv
datasets
errors
.settings
.project
.pydevproject
wiki.cfg
fabric.py
fabfile.py
deployment
data
Index: trunk/tools/editor_trends/database/sqlite_logic.py
@@ -1,156 +1,156 @@
2 -def retrieve_editor_ids_db():
3 - contributors = set()
4 - connection = db.init_database()
5 - cursor = connection.cursor()
6 - if settings.PROGRESS_BAR:
7 - cursor.execute('SELECT MAX(ROWID) FROM contributors')
8 - for id in cursor:
9 - pass
10 - pbar = progressbar.ProgressBar(maxval=id[0]).start()
11 -
12 - cursor.execute('SELECT contributor FROM contributors WHERE bot=0')
13 -
14 - print 'Retrieving contributors...'
15 - for x, contributor in enumerate(cursor):
16 - contributors.add(contributor[0])
17 - if x % 100000 == 0:
18 - pbar.update(x)
19 - print 'Serializing contributors...'
20 - utils.store_object(contributors, 'contributors')
21 - print 'Finished serializing contributors...'
22 -
23 - if pbar:
24 - pbar.finish()
25 - print 'Total elapsed time: %s.' % (utils.humanize_time_difference(pbar.seconds_elapsed))
26 -
27 - connection.close()
28 -
29 -def retrieve_edits_by_contributor(input_queue, result_queue, pbar):
30 - connection = db.init_database()
31 - cursor = connection.cursor()
32 -
33 - while True:
34 - try:
35 - contributor = input_queue.get(block=False)
36 - if contributor == None:
37 - break
38 -
39 - cursor.execute('SELECT contributor, timestamp, bot FROM contributors WHERE contributor=?', (contributor,))
40 - edits = {}
41 - edits[contributor] = set()
42 - for edit, timestamp, bot in cursor:
43 - date = utils.convert_timestamp_to_date(timestamp)
44 - edits[contributor].add(date)
45 - #print edit, timestamp, bot
46 -
47 - utils.write_data_to_csv(edits, retrieve_edits_by_contributor)
48 - if pbar:
49 - utils.update_progressbar(pbar, input_queue)
50 -
51 - except Empty:
52 - pass
53 -
54 - connection.close()
55 -
56 -
57 -def store_data_db(data_queue, pids):
58 - connection = db.init_database()
59 - cursor = connection.cursor()
60 - db.create_tables(cursor, db_settings.CONTRIBUTOR_TABLE)
61 - empty = 0
62 - values = []
63 - while True:
64 - try:
65 - chunk = data_queue.get(block=False)
66 - contributor = chunk['contributor'].encode(settings.encoding)
67 - article = chunk['article']
68 - timestamp = chunk['timestamp'].encode(settings.encoding)
69 - bot = chunk['bot']
70 - values.append((contributor, article, timestamp, bot))
71 -
72 - if len(values) == 50000:
73 - cursor.executemany('INSERT INTO contributors VALUES (?,?,?,?)', values)
74 - connection.commit()
75 - #print 'Size of queue: %s' % data_queue.qsize()
76 - values = []
77 -
78 - except Empty:
79 -
80 - if all([utils.check_if_process_is_running(pid) for pid in pids]):
81 - pass
82 - else:
83 - break
84 - connection.close()
85 -
86 -
87 -def create_bots_db(db_name):
88 - '''
89 - This function reads the csv file provided by Erik Zachte and constructs a
90 - sqlite memory database. The reason for this is that I suspect I will need
91 - some simple querying capabilities in the future, else a dictionary would
92 - suffice.
93 - '''
94 - connection = db.init_database('db_name')
95 - #connection = db.init_database('data/database/bots.db')
96 - cursor = connection.cursor()
97 - db.create_tables(cursor, db_settings.BOT_TABLE)
98 - values = []
99 - fields = [field[0] for field in db_settings.BOT_TABLE['bots']]
100 - for line in utils.read_data_from_csv('data/csv/StatisticsBots.csv', settings.encoding):
101 - line = line.split(',')
102 - row = []
103 - for x, (field, value) in enumerate(zip(fields, line)):
104 - if db_settings.BOT_TABLE['bots'][x][1] == 'INTEGER':
105 - value = int(value)
106 - elif db_settings.BOT_TABLE['bots'][x][1] == 'TEXT':
107 - value = value.replace('/', '-')
108 - #print field, value
109 - row.append(value)
110 - values.append(row)
111 -
112 - cursor.executemany('INSERT INTO bots VALUES (?,?,?,?,?,?,?,?,?,?);', values)
113 - connection.commit()
114 - if db_name == ':memory':
115 - return cursor
116 - else:
117 - connection.close()
118 -
119 -def retrieve_botnames_without_id(cursor, language):
120 - return cursor.execute('SELECT name FROM bots WHERE language=?', (language,)).fetchall()
121 -
122 -
123 -def add_id_to_botnames():
124 - '''
125 - This is the worker function for the multi-process version of
126 - lookup_username.First, the names of the bots are retrieved, then the
127 - multiprocess is launched by making a call to pc.build_scaffolding. This is a
128 - generic launcher that takes as input the function to load the input_queue,
129 - the function that will do the main work and the objects to be put in the
130 - input_queue. The launcher also accepts optional keyword arguments.
131 - '''
132 - cursor = create_bots_db(':memory')
133 - files = utils.retrieve_file_list(settings.input_location, 'xml')
134 -
135 - botnames = retrieve_botnames_without_id(cursor, 'en')
136 - bots = {}
137 - for botname in botnames:
138 - bots[botname[0]] = 1
139 - pc.build_scaffolding(pc.load_queue, lookup_username, files, bots=bots)
140 - cursor.close()
141 -
142 -
143 -def debug_lookup_username():
144 - '''
145 - This function launches the lookup_username function but then single
146 - threaded, this eases debugging. That's also the reason why the queue
147 - parameters are set to None. When launching this function make sure that
148 - debug=False when calling lookup_username
149 - '''
150 - cursor = create_bots_db(':memory')
151 - botnames = retrieve_botnames_without_id(cursor, 'en')
152 - bots = {}
153 - for botname in botnames:
154 - bots[botname[0]] = 1
155 -
156 - lookup_username('12.xml', None, None, bots, debug=True)
157 - cursor.close()
 2+def retrieve_editor_ids_db():
 3+ contributors = set()
 4+ connection = db.init_database()
 5+ cursor = connection.cursor()
 6+ if settings.PROGRESS_BAR:
 7+ cursor.execute('SELECT MAX(ROWID) FROM contributors')
 8+ for id in cursor:
 9+ pass
 10+ pbar = progressbar.ProgressBar(maxval=id[0]).start()
 11+
 12+ cursor.execute('SELECT contributor FROM contributors WHERE bot=0')
 13+
 14+ print 'Retrieving contributors...'
 15+ for x, contributor in enumerate(cursor):
 16+ contributors.add(contributor[0])
 17+ if x % 100000 == 0:
 18+ pbar.update(x)
 19+ print 'Serializing contributors...'
 20+ utils.store_object(contributors, 'contributors')
 21+ print 'Finished serializing contributors...'
 22+
 23+ if pbar:
 24+ pbar.finish()
 25+ print 'Total elapsed time: %s.' % (utils.humanize_time_difference(pbar.seconds_elapsed))
 26+
 27+ connection.close()
 28+
 29+def retrieve_edits_by_contributor(input_queue, result_queue, pbar):
 30+ connection = db.init_database()
 31+ cursor = connection.cursor()
 32+
 33+ while True:
 34+ try:
 35+ contributor = input_queue.get(block=False)
 36+ if contributor == None:
 37+ break
 38+
 39+ cursor.execute('SELECT contributor, timestamp, bot FROM contributors WHERE contributor=?', (contributor,))
 40+ edits = {}
 41+ edits[contributor] = set()
 42+ for edit, timestamp, bot in cursor:
 43+ date = utils.convert_timestamp_to_date(timestamp)
 44+ edits[contributor].add(date)
 45+ #print edit, timestamp, bot
 46+
 47+ utils.write_data_to_csv(edits, retrieve_edits_by_contributor)
 48+ if pbar:
 49+ utils.update_progressbar(pbar, input_queue)
 50+
 51+ except Empty:
 52+ pass
 53+
 54+ connection.close()
 55+
 56+
 57+def store_data_db(data_queue, pids):
 58+ connection = db.init_database()
 59+ cursor = connection.cursor()
 60+ db.create_tables(cursor, db_settings.CONTRIBUTOR_TABLE)
 61+ empty = 0
 62+ values = []
 63+ while True:
 64+ try:
 65+ chunk = data_queue.get(block=False)
 66+ contributor = chunk['contributor'].encode(settings.encoding)
 67+ article = chunk['article']
 68+ timestamp = chunk['timestamp'].encode(settings.encoding)
 69+ bot = chunk['bot']
 70+ values.append((contributor, article, timestamp, bot))
 71+
 72+ if len(values) == 50000:
 73+ cursor.executemany('INSERT INTO contributors VALUES (?,?,?,?)', values)
 74+ connection.commit()
 75+ #print 'Size of queue: %s' % data_queue.qsize()
 76+ values = []
 77+
 78+ except Empty:
 79+
 80+ if all([utils.check_if_process_is_running(pid) for pid in pids]):
 81+ pass
 82+ else:
 83+ break
 84+ connection.close()
 85+
 86+
 87+def create_bots_db(db_name):
 88+ '''
 89+ This function reads the csv file provided by Erik Zachte and constructs a
 90+ sqlite memory database. The reason for this is that I suspect I will need
 91+ some simple querying capabilities in the future, else a dictionary would
 92+ suffice.
 93+ '''
 94+ connection = db.init_database('db_name')
 95+ #connection = db.init_database('data/database/bots.db')
 96+ cursor = connection.cursor()
 97+ db.create_tables(cursor, db_settings.BOT_TABLE)
 98+ values = []
 99+ fields = [field[0] for field in db_settings.BOT_TABLE['bots']]
 100+ for line in utils.read_data_from_csv('data/csv/StatisticsBots.csv', settings.encoding):
 101+ line = line.split(',')
 102+ row = []
 103+ for x, (field, value) in enumerate(zip(fields, line)):
 104+ if db_settings.BOT_TABLE['bots'][x][1] == 'INTEGER':
 105+ value = int(value)
 106+ elif db_settings.BOT_TABLE['bots'][x][1] == 'TEXT':
 107+ value = value.replace('/', '-')
 108+ #print field, value
 109+ row.append(value)
 110+ values.append(row)
 111+
 112+ cursor.executemany('INSERT INTO bots VALUES (?,?,?,?,?,?,?,?,?,?);', values)
 113+ connection.commit()
 114+ if db_name == ':memory':
 115+ return cursor
 116+ else:
 117+ connection.close()
 118+
 119+def retrieve_botnames_without_id(cursor, language):
 120+ return cursor.execute('SELECT name FROM bots WHERE language=?', (language,)).fetchall()
 121+
 122+
 123+def add_id_to_botnames():
 124+ '''
 125+ This is the worker function for the multi-process version of
 126+ lookup_username.First, the names of the bots are retrieved, then the
 127+ multiprocess is launched by making a call to pc.build_scaffolding. This is a
 128+ generic launcher that takes as input the function to load the input_queue,
 129+ the function that will do the main work and the objects to be put in the
 130+ input_queue. The launcher also accepts optional keyword arguments.
 131+ '''
 132+ cursor = create_bots_db(':memory')
 133+ files = utils.retrieve_file_list(settings.input_location, 'xml')
 134+
 135+ botnames = retrieve_botnames_without_id(cursor, 'en')
 136+ bots = {}
 137+ for botname in botnames:
 138+ bots[botname[0]] = 1
 139+ pc.build_scaffolding(pc.load_queue, lookup_username, files, bots=bots)
 140+ cursor.close()
 141+
 142+
 143+def debug_lookup_username():
 144+ '''
 145+ This function launches the lookup_username function but then single
 146+ threaded, this eases debugging. That's also the reason why the queue
 147+ parameters are set to None. When launching this function make sure that
 148+ debug=False when calling lookup_username
 149+ '''
 150+ cursor = create_bots_db(':memory')
 151+ botnames = retrieve_botnames_without_id(cursor, 'en')
 152+ bots = {}
 153+ for botname in botnames:
 154+ bots[botname[0]] = 1
 155+
 156+ lookup_username('12.xml', None, None, bots, debug=True)
 157+ cursor.close()
Property changes on: trunk/tools/editor_trends/database/sqlite_logic.py
___________________________________________________________________
Added: svn:eol-style
158158 + native
Property changes on: trunk/tools/editor_trends/database
___________________________________________________________________
Added: svn:ignore
159159 + wikistats
zips
notes.txt
*.pyc
*.xml
*.db
*.bin
*.zip
*.csv
datasets
errors
.settings
.project
.pydevproject
wiki.cfg
fabric.py
fabfile.py
deployment
data
Property changes on: trunk/tools/editor_trends/bots
___________________________________________________________________
Added: svn:ignore
160160 + wikistats
zips
notes.txt
*.pyc
*.xml
*.db
*.bin
*.zip
*.csv
datasets
errors
.settings
.project
.pydevproject
wiki.cfg
fabric.py
fabfile.py
deployment
data
Property changes on: trunk/tools/editor_trends/datasets
___________________________________________________________________
Modified: svn:ignore
161161 - cohort_data.txt
cohorts.dta
difference observations erik vs diederik.ods
difference observations erik vs diederik.xlsx
editors.dta
enwiki_editors.csv
enwiki_long_editors.csv
enwiki_wide_editors.csv
162162 + wikistats
zips
notes.txt
*.pyc
*.xml
*.db
*.bin
*.zip
*.csv
datasets
errors
.settings
.project
.pydevproject
wiki.cfg
fabric.py
fabfile.py
deployment
data
Property changes on: trunk/tools/editor_trends/documentation
___________________________________________________________________
Modified: svn:ignore
163163 - language_codes.xlsx
164164 + wikistats
zips
notes.txt
*.pyc
*.xml
*.db
*.bin
*.zip
*.csv
datasets
errors
.settings
.project
.pydevproject
wiki.cfg
fabric.py
fabfile.py
deployment
data
Property changes on: trunk/tools/editor_trends/data/database
___________________________________________________________________
Modified: svn:ignore
165165 - *.db
166166 + wikistats
zips
notes.txt
*.pyc
*.xml
*.db
*.bin
*.zip
*.csv
datasets
errors
.settings
.project
.pydevproject
wiki.cfg
fabric.py
fabfile.py
deployment
data
Property changes on: trunk/tools/editor_trends/data/objects
___________________________________________________________________
Modified: svn:ignore
167167 - *.bin
168168 + wikistats
zips
notes.txt
*.pyc
*.xml
*.db
*.bin
*.zip
*.csv
datasets
errors
.settings
.project
.pydevproject
wiki.cfg
fabric.py
fabfile.py
deployment
data
Property changes on: trunk/tools/editor_trends/data/csv
___________________________________________________________________
Modified: svn:ignore
169169 - *.csv
170170 + wikistats
zips
notes.txt
*.pyc
*.xml
*.db
*.bin
*.zip
*.csv
datasets
errors
.settings
.project
.pydevproject
wiki.cfg
fabric.py
fabfile.py
deployment
data
Property changes on: trunk/tools/editor_trends/data
___________________________________________________________________
Added: svn:ignore
171171 + wikistats
zips
notes.txt
*.pyc
*.xml
*.db
*.bin
*.zip
*.csv
datasets
errors
.settings
.project
.pydevproject
wiki.cfg
fabric.py
fabfile.py
deployment
data
Property changes on: trunk/tools/editor_trends/algorithms
___________________________________________________________________
Added: svn:ignore
172172 + wikistats
zips
notes.txt
*.pyc
*.xml
*.db
*.bin
*.zip
*.csv
datasets
errors
.settings
.project
.pydevproject
wiki.cfg
fabric.py
fabfile.py
deployment
data
Index: trunk/tools/editor_trends/run.bat
@@ -1,3 +1,3 @@
2 -@echo off
3 -python split_xml_file.py
4 -python map_wiki_editors.py
 2+@echo off
 3+python split_xml_file.py
 4+python map_wiki_editors.py
Property changes on: trunk/tools/editor_trends/run.bat
___________________________________________________________________
Added: svn:eol-style
55 + native
Property changes on: trunk/tools/editor_trends
___________________________________________________________________
Modified: svn:ignore
66 - wikistats
zips
notes.txt
*.pyc
datasets
errors
.settings
.project
.pydevproject
wiki.cfg
fabric.py
fabfile.py
deployment
77 + wikistats
zips
notes.txt
*.pyc
*.xml
*.db
*.bin
*.zip
*.csv
datasets
errors
.settings
.project
.pydevproject
wiki.cfg
fabric.py
fabfile.py
deployment
data

Status & tagging log