r95685 MediaWiki - Code Review archive

Repository:MediaWiki
Revision:r95684‎ | r95685 | r95686 >
Date:18:57, 29 August 2011
Author:giovanni
Status:deferred
Tags:
Comment:
added documentation to editor_lifecycle
Modified paths:
  • /trunk/tools/wsor/editor_lifecycle/README.rst (modified) (history)
  • /trunk/tools/wsor/editor_lifecycle/TODO.rst (added) (history)

Diff [purge]

Index: trunk/tools/wsor/editor_lifecycle/TODO.rst
@@ -0,0 +1,2 @@
 2+* Use `oursql.Cursor.executemany` in `fetchrates`. Presently this is not possible,
 3+ because of a bug in `oursql`. See https://answers.launchpad.net/oursql/+question/166877
Index: trunk/tools/wsor/editor_lifecycle/README.rst
@@ -1,7 +1,11 @@
2 -============
3 -README
4 -============
 2+Editor lifecycle
 3+================
54
 5+Author: Giovanni Luca Ciampaglia
 6+
 7+License
 8+-------
 9+
610 Copyright (C) 2011 GIOVANNI LUCA CIAMPAGLIA, GCIAMPAGLIA@WIKIMEDIA.ORG
711 This program is free software; you can redistribute it and/or modify
812 it under the terms of the GNU General Public License as published by
@@ -18,33 +22,54 @@
1923 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
2024 http://www.gnu.org/copyleft/gpl.html
2125
22 -workflow
 26+Installation
 27+------------
2328
24 -This package is a collection of python and shell scripts that can assist
25 -creating and analyzing data on user life cycle.
 29+To install this package you can use the normal distutils command::
2630
27 -Sample selection
 31+ python setup.py install
2832
29 -TBD
 33+see http://docs.python.org/install/index.html#install-index for more options.
 34+You might require root access (sudo) to perform a system-wide installation.
3035
31 -Edit activity data collection
 36+Usage
 37+-----
 38+See http://http://meta.wikimedia.org/wiki/Research:Editor_lifecycle. All scripts
 39+accept arguments from the command line and understand the common -h/--help
 40+option.
3241
33 -First use `fetchrates` to download the rate data from the MySQL database. This
34 -script takes a user_id in input (and stores the rate data in a file called
35 -<user_id>.npy). This script can be parallelized. At the end you will end up with
36 -a bunch of NPY files.
 42+Workflow
 43+--------
3744
38 -Cohort selection
 45+1. Fetch user rates using `ratesnobots.sql`::
3946
40 -See the docstring in `mkcohort`.
 47+ mysql -BNe < ratesnobots.sql > rates.tsv
4148
42 -Cohort analysis
 49+Note: To be able to run this query, you must be able to access an internal
 50+resource of the Wikimedia Foundation, see here for more information:
 51+http://collab.wikimedia.org/wiki/WSoR_datasets/bot. If you can't access this
 52+page, you can recreate this information from a public dump of the
 53+`user_groups` and `user` tables in the following way:
4354
44 -See `graphlife`, `fitting`, `fitting_batch.sh`, and `relax`.
 55+ a. Gather usernames from bot status
 56+ (http://en.wikipedia.org/w/index.php?title=Wikipedia:Bots/Status) and list of
 57+ bots by number of edits
 58+ (http://en.wikipedia.org/wiki/Wikipedia:List_of_bots_by_number_of_edits)
4559
 60+ b. Select the user IDs of the gathered user names from `user`
 61+
 62+ c. Do a union the above data with user_groups::
 63+
 64+ SELECT DISTINCT ug_user FROM user_groups where ug_group = "bot"
 65+
 66+2. Use `mkcohort` to make cohorts. This will create a file where each line is a
 67+ cohort, specified by the first two columns. Columns after the second are the
 68+ IDs of users.
 69+
 70+3. Use `fetchrates` to fetch daily edit counts using the cohort data. See
 71+ `sge/rates.sh` if you want to run this query from within the toolserver.
 72+
 73+4. At this point you can use the other utilities to analyze the rate data. To
 74+ compute and plot activity peaks, use `comppeak` and `plotpeak`.
 75+
 76+5. Happy hacking/researching!

Status & tagging log