Index: trunk/tools/wsor/editor_lifecycle/TODO.rst |
— | — | @@ -0,0 +1,2 @@ |
| 2 | +* Use `oursql.Cursor.executemany` in `fetchrates`. Presently this is not possible, |
| 3 | + because of a bug in `oursql`. See https://answers.launchpad.net/oursql/+question/166877 |
Index: trunk/tools/wsor/editor_lifecycle/README.rst |
— | — | @@ -1,7 +1,11 @@ |
2 | | -============ |
3 | | -README |
4 | | -============ |
| 2 | +Editor lifecycle |
| 3 | +================ |
5 | 4 | |
| 5 | +Author: Giovanni Luca Ciampaglia |
| 6 | + |
| 7 | +License |
| 8 | +------- |
| 9 | + |
6 | 10 | Copyright (C) 2011 GIOVANNI LUCA CIAMPAGLIA, GCIAMPAGLIA@WIKIMEDIA.ORG |
7 | 11 | This program is free software; you can redistribute it and/or modify |
8 | 12 | it under the terms of the GNU General Public License as published by |
— | — | @@ -18,33 +22,54 @@ |
19 | 23 | 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. |
20 | 24 | http://www.gnu.org/copyleft/gpl.html |
21 | 25 | |
22 | | -workflow |
| 26 | +Installation |
| 27 | +------------ |
23 | 28 | |
24 | | -This package is a collection of python and shell scripts that can assist |
25 | | -creating and analyzing data on user life cycle. |
| 29 | +To install this package you can use the normal distutils command:: |
26 | 30 | |
27 | | -Sample selection |
| 31 | + python setup.py install |
28 | 32 | |
29 | | -TBD |
| 33 | +see http://docs.python.org/install/index.html#install-index for more options. |
| 34 | +You might require root access (sudo) to perform a system-wide installation. |
30 | 35 | |
31 | | -Edit activity data collection |
| 36 | +Usage |
| 37 | +----- |
| 38 | +See http://http://meta.wikimedia.org/wiki/Research:Editor_lifecycle. All scripts |
| 39 | +accept arguments from the command line and understand the common -h/--help |
| 40 | +option. |
32 | 41 | |
33 | | -First use `fetchrates` to download the rate data from the MySQL database. This |
34 | | -script takes a user_id in input (and stores the rate data in a file called |
35 | | -<user_id>.npy). This script can be parallelized. At the end you will end up with |
36 | | -a bunch of NPY files. |
| 42 | +Workflow |
| 43 | +-------- |
37 | 44 | |
38 | | -Cohort selection |
| 45 | +1. Fetch user rates using `ratesnobots.sql`:: |
39 | 46 | |
40 | | -See the docstring in `mkcohort`. |
| 47 | + mysql -BNe < ratesnobots.sql > rates.tsv |
41 | 48 | |
42 | | -Cohort analysis |
| 49 | +Note: To be able to run this query, you must be able to access an internal |
| 50 | +resource of the Wikimedia Foundation, see here for more information: |
| 51 | +http://collab.wikimedia.org/wiki/WSoR_datasets/bot. If you can't access this |
| 52 | +page, you can recreate this information from a public dump of the |
| 53 | +`user_groups` and `user` tables in the following way: |
43 | 54 | |
44 | | -See `graphlife`, `fitting`, `fitting_batch.sh`, and `relax`. |
| 55 | + a. Gather usernames from bot status |
| 56 | + (http://en.wikipedia.org/w/index.php?title=Wikipedia:Bots/Status) and list of |
| 57 | + bots by number of edits |
| 58 | + (http://en.wikipedia.org/wiki/Wikipedia:List_of_bots_by_number_of_edits) |
45 | 59 | |
| 60 | + b. Select the user IDs of the gathered user names from `user` |
| 61 | + |
| 62 | + c. Do a union the above data with user_groups:: |
| 63 | + |
| 64 | + SELECT DISTINCT ug_user FROM user_groups where ug_group = "bot" |
| 65 | + |
| 66 | +2. Use `mkcohort` to make cohorts. This will create a file where each line is a |
| 67 | + cohort, specified by the first two columns. Columns after the second are the |
| 68 | + IDs of users. |
| 69 | + |
| 70 | +3. Use `fetchrates` to fetch daily edit counts using the cohort data. See |
| 71 | + `sge/rates.sh` if you want to run this query from within the toolserver. |
| 72 | + |
| 73 | +4. At this point you can use the other utilities to analyze the rate data. To |
| 74 | + compute and plot activity peaks, use `comppeak` and `plotpeak`. |
| 75 | + |
| 76 | +5. Happy hacking/researching! |