Index: branches/ariel/xmldumps-backup/README |
— | — | @@ -6,40 +6,45 @@ |
7 | 7 | |
8 | 8 | === Worker === |
9 | 9 | |
10 | | -Each dump machine runs a worker process which continuously generates dumps. |
| 10 | +Each dump machine runs a worker process, a shell script which continuously |
| 11 | +calls a python script to generate a dump for the next available wiki. |
11 | 12 | At each iteration, the set of wikis is ordered by last dump date, and the |
12 | 13 | least-recently-touched wiki is selected. |
13 | 14 | |
14 | | -Workers are kept from stomping on each other by creating a lock file in |
15 | | -the private dump directory. To aid in administration, the lock file contains |
| 15 | +There are two directory trees used by the dumps processes, one for public |
| 16 | +tables and files of public wikis, and one for private wikis or for private |
| 17 | +tables and files (such as the user table) of public wikis. |
| 18 | + |
| 19 | +Workers (the python scripts) are kept from stomping on each other by creating |
| 20 | +a lock file in the private dump directory for the specific wiki. The lock file contains |
16 | 21 | the hostname and process ID of the worker process holding the lock. |
17 | 22 | |
18 | 23 | Lock files are touched every 10 seconds while the process runs, and removed |
19 | 24 | at the end. |
20 | 25 | |
21 | | -On each iteration, the script and configuration are reloaded, so additions |
22 | | -to the database list or dump code will be made available without manually |
23 | | -restarting things. |
| 26 | +On each iteration, a new copy of the python script is run, which reads its |
| 27 | +configuration files from scratch, so additions to the database list files or |
| 28 | +changes to the dupm script introduced during the middle of one dump will |
| 29 | +go into effect at the start of the next dump. |
24 | 30 | |
25 | | - |
26 | 31 | === Monitor === |
27 | 32 | |
28 | | -One master machine runs the monitor process, which periodically sweeps all |
29 | | -wikis for their current status. This accomplishes two tasks: |
| 33 | +One server runs the monitor process, which periodically sweeps all |
| 34 | +public dump directories (one per wiki) for their current status. This accomplishes two tasks: |
30 | 35 | |
31 | 36 | * The index page is updated with a summary of dump states |
32 | | -* Aborted dumps are detected and cleaned up |
| 37 | +* Aborted dumps are detected and cleaned up (how complete is this?) |
33 | 38 | |
34 | 39 | A lock file that has not been touched in some time is detected as stale, |
35 | 40 | indicating that the worker process holding the lock has died. The status |
36 | 41 | for that dump can then be updated from running to stopped, and the lock |
37 | | -file is removed so that the wiki will get redumped later. |
| 42 | +file is removed so that the wiki will get dumped again later. |
38 | 43 | |
| 44 | +== Code == |
39 | 45 | |
40 | | -== Code files == |
41 | | - |
42 | 46 | worker.py |
43 | | -- Runs a dump for the least-recently dumped wiki in the stack. |
| 47 | +- Runs a dump for the least-recently dumped wiki in the stack, or the desired wiki |
| 48 | + can be specified from the command line |
44 | 49 | |
45 | 50 | monitor.py |
46 | 51 | - Generates the site-wide index summary and removes stale locks. |
— | — | @@ -47,7 +52,16 @@ |
48 | 53 | WikiDump.py |
49 | 54 | - Shared classes and functions |
50 | 55 | |
| 56 | +CommandManagement.py |
| 57 | +- Classes for running multiple commands at the same time, used for running some phases |
| 58 | + of the dumps in multiple pieces at the same time, for speed |
51 | 59 | |
| 60 | +mwbzutils/ |
| 61 | +- Library of utilities for working with bzip2 files, used for locating |
| 62 | + an arbitrary XML page in a dump file, checking that the file was written |
| 63 | + out completely without truncation, and other tools. See the README in |
| 64 | + the directory for more details. |
| 65 | + |
52 | 66 | == Configuration == |
53 | 67 | |
54 | 68 | Configuration is done with an INI-style configuration file wikidump.conf. |