r102746 MediaWiki - Code Review archive

Repository:MediaWiki
Revision:r102745‎ | r102746 | r102747 >
Date:03:25, 11 November 2011
Author:khorn
Status:ok (Comments)
Tags:
Comment:
Adding donate.wikimedia.org to the landing page filters (and slightly improved performance).
Modified paths:
  • /trunk/udplog/filters/lp-filter.c (modified) (history)

Diff [purge]

Index: trunk/udplog/filters/lp-filter.c
@@ -2,18 +2,24 @@
33 #include <stdio.h>
44 #include <string.h>
55
6 -char burl[]="wikimediafoundation.org/";
 6+char url1[] ="wikimediafoundation.org/";
 7+char url2[] ="donate.wikimedia.org/";
78
89
910 main() {
10 - char line[10240];
11 - char title[10240];
12 - char *urlstart, *urlend;
 11+ char line[65534];
 12+ char title[65534];
1313 char *t = 0;
 14+ char *u = 0;
 15+
 16+ //to cut down on the processing time: Guess first.
 17+ //longest filter is 25 characters.
 18+ //I'm allowing for one heck of a subdomain, here.
 19+ int search_length = 75;
1420
1521 while (!feof(stdin)) {
1622 char *r;
17 - r=fgets(line, 10000, stdin);
 23+ r=fgets(line, 65534, stdin);
1824
1925 int pos=0;
2026 t = line;
@@ -28,13 +34,26 @@
2935 }
3036 if (!t)
3137 continue;
32 - urlstart = t;
33 - urlend = strstr(urlstart, " ");
34 - strncpy(title, urlstart, urlend-urlstart);
35 - title[urlend-urlstart]=0;
36 - if (strstr(title, burl) )
37 - printf("%s", line);
3838
 39+ strncpy(title, t, search_length);
 40+ title[search_length]=0;
 41+
 42+ if (strstr(title, url1) || strstr(title, url2) ){
 43+ u = strstr(title, " ");
 44+
 45+ if (!u){ //no spaces, just do it.
 46+ printf("%s", line);
 47+ } else {
 48+ //make sure it was before the first space.
 49+ t = strstr(title, url1);
 50+ if (!t)
 51+ t = strstr(title, url2);
 52+ if ( (t) && t < u ) {
 53+ printf("%s", line);
 54+ }
 55+ }
 56+ }
 57+
3958 }
4059
4160 }

Follow-up revisions

RevisionCommit summaryAuthorDate
r102811After confirming with metrics, made changes that filter out a whole lot of ga...khorn20:07, 11 November 2011

Comments

#Comment by Petrb (talk | contribs)   11:52, 11 November 2011

seems ok, although I am not sure if the multiple call of strstr (you call it many more times than in previous case) makes the performance improvement significant.

#Comment by Khorn (WMF) (talk | contribs)   20:21, 11 November 2011

Actually, it was about 4% faster than it had been with the one filter string with the data we usually get, because the vast majority of it got booted on the first pair of strstr checks. Ran it a bunch of times against a 5.4 million line log sample. Of course, the worst case scenario would be if most of the log was something we wanted to get through, but that's not what the data looks like.

Status & tagging log