Wikipedia edit networks (tutorial)
The edit network associated with the history of Wikipedia pages is a network whose nodes are the page(s) and all contributing users and whose edges encode time-stamped, typed, and weighted interaction events (edit events) between users and pages and between users and users. Specifically, edit events encode the exact time when an edit has been done along with one or several of the following types of edit interaction:
- the amount of new text that a user adds to a page;
- the amount of text that a user deletes (along with the other user/s that has/have previously added this text);
- the amount of previously deleted text that a user restores (along with the users that previously deleted and the ones that originally added the text).
Together these edit events form a highly dynamic network revealing the emergent collaboration structure among contributing users. For instance, it can be derived
- who are the users that contributed most of the text;
- what are the implicit roles of users (e.g., contributors of new content, vanalism fighters, watchdogs);
- whether there are opinion groups, i.e., groups of users that mutually fight against each others edits.
This tutorial is a practically oriented "how-to"-guide giving an example based introduction to the computation, analysis, and visualization of Wikipedia edit networks. More background can be found in the papers cited in the references. To follow the steps outlined here (or to do a similar study) you should download WikiEvent - a small graphical java software with which the Wikipedia edit networks can be computed.
How to download the edit history?
Wikipedia not only provides access to the current version of each page but also to all of its previous versions. To view the page history in your browser you can just click on the history link on top of each page and browse through the versions. However, for automatic extraction of edit events we need to download the complete history in a more structured format. To do this there are various possibilities that are appropriate in different scenarios (and dependent on your computational resources and internet bandwidth).
To get the history of all pages you can go to the Wikimedia database dumps, select the wiki of interest (for instance, enwiki for the English-language Wikipedia), and download all files linked under the headline All pages with complete edit history. The complete database is extremely large (several terabytes for the English-language Wikipedia) and probably cannot be managed with an ordinary desktop computer.
Another possibility to get the complete history of a Wikipedia page (or of a small set of pages) is to use the wiki's Export page. (The preceeding link is for the English-language Wikipedia - for other languages just change the language identifier en in the URL to, for instance, de or fr or es, etc.)
For instance, to download the history of the page Social network analysis make settings as in the screenshot above and click on the Export button. However, as it is noted on the page, exporting is limited to 1000 revisions and the example page (Social network analysis) has already more than 2700 revisions. In principle it is possible to download the next 1000 revisions by specifying an appropriate offset (as explained on the manual page for Special:Export) and then pasting the files together. However, since this is rather tedious the software WikiEvent offers a possibility to do this automatically. (Internally WikiEvent proceeds exactly as described above by retrieving revisions in chunks of 1000 and appending these to a single output file.)
To download a page history with WikiEvent you start the program (download it from http://www.inf.uni-konstanz.de/algo/software/wikievent/ and execute by double-clicking) and click on the entry download history in the net menu. You have to specify the language of the Wikipedia (for instance, en for English, de for German, fr for French, etc), the title of the page to download and a directory on your computer in which the file should be saved.
The program is actually very silent - for instance, you don't see a progres bar - until the download is complete. The time it takes to download depends on many factors, among them the size of the page history (which might be several gigabytes for some popular pages!) and the bandwidth of your internet connection. At the end you see the number of downloaded revisions in the message area of WikiEvent.
For information: the size of the history file for the page Social network analysis is about 83 Megabytes on July 20, 2012 (obviously growing). The history is saved in a file Social_network_analysis.xml in the directory that you have chosen. If you are interested, the XML format is described in the page http://meta.wikimedia.org/wiki/Help:Export - but you never have to read these files since they are automatically processed as described below.
Computing the edit network
To compute the edit events from a Wikipedia history file select the entry extract edit events in the page-menu of WikiEvent. You have to specify one or more history file(s) and a directory to save the files with the edit events. (These output files have the same names as the input files - just with the ending .xml replaced by .csv.)
If we compute the edit events from the history of the page Social network analysis, then the first few lines of the edit event file look like this:
PageTitle;RevisionID;Time(calendar);Time(milliseconds);InteractionType;WordCount;ActiveUser;Target "Social network analysis";1711088;2003-09-23T21:08:52Z;1064344132000;added;196;"142.177.104.40";"Social network analysis" "Social network analysis";2002109;2003-11-11T06:13:44Z;1068527624000;added;10;"63.228.105.175";"Social network analysis" "Social network analysis";2002109;2003-11-11T06:13:44Z;1068527624000;deleted;192;"63.228.105.175";"142.177.104.40" "Social network analysis";2036847;2003-12-19T22:42:43Z;1071870163000;added;54;"Davodd";"Social network analysis" "Social network analysis";2036847;2003-12-19T22:42:43Z;1071870163000;deleted;7;"Davodd";"63.228.105.175" "Social network analysis";2210638;2003-12-24T13:29:11Z;1072268951000;added;1;"210.49.82.219";"Social network analysis" "Social network analysis";2210638;2003-12-24T13:29:11Z;1072268951000;deleted;1;"210.49.82.219";"Davodd" ...
The file encodes a table with entries separated by semicolons (;). The columns from left to right encode
- The title of the page (since a history file can contain the history of several pages the title-field can actually vary.).
- The revision id which is a number uniquely identifying a revision in Wikipedia (not just in one page). A single edit can produce more than just one line in the output file (we say more on this below); the revision id makes it possible to recognize which lines belong to the same edit.
- The time of the edit given as a date/time-string. For instance the first edit happend on September 23, 2003 at 21:08:52 (where time is measured in the UTC time zone).
- Once again the edit time given as a number encoding milliseconds since January 1, 1970 at 00:00:00.000 Greenwich Mean Time. (This value is actually obtained by the method
getTimeInMillis
of the java classCalendar
.) The time in milliseconds is helpful if you just need the time difference between revisions and not the actual time or date; it is obvisously easier to compute the time difference from numbers than from data/time strings. - The edit type which can be added, deleted, restored, or undeleted; we say more on this below.
- The word count, i.e., the number of words that are added, deleted, restored, or undeleted with respect to the given target.
- The active user is the user that has done the edit; it is the source node of the edit event. The user is identified by a user name if logged in; otherwise (if it is an anonymous edit) the user is identified by an IP address.
- The target node of the edit event is either the page or a user. If the event type is added, then the target is the page (the active user adds text to the page). If the event type is deleted, restored, or undeleted, then the target is the user who has previously written or deleted the text (the active user deletes/restores/undeletes text that has been added/deleted by the target user).
We say more on the different event types in the following.
The structure of edit network data
Consider an example of three revisions on one page where
- (in Revision 1) user Alice adds some new text to the page;
- subsequently (in Revision 2), user Bob deletes this text;
- then (in Revision 3), user Charlie reverts Bob's edit - setting back the page text to the one submitted in Revision 1.
These three edits together give rise to four dyadic edit events (shown in the image below):
- An edit event of type added from user Alice to the edited page.
- An edit event of type deleted from user Bob directed to user Alice.
- An edit event of type restored from user Charlie directed to user Alice (Charlie restored text that has been previously written by Alice).
- An edit event of type undeleted from user Charlie directed to user Bob (Charlie restored text that has been previously deleted by Bob). Note that after the revert the restored text is (again) authored by Alice and not by Charlie.
All edit events are weighted by the number of words that have been added, deleted, restored, or undeleted and all edit events have a time stamp marking the time when the edit has been submitted.
For determining the amount of text modified in an edit we make some choices. For instance, if complete sentences are just moved from one part of the page to another, we do not count this as any change. More detailed information about the text-processing conventions can be found in
- Ulrik Brandes, Patrick Kenis, Jürgen Lerner, and Denise van Raaij: Network Analysis of Collaboration Structure in Wikipedia. Proc. 18th Intl. World Wide Web Conference (WWW 2009).
and more technically in
- Ulrik Brandes, Patrick Kenis, Jürgen Lerner, and Denise van Raaij: Computing Wikipedia Edit Networks. Technical Report, 2009.
Analysis and visualization of edit networks
Statistical modeling of edit event networks
Computing simple edit events
The discussion network
References
Published papers that propose and/or make use of Wikipedia edit networks include the following.
- Jürgen Lerner, Ulrik Brandes, Patrick Kenis, and Denise van Raaij: Modeling Open, Web-based Collaboration Networks: The Case of Wikipedia. In Markus Gamper, Linda Reschke, Michael Schönhuth (Eds.): Knoten und Kanten 2.0, pp 141-162. transcript-Verlag, 2012.
- Jürgen Lerner, Patrick Kenis, Denise van Raaij and Ulrik Brandes: Will they stay or will they go? How network properties of WebICs predict dropout rates of valuable Wikipedians. European Management Journal, 29(5):404-413, 2011.
- Ulrik Brandes, Patrick Kenis, Jürgen Lerner, and Denise van Raaij: Network Analysis of Collaboration Structure in Wikipedia. Proc. 18th Intl. World Wide Web Conference (WWW 2009).
More technical details about the computation of Wikipedia edit networks can be found in
- Ulrik Brandes, Patrick Kenis, Jürgen Lerner, and Denise van Raaij: Computing Wikipedia Edit Networks. Technical Report, 2009.