Wikipedia edit networks (tutorial)

From visone manual
Jump to navigation Jump to search

The edit network associated with the history of Wikipedia pages is a network whose nodes are the page(s) and all contributing users and whose edges encode time-stamped, typed, and weighted interaction events (edit events) between users and pages and between users and users. Specifically, edit events encode the exact time when an edit has been done along with one or several of the following types of edit interaction:

  • the amount of new text that a user adds to a page;
  • the amount of text that a user deletes (along with the other user/s that has/have previously added this text);
  • the amount of previously deleted text that a user restores (along with the users that previously deleted and the ones that originally added the text).

Together these edit events form a highly dynamic network revealing the emergent collaboration structure among contributing users. For instance, it can be derived

  • who are the users that contributed most of the text;
  • what are the implicit roles of users (e.g., contributors of new content, vanalism fighters, watchdogs);
  • whether there are opinion groups, i.e., groups of users that mutually fight against each others edits.

This tutorial is a practically oriented "how-to"-guide giving an example based introduction to the computation, analysis, and visualization of Wikipedia edit networks. More background can be found in the papers cited in the references. To follow the steps outlined here (or to do a similar study) you should download WikiEvent - a small graphical java software with which the Wikipedia edit networks can be computed.

How to download the edit history?

Wikipedia not only provides access to the current version of each page but also to all of its previous versions. To view the page history in your browser you can just click on the history link on top of each page and browse through the versions. However, for automatic extraction of edit events we need to download the complete history in a more structured format. To do this there are various possibilities that are appropriate in different scenarios (and dependent on your computational resources and internet bandwidth).

To get the history of all pages you can go to the Wikimedia database dumps, select the wiki of interest (for instance, enwiki for the English-language Wikipedia), and download all files linked under the headline All pages with complete edit history. The complete database is extremely large (several terabytes for the English-language Wikipedia) and probably cannot be managed with an ordinary desktop computer.

Another possibility to get the complete history of a Wikipedia page (or of a small set of pages) is to use the wiki's Export page. (The preceeding link is for the English-language Wikipedia - for other languages just change the language identifier en in the URL to, for instance, de or fr or es, etc.)

Wikipedia export pages.png

For instance, to download the history of the page Social network analysis make settings as in the screenshot above and click on the Export button. However, as it is noted on the page, exporting is limited to 1000 revisions and the example page (Social network analysis) has already more than 2700 revisions. In principle it is possible to download the next 1000 revisions by specifying an appropriate offset (as explained on the manual page for Special:Export) and then pasting the files together. However, since this is rather tedious the software WikiEvent offers a possibility to do this automatically. (Internally WikiEvent proceeds exactly as described above by retrieving revisions in chunks of 1000 and appending these to a single output file.)

To download a page history with WikiEvent you start the program (download it from http://www.inf.uni-konstanz.de/algo/software/wikievent/ and execute by double-clicking) and click on the entry download history in the net menu. You have to specify the language of the Wikipedia (for instance, en for English, de for German, fr for French, etc), the title of the page to download and a directory on your computer in which the file should be saved.

Wikipedia lang code.png Wikipedia page title.png Wikipedia download sna.png

The program is actually very silent - for instance, you don't see a progres bar - until the download is complete. The time it takes to download depends on many factors, among them the size of the page history (which might be several gigabytes for some popular pages!) and the bandwidth of your internet connection. At the end you see the number of downloaded revisions in the message area of WikiEvent.

For information: the size of the history file for the page Social network analysis is about 83 Megabytes on July 20, 2012 (obviously growing). The history is saved in a file Social_network_analysis.xml in the directory that you have chosen. If you are interested, the XML format is described in the page http://meta.wikimedia.org/wiki/Help:Export - but you never have to read these files since they are automatically processed as described below.

Computing the edit network

To compute the edit events from a Wikipedia history file select the entry extract edit events in the page-menu of WikiEvent. You have to specify one or more history file(s) and a directory to save the files with the edit events. (These output files have the same names as the input files - just with the ending .xml replaced by .csv.)

Wikipedia extract edit events.png

If we compute the edit events from the history of the page Social network analysis, then the first few lines of the edit event file look like this:

 PageTitle;RevisionID;Time(calendar);Time(milliseconds);InteractionType;WordCount;ActiveUser;Target
 "Social network analysis";1711088;2003-09-23T21:08:52Z;1064344132000;added;196;"142.177.104.40";"Social network analysis"
 "Social network analysis";2002109;2003-11-11T06:13:44Z;1068527624000;added;10;"63.228.105.175";"Social network analysis"
 "Social network analysis";2002109;2003-11-11T06:13:44Z;1068527624000;deleted;192;"63.228.105.175";"142.177.104.40"
 "Social network analysis";2036847;2003-12-19T22:42:43Z;1071870163000;added;54;"Davodd";"Social network analysis"
 "Social network analysis";2036847;2003-12-19T22:42:43Z;1071870163000;deleted;7;"Davodd";"63.228.105.175"
 "Social network analysis";2210638;2003-12-24T13:29:11Z;1072268951000;added;1;"210.49.82.219";"Social network analysis"
 "Social network analysis";2210638;2003-12-24T13:29:11Z;1072268951000;deleted;1;"210.49.82.219";"Davodd"
 ...

The file encodes a table with entries separated by semicolons (;). The columns from left to right encode

  • The title of the page (since a history file can contain the history of several pages the title-field can actually vary.).
  • The revision id which is a number uniquely identifying a revision in Wikipedia (not just in one page). A single edit can produce more than just one line in the output file (we say more on this below).
  • The time of the edit given as a date/time-string. For instance the first edit happend on September 23, 2003 at 21:08:52 (where time is measured in the UTC time zone).
  • Once again the edit time given as a number encoding milliseconds since January 1, 1970 at 00:00:00.000 Greenwich Mean Time. (This value is actually obtained by the method getTimeInMillis of the java class Calendar.) The time in milliseconds is helpful if you just need the time difference between revisions and not the actual time or date.
  • The edit type which can be added, deleted, restored, or undeleted; we say more on this below.
  • The word count, i.e., the number of words that are added, deleted, restored, or undeleted with respect to the given target.

The structure of edit network data

Analysis and visualization of edit networks

Statistical modeling of edit event networks

Computing simple edit events

The discussion network

References

Published papers that propose and/or make use of Wikipedia edit networks include the following.

More technical details about the computation of Wikipedia edit networks can be found in