Wikipedia edit networks (tutorial): Difference between revisions

From visone manual
Jump to navigation Jump to search
Line 104: Line 104:
After these five components have been chosen visone needs some information about the interpretation of time. (visone can handle very general date/time formats - but some information is necessary.) The first choice is the selection between '''numeric time''' (if the time fields correspond to integer numbers) or '''calendar time''' (if time fields can somehow, specified below, be turned into a date/time). We have calendar time in our example.
After these five components have been chosen visone needs some information about the interpretation of time. (visone can handle very general date/time formats - but some information is necessary.) The first choice is the selection between '''numeric time''' (if the time fields correspond to integer numbers) or '''calendar time''' (if time fields can somehow, specified below, be turned into a date/time). We have calendar time in our example.


If time is given by calendar, a '''time format pattern''' has to be specified. visone proposes some known pattern - among others the pattern '''yyyy-MM-dd'T'HH:mm:ss'Z'''' which is appropriate for the Wikipedia edit times. You can enter other than the proposed patterns in the textfield if date/time is formatted differently (see the webpage on the java class [http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html SimpleDateFormat] for guidance).
If time is given by calendar, a '''time format pattern''' has to be specified. visone proposes some known pattern - among others the pattern '''yyyy-MM-dd'T'HH:mm:ss'Z'''' which is appropriate for the Wikipedia edit times. You can enter other than the proposed patterns in the textfield if date/time is formatted differently (see the webpage on the java class [http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html SimpleDateFormat] for guidance). visone assists you in finding the right pattern by showing some date/time strings as they appear in the file and - whenever you select a date format pattern - the dialog shows you the current time formatted by the specified pattern.


== Analysis and visualization of edit networks ==
== Analysis and visualization of edit networks ==

Revision as of 15:17, 23 July 2012

The edit network associated with the history of Wikipedia pages is a network whose nodes are the page(s) and all contributing users and whose edges encode time-stamped, typed, and weighted interaction events (edit events) between users and pages and between users and users. Specifically, edit events encode the exact time when an edit has been done along with one or several of the following types of edit interaction:

  • the amount of new text that a user adds to a page;
  • the amount of text that a user deletes (along with the other user/s that has/have previously added this text);
  • the amount of previously deleted text that a user restores (along with the users that previously deleted and the ones that originally added the text).

Together these edit events form a highly dynamic network revealing the emergent collaboration structure among contributing users. For instance, it can be derived

  • who are the users that contributed most of the text;
  • what are the implicit roles of users (e.g., contributors of new content, vanalism fighters, watchdogs);
  • whether there are opinion groups, i.e., groups of users that mutually fight against each others edits.

This tutorial is a practically oriented "how-to"-guide giving an example based introduction to the computation, analysis, and visualization of Wikipedia edit networks. More background can be found in the papers cited in the references. To follow the steps outlined here (or to do a similar study) you should download WikiEvent - a small graphical java software with which the Wikipedia edit networks can be computed.

How to download the edit history?

Wikipedia not only provides access to the current version of each page but also to all of its previous versions. To view the page history in your browser you can just click on the history link on top of each page and browse through the versions. However, for automatic extraction of edit events we need to download the complete history in a more structured format. To do this there are various possibilities that are appropriate in different scenarios (and dependent on your computational resources and internet bandwidth).

To get the history of all pages you can go to the Wikimedia database dumps, select the wiki of interest (for instance, enwiki for the English-language Wikipedia), and download all files linked under the headline All pages with complete edit history. The complete database is extremely large (several terabytes for the English-language Wikipedia) and probably cannot be managed with an ordinary desktop computer.

Another possibility to get the complete history of a Wikipedia page (or of a small set of pages) is to use the wiki's Export page. (The preceeding link is for the English-language Wikipedia - for other languages just change the language identifier en in the URL to, for instance, de or fr or es, etc.)

Wikipedia export pages.png

For instance, to download the history of the page Social network analysis make settings as in the screenshot above and click on the Export button. However, as it is noted on the page, exporting is limited to 1000 revisions and the example page (Social network analysis) has already more than 2700 revisions. In principle it is possible to download the next 1000 revisions by specifying an appropriate offset (as explained on the manual page for Special:Export) and then pasting the files together. However, since this is rather tedious the software WikiEvent offers a possibility to do this automatically. (Internally WikiEvent proceeds exactly as described above by retrieving revisions in chunks of 1000 and appending these to a single output file.)

To download a page history with WikiEvent you start the program (download it from http://www.inf.uni-konstanz.de/algo/software/wikievent/ and execute by double-clicking) and click on the entry download history in the net menu. You have to specify the language of the Wikipedia (for instance, en for English, de for German, fr for French, etc), the title of the page to download and a directory on your computer in which the file should be saved.

Wikipedia lang code.png Wikipedia page title.png Wikipedia download sna.png

The program is actually very silent - for instance, you don't see a progres bar - until the download is complete. The time it takes to download depends on many factors, among them the size of the page history (which might be several gigabytes for some popular pages!) and the bandwidth of your internet connection. At the end you see the number of downloaded revisions in the message area of WikiEvent.

For information: the size of the history file for the page Social network analysis is about 83 Megabytes on July 20, 2012 (obviously growing). The history is saved in a file Social_network_analysis.xml in the directory that you have chosen. If you are interested, the XML format is described in the page http://meta.wikimedia.org/wiki/Help:Export - but you never have to read these files since they are automatically processed as described below.

Computing the edit network

To compute the edit events from a Wikipedia history file select the entry extract edit events in the page-menu of WikiEvent. You have to specify one or more history file(s) and a directory to save the files with the edit events. (These output files have the same names as the input files - just with the ending .xml replaced by .csv.)

Wikipedia extract edit events.png

If we compute the edit events from the history of the page Social network analysis, then the first few lines of the edit event file look like this:

 PageTitle;RevisionID;Time(calendar);Time(milliseconds);InteractionType;WordCount;ActiveUser;Target
 "Social network analysis";1711088;2003-09-23T21:08:52Z;1064344132000;added;196;"142.177.104.40";"Social network analysis"
 "Social network analysis";2002109;2003-11-11T06:13:44Z;1068527624000;added;10;"63.228.105.175";"Social network analysis"
 "Social network analysis";2002109;2003-11-11T06:13:44Z;1068527624000;deleted;192;"63.228.105.175";"142.177.104.40"
 "Social network analysis";2036847;2003-12-19T22:42:43Z;1071870163000;added;54;"Davodd";"Social network analysis"
 "Social network analysis";2036847;2003-12-19T22:42:43Z;1071870163000;deleted;7;"Davodd";"63.228.105.175"
 "Social network analysis";2210638;2003-12-24T13:29:11Z;1072268951000;added;1;"210.49.82.219";"Social network analysis"
 "Social network analysis";2210638;2003-12-24T13:29:11Z;1072268951000;deleted;1;"210.49.82.219";"Davodd"
 ...

The file encodes a table with entries separated by semicolons (;). The columns from left to right encode

  • The title of the page (since a history file can contain the history of several pages the title-field can actually vary.).
  • The revision id which is a number uniquely identifying a revision in Wikipedia (not just in one page). A single edit can produce more than just one line in the output file (we say more on this below); the revision id makes it possible to recognize which lines belong to the same edit.
  • The time of the edit given as a date/time-string. For instance the first edit happend on September 23, 2003 at 21:08:52 (where time is measured in the UTC time zone).
  • Once again the edit time given as a number encoding milliseconds since January 1, 1970 at 00:00:00.000 Greenwich Mean Time. (This value is actually obtained by the method getTimeInMillis of the java class Calendar.) The time in milliseconds is helpful if you just need the time difference between revisions and not the actual time or date; it is obvisously easier to compute the time difference from numbers than from data/time strings.
  • The edit type which can be added, deleted, restored, or undeleted; we say more on this below.
  • The word count, i.e., the number of words that are added, deleted, restored, or undeleted with respect to the given target.
  • The active user is the user that has done the edit; it is the source node of the edit event. The user is identified by a user name if logged in; otherwise (if it is an anonymous edit) the user is identified by an IP address.
  • The target node of the edit event is either the page or a user. If the event type is added, then the target is the page (the active user adds text to the page). If the event type is deleted, restored, or undeleted, then the target is the user who has previously written or deleted the text (the active user deletes/restores/undeletes text that has been added/deleted by the target user).

We say more on the different event types in the following.

The structure of edit network data

Consider an example of three revisions on one page where

  • (in Revision 1) user Alice adds some new text to the page;
  • subsequently (in Revision 2), user Bob deletes this text;
  • then (in Revision 3), user Charlie reverts Bob's edit - setting back the page text to the one submitted in Revision 1.

These three edits together give rise to four dyadic edit events (shown in the image below):

Edit network example.png

  • An edit event of type added from user Alice to the edited page.
  • An edit event of type deleted from user Bob directed to user Alice.
  • An edit event of type restored from user Charlie directed to user Alice (Charlie restored text that has been previously written by Alice).
  • An edit event of type undeleted from user Charlie directed to user Bob (Charlie restored text that has been previously deleted by Bob). Note that after the revert the restored text is (again) authored by Alice and not by Charlie.

All edit events are weighted by the number of words that have been added, deleted, restored, or undeleted and all edit events have a time stamp marking the time when the edit has been submitted.

A single edit on a Wikipedia page generates one hyperedge linking the active user to (potentially) several other users and the edited page. Such a hyperedge has been turned into several lines of the CSV file, each encoding one (dyadic) edge, linking the source (active user) to one target in one interaction type. Note that the hyperedges can be reconstructed from the data using the revision ids (see above).

For determining the amount of text modified in an edit we make some choices. For instance, if complete sentences are just moved from one part of the page to another, we do not count this as any change. More detailed information about the text-processing conventions can be found in

and more technically in

Importing edit event networks into visone

Note: this section describes a fuctionality that will be in the next visone release (around September 2012).

The CSV file with the computed edit events can be imported in visone when opening it as an event list file. Visone's capabilities in importing and analyzing event networks are documented in the tutorial on event networks; here we treat the special case of edit event networks.

To open an event list file (such as the newly created Social_network_analysis.csv) click on open in visone's file menu, select files of type: event list files, navigate to the CSV file, and click on ok. In the import options, choose the semicolon (;) as a cell delimiter and double quotes (")as textframe.

File open eventlist.png Import options event list.png

visone can now read the various entries of the input file - and you have to specify how these should be mapped to the resulting networks. Concretely you have to specify how the various components of an event are encoded in the file (Event format tab); how to iterate over the network sequence (Event iterator tab); how the events are mapped to the network's link attributes (Event network tab); and, if desired, which statistics should be computed while constructing the event network (Eventnet statistics tab). The tabs should be filled out in the order as they are numbered in the dialog since choice-possibilities for the latter tabs depend on previous settings. If you make changes in some tab you have to subsequently set (again) the values for the latter tabs.

Eventnet dialog format.png

In the event format tab (see the image above) you first have to specify which columns of the input file hold the information about the five components of an event (source, target, time, type, and weight). You can set the values as in the image above.

After these five components have been chosen visone needs some information about the interpretation of time. (visone can handle very general date/time formats - but some information is necessary.) The first choice is the selection between numeric time (if the time fields correspond to integer numbers) or calendar time (if time fields can somehow, specified below, be turned into a date/time). We have calendar time in our example.

If time is given by calendar, a time format pattern has to be specified. visone proposes some known pattern - among others the pattern yyyy-MM-dd'T'HH:mm:ss'Z' which is appropriate for the Wikipedia edit times. You can enter other than the proposed patterns in the textfield if date/time is formatted differently (see the webpage on the java class SimpleDateFormat for guidance). visone assists you in finding the right pattern by showing some date/time strings as they appear in the file and - whenever you select a date format pattern - the dialog shows you the current time formatted by the specified pattern.

Analysis and visualization of edit networks

Statistical modeling of edit event networks

Computing simple edit events

The discussion network

References

Published papers that propose and/or make use of Wikipedia edit networks include the following.

More technical details about the computation of Wikipedia edit networks can be found in