Wikipedia edit networks (tutorial): Difference between revisions

From visone manual
Jump to navigation Jump to search
Line 12: Line 12:


== How to download the edit history? ==
== How to download the edit history? ==
Wikipedia not only provides access to the current version of each page but also all of its previous versions. To view the page history in your browser you can just click on the '''history''' link on top of each page and browse through the versions. However, for automatic extraction of edit events we need to download the complete history in a more structured format. To do this there are various possibilities that are appropriate in different scenarios (and dependent on your computational resources and internet bandwidth).
To get the history of '''all''' pages you can go to the [http://dumps.wikimedia.org/backup-index.html Wikimedia database dumps], select the wiki of interest (for instance, '''enwiki''' for the English-language Wikipedia), and download all files linked under the headline ''All pages with complete edit history''. The complete database is extremely large (several [http://en.wikipedia.org/wiki/Terabyte terabytes] of data) and certainly cannot be managed with an ordinary desktop computer.
Another possibility to get the complete history of a Wikipedia page (or of a small set of pages) is to use the wiki's [http://en.wikipedia.org/wiki/Special:Export Export page]. (The preceeding link is for the English-language Wikipedia - for other languages just change the language identifier ''en'' to, for instance, ''de'' or ''fr'' or ''es'', etc.)
[[File:Wikipedia_export_pages.png|400px]
For instance, to download the history of the page [http://en.wikipedia.org/wiki/Social_network_analysis '''Social network analysis'''] make settings as in the screenshot above and click on the '''Export''' button. However, as it is noted on the page, exporting is limited to 1000 revisions and the example page (Social network analysis) has already more than 2700 revisions. In principle it is possible to download the next 1000 revisions by specifying an appropriate offset (as explained on pages linked from the Export page) and then pasting the files together. However, since this is rather tedious the software [[WikiEvent_(software)|WikiEvent]] offers a possibility to do this automatically. (Internally WikiEvent proceeds exactly as described above by retrieving revisions in chunks of 1000 and appending these to a single output file.)


== Computing the edit network ==  
== Computing the edit network ==  

Revision as of 09:10, 20 July 2012

The edit network associated with the history of Wikipedia pages is a network whose nodes are the page(s) and all contributing users and whose edges encode time-stamped, typed, and weighted interaction events (edit events) between users and pages and between users and users. Specifically, edit events encode the exact time when an edit has been done along with one or several of the following types of edit interaction:

  • the amount of new text that a user adds to a page;
  • the amount of text that a user deletes (along with the other user/s that has/have previously added this text);
  • the amount of previously deleted text that a user restores (along with the users that previously deleted and the ones that originally added the text).

Together these edit events form a highly dynamic network revealing the emergent collaboration structure among contributing users. For instance, it can be derived

  • who are the users that contributed most of the text;
  • what are the implicit roles of users (e.g., contributors of new content, vanalism fighters, watchdogs);
  • whether there are opinion groups, i.e., groups of users that mutually fight against each others edits.

This tutorial is a practically oriented "how-to"-guide giving an example based introduction to the computation, analysis, and visualization of Wikipedia edit networks. More background can be found in the papers cited in the references. To follow the steps outlined here (or to do a similar study) you should download WikiEvent - a small graphical java software with which the Wikipedia edit networks can be computed.

How to download the edit history?

Wikipedia not only provides access to the current version of each page but also all of its previous versions. To view the page history in your browser you can just click on the history link on top of each page and browse through the versions. However, for automatic extraction of edit events we need to download the complete history in a more structured format. To do this there are various possibilities that are appropriate in different scenarios (and dependent on your computational resources and internet bandwidth).

To get the history of all pages you can go to the Wikimedia database dumps, select the wiki of interest (for instance, enwiki for the English-language Wikipedia), and download all files linked under the headline All pages with complete edit history. The complete database is extremely large (several terabytes of data) and certainly cannot be managed with an ordinary desktop computer.

Another possibility to get the complete history of a Wikipedia page (or of a small set of pages) is to use the wiki's Export page. (The preceeding link is for the English-language Wikipedia - for other languages just change the language identifier en to, for instance, de or fr or es, etc.)

[[File:Wikipedia_export_pages.png|400px]

For instance, to download the history of the page Social network analysis make settings as in the screenshot above and click on the Export button. However, as it is noted on the page, exporting is limited to 1000 revisions and the example page (Social network analysis) has already more than 2700 revisions. In principle it is possible to download the next 1000 revisions by specifying an appropriate offset (as explained on pages linked from the Export page) and then pasting the files together. However, since this is rather tedious the software WikiEvent offers a possibility to do this automatically. (Internally WikiEvent proceeds exactly as described above by retrieving revisions in chunks of 1000 and appending these to a single output file.)

Computing the edit network

The structure of edit network data

Analysis and visualization of edit networks

Statistical modeling of edit event networks

Computing simple edit events

The discussion network

References

Published papers that propose and/or make use of Wikipedia edit networks include the following.

More technical details about the computation of Wikipedia edit networks can be found in