Wikipedia edit networks (tutorial): Difference between revisions
Line 15: | Line 15: | ||
Wikipedia not only provides access to the current version of each page but also all of its previous versions. To view the page history in your browser you can just click on the '''history''' link on top of each page and browse through the versions. However, for automatic extraction of edit events we need to download the complete history in a more structured format. To do this there are various possibilities that are appropriate in different scenarios (and dependent on your computational resources and internet bandwidth). | Wikipedia not only provides access to the current version of each page but also all of its previous versions. To view the page history in your browser you can just click on the '''history''' link on top of each page and browse through the versions. However, for automatic extraction of edit events we need to download the complete history in a more structured format. To do this there are various possibilities that are appropriate in different scenarios (and dependent on your computational resources and internet bandwidth). | ||
To get the history of '''all''' pages you can go to the [http://dumps.wikimedia.org/backup-index.html Wikimedia database dumps], select the wiki of interest (for instance, '''enwiki''' for the English-language Wikipedia), and download all files linked under the headline ''All pages with complete edit history''. The complete database is extremely large (several [http://en.wikipedia.org/wiki/Terabyte terabytes] | To get the history of '''all''' pages you can go to the [http://dumps.wikimedia.org/backup-index.html Wikimedia database dumps], select the wiki of interest (for instance, '''enwiki''' for the English-language Wikipedia), and download all files linked under the headline ''All pages with complete edit history''. The complete database is extremely large (several [http://en.wikipedia.org/wiki/Terabyte terabytes] for the English-language Wikipedia) and probably cannot be managed with an ordinary desktop computer. | ||
Another possibility to get the complete history of a Wikipedia page (or of a small set of pages) is to use the wiki's [http://en.wikipedia.org/wiki/Special:Export Export page]. (The preceeding link is for the English-language Wikipedia - for other languages just change the language identifier ''en'' to, for instance, ''de'' or ''fr'' or ''es'', etc.) | Another possibility to get the complete history of a Wikipedia page (or of a small set of pages) is to use the wiki's [http://en.wikipedia.org/wiki/Special:Export Export page]. (The preceeding link is for the English-language Wikipedia - for other languages just change the language identifier ''en'' to, for instance, ''de'' or ''fr'' or ''es'', etc.) | ||
Line 27: | Line 27: | ||
[[File:Wikipedia_lang_code.png]] [[File:Wikipedia_page_title.png]] [[File:Wikipedia_download_sna.png]] | [[File:Wikipedia_lang_code.png]] [[File:Wikipedia_page_title.png]] [[File:Wikipedia_download_sna.png]] | ||
The program is actually very silent - for instance, you don't see a progres bar - until the download is complete. The time it takes to download depends on many factors, among them the size of the page history (which might be several [http://en.wikipedia.org/wiki/Gigabyte gigabytes] for some popular pages!) and the bandwidth of your internet connection. At the end you see the number of downloaded revisions in the message area. For information: the size of the history file for the page ''Social network analysis'' is about 83 [http://en.wikipedia.org/wiki/Megabyte Megabytes] on July 20, 2012. | The program is actually very silent - for instance, you don't see a progres bar - until the download is complete. The time it takes to download depends on many factors, among them the size of the page history (which might be several [http://en.wikipedia.org/wiki/Gigabyte gigabytes] for some popular pages!) and the bandwidth of your internet connection. At the end you see the number of downloaded revisions in the message area. | ||
For information: the size of the history file for the page ''Social network analysis'' is about 83 [http://en.wikipedia.org/wiki/Megabyte Megabytes] on July 20, 2012 (obviously growing). The history is saved in a file '''Social_network_analysis.xml''' in the directory that you have chosen. If you are interested, the XML format is described in the page [http://meta.wikimedia.org/wiki/Help:Export http://meta.wikimedia.org/wiki/Help:Export] - but you never have to read these files since they are automatically processed as described below. | |||
== Computing the edit network == | == Computing the edit network == |
Revision as of 09:41, 20 July 2012
The edit network associated with the history of Wikipedia pages is a network whose nodes are the page(s) and all contributing users and whose edges encode time-stamped, typed, and weighted interaction events (edit events) between users and pages and between users and users. Specifically, edit events encode the exact time when an edit has been done along with one or several of the following types of edit interaction:
- the amount of new text that a user adds to a page;
- the amount of text that a user deletes (along with the other user/s that has/have previously added this text);
- the amount of previously deleted text that a user restores (along with the users that previously deleted and the ones that originally added the text).
Together these edit events form a highly dynamic network revealing the emergent collaboration structure among contributing users. For instance, it can be derived
- who are the users that contributed most of the text;
- what are the implicit roles of users (e.g., contributors of new content, vanalism fighters, watchdogs);
- whether there are opinion groups, i.e., groups of users that mutually fight against each others edits.
This tutorial is a practically oriented "how-to"-guide giving an example based introduction to the computation, analysis, and visualization of Wikipedia edit networks. More background can be found in the papers cited in the references. To follow the steps outlined here (or to do a similar study) you should download WikiEvent - a small graphical java software with which the Wikipedia edit networks can be computed.
How to download the edit history?
Wikipedia not only provides access to the current version of each page but also all of its previous versions. To view the page history in your browser you can just click on the history link on top of each page and browse through the versions. However, for automatic extraction of edit events we need to download the complete history in a more structured format. To do this there are various possibilities that are appropriate in different scenarios (and dependent on your computational resources and internet bandwidth).
To get the history of all pages you can go to the Wikimedia database dumps, select the wiki of interest (for instance, enwiki for the English-language Wikipedia), and download all files linked under the headline All pages with complete edit history. The complete database is extremely large (several terabytes for the English-language Wikipedia) and probably cannot be managed with an ordinary desktop computer.
Another possibility to get the complete history of a Wikipedia page (or of a small set of pages) is to use the wiki's Export page. (The preceeding link is for the English-language Wikipedia - for other languages just change the language identifier en to, for instance, de or fr or es, etc.)
For instance, to download the history of the page Social network analysis make settings as in the screenshot above and click on the Export button. However, as it is noted on the page, exporting is limited to 1000 revisions and the example page (Social network analysis) has already more than 2700 revisions. In principle it is possible to download the next 1000 revisions by specifying an appropriate offset (as explained on pages linked from the Export page) and then pasting the files together. However, since this is rather tedious the software WikiEvent offers a possibility to do this automatically. (Internally WikiEvent proceeds exactly as described above by retrieving revisions in chunks of 1000 and appending these to a single output file.)
To download a page history with WikiEvent you start the program (download it from http://www.inf.uni-konstanz.de/algo/software/wikievent/ and execute by double-clicking) and click on the entry download history in the net menu. You have to specify the language of the Wikipedia (for instance, en for English, de for German, fr for French, etc), the title of the page to download and a directory on your computer in which the file should be saved.
The program is actually very silent - for instance, you don't see a progres bar - until the download is complete. The time it takes to download depends on many factors, among them the size of the page history (which might be several gigabytes for some popular pages!) and the bandwidth of your internet connection. At the end you see the number of downloaded revisions in the message area.
For information: the size of the history file for the page Social network analysis is about 83 Megabytes on July 20, 2012 (obviously growing). The history is saved in a file Social_network_analysis.xml in the directory that you have chosen. If you are interested, the XML format is described in the page http://meta.wikimedia.org/wiki/Help:Export - but you never have to read these files since they are automatically processed as described below.
Computing the edit network
The structure of edit network data
Analysis and visualization of edit networks
Statistical modeling of edit event networks
Computing simple edit events
The discussion network
References
Published papers that propose and/or make use of Wikipedia edit networks include the following.
- Jürgen Lerner, Ulrik Brandes, Patrick Kenis, and Denise van Raaij: Modeling Open, Web-based Collaboration Networks: The Case of Wikipedia. In Markus Gamper, Linda Reschke, Michael Schönhuth (Eds.): Knoten und Kanten 2.0, pp 141-162. transcript-Verlag, 2012.
- Jürgen Lerner, Patrick Kenis, Denise van Raaij and Ulrik Brandes: Will they stay or will they go? How network properties of WebICs predict dropout rates of valuable Wikipedians. European Management Journal, 29(5):404-413, 2011.
- Ulrik Brandes, Patrick Kenis, Jürgen Lerner, and Denise van Raaij: Network Analysis of Collaboration Structure in Wikipedia. Proc. 18th Intl. World Wide Web Conference (WWW 2009).
More technical details about the computation of Wikipedia edit networks can be found in
- Ulrik Brandes, Patrick Kenis, Jürgen Lerner, and Denise van Raaij: Computing Wikipedia Edit Networks. Technical Report, 2009.