Keep blogging ... Keep ideas flowing: June 2008

How to run your own Wikipedia: A tutorial

on setting up a platform for open content research

Chitu Okoli

John Molson School of Business

Concordia University

Montréal, QC H3G 1M8, Canada

(514) 848-2424 x2985

Chitu.Okoli@concordia.ca

Bilal Abdul Kader

John Molson School of Business

Concordia University

Montréal, QC H3G 1M8, Canada

Telephone number, incl. country code

bila_abd@jmsb.concordia.ca

EXTENDED ABSTRACT

Wikipedia, “the free encyclopedia that anyone can edit” (http://www.wikipedia.org), has become the most significant exemplar of the burgeoning open content model. Beginning with open source software, open content in general goes on to apply copyright law to grant royalty-free licenses for users of various information products for use, production of derivative works, and redistribution. Applying this model to the encyclopedia, Wikipedia has proven that this concept can indeed yield high-quality products beyond just software.

As a result of its popularity, Wikipedia has attracted the attention of scholars who have taken various angles to understand how this model works and to apply their learning to various fields of interest. Whereas some of these scholarly studies examine Wikipedia philosophically, or they use its article content as their raw data source, a significant category of studies goes further to track the history of changes over time and investigate the editing relationships between various Wikipedia contributors.

These studies require not only accessing the publicly-available Wikipedia from the Web, but actually downloading the contents of Wikipedia, including the revision history, and installing it in a local MySQL database for detailed analysis. This is accessible, because Wikipedia provides periodic (monthly or bimonthly) dumps of the full contents of the database, without images. Moreover, MediaWiki, the wiki software that was custom-developed to run Wikipedia, is downloadable open source software. Research studies that analyze more than just the current version of Wikipedia would require downloading the Wikipedia database. They would most likely also need the local installation of a MediaWiki wiki interface that would enable a full local replication of Wikipedia. While not necessary for database queries, the MediaWiki interface is the most practical way to read and access individual articles in a format similar to that provided by the live Wikipedia.

This paper has two main sections. The first section generally introduces Wikipedia as a subject of scholarly research, as indicated by its being either a significant source or the focus of over 50 peer-reviewed journal articles, not counting other scholarly sources such as conference proceedings. In particular, this review section focuses on those studies that involve in-depth analysis of the data in the Wikipedia database, beyond merely the content of Wikipedia articles.

The second, major section of this paper is a practical tutorial about how to setup an offline Wikipedia server for the purpose of academic research. For hardware, the live Wikipedia currently runs on a variety of servers from HP and Dell *** Bilal, please verify and fill in here***, with dedicated servers that run the Apache web server, MySQL database engines, and Squid web caching proxy servers. The systems run on various flavors of Linux, mainly Ubuntu and Fedora Core. This paper focuses on an offline server for research purposes, and so it does not describe what is involved for maintaining a live web server with high volume requests.

For software, we describe issues involved in setting up a server on Linux, Windows, and Mac OS, and explain why Linux is best for this purpose. We describe the installation and configuration of the key software involved: the Apache-MySQL-PHP stack, and the MediaWiki wiki server. We also detail necessary security configurations to assure a secure and stable server.

We describe in detail how to make requests on the server that would be involved in a research study, presenting a number of sample requests based on the studies presently reported in the literature. We present methods of querying the server using MySQL and XML.

Finally, we discuss important aspects of maintaining the server. We discuss the human resources necessary, particularly for academic researchers who might have limited resources. We present how to keep the data up-to-date with the periodic data dumps made of Wikipedia. We discuss establishing a backup regime, including assuring the integrity of a restore procedure, should it be necessary. Finally, we discuss issues involved in keeping the server software up-to-date for security and performance reasons.

This paper provides a justification for considering Wikipedia as a scholarly focus of study, and provides a detailed, step-by-step resource guide in setting up the necessary hardware, software, and data retrieval tools to obtain access to this rich source of research data.

Categories and Subject Descriptors

H.2.4 [Information Systems]: Database Management: Systems – query processing, textual databases.

General Terms

Management, Measurement, Performance, Experimentation.

Keywords

Open content, Wikipedia, open source software, databases, wikis, research, tutorial.

Local Wikipedia Tutorial

Hardware considerations

Hosting Scenarios

Local single server hosting

In this scenario, we would install a local server at JMSB Concordia to host the application and the database. This option has several advantages and disadvantages.

The main advantages are:

Lower long run cost. The cost of a server is equal to the cost of one year hosting with similar server characteristics. Knowing that the useful life of a server is about three years, the local server is three times more cost effective than a hosting solution and a cloud hosting service.
The server is behind a white access list firewall and no-one can access unless authorized by the network administrator.
Full physical access to the server which offers better security and more efficient backup/restore operations if needed.
Much faster access to the server from within the LAN for local researchers who are the only users of the application in the short and medium term.
Flexibility in the backup options and medium. Locally, one can backup the OS, applications, and data on several medium depending on the need and reliability.

The main disadvantages:

Bandwidth limit for download and for internet access also if the server is open for internet usage thereafter.
Need for dedicated system and database administration for the server. This cost can be less than optimum if compared to web hosting and cloud hosting because of the economics of scale in the later two hosting schemes.
Need for load balancing at the server level in case the server is open for the internet. In the other cases, load balancing might be pushed to the network level for increased performance and economics of scope and scale.
Non-processing power redundancy in case of a hardware failure. Hardware replacements calls for few days in best scenarios.

Dual server hosting (Dedicated web server + dedicated mySQL server)
Single Internet box host
Cloud Service

Distributed Storage
Parallel Processing

Comparison and Contract

Processing Power

CPU Architecture

Single core / Multi-core
x86 / x64

Single server / Parallel processing

Motherboard Flexibility

CPU Upgrade
Memory Upgrade

Memory

Size
Latency

Storage space

Disk Speed

Transfer rate
RPM

Multi disks
Efficient swap
RAID / Non RAID
Internal / External Issue
Single Node / External nodes

Hardware providers

Selection criteria

Local service providers
Price / Delivery / Reliability
Advanced support on site

Software Considerations

Operating System

Linux / Windows
Linux Recommended OS

Ubuntu Server
CentOS

Linux Choice

Installation
Customization
Securing
Updates
Maintenance
Backup

Windows Server (Windows 2008)

Installation
Configuration
Security
User roles and access

LAMP Stack

Open Source
Widespread usage
Security
Free support on the internet

Web Server

Apache

Open Source
Speed, Reliability, and Security
Perfect match with MediaWiki
.htaccess and friendly URL
PHP module / CGI
Installation

Windows
Linux

Configuration

Windows
Linux

Maintenance
Security

Windows
Linux

Updates
Creating Domains

Windows Native
Security Issues
Tweaking for friendly URL
Installation
Configuration
PHP CGI
Creating domains
Configuring INET_ access

Scripting Language

PHP 4 / PHP 5
PHP ini settings
Wikipedia special settings
Upload settings
Image library (optional)
Math Tex Library (optional)

Database

mysQL / PostGreSQL
Selection criteria
Installation

Win32 / Linux

Configuration
Optimization for large queries

InnoDB tables
Wiki Settings
Large buffer select
Optimized Queries
Full text index (On / Off)

Import

mySQL Admin
largeDBObject Imports

Export

SQL
Other format
XML Interface (xml Wiki)

Backup

Raw SQL
Compressed SQL
XML
Other formats
Raw Data

Restore Test

Restore integrity
Rehashing

Master / Slave servers

Concept
Configuration
Synchronization
Data redundancy
Advanced settings

MediaWiki

Requirements
Installation
Configuration
Maintenance
Security updates
Version updates

Wikipedia

Introduction
History
Concept
Download SQL dumps
SQL Import and local build
Simple Queries
Advanced Queries
Standard Statistics
Advanced Statistics
Academic service sharing
Reports

Keep blogging ... Keep ideas flowing

Thursday, June 12, 2008

How to run your own Wikipedia: A tutorial on setting up a platform for open content research

Blog Archive