Thursday, June 12, 2008

How to run your own Wikipedia: A tutorial on setting up a platform for open content research

How to run your own Wikipedia: A tutorial
on setting up a platform for open content research
Chitu Okoli
John Molson School of Business
Concordia University
Montréal, QC H3G 1M8, Canada
(514) 848-2424 x2985
Chitu.Okoli@concordia.ca

Bilal Abdul Kader
John Molson School of Business
Concordia University
Montréal, QC H3G 1M8, Canada
Telephone number, incl. country code
bila_abd@jmsb.concordia.ca

EXTENDED ABSTRACT
Wikipedia, “the free encyclopedia that anyone can edit” (http://www.wikipedia.org), has become the most significant exemplar of the burgeoning open content model. Beginning with open source software, open content in general goes on to apply copyright law to grant royalty-free licenses for users of various information products for use, production of derivative works, and redistribution. Applying this model to the encyclopedia, Wikipedia has proven that this concept can indeed yield high-quality products beyond just software.
As a result of its popularity, Wikipedia has attracted the attention of scholars who have taken various angles to understand how this model works and to apply their learning to various fields of interest. Whereas some of these scholarly studies examine Wikipedia philosophically, or they use its article content as their raw data source, a significant category of studies goes further to track the history of changes over time and investigate the editing relationships between various Wikipedia contributors.
These studies require not only accessing the publicly-available Wikipedia from the Web, but actually downloading the contents of Wikipedia, including the revision history, and installing it in a local MySQL database for detailed analysis. This is accessible, because Wikipedia provides periodic (monthly or bimonthly) dumps of the full contents of the database, without images. Moreover, MediaWiki, the wiki software that was custom-developed to run Wikipedia, is downloadable open source software. Research studies that analyze more than just the current version of Wikipedia would require downloading the Wikipedia database. They would most likely also need the local installation of a MediaWiki wiki interface that would enable a full local replication of Wikipedia. While not necessary for database queries, the MediaWiki interface is the most practical way to read and access individual articles in a format similar to that provided by the live Wikipedia.
This paper has two main sections. The first section generally introduces Wikipedia as a subject of scholarly research, as indicated by its being either a significant source or the focus of over 50 peer-reviewed journal articles, not counting other scholarly sources such as conference proceedings. In particular, this review section focuses on those studies that involve in-depth analysis of the data in the Wikipedia database, beyond merely the content of Wikipedia articles.
The second, major section of this paper is a practical tutorial about how to setup an offline Wikipedia server for the purpose of academic research. For hardware, the live Wikipedia currently runs on a variety of servers from HP and Dell *** Bilal, please verify and fill in here***, with dedicated servers that run the Apache web server, MySQL database engines, and Squid web caching proxy servers. The systems run on various flavors of Linux, mainly Ubuntu and Fedora Core. This paper focuses on an offline server for research purposes, and so it does not describe what is involved for maintaining a live web server with high volume requests.
For software, we describe issues involved in setting up a server on Linux, Windows, and Mac OS, and explain why Linux is best for this purpose. We describe the installation and configuration of the key software involved: the Apache-MySQL-PHP stack, and the MediaWiki wiki server. We also detail necessary security configurations to assure a secure and stable server.
We describe in detail how to make requests on the server that would be involved in a research study, presenting a number of sample requests based on the studies presently reported in the literature. We present methods of querying the server using MySQL and XML.
Finally, we discuss important aspects of maintaining the server. We discuss the human resources necessary, particularly for academic researchers who might have limited resources. We present how to keep the data up-to-date with the periodic data dumps made of Wikipedia. We discuss establishing a backup regime, including assuring the integrity of a restore procedure, should it be necessary. Finally, we discuss issues involved in keeping the server software up-to-date for security and performance reasons.
This paper provides a justification for considering Wikipedia as a scholarly focus of study, and provides a detailed, step-by-step resource guide in setting up the necessary hardware, software, and data retrieval tools to obtain access to this rich source of research data.

Categories and Subject Descriptors
H.2.4 [Information Systems]: Database Management: Systems – query processing, textual databases.
General Terms
Management, Measurement, Performance, Experimentation.
Keywords
Open content, Wikipedia, open source software, databases, wikis, research, tutorial.



Local Wikipedia Tutorial


  1. Hardware considerations
    1. Hosting Scenarios
      1. Local single server hosting
In this scenario, we would install a local server at JMSB Concordia to host the application and the database. This option has several advantages and disadvantages.

The main advantages are:
  • Lower long run cost. The cost of a server is equal to the cost of one year hosting with similar server characteristics. Knowing that the useful life of a server is about three years, the local server is three times more cost effective than a hosting solution and a cloud hosting service.
  • The server is behind a white access list firewall and no-one can access unless authorized by the network administrator.
  • Full physical access to the server which offers better security and more efficient backup/restore operations if needed.
  • Much faster access to the server from within the LAN for local researchers who are the only users of the application in the short and medium term.
  • Flexibility in the backup options and medium. Locally, one can backup the OS, applications, and data on several medium depending on the need and reliability.

The main disadvantages:
  • Bandwidth limit for download and for internet access also if the server is open for internet usage thereafter.
  • Need for dedicated system and database administration for the server. This cost can be less than optimum if compared to web hosting and cloud hosting because of the economics of scale in the later two hosting schemes.
  • Need for load balancing at the server level in case the server is open for the internet. In the other cases, load balancing might be pushed to the network level for increased performance and economics of scope and scale.
  • Non-processing power redundancy in case of a hardware failure. Hardware replacements calls for few days in best scenarios.



      1. Dual server hosting (Dedicated web server + dedicated mySQL server)
      2. Single Internet box host
      3. Cloud Service
        1. Distributed Storage
        2. Parallel Processing
      1. Comparison and Contract

    1. Processing Power
      1. CPU Architecture
        1. Single core / Multi-core
        2. x86 / x64
      1. Single server / Parallel processing

    1. Motherboard Flexibility
      1. CPU Upgrade
      2. Memory Upgrade

    1. Memory
      1. Size
      2. Latency

    1. Storage space
      1. Disk Speed
        1. Transfer rate
        2. RPM
      1. Multi disks
      2. Efficient swap
      3. RAID / Non RAID
      4. Internal / External Issue
      5. Single Node / External nodes

    1. Hardware providers
      1. Selection criteria
        1. Local service providers
        2. Price / Delivery / Reliability
        3. Advanced support on site




  1. Software Considerations
    1. Operating System
      1. Linux / Windows
      2. Linux Recommended OS
        1. Ubuntu Server
        2. CentOS

      1. Linux Choice
        1. Installation
        2. Customization
        3. Securing
        4. Updates
        5. Maintenance
        6. Backup

      1. Windows Server (Windows 2008)
        1. Installation
        2. Configuration
        3. Security
        4. User roles and access


    1. LAMP Stack
      1. Open Source
      2. Widespread usage
      3. Security
      4. Free support on the internet

    1. Web Server
      1. Apache
        1. Open Source
        2. Speed, Reliability, and Security
        3. Perfect match with MediaWiki
        4. .htaccess and friendly URL
        5. PHP module / CGI
        6. Installation
          1. Windows
          2. Linux
        1. Configuration
          1. Windows
          2. Linux
        1. Maintenance
        2. Security
          1. Windows
          2. Linux
        1. Updates
        2. Creating Domains

      1. IIS
        1. Windows Native
        2. Security Issues
        3. Tweaking for friendly URL
        4. Installation
        5. Configuration
        6. PHP CGI
        7. Creating domains
        8. Configuring INET_ access

    1. Scripting Language
      1. PHP 4 / PHP 5
      2. PHP ini settings
      3. Wikipedia special settings
      4. Upload settings
      5. Image library (optional)
      6. Math Tex Library (optional)

    1. Database
      1. mysQL / PostGreSQL
      2. Selection criteria
      3. Installation
        1. Win32 / Linux
      1. Configuration
      2. Optimization for large queries
        1. InnoDB tables
        2. Wiki Settings
        3. Large buffer select
        4. Optimized Queries
        5. Full text index (On / Off)

      1. Import
        1. mySQL Admin
        2. largeDBObject Imports

      1. Export
        1. SQL
        2. Other format
        3. XML Interface (xml Wiki)

      1. Backup
        1. Raw SQL
        2. Compressed SQL
        3. XML
        4. Other formats
        5. Raw Data

      1. Restore Test
        1. Restore integrity
        2. Rehashing

      1. Master / Slave servers
        1. Concept
        2. Configuration
        3. Synchronization
        4. Data redundancy
        5. Advanced settings

    1. MediaWiki
      1. Requirements
      2. Installation
      3. Configuration
      4. Maintenance
      5. Security updates
      6. Version updates


    1. Wikipedia
      1. Introduction
      2. History
      3. Concept
      4. Download SQL dumps
      5. SQL Import and local build
      6. Simple Queries
      7. Advanced Queries
      8. Standard Statistics
      9. Advanced Statistics
      10. Academic service sharing
      11. Reports