This article illustrates how to create MySQL database containing entire Wikipedia out of downloaded Wikipedia XML dump (enwiki-latest-pages-articles-multistream.xml)
Prerequisite:
- Make sure you have at least 100GB of free space on your hard drive + another 60 GB of free space for MySQL database.
- Download enwiki-latest-pages-articles-multistream.xml.bz2 from http://dumps.wikimedia.org/enwiki/latest/
- Extract >10GB enwiki-latest-pages-articles-multistream.xml.bz2, you'll end up with file called: enwiki-latest-pages-articles-multistream.xml. It'll be approx. 45GB in size.
- Delete enwiki-latest-pages-articles-multistream.xml.bz2, you don't need it anymore.
-
Download: mwdumper.jar from http://dumps.wikimedia.org/tools/, we'll use it to convert XML to SQL.
-
Place mwdumper.jar to same folder where you've extracted enwiki-latest-pages-articles-multistream.xml
-
Run following command through windows command prompt:
java -jar mwdumper.jar enwiki-latest-pages-articles-multistream.xml --format=sql:1.5 > enwiki-latest-pages-articles-multistream.sql
Use following filter (--filter=latest) if you don't care about old article revisions and only want to keep the most recent one:
java -jar mwdumper.jar enwiki-latest-pages-articles-multistream.xml --format=sql:1.5 --filter=latest > enwiki-latest-pages-articles-multistream.sql
This skips all but the last revision listed for each page. Note: Currently this pays no attention to the timestamp or revision number, but simply the order of items in the dump. This may or may not be strictly correct.
When process starts, you'll see something similar as illustrated in this screenshot:
This will run for a while, so the only requirement is patience :)
Once this step is completed, you can delete enwiki-latest-pages-articles-multistream.xml, we won't need it.
- Now is time to create 'wiki' database. Run following SQL command in MySQL:
CREATE DATABASE `wiki` /*!40100 COLLATE 'utf8_general_ci' */;
- Next step is to create all Wikipedia required tables. Do so by executing following SQL:
use wiki;
CREATE TABLE /*_*/user (
user_id int unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
-- Usernames must be unique, must not be in the form of
-- an IP address. _Shouldn't_ allow slashes or case
-- conflicts. Spaces are allowed, and are _not_ converted
-- to underscores like titles. See the User::newFromName() for
-- the specific tests that usernames have to pass.
user_name varchar(255) binary NOT NULL default '',
-- Optional 'real name' to be displayed in credit listings
user_real_name varchar(255) binary NOT NULL default '',
-- Password hashes, see User::crypt() and User::comparePasswords()
-- in User.php for the algorithm
user_password tinyblob NOT NULL,
-- When using 'mail me a new password', a random
-- password is generated and the hash stored here.
-- The previous password is left in place until
-- someone actually logs in with the new password,
-- at which point the hash is moved to user_password
-- and the old password is invalidated.
user_newpassword tinyblob NOT NULL,
-- Timestamp of the last time when a new password was
-- sent, for throttling and expiring purposes
-- Emailed passwords will expire $wgNewPasswordExpiry
-- (a week) after being set. If user_newpass_time is NULL
-- (eg. created by mail) it doesn't expire.
user_newpass_time binary(14),
-- Note: email should be restricted, not public info.
-- Same with passwords.
user_email tinytext NOT NULL,
-- This is a timestamp which is updated when a user
-- logs in, logs out, changes preferences, or performs
-- some other action requiring HTML cache invalidation
-- to ensure that the UI is updated.
user_touched binary(14) NOT NULL default '',
-- A pseudorandomly generated value that is stored in
-- a cookie when the "remember password" feature is
-- used (previously, a hash of the password was used, but
-- this was vulnerable to cookie-stealing attacks)
user_token binary(32) NOT NULL default '',
-- Initially NULL; when a user's e-mail address has been
-- validated by returning with a mailed token, this is
-- set to the current timestamp.
user_email_authenticated binary(14),
-- Randomly generated token created when the e-mail address
-- is set and a confirmation test mail sent.
user_email_token binary(32),
-- Expiration date for the user_email_token
user_email_token_expires binary(14),
-- Timestamp of account registration.
-- Accounts predating this schema addition may contain NULL.
user_registration binary(14),
-- Count of edits and edit-like actions.
--
-- *NOT* intended to be an accurate copy of COUNT(*) WHERE rev_user=user_id
-- May contain NULL for old accounts if batch-update scripts haven't been
-- run, as well as listing deleted edits and other myriad ways it could be
-- out of sync.
--
-- Meant primarily for heuristic checks to give an impression of whether
-- the account has been used much.
--
user_editcount int
) /*$wgDBTableOptions*/;
CREATE UNIQUE INDEX /*i*/user_name ON /*_*/user (user_name);
CREATE INDEX /*i*/user_email_token ON /*_*/user (user_email_token);
CREATE INDEX /*i*/user_email ON /*_*/user (user_email(50));
--
-- User permissions have been broken out to a separate table;
-- this allows sites with a shared user table to have different
-- permissions assigned to a user in each project.
--
-- This table replaces the old user_rights field which used a
-- comma-separated blob.
--
CREATE TABLE /*_*/user_groups (
-- Key to user_id
ug_user int unsigned NOT NULL default 0,
-- Group names are short symbolic string keys.
-- The set of group names is open-ended, though in practice
-- only some predefined ones are likely to be used.
--
-- At runtime $wgGroupPermissions will associate group keys
-- with particular permissions. A user will have the combined
-- permissions of any group they're explicitly in, plus
-- the implicit '*' and 'user' groups.
ug_group varbinary(32) NOT NULL default ''
) /*$wgDBTableOptions*/;
CREATE UNIQUE INDEX /*i*/ug_user_group ON /*_*/user_groups (ug_user,ug_group);
CREATE INDEX /*i*/ug_group ON /*_*/user_groups (ug_group);
-- Stores the groups the user has once belonged to.
-- The user may still belong to these groups (check user_groups).
-- Users are not autopromoted to groups from which they were removed.
CREATE TABLE /*_*/user_former_groups (
-- Key to user_id
ufg_user int unsigned NOT NULL default 0,
ufg_group varbinary(32) NOT NULL default ''
) /*$wgDBTableOptions*/;
CREATE UNIQUE INDEX /*i*/ufg_user_group ON /*_*/user_former_groups (ufg_user,ufg_group);
--
-- Stores notifications of user talk page changes, for the display
-- of the "you have new messages" box
--
CREATE TABLE /*_*/user_newtalk (
-- Key to user.user_id
user_id int NOT NULL default 0,
-- If the user is an anonymous user their IP address is stored here
-- since the user_id of 0 is ambiguous
user_ip varbinary(40) NOT NULL default '',
-- The highest timestamp of revisions of the talk page viewed
-- by this user
user_last_timestamp varbinary(14) NULL default NULL
) /*$wgDBTableOptions*/;
-- Indexes renamed for SQLite in 1.14
CREATE INDEX /*i*/un_user_id ON /*_*/user_newtalk (user_id);
CREATE INDEX /*i*/un_user_ip ON /*_*/user_newtalk (user_ip);
--
-- User preferences and perhaps other fun stuff. :)
-- Replaces the old user.user_options blob, with a couple nice properties:
--
-- 1) We only store non-default settings, so changes to the defauls
-- are now reflected for everybody, not just new accounts.
-- 2) We can more easily do bulk lookups, statistics, or modifications of
-- saved options since it's a sane table structure.
--
CREATE TABLE /*_*/user_properties (
-- Foreign key to user.user_id
up_user int NOT NULL,
-- Name of the option being saved. This is indexed for bulk lookup.
up_property varbinary(255) NOT NULL,
-- Property value as a string.
up_value blob
) /*$wgDBTableOptions*/;
CREATE UNIQUE INDEX /*i*/user_properties_user_property ON /*_*/user_properties (up_user,up_property);
CREATE INDEX /*i*/user_properties_property ON /*_*/user_properties (up_property);
--
-- Core of the wiki: each page has an entry here which identifies
-- it by title and contains some essential metadata.
--
CREATE TABLE /*_*/page (
-- Unique identifier number. The page_id will be preserved across
-- edits and rename operations, but not deletions and recreations.
page_id int unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
-- A page name is broken into a namespace and a title.
-- The namespace keys are UI-language-independent constants,
-- defined in includes/Defines.php
page_namespace int NOT NULL,
-- The rest of the title, as text.
-- Spaces are transformed into underscores in title storage.
page_title varchar(255) binary NOT NULL,
-- Comma-separated set of permission keys indicating who
-- can move or edit the page.
page_restrictions tinyblob NOT NULL,
-- Number of times this page has been viewed.
page_counter bigint unsigned NOT NULL default 0,
-- 1 indicates the article is a redirect.
page_is_redirect tinyint unsigned NOT NULL default 0,
-- 1 indicates this is a new entry, with only one edit.
-- Not all pages with one edit are new pages.
page_is_new tinyint unsigned NOT NULL default 0,
-- Random value between 0 and 1, used for Special:Randompage
page_random real unsigned NOT NULL,
-- This timestamp is updated whenever the page changes in
-- a way requiring it to be re-rendered, invalidating caches.
-- Aside from editing this includes permission changes,
-- creation or deletion of linked pages, and alteration
-- of contained templates.
page_touched binary(14) NOT NULL default '',
-- Handy key to revision.rev_id of the current revision.
-- This may be 0 during page creation, but that shouldn't
-- happen outside of a transaction... hopefully.
page_latest int unsigned NOT NULL,
-- Uncompressed length in bytes of the page's current source text.
page_len int unsigned NOT NULL
) /*$wgDBTableOptions*/;
CREATE UNIQUE INDEX /*i*/name_title ON /*_*/page (page_namespace,page_title);
CREATE INDEX /*i*/page_random ON /*_*/page (page_random);
CREATE INDEX /*i*/page_len ON /*_*/page (page_len);
CREATE INDEX /*i*/page_redirect_namespace_len ON /*_*/page (page_is_redirect, page_namespace, page_len);
--
-- Every edit of a page creates also a revision row.
-- This stores metadata about the revision, and a reference
-- to the text storage backend.
--
CREATE TABLE /*_*/revision (
-- Unique ID to identify each revision
rev_id int unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
-- Key to page_id. This should _never_ be invalid.
rev_page int unsigned NOT NULL,
-- Key to text.old_id, where the actual bulk text is stored.
-- It's possible for multiple revisions to use the same text,
-- for instance revisions where only metadata is altered
-- or a rollback to a previous version.
rev_text_id int unsigned NOT NULL,
-- Text comment summarizing the change.
-- This text is shown in the history and other changes lists,
-- rendered in a subset of wiki markup by Linker::formatComment()
rev_comment tinyblob NOT NULL,
-- Key to user.user_id of the user who made this edit.
-- Stores 0 for anonymous edits and for some mass imports.
rev_user int unsigned NOT NULL default 0,
-- Text username or IP address of the editor.
rev_user_text varchar(255) binary NOT NULL default '',
-- Timestamp of when revision was created
rev_timestamp binary(14) NOT NULL default '',
-- Records whether the user marked the 'minor edit' checkbox.
-- Many automated edits are marked as minor.
rev_minor_edit tinyint unsigned NOT NULL default 0,
-- Restrictions on who can access this revision
rev_deleted tinyint unsigned NOT NULL default 0,
-- Length of this revision in bytes
rev_len int unsigned,
-- Key to revision.rev_id
-- This field is used to add support for a tree structure (The Adjacency List Model)
rev_parent_id int unsigned default NULL,
-- SHA-1 text content hash in base-36
rev_sha1 varbinary(32) NOT NULL default ''
) /*$wgDBTableOptions*/ MAX_ROWS=10000000 AVG_ROW_LENGTH=1024;
-- In case tables are created as MyISAM, use row hints for MySQL
Then execute similar to following:
cmd /C 'mysql -uUSERNAME -p --force wiki