Converting WordPress To Webby

The process of converting my old WordPress posts to Webby was relatively painless, but there are a few things worth sharing.

The first step was to export my WordPress MySQL database and create a local copy, and then to create DataMapper classes corresponding to the two tables I was interested in, wp_posts and wp_comments.

mysql> describe wp_posts;
+-----------------------+---------------------+
| Field                 | Type                |
+-----------------------+---------------------+
| ID                    | bigint(20) unsigned |
| post_author           | bigint(20)          |
| post_date             | datetime            |
| post_date_gmt         | datetime            |
| post_content          | longtext            |
| post_title            | text                |
| post_category         | int(4)              |
| post_excerpt          | text                |
| post_status           | varchar(20)         |
| comment_status        | varchar(20)         |
| ping_status           | varchar(20)         |
| post_password         | varchar(20)         |
| post_name             | varchar(200)        |
| to_ping               | text                |
| pinged                | text                |
| post_modified         | datetime            |
| post_modified_gmt     | datetime            |
| post_content_filtered | text                |
| post_parent           | bigint(20)          |
| guid                  | varchar(255)        |
| menu_order            | int(11)             |
| post_type             | varchar(20)         |
| post_mime_type        | varchar(100)        |
| comment_count         | bigint(20)          |
+-----------------------+---------------------+
24 rows in set (0.01 sec)                      
                                               
mysql> describe wp_comments;                   
+----------------------+---------------------+
| Field                | Type                |
+----------------------+---------------------+
| comment_ID           | bigint(20) unsigned |
| comment_post_ID      | int(11)             |
| comment_author       | tinytext            |
| comment_author_email | varchar(100)        |
| comment_author_url   | varchar(200)        |
| comment_author_IP    | varchar(100)        |
| comment_date         | datetime            |
| comment_date_gmt     | datetime            |
| comment_content      | text                |
| comment_karma        | int(11)             |
| comment_approved     | varchar(20)         |
| comment_agent        | varchar(255)        |
| comment_type         | varchar(20)         |
| comment_parent       | bigint(20)          |
| user_id              | bigint(20)          |
+----------------------+---------------------+
15 rows in set (0.00 sec)

And no, I don’t know why they have wp_posts.ID as a bigint(20) and then wp_comments.comment_post_ID, which should be the same size, as an int(11). This is a database that has been upgraded a few times so perhaps that’s a legacy thing.

While DataMapper can easily accept a non-standard primary key in a table, it gets a little trickier when you are linking two tables together using has n and belongs_to. I found it simpler to just change the names of the primary keys and foreign key. So, after creating a new database and loading the mysqldump file with all my blog’s data, I ran the following:

ALTER TABLE wp_posts CHANGE ID id bigint(20) unsigned;
ALTER TABLE wp_comments CHANGE comment_ID id bigint(20) unsigned;
ALTER TABLE wp_comments CHANGE comment_post_ID post_id int(11);

Update: I think I cracked the custom parent_key, child_key bit in DataMapper.

class Post
  has n,       :comments, 
               :parent_key => [:ID], 
               :child_key => [:comment_ID]
end

class Comment
  belongs_to   :post, 
               :parent_key => [:comment_post_ID], 
               :child_key => [:comment_ID]
end

See parent_key_example.rb for a full working example. This should negate the need to change field names as above but I haven’t fully tested it.

One of the really nice things about DataMapper is that it will happily ignore any fields in your database which you don’t mention explicitly. So, you only have to define DataMapper properties for the fields you want to be able to work with. The top of my post.rb file looks like:

class Post
  include DataMapper::Resource
  storage_names[:default] = 'wp_posts'
  
  property :id, Integer, :serial => true # original field name ID
  property :post_date, DateTime
  property :post_content, Text
  property :post_title, String
  property :post_status, String
  property :post_name, String
  
  has n, :comments, :comment_approved => true, :order => [:comment_date]
  

And my comment.rb file starts with:

class Comment
  include DataMapper::Resource
  storage_names[:default] = 'wp_comments'
  
  property :id, Integer, :serial => true # original field name comment_ID
  property :post_id, Integer # original field name comment_post_ID
  property :comment_author, String
  property :comment_author_url, String
  property :comment_date, DateTime
  property :comment_content, String
  property :comment_approved, Boolean
  property :user_id, Integer
  
  belongs_to :post
  

So, just like that I can access all my posts and comments using DataMapper classes, and I can do things like post.comments.

The initialization for DataMapper is simply:

require "rubygems"
require "dm-core"
DataMapper.setup(:default, 'mysql://localhost/ananelson_wordpress?socket=/tmp/mysql.sock')

# Local files
require "lib/comment"
require "lib/post"

Now, how do I get the content formatted nicely? Wordpress takes the data stored in the database and feeds it through a PHP function called the_content.

// This is an excerpt from the WordPress source code. http://wordpress.org/about/gpl/
function the_content($more_link_text = '(more...)', $stripteaser = 0, $more_file = '') {
	$content = get_the_content($more_link_text, $stripteaser, $more_file);
	$content = apply_filters('the_content', $content);
	$content = str_replace(']]>', ']]>', $content);
	echo $content;
}

The apply_filters function is the thing that interests me. More digging in the WordPress source revealed:

// This is an excerpt from the WordPress source code. http://wordpress.org/about/gpl/

add_filter('the_content', 'wptexturize');
add_filter('the_content', 'convert_smilies');
add_filter('the_content', 'convert_chars');
add_filter('the_content', 'wpautop');
add_filter('the_content', 'prepend_attachment');

# snip...

add_filter('comment_text', 'wptexturize');
add_filter('comment_text', 'convert_chars');
add_filter('comment_text', 'make_clickable', 9);
add_filter('comment_text', 'force_balance_tags', 25);
add_filter('comment_text', 'convert_smilies', 20);
add_filter('comment_text', 'wpautop', 30);

So, WordPress has a number of filters which are applied to the post content and the comments after the text is pulled out of the database. The simplest way I could think of to replicate this behaviour was to just use these same WordPress filters. I decided that I could live without the convert_smilies, and that there was no reason not to use make_clickable for my posts as well as for the comments, so that left me with a standard list of filters. I wrote a short php-based shell script:

#!/usr/bin/env php -q
<?php

include 'wp/plugin.php';

include 'wp/kses.php';
include 'wp/formatting.php';
include 'wp/shortcodes.php';

$text = file_get_contents($argv[1]);

$text = wptexturize($text);
$text = convert_chars($text);
$text = make_clickable($text);
$text = force_balance_tags($text);
$text = wpautop($text);

echo $text;
?>

Then I just had to wrap the shell script in Ruby.

def wp_format(text)
  tmpfile = "temp.txt"
  File.open(tmpfile, 'w') do |f|
    f.write text
  end
  
  result = `./wp_format #{tmpfile}`
  `rm #{tmpfile}`
  puts result
  result
end

For some reason Ruby’s Tempfile library gave me some strange filenames which either got garbled or weren’t palatable to system(), so I just used “temp.txt”. You could always add a timestamp if you wanted to.

Now, I need to recreate the perma-url scheme I had set up in WordPress.

  
  def filedir
    location = "../content/" # relative path to webby content dir
    location + "said/on/" + post_date.strftime("%Y/%m/%d/") + post_name
  end
  
  def filename
    filedir + "/index.txt"
  end
  

I used a directory “said/on” (yeah, sorry, I was feeling too clever that day) followed by Year/Month/Day and then the post slug. So, in my Post class I have two functions, filedir which creates the directory and then filename which adds the post slug and a .txt extension (.txt since this is going into Webby).

Finally, I need code which formats comments and posts, and then a method to iterate over all published posts and all approved comments to print them in that format.

In post.rb:

  
  def webby_header
%{---
title: #{post_title}
created_at: #{post_date.to_s}
---
}
  end
  
  def publish
    FileUtils.mkdir_p(filedir)
    File.open(filename, "w") do |f|
      f.write(webby_header)
      if [33].include?(id) # Post no. 33 and wp_format don't get along.
        f.write(post_content)
      else
        f.write(wp_format(post_content))
      end
      if !comments.empty?
        f.write("\n\n<hr>\n\n<h3>Comments</h3>\n")
        comments.each do |c|
          f.write(c.to_html)
        end
      end
    end
  end
  
  def self.publish_all
    FileUtils.rm_rf("../content/said")
    Post.all(:post_status => 'publish').each do |p|
      p.publish
    end
  end
  

In comment.rb:

  
  def author_with_url
    if comment_author_url.to_s === ""
      comment_author
    else
      %{<a href="#{comment_author_url}">#{comment_author}</a>}
    end
  end
  
  def to_html
    %{
<b>#{author_with_url}</b> #{comment_date.strftime("%d %b %Y")}
#{wp_format(comment_content)}

}
  end
  

Not the most beautiful of code, but I’m only using it once and it works.

So, when I call Post.publish_all, I get a directory structure like this in my Webby content directory:

And the next time I call rake build, each of those text files will be converted to a HTML page.

I have ignored tags and categories, and I didn’t have to deal with images in any of my blog posts, so that made this job easier. I did have to manually tweak the output for two of these blog posts. In one of them, quotation marks were turned into some bizarre character and, since there were only 6 of them, I changed them by hand. Also one of my posts resisted wp_format completely so I just excluded that one from being formatted and added a Webby textile filter, which worked just fine.

If I had more posts to convert I would have investigated the reasons behind these problems and adjusted my code accordingly, but in this case it made sense to just fix them.

So, there you are. A relatively painless export. I can see that DataMapper is going to be my tool of choice for quickly working with legacy databases and exporting or reformatting them. It’s so quick to set up, and then you have access to any Ruby library you need to help you process your data.

You are free to make use of any of these scripts subject to the terms of the GPL. We really, really need a decent license for code snippets which fits in a single line comment. I’m going with GPL on this one since that is WordPress’s license and I’m using bits of their code here. But, if you want to do something similar to what I have done here not relating to WordPress then you can consider the code I have written to be in the public domain or, if you prefer, the MIT license. And, thats the code, not the blog post. Of course, if you find this useful I’d love to hear about it in the comments, by email or on your blog.

If you are looking for any of my old posts, there is a list of them here.




John Wright 03 Oct 2008

Any more word on how you are liking Webby. We are considering it for a documentation site inside Adobe and I was wondering how stable it is. I like the clean look of your site. But how do you use it as a blog? Did you integrate a RoR blog in somehow, or is this comment feature something you just added? If you can release your blog code as MIT that would be awesome.

Ana Nelson 04 Oct 2008

Hi, John,

I'm still delighted with Webby and I use it everywhere I can. It's very stable since you are simply publishing static HTML. My comment system is still in development, I haven't been posting much lately so I haven't received that many comments for testing purposes. :-) I am happy to share the code now and will publish it under an MIT license as soon as I feel it is ready. Basically, it uses PHP to send an email when someone completes a comment form, Ruby's Net/IMAP to gather comments from the email account, DataMapper to store the comments in a database, and a Webby helper to incorporate comments into each blog post. I am using a gmail account for now to see if gmail's spam filtering will work on comment spam, but when I get a chance I hope to replace this with generic email and an open source filtering system such as bogofilter.

If you are considering using Webby for documentation purposes, you should also take a look at Idiopidae (see link in footnote and click on "Download All" in the sidebar of this post for an example).

-Ana