Fixing basic encoding issues migrating / upgrading a WordPress database

I recently joined Automattic as a Code Wrangler (more on that soon), and I decided now was the time to de-spam, combine and resuscitate my old WordPress blogs.

Along the way I noticed a few annoying encoding issues crept in. The Wordpress XML exporter/importer, for example, double-escaped all my quotes so that <a href=”this”> became <a href=\”this\”>, which as you can imagine broke… everything.

The SQL query to fix this is:

UPDATE `wp_posts`
  SET post_content_filtered = REPLACE(post_content_filtered, '\\"', '"'),
      post_content = REPLACE(post_content, '\\"', '"')
  WHERE post_content_filtered like '%\"%' OR post_content LIKE '%\"%'

Next I noticed a lot of double-quotes and ellipses had turned into gibberish like ““” and “—. Here’s the SQL queries to deal with that:

UPDATE wp_posts SET post_content = REPLACE(post_content, '’', '\''), post_content_filtered = REPLACE(post_content_filtered, '’', '\'');
UPDATE wp_posts SET post_content = REPLACE(post_content, '“', '“'), post_content_filtered = REPLACE(post_content_filtered, '“', '“');
UPDATE wp_posts SET post_content = REPLACE(post_content, 'â€', '”'), post_content_filtered = REPLACE(post_content_filtered, 'â€', '”');
UPDATE wp_posts SET post_content = REPLACE(post_content, '¦', '…'), post_content_filtered = REPLACE(post_content_filtered, '¦', '…');

Once I ran these few queries, I couldn’t see any more formatting errors. Others online have advised disabling the line “define(‘DB_CHARSET’, ‘utf8’)” in the wp-config.php. This seems inadvisable to me as it may allow your database to become even more inconsistent, by not enforcing a single clear and future-proof encoding on all content.

Now that everything’s been cleaned up my old content looks great and I can move on to customising the theme of my site.