Beautify your WordPress HTML output with HTML Tidy

As you may have noticed, HTML that WordPress outputs is usually very, very messy. This is because the final output is generated by PHP from various sources – pieces from here and there combined together, and then just spat out.

Usually there is no need to do anything for this, as the markup is not meant to be human-readable, and browsers just don’t care as long as the syntax is correct.

For optimizing purposes you would like to use some plugin to minify the HTML, which basically means removing all unnecessary spaces and newlines.

But there are some cases when you’d like to tidy up the HTML so it looks beautiful and correctly indented. One use case would be your personal portfolio site that you have made to show up your skills as a web developer, and there might be a change that someone is looking up the source code to get an idea how have you implemented the site.

Cleaning up the markup with HTML Tidy

HTML Tidy is a command-line tool that was originally released in 2003. It was written by W3C’s Dave Raggett to correct invalid HTML syntax, detect web accessibility errors and improve the layout and indent style of the markup. Since then the tool has turned into a C library and PHP has added bindings for it.

To use HTML Tidy in PHP you need libtidy5 library for your operating system and you must compile PHP with an option –with-tidy. In Debian or Ubuntu you probably want to use apt for this:

apt install libtidy-dev php-tidy

Using Tidy in WordPress is simple – here’s a fully working example that you can add into your functions.php.

function html_tidy_cb($html) {
  if (!str_starts_with($html, '<!DOCTYPE html') &&
      !str_starts_with($html, '<!doctype html')) {
    return $html;
  }

  $config = [
    'drop-empty-elements' => false,
    'indent' => 2,
    'wrap' => 0
  ];

  $tidy = new tidy;
  $tidy->parseString($html, $config, 'utf8');
  $tidy->cleanRepair();

  return $tidy;
}

function html_tidy() {
  ob_start('html_tidy_cb');
}

add_action('init', 'html_tidy');

Let’s start from the bottom.

At line #24 we add a new action for the init hook with a callback html_tidy().

At line #21 we call ob_start() function that turns output buffering on. Everything WordPress outputs will be now sent to our callback function called html_tidy_cb().

At line #1 we define the html_tidy_cb() function with a parameter $html that contains the buffered output from WordPress.

Inside our callback we first check that the content is HTML document as we don’t want to process dynamically generated files like robots.txt. We do this by checking if the content starts with a correct doctype definition. I know this probably isn’t the best way to do it, but at least it works in 99% of cases – and when it’s not, it should not cause any issues except the missing beautification.

After that comes the Tidy configuration. We set the “drop-empty-elements” to false to allow empty elements like <i class=”fa-settings”></i> and so on. Otherwise Tidy will remove them. Setting “indent” to 2 means that Tidy will decide automatically whether or not it should indent the content of elements such as TITLE, H1…H6, LI, TD or P. Setting “wrap” to 0 means that Tidy should not wrap long lines at all. Set this to 80 if you’d like to wrap lines using a right margin of 80 characters.

Rest of the code should be pretty self-explanatory. We create a Tidy instance, parse the buffered HTML content, apply cleaning and repair operations to it, and then return the beautified markup.

Use only with page caching

One last thing, and this is important! Please don’t use Tidy in a production environment without a full page caching. Tidy works best with a static content that you are serving from CDN or similar cache. You really don’t want to run Tidy on each viewer request because that would eat your resources quickly.