icon Clean up your Web pages with HP's HTML Tidy

Dave Raggett
Hewlett Packard Laboratories,
Filton Road, Stoke Gifford, Bristol BS12 6QZ, U.K.

dsr@w3.org
Keywords
HTML; Validation; Error correction; Pretty-printing

1. Introduction to Tidy

When editing HTML it is easy to make mistakes. Wouldn't it be nice if there was a simple way to fix these mistakes automatically and tidy up sloppy editing into nicely laid out markup? Well now there is, thanks to Dave Raggett of HP Labs. HTML Tidy is a free utility for doing just that. It also works great on the attrociously hard to read markup generated by specialized HTML editors and conversion tools, and can help you identify where you need to pay further attention on making your pages more accessible to people with disabilities.

Tidy is able to fix up a wide range of problems and to bring to your attention things that you need to work on yourself. Each item found is listed with the line number and column so that you can see where the problem lies in your markup. Tidy will not generate a cleaned up version when there are problems that it is not sure of how to handle. These are logged as "errors" rather than "warnings".

1.1. Examples of Tidy at work

Here are just a few examples of how Tidy perfects your HTML for you:

1.2. Layout style

You can choose which style you want Tidy to use when it generates the cleaned up markup: for instance whether you like elements to indent their contents or not.

1.3. Internationalization issues

Tidy uses UTF-8 internally to represent character values. The full set of HTML 4.0 entities are defined. Cleaned up output uses HTML entity names for characters when appropriate. Otherwise characters outside the normal ASCII range are output as numeric character entities. Support for a range of character encodings is under development and offers of help are welcomed.

1.4. Accessibility

Tidy offers advice on accessibility problems for people using non-graphical browsers. The most common thing you will see is the suggestion you add a summary attribute to table elements. The idea is to provide a summary of the table's role and structure suitable for use with aural browsers.

1.5. Getting rid of those FONT tags

If you are to switch to using style sheets you do not want FONT, NOBR and CENTER elements. Tidy will obligingly remove them if you ask.

1.6. Future releases

Future releases may address:

1.7. Support for XML

XML processors compliant with W3C's XML 1.0 recommendation are very picky about which files they will accept. Tidy can help you to fix errors that cause your XML files to be rejected.

1.8. Indenting text for a better layout

 <html>
   <head>
   </head>
   <body>
     <p>
       para which has enough text to cause a line break, and so test
       the wrapping mechanism for long lines.
     </p>
 <pre>This is
 <em>genuine
       preformatted</em>
    text
 </pre>
     <ul>
       <li>
         1st list item 
       </li>
       <li>
         2nd list item
       </li>
     </ul>
     <!-- end comment -->
   </body>
 </html>

and this is the default style:

 <html>
 <head>
 </head>
 <body>
 <p>para which has enough text to cause a line break, and so test
 the wrapping mechanism for long lines.</p>
 
 <pre>This is
 <em>genuine
       preformatted</em>
    text
 </pre>
 
 <ul>
 <li>1st list item </li>
 
 <li>2nd list item</li>
 </ul>
 
 <!-- end comment -->
 </body>
 </html>
 

1.9. Implementation details

The code is in ANSI C and uses the C standard library for i/o. The parser is thread-safe although the code for pretty printing the parse tree is not (yet). The parser works top down, building a complete parse tree in memory. Document text is held in an expanding character array. The code has so far been tested on Windows'95, Windows NT, Linux, SunOS, Solaris and HP-UX.

You can read more about Tidy and download the source code and binaries for common platforms from: http://www.w3.org/People/Raggett/tidy

Dave Raggett dsr@w3.org is an engineer at Hewlett Packard's UK Laboratories, and works on assignment to the World Wide Web Consortium, where he is the W3C lead for HTML.