2009-09-01

The quest for syntax highlighting

Posting code with syntax highlighting is a pain. All my code editors support copying highlighted source with proper formatting as rich text, but unfortunately the blog can only accept HTML.

Microsoft Word can serve as an RTF to HTML converter, if you're desperate, but the HTML produced by Word (pick a version, any version) is just awful. Rather than proper HTML Word insists on saving a special HTML dialect that can be used for round-tripping the document in Office. Because being able to edit your HTML in Word is your main priority, right?

There are ways to sanitize Word's HTML output, but I thought I'd cut out the middleman by using an RTF to HTML converter directly. Actually, what I'd really want is a tool that can produce HTML from RTF copied on the clipboard, rather than something that requires me to save a file, run a converter, then copy the output of that file, which is rather roundabout.

Writing an application that monitors the clipboard and does something when the contents change is trivial, so I focused on a programming library that could handle the RTF to HTML conversion for me. It had to be something free, because I'm not paying just to be able to blog. :-) I did what everybody would do and entered "rtf html" in Google.

My first hit was DocFrac. This project's been abandoned for two years now, it seems, but no matter. There's a standalone converter but I was interested in the library. It's offered as both a regular DLL and an ActiveX component, and since I primarily do my work in C# I naturally went for the ActiveX component.

DocFrac unfortunately failed its very first test by incorrectly converting a piece of text with { } braces in it (they were stripped). Since those are rather essential to most code I write, this is an instant disqualification. That, and any conversion seemed to take at least 3 seconds, with no real reason for why that should take so long. Exit DocFrac.

Next hit was a piece of C++ code on CodeGuru by one Daniel Beutler. This code is 6 years old; after the requisite project conversion and fixing the issue with nonstandard scope in for-loops it compiled and ran just fine. The first file I tested it on triggered an assertion: the converter was of the opinion the file was invalid because it contained more { braces than } braces. Unfortunately it was unable to take into account that those braces were part of the text, not the RTF markup, so there was no problem. After experimentally fixing this issue the resulting output was even worse than DocFrac's: everything after the first opening brace { had been stripped away. Wow.

I took a look at the RTF specification. While it's very hard to implement completely because of the abundance of fine details (RTF is rich enough to store a Word document with all its formatting, including all quirks for backwards compatibility) I would at least expect authors of RTF tools to get something as simple as the escape mechanism right! Apparently that's harder than it looks, too.

Next up: a free tool for converting RTF to HTML in much the way I want it (it still requires an input file, copy-paste doesn't work), no libraries required. This one uses "IE internal conversion functions". Whatever those are, they're not up to the task. The output did contain all the text this time, but no indentation, and the wrong font was used.

The rest of the hits were for various payware components, most of them very expensive. I have no doubt this accurately reflects the price for developing software that can process the full range of RTF documents, but it's out of my league. I searched SourceForge just in case, but everything RTF-related there was either inapplicable or woefully incomplete and/or abandoned.

And then, after slightly modifying my original search, I found Mike Stall's blog entry on exactly this problem. Running into the same issues, he whipped up some code himself. His code doesn't actually work for Visual Studio 2008, because it assumes a particular color table (VS 2008 will generate RTF with varying color tables). But from there you can at least find CopySourceAsHtml, a plugin for Visual Studio that does exactly what the name implies, and which quite recently has been resurrected for Visual Studio 2008.

This solves the problem for Visual Studio, at least, and while its SQL highlighting isn't as nice as what SQL Server Management Studio produces, I can certainly live with it. The obvious next step would be to write my own cruddy RTF to HTML converter, but this doesn't seem worth the effort.

So, in summary:

    class Program {

        static void Main(string[] args) {

            Console.WriteLine("Hello, world!");

        }

    }

No comments: