So, for the past little while on and off, I’ve been looking for and playing around with HTML to Wiki Converters to see which one works best. Most of the ones I’ve found are online and most of them seem to be based on a Perl script created by David Iberri, who provides a web interface as well.
David Iberri has provided a running web interface version for his script for a lot of different wiki dialects. However, I’ve only tested the MediaWiki version for the purposes of my project. I really like the “Fetch from URL” feature which is not available on many others.
Interestingly, I found what looks to be the exact same converter on another site, but it gives me slightly different results. (see below)
Seapine’s HTML to Wiki
The one is really good for basic things and even though it does not have a “Fetch from URL” feature, you can easily copy/paste. However, this converter frequently broke for me when dealing with whole pages because it seemed to stop working when it faced something that it didn’t quite recognize.
Batch/Site HTML to MediaWiki converter
I have not actually tried this one, but I thought it might be a useful resource for later and for other people. This uses the same Perl script in combination with MediaWiki’s PHP importing scripts.
Comparison between HTML2Wiki and the berliOS version
Neither deals with ’ (the non-straight apostrophe) very well for some reason, and I’m guessing it will have problems with some other characters as well. Currently, both give a � in place. However, if it’s always the same character in your wiki document, it’s easy enough to do a replace all.
Both seem to handle tables quite well and one as well as the other, though sometimes the Iberri one seems to forget to put the first line of the table code on a new line, which of course, means the table fails to work.
I would say that overall I like the berliOS version better for links because it can recognize anchor links, whereas the Iberri one will display text. For example (berliOS):
[#reserve Finding Articles on Course Reserve].
The Iberri one does a better job at “oh my god i don’t understand this” by simply stripping the HTML and leaving text. The berliOS one will try to interpret it and end up with odd things at times. However, I think it’s pretty understandable that it doesn’t handle mouse over boxes very well especially when the original script to do that is CSS and not a part of the HTML tag. For example (berliOS):
You CAN find hundreds of thousands of articles through the UBC Library Web. more »
UBC Library subscribes to tens of thousands of magazines, journals and newspapers, in print and in full text online. The UBC Library Catalogue DOES NOT list individual articles by topic. more »
To search for articles by topic, you need to start your search in an index or database. (Instructions follow.) Like the catalogues of most libraries in the world, UBC Library�s catalogue does not contain a listing for each article in each journal in its collection. Search engines like Google DO NOT retrieve most academic articles. But… more »
”’Google Scholar (Beta)”’ has begun to reach some academic journals and online archives, but for now, Indexes and Databases are the most complete searchable lists of articles.
Most academic and publicly-funded researchers publish the results of their research in scholarly journals or in online archives, which search engines don�t reach. Most popular magazines do not provide their content for free on the Web. Newspaper articles have a different search guide (right here).
So overall, I like the berliOS one better because it recognizes more elements, but it’s easier to screw things up with it. So I would say the Iberri one is easier to use since it generally just strips what it doesn’t understand.
On a related footnote-sort note, after converting to wiki code, if there is a lot of HTML code left that seems to be messing up the wiki page, you can try stripping or ‘tidying’ the HTML code. HTML Tidy tries to make the HTML conform to current HTML standards, but depending on how the page is done, it might start creating CSS which obviously wiki pages don’t understand, so the strip HTML function may work better.
Zubrag’s Strip HTML online tool