I couldn’t help but post this one… because it’s so very true
So, some people may ask, why are you trying to convert PDF to Wiki? PDF is usually the last step in the process, so just use the original document. My response would naturally be, what if you don’t have the original document?
A Two-Step Process
Through my searching and reading on the topic, it seems there is no PDF2Wiki Converter. Every site that I have read explains converting the PDF to one of: DOC, RTF, HTML, XML first then to wiki format.
I tried a number of PDF to HTML programs, but none of them worked to my satisfaction. Most of them only converted simple formatting, such as bold and italics. Adobe has an online conversion tool. It’s better than some of the others I’ve tried as it interprets lists and such. The resulting code is rather ugly and a lot of the code would need to be stripped before using a HTML to Wiki converter. See my previous post on HTML2Wiki for a couple of tools on tidying or stripping HTML code.
I found that a much better alternative was converting the PDF to a DOC/RTF file since it’s a lot simpler and some formatting might be lost, but you won’t have a lot of needless code that might mess up your wiki page. There are a lot of online tools that provide a PDF to DOC/RTF service, however, again, they only tend to do basic formatting. Adobe Acrobat does a really good job, because it will change lists into formatted lists (instead of normal text). The major downside of course is that Acrobat is a paid program though there is a 30-day trial.
I had a lot of problems in particular with PDF to HTML, so I thought PDF to DOC/RTF is simply. Honestly though, unless you have a PDF file which is really long and has a lot of simple formatting (bold, italics, etc.), if you cannot get your hands on Acrobat, then I suggest simply copy/paste (or alternatively save as a text file) and manually formatting it in the wiki’s editing box. Of course this depends on the wiki you’re using because ones that don’t have a toolbar to help you quickly format might be a bit of a pain. Someone please let me know if you have found a better method!
So to continue on ways to convert existing documents to wiki code, next is formatted text documents, which is typically word DOC files, but may also be something like RTF files.
Most sites I found actually just instructed people to use a 2 step conversion. From Word to HTML and then to wiki code. While this may work, it’s much less efficient and I can imagine more things are lost in the process. Admittedly, the converters that I have found are all geared towards MediaWiki, so if you’re using a different wiki then these converters may not work so well. Nevertheless, MediaWiki provides a list of Word to Wiki converters the most basic of which does not seem to be specifically geared to MediaWiki.
OpenOffice Sun Wiki Publisher Plugin (MAC and Windows compatible, not sure about other platforms)
(the wiki converter is built-in, the publishing part of it is optional)
The downside of OpenOffice is that it does not always interpret word documents very well. Embedded images tend to turn into hex code (ex. ffd8ffe000104a46494600010201 etc.) and tables aren’t always interpreted correctly either. The one I tried turned into overlapping text. So, in part, the usefulness of the outputted wiki code will depend on how well OpenOffice has read the word DOC itself, but it should handle ODT and RTF just fine.
Word2MediaWikiPlus Macro (Windows Only)
Word is the better choice for documents that OpenOffice can’t seem to handle very well. There is also a Word2MediaWiki Macro which is easier to use, but does not convert tables or deal with images very well.
For the OpenOffice plugin, ‘special characters’ (used loosely here) sometimes turn into weird symbols or random special characters. As with the HTML converters from the last post, something like ’ (not straight apostrophe) gets changed into ‚Äô, or a bullet point (which isn’t recognized to be in a bulleted list) turns into ‚Ä¢.
The Word2MediaWikiPlus (W2MWP) converter is better at dealing with special characters. The macro will simply insert the character as is and at times put a nowiki tag around it, but regardless, it displays just fine.
For some reason, the W2MWP plugin turns text boxes into a single cell table and then repeats the same text again as regular text (not inside a table). The OpenOffice plugin strips the text of formatting and leaves it as regular text in the wiki output.
When tables are interpreted correctly, I think the OpenOffice plugin does a better job overall. The W2MWP macro is better at keeping formatting, such as colours and border style (below right), but OpenOffice one seems to interpret things inside a table better, such as type of lists (below left). (It’s supposed to be a bulleted list, not a numbered list.)
Needs Good Original Document Formatting
In both cases, the usefulness of the wiki code will depend on how well the original document was formatted. For example, in one of the documents I tested, a number of the number and bullet lists were not formatted as such, but instead, numbers and bullets were just manually added. In both plugins, they were considered to be regular text with a ‘special’ character or number at the beginning of it.
Whether the Word2Wiki or the OpenOffice plugin is better depends on your priorities. OpenOffice seems to interpret lists and text boxes better, and doing a replace all for characters that weren’t interpreted properly is a pretty quick step. W2MWP is better at keeping formatting and interpreting all characters. So, if you like the way your document looks and you want to keep it that way, use the W2MWP macro. The big downside of course is that it doesn’t work on MACs (which I’m using right now, yay for VMware). Nevertheless, my conclusion is that the DOC2Wiki Converters are useful, but may not be the optimal solution depending on how much you’re willing to install and play around with. And if the document isn’t formatted like it should be, then manual wiki formatting might be the way to go.
So, for the past little while on and off, I’ve been looking for and playing around with HTML to Wiki Converters to see which one works best. Most of the ones I’ve found are online and most of them seem to be based on a Perl script created by David Iberri, who provides a web interface as well.
David Iberri has provided a running web interface version for his script for a lot of different wiki dialects. However, I’ve only tested the MediaWiki version for the purposes of my project. I really like the “Fetch from URL” feature which is not available on many others.
Interestingly, I found what looks to be the exact same converter on another site, but it gives me slightly different results. (see below)
Seapine’s HTML to Wiki
The one is really good for basic things and even though it does not have a “Fetch from URL” feature, you can easily copy/paste. However, this converter frequently broke for me when dealing with whole pages because it seemed to stop working when it faced something that it didn’t quite recognize.
Batch/Site HTML to MediaWiki converter
I have not actually tried this one, but I thought it might be a useful resource for later and for other people. This uses the same Perl script in combination with MediaWiki’s PHP importing scripts.
Comparison between HTML2Wiki and the berliOS version
Neither deals with ’ (the non-straight apostrophe) very well for some reason, and I’m guessing it will have problems with some other characters as well. Currently, both give a � in place. However, if it’s always the same character in your wiki document, it’s easy enough to do a replace all.
Both seem to handle tables quite well and one as well as the other, though sometimes the Iberri one seems to forget to put the first line of the table code on a new line, which of course, means the table fails to work.
I would say that overall I like the berliOS version better for links because it can recognize anchor links, whereas the Iberri one will display text. For example (berliOS):
[#reserve Finding Articles on Course Reserve].
The Iberri one does a better job at “oh my god i don’t understand this” by simply stripping the HTML and leaving text. The berliOS one will try to interpret it and end up with odd things at times. However, I think it’s pretty understandable that it doesn’t handle mouse over boxes very well especially when the original script to do that is CSS and not a part of the HTML tag. For example (berliOS):
You CAN find hundreds of thousands of articles through the UBC Library Web. more »
UBC Library subscribes to tens of thousands of magazines, journals and newspapers, in print and in full text online. The UBC Library Catalogue DOES NOT list individual articles by topic. more »
To search for articles by topic, you need to start your search in an index or database. (Instructions follow.) Like the catalogues of most libraries in the world, UBC Library�s catalogue does not contain a listing for each article in each journal in its collection. Search engines like Google DO NOT retrieve most academic articles. But… more »
”’Google Scholar (Beta)”’ has begun to reach some academic journals and online archives, but for now, Indexes and Databases are the most complete searchable lists of articles.
Most academic and publicly-funded researchers publish the results of their research in scholarly journals or in online archives, which search engines don�t reach. Most popular magazines do not provide their content for free on the Web. Newspaper articles have a different search guide (right here).
So overall, I like the berliOS one better because it recognizes more elements, but it’s easier to screw things up with it. So I would say the Iberri one is easier to use since it generally just strips what it doesn’t understand.
On a related footnote-sort note, after converting to wiki code, if there is a lot of HTML code left that seems to be messing up the wiki page, you can try stripping or ‘tidying’ the HTML code. HTML Tidy tries to make the HTML conform to current HTML standards, but depending on how the page is done, it might start creating CSS which obviously wiki pages don’t understand, so the strip HTML function may work better.
Zubrag’s Strip HTML online tool
So recently, I’ve been working on a mini-usability design study by asking users to do a card sort. In the process, I found some interesting tidbits.
What’s a Card Sort?
For those who don’t know what a card sort is, you basically put ideas (i.e. possible links to pages) on index cards or sticky notes and ask people (usually in a group) to sort them into categories, either existing ones you provide or ones that they name after.
Number of People to Test
Interestingly, I found that some articles suggested 25-30 people, but according to Nielson‘s correlation study, 15 is enough and after 20, it’s not worth the resources.
Card Sort Methodology
Open-sort vs. Closed-sort: We decided to use a close sort (categories are pre-determined) since we had already created a proposed information architecture (i.e. navigation structure).
Group vs. Individual: I had originally planned to do individual sessions since that would be more flexible, but J. (a coworker) has read studies about how these sorts of exercises work better in a group. I have read in various articles that group card sorts is the preferred method, so that made sense.
Silent vs. not: J. also suggested a silent card sort, which really did affect the group dynamic. I could see that even when silent there were people who were more assertive than others and that during the discussion that followed, those people were definitely more opinionated as well. So, I’m glad we did it as a silent sort.
Scheduling was definitely much more time consuming than I had thought it would be. And trying to find faculty was the most difficult. Perhaps due to the incentive that we provided ($10 for 30 mins), we had plenty of student volunteers, especially grads (probably because they were around whereas undergrads were less likely to be as it’s between the two summer terms). For faculty, our hardest-to-get group, personal e-mails were definitely necessary! (and from someone they know).
Getting people to think in the right mind frame was also an interesting task. A number of people who participated kept thinking about the design. Although it brings up interesting points which are helpful while we design a new site, some of it was irrelevant. Some kept thinking that it would be the home page, but no… it is not. They got the idea that what they saw was definitely going to be on the website, but that’s not true either. It got a bit frustrating at times, because I would basically say, “yes, good point, but let’s focus on the task at hand” (which was the card sort itself and naming the categories). Most of the time it worked, but with one or two people… somehow that didn’t. They were so focused on “this is what and how I would like to see the website to be”, so I had to repeat more than once that it’s not the home page, just a single page somewhere. I got around it by turning my open questions into closed questions, but man… argumentative people can definitely change the group dynamics and totally veer the discussion in a totally different direction. Okay… apologies… </rant> But I think it brings up the important point that having a good mediator/facilitator is very important. I honestly think that my coworker would have done a better job than I did, but ah well, you do what you can.
Backup plans are a must-have! What if something goes wrong? Terrible on my part, I know, I did not really think about it before the actual sessions took place. What do you if someone doesn’t show up? What if more people suddenly show up? Does it matter to your study? I decided that for our purposes, if one person give or take in a group wasn’t a big deal, but definitely something to think about next time. Making sure you have all needed materials and back-up materials if things break down is also another much needed consideration.
Another Online Resource
Finally, there were a lot of good online resources. In particular, Spencer & Warfel’s Guide is quite comprehensive.
Information Architecture, or IA, is just IT jargon for an organization or navigation system, typically a hierarchy. Actually, it depends who you ask. For a general idea on why this definition might be contentious, please google it, or refer to the Wikipedia article on IA. Anyhow, for the purposes of my post, please think of the IA as I have defined it, which is what I was told in my department.
I first gathered a lot of data (as the previous post probably indicates). What I basically did was do an inventory of existing content. I also looked at each page to see what topic it generally covered. In doing the inventory, I think came up with three major data tables.
- How many pages covered the same topics.
- Which pages showed up in the inventory more than once.
- Which pages were visited the most (of what I included in the inventory).
Assumptions to Think About
Oh yes, of course, using this kind of data has certain assumptions of course, which may or may not be true.
- The more pages on a topic that staff have created, the more important staff think this topic is.
- The more a page is linked to, the more staff find it a useful page.
- The more a page is visited, the more useful that page is to our users.
Although I don’t think these assumptions should be accepted without any scrutiny, I think they’re also okay to make, to a certain extent. These assumptions aren’t always true, but they can be used as indicators. Also, I would be hard pressed to use only one of these indicators, but in combination can be used as a good base.
Creating the Base IA
- Summarize: Using the 3 tables, I did a sort of summary table by way of ranking and counting to come up with the most common topics.
- Group: I then grouped them in such a way that it made sense to me and named each group.
- Fill in the Holes: I filled in any “holes”, such as in the category Finding (Library Resources), Books and Journals were obviously in there, but Maps were not.
- Add what’s Missing: I think consulting with one or more expert (in this case, librarian) is a really important step in recognizing that there may be really important pages that people just can’t find or don’t know exist (or maybe doesn’t exist yet).
Work in Progress
No wonder programs go through so many versions, I don’t know how many I’ve gone through just consulting with one other person! No doubt it’ll go through many more as users and other staff are consulted. As long as it doesn’t degrade into this: How a Web Design Goes to Hell
So, I’ve been doing an inventory of all the instructional “how-to” type pages (and slightly broader) on the UBC library‘s website and I came up with some rather interesting (in some cases, what I thought were staggering) statistics.
Of the 794 internal and external links:
- somewhat surprisingly, only 3% were 404/dead links
- 16% were duplicate links (meaning I had already inventoried the link at least once)
Of the 590 internal pages:
- 20% are in PDF format
- 4.6% are Videos (mostly outdated)
- 3% are PDF versions of a webpage
- 20% (a whooping 106 page) duplicate content of another page. For example, I found 12 different pages that talks about How to Cite something (in general, not different styles).
What I also found interesting were how out of date some of the pages were. The best example was a page that refers to “Information Navigator 2001”! (Disclaimer: I did an inventory based on following links from the Instructional pages, branch pages, and FAQ, so it does not include any delinked pages.)
It’s no secret that I’m part of a larger project to revamp the library website, and I think I just provided some pretty good hard data to justify it.