kasku.net/blog/2022/11/23

Document Formats, ar5iv, Transcribing PDF files

ar5iv is a pretty cool project, the idea is that since in general you submit (La)TeX code to arXiv rather than a pdf, they can convert that famously stable format into html, which I prefer.

(I was under the impression that arXiv didn't accept pdf uploads at all, but apparently they do in rare circumstances.)

One reason I'd rather not have to read a pdf is that I don't have a very high DPI monitor, which means fonts in pdfs don't look very clear since it's basically an image format, compared to text in a browser which generally looks decent, though I'm not exactly a fan of the font choice on ar5iv. Being html also means you can reflow text into whatever window size you want, you don't have to read stuff broken up across pages, you can link to things more easily via anchors on section headings, etc.

Gilles Castel described his workflow where he comes up with some clever ways to overcome the limitations of pdfs. For example "instantly adding clickable links to notes" isn't anything special if the link is to a webpage, but it is if it's a link to a page in a pdf, and it takes a lot more effort as described in the "technical details" section. It involves the phrase "defining a custom protocol", for example. Parts of it he calls "a bit of a hack", which is putting it lightly. Though, the end result is good, and I hope that ar5iv or any similar projects will help people have similar benefits without going to all that trouble.

However, it is still possible to scrape information from pdf files and reformat it in html, for example this html version of the intel x86 manual, though being in a better format doesn't outweigh how much better the amd manuals are, or this script to convert a C standard draft pdf into html, for example you can read the c17 version hosted here.

What these sites converted from pdfs have in common is that they both have notices saying they're filled with errors and you shouldn't rely on it, which is understandable because I can't even begin to imagine how brittle this process is. You can run pdftotext (part of poppler) on any pdf that you have (as long as it has the textual data embedded in the file, which might not be the case for a scan if no one has run an ocr tool on the file) and see where you'd be starting if you wanted to reformat it. It's about what you'd expect.

Honestly, it's good enough to make the process feasible, but you'd still need to go through every page manually and check for errors, so it's really just a much faster way to begin transcribing a pdf. I don't know how many hours it would take to fix the above documents to the point they became "production quality", but it seems like it would be within the realm of possibility.

Converting stuff from one form into another does introduce errors, but on the other hand having more eyes on the original document could find errors in that too. I might try and do a few pages and see how I get on, but I probably wouldn't gather enough data by doing that to make estimates about error frequency or speed, because I probably won't have a large enough sample to extrapolate the error rate from, and I won't spend enough time to reach my maximum speed.