Alex Schroeder 🐝 is a user on octodon.social. You can follow them or interact with them if you have an account anywhere in the fediverse. If you don't, you can sign up here.

What’s my best option to put about 100 pages of photocopied typewriter pages in German online with ? I think my very old CanoScan LiDE 25 scanner comes with some OCR software. Was there a significant improvement which is freely available, now? How cool is the workflow with pictures taken from a phone and tesseract? Any other free options?

Scanner Mini is not good as far as I can tell.

I thought perhaps orientation is to blame but apparently not.

Current status: apt install yagf tesseract-ocr-deu. "YAGF is a graphical interface for cuneiform and tesseract text recognition tools on the Linux platform. With YAGF you can scan images via XSane, import pages from PDF documents, perform images preprocessing and recognize texts using cuneiform from a single command centre. YAGF also makes it easy to scan and recognize several images sequentially." Looking forward to cuneiform OCR!

Alex Schroeder 🐝 @kensanata

Current status: as soon as xsane is finished scanning the page, crashes. I think I'm going to use tesseract directly, from the command line. Or at least try this for one page and if it works, find a workflow to scan all those 100 pages using my old scanner and then a Python script like the one suggested by @vickysteeves for all the nitty gritty details. Better than improvised bash hacking!

Β· Web Β· 0 Β· 0

Current status: running a perl script which loops through scanimage and calls tesseract on every image. Also my status: nine instances of tesseract running simultaneously and load > 25. πŸ˜‚

@kensanata Nice of you to let us know how it goes. I have just gotten a few old books out of storage and I am toying with the idea of scanning them someday, but I have no idea of what to expect from current tools.

@nono I'm currently 30 pages in. I don't have a good scanner. It's a flat bed scanner that takes a few seconds to scan a page. So: whenever it is silent, open cover, take out paper, put in next page, close cover, switch to terminal, hit Enter, go back to what I was doing. As for the non-English tesseract output... Not sure!

@nono the sample page I showed above, for example: "fuhr ich mit Gretl 32 einem Wagen, kein Auto,sondern eben einer Kalesse
nach Gaya β€šVeraniaßt dies natΓΌrlich Herr Willerth . Bei dieser Gelegen-
leit blieben wir gleich einen Nachmittag bei Elli.Γ€.h.bei Familie Grunt
und verbrachten wieder nette Stunden β€ždann gings wieΓ€er nach Keltschan."
It's readable! But practically every line will need manual intervention.

I scanned all 155 pages! And I ran tesseract on them. All that remains is to clean it up... Aaaargh! These are the memories of my grandfather's sister from her birth in 1905 up to around 1950. There's a lot of calamity, poverty and wartime in there. And some good stuff, too. Fixing it up will at least make me read it carefully. πŸ˜…

@seanl Sure, I'll post it on my website. Like the story of my other grandfather. alexschroeder.ch/wiki/Roland_L

@kensanata Damn.. seems like you done most of the work already. I did a similar thing and used 'gimagereader-gtk' and 'xsane'. Worked almost flawlessly.

Editing scanned German photocopied typewriter pages with tesseract OCR: taking forever. Stalled on page 2. Ouch! this will require some grit.