Resources
Using Tesseract
- Tesseract documentation
- Installing
- Command line usage for Tesseract
- Command line usage for Tesseract (detailed)
Additional Tutorials
- How to Digitize Texts with Open-Source Command-Line Optical Character Recognition (OCR) Software
- Build Your Own Text-as-Data Corpus: A Print-to-Bytes Primer from WIDH@NYCDH Week 2021 by Nicholas Wolfe;
- Images to Text: A Gentle Introduction to Optical Character Recognition with PyTesseract from the 2021 Text Analysis Pedagogy Institute by Hannah Jacobs.
Alternative OCR tools
- PyTesseract
- Adobe Acrobat(https://www.adobe.com/acrobat/how-to/ocr-software-convert-pdf-to-text.html)
- Abbyy FineReader(https://www.abbyy.com/) (Great when you need high accuracy, like when publishing a book)
Additional Reading
- Cordell, R. 2017. “Q i-jtb the Raven”: Taking Dirty OCR Seriously.” Book History, 20, 188-225.
- Cordell, Ryan. 2019. “Why You (A Humanist) Should Care About Optical Character Recognition.”
- Hawk, Brandon W. “OCR and Medieval Manuscripts: Establishing a Baseline.”
- Smith, David, and Ryan Cordell. 2018. “A Research Agenda for Historical and Multilingual Optical Character Recognition.”