Text Extraction (OCR)
Optical Character Recognition is the process in which the text from images or PDFs is extracted.
Built-in support#
Since v0.103.0, Trilium has built-in support for OCR. The extracted text can be:
- Integrated with Search, to quickly find the image or file based on snippets of text.
- Integrated with the AI feature, which allows the agent to access the content of a non-text note.
- Manually accessed for other purposes (e.g. copying into a note or sending it somewhere else).
Supported formats#
OCR in Trilium supports the following formats:
Images#
- Both individual image notes and attachments in text files are supported.
- Supported formats: JPEG, PNG, GIF, BMP, TIFF, WebP.
- Currently only single-page TIFFs are supported. If you have multi-page TIFFs consider splitting them into individual images.
- Note that this feature works best for computer-rendered text rather than handwriting.
- The underlying technology is Tesseract.js.
PDFs#
Currently only text extraction is supported and not OCR.
- This means that the PDF needs to have proper text information in it (i.e. the text can be selected in a PDF viewer), whereas scanned documents are not yet supported.
- There are plans to integrate the same OCR-based recognition for PDFs used for images, but this is not yet implemented.
Office documents#
The text will be extracted from the following file formats:
- Microsoft Word documents
- Microsoft Excel documents (only the raw text information, the cell structure is not maintained).
- Microsoft PowerPoint documents
- The OpenDocument alternatives to the previous formats (Text, Spreadsheet, Presentation), created by editors such as LibreOffice and OpenOffice.
Configuring and triggering OCR#
The OCR can be configured by going to Options → Media and looking for the Text Extraction (OCR) section.
There are three ways to trigger the OCR:
- By enabling Auto-process new files which will process only the notes or attachments created after enabling the option, existing files will remain unprocessed.
- By pressing Start Batch Processing which will process all the existing notes.
- By manually requesting for an image or file to have its text extracted, regardless of whether the automatic processing is enabled or not.
Minimum confidence#
When extracting text from an image, there is a certain level of confidence which indicates whether the extracted text appears relevant.
When the minimum confidence is set to a low percentage, the text extraction can interpret symbols and drawings incorrectly resulting in garbled text.
If the extracted text for a note or an attachment quality is lower than the minimum confidence, the OCR is disregarded.
Language management#
OCR needs to be aware of the language of the content in order for it to work correctly. The reason is that each language has its own data which needs to be downloaded, and accents or other symbols will not be supported by the default language.
To configure the languages that are supported by the OCR, simply go to Options → Language & Region and adjust the Content languages.
When there are no content languages defined, the user interface Language is used instead.
After making this change, the automatic processing or manual reprocessing will take into consideration the new languages.
To enforce the detection in a particular language for a given note, use the language attribute, similar to text content language. For Attachments, it's not possible to manually adjust the language.
Viewing extracted content for a single note#
To access the extracted content of a note:
- For File notes, go to the Note buttons → Advanced → View OCR Text.
- For Attachments (e.g. Images in Text notes), double-click the attachment to view the details, press the […] button at the left and select View extracted text (OCR).
This section allows:
- Viewing the extracted text, which can be copied elsewhere if needed or just to check the quality of the extraction.
- If the note has not been extracted yet, pressing Process OCR will process it in the background. If the extraction confidence is lower than the minimum confidence, there will be a notification.
- Similarly, if the minimum confidence was changed in settings, it is possible to press the Process OCR button again to extract the text again.