A wrapper to work with Tesseract OCR inside PHP.
Via Composer:
$ composer require thiagoalessio/tesseract_ocr
There are many ways to install Tesseract OCR on your system, but if you just want something quick to get up and running, I recommend installing the Capture2Text package with Chocolatey.
choco install capture2text --version 3.9
tesseract
binary.
With MacPorts you can install support for individual languages, like so:
$ sudo port install tesseract-<langcode>
But that is not possible with Homebrew. It comes only with English support by default, so if you intend to use it for other language, the quickest solution is to install them all:
$ brew install tesseract tesseract-lang
use thiagoalessio\TesseractOCR\TesseractOCR;
echo (new TesseractOCR('text.png'))
->run();
The quick brown fox
jumps over
the lazy dog.
use thiagoalessio\TesseractOCR\TesseractOCR;
echo (new TesseractOCR('german.png'))
->lang('deu')
->run();
Bülowstraße
use thiagoalessio\TesseractOCR\TesseractOCR;
echo (new TesseractOCR('mixed-languages.png'))
->lang('eng', 'jpn', 'spa')
->run();
I eat すし y Pollo
use thiagoalessio\TesseractOCR\TesseractOCR;
echo (new TesseractOCR('8055.png'))
->allowlist(range('A', 'Z'))
->run();
BOSS
Yes, I know some of you might want to use this library for the noble purpose of breaking CAPTCHAs, so please take a look at this comment:
Executes a tesseract
command, optionally receiving an integer as timeout
,
in case you experience stalled tesseract processes.
$ocr = new TesseractOCR();
$ocr->run();
$ocr = new TesseractOCR();
$timeout = 500;
$ocr->run($timeout);
Define the path of an image to be recognized by tesseract
.
$ocr = new TesseractOCR();
$ocr->image('/path/to/image.png');
$ocr->run();
Set the image to be recognized by tesseract
from a string, with its size.
This can be useful when dealing with files that are already loaded in memory.
You can easily retrieve the image data and size of an image object :
//Using Imagick
$data = $img->getImageBlob();
$size = $img->getImageLength();
//Using GD
ob_start();
// Note that you can use any format supported by tesseract
imagepng($img, null, 0);
$size = ob_get_length();
$data = ob_get_clean();
$ocr = new TesseractOCR();
$ocr->imageData($data, $size);
$ocr->run();
Define a custom location of the tesseract
executable,
if by any reason it is not present in the $PATH
.
echo (new TesseractOCR('img.png'))
->executable('/path/to/tesseract')
->run();
Returns the current version of tesseract
.
echo (new TesseractOCR())->version();
Returns a list of available languages/scripts.
foreach((new TesseractOCR())->availableLanguages() as $lang) echo $lang;
More info: /~https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#languages-and-scripts
Specify a custom location for the tessdata directory.
echo (new TesseractOCR('img.png'))
->tessdataDir('/path')
->run();
Specify the location of user words file.
This is a plain text file containing a list of words that you want to be
considered as a normal dictionary words by tesseract
.
Useful when dealing with contents that contain technical terminology, jargon, etc.
$ cat /path/to/user-words.txt
foo
bar
echo (new TesseractOCR('img.png'))
->userWords('/path/to/user-words.txt')
->run();
Specify the location of user patterns file.
If the contents you are dealing with have known patterns, this option can help a lot tesseract's recognition accuracy.
$ cat /path/to/user-patterns.txt'
1-\d\d\d-GOOG-441
www.\n\\\*.com
echo (new TesseractOCR('img.png'))
->userPatterns('/path/to/user-patterns.txt')
->run();
Define one or more languages to be used during the recognition. A complete list of available languages can be found at: /~https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#languages
Tip from @daijiale: Use the combination ->lang('chi_sim', 'chi_tra')
for proper recognition of Chinese.
echo (new TesseractOCR('img.png'))
->lang('lang1', 'lang2', 'lang3')
->run();
Specify the Page Segmentation Method, which instructs tesseract
how to
interpret the given image.
More info: /~https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#page-segmentation-method
echo (new TesseractOCR('img.png'))
->psm(6)
->run();
Specify the OCR Engine Mode. (see tesseract --help-oem
)
echo (new TesseractOCR('img.png'))
->oem(2)
->run();
Specify the image DPI. It is useful if your image does not contain this information in its metadata.
echo (new TesseractOCR('img.png'))
->dpi(300)
->run();
This is a shortcut for ->config('tessedit_char_whitelist', 'abcdef....')
.
echo (new TesseractOCR('img.png'))
->allowlist(range('a', 'z'), range(0, 9), '-_@')
->run();
Specify a config file to be used. It can either be the path to your own config file or the name of one of the predefined config files: /~https://github.com/tesseract-ocr/tesseract/tree/master/tessdata/configs
echo (new TesseractOCR('img.png'))
->configFile('hocr')
->run();
Specify an Outputfile to be used. Be aware: If you set an outputfile then
the option withoutTempFiles
is ignored.
Tempfiles are written (and deleted) even if withoutTempFiles = true
.
In combination with configFile
you are able to get the hocr
, tsv
or
pdf
files.
echo (new TesseractOCR('img.png'))
->configFile('pdf')
->setOutputFile('/PATH_TO_MY_OUTPUTFILE/searchable.pdf')
->run();
Shortcut for ->configFile('digits')
.
echo (new TesseractOCR('img.png'))
->digits()
->run();
Shortcut for ->configFile('hocr')
.
echo (new TesseractOCR('img.png'))
->hocr()
->run();
Shortcut for ->configFile('pdf')
.
echo (new TesseractOCR('img.png'))
->pdf()
->run();
Shortcut for ->configFile('quiet')
.
echo (new TesseractOCR('img.png'))
->quiet()
->run();
Shortcut for ->configFile('tsv')
.
echo (new TesseractOCR('img.png'))
->tsv()
->run();
Shortcut for ->configFile('txt')
.
echo (new TesseractOCR('img.png'))
->txt()
->run();
Define a custom directory to store temporary files generated by tesseract.
Make sure the directory actually exists and the user running php
is allowed
to write in there.
echo (new TesseractOCR('img.png'))
->tempDir('./my/custom/temp/dir')
->run();
Specify that tesseract
should output the recognized text without writing to temporary files.
The data is gathered from the standard output of tesseract
instead.
echo (new TesseractOCR('img.png'))
->withoutTempFiles()
->run();
Any configuration option offered by Tesseract can be used like that:
echo (new TesseractOCR('img.png'))
->config('config_var', 'value')
->config('other_config_var', 'other value')
->run();
Or like that:
echo (new TesseractOCR('img.png'))
->configVar('value')
->otherConfigVar('other value')
->run();
More info: /~https://github.com/tesseract-ocr/tesseract/wiki/ControlParams
Sometimes, it may be useful to limit the number of threads that tesseract is
allowed to use (e.g. in this case).
Set the maxmium number of threads as param for the run
function:
echo (new TesseractOCR('img.png'))
->threadLimit(1)
->run();
You can contribute to this project by:
- Opening an Issue if you found a bug or wish to propose a new feature;
- Placing a Pull Request with code that fix a bug, missing/wrong documentation or implement a new feature;
Just make sure you take a look at our Code of Conduct and Contributing instructions.
tesseract-ocr-for-php is released under the MIT License.