Jump to content

Punjabi Ocr Project In Development


Genie Singh

Recommended Posts

I would like to use this space to post information, to see feedback, ideas, suggestions and possibly implementation of a Punjabi OCR program. I will focus on the implementation of the project using c# however I am open to other peoples expertise in other languages as well since programming skills can be transferable skills.

process.png

To begin with I suggest the use of a library known as the Teressact library controlled by Google at the moment, it has been used by some to create Arabic, Hebrew, English, European languages, Hindi OCR programs. It as an engine supports many languages and can be trained to support many more, it is one of the best open source (free) pieces of software an engine that can be used to produce great software. It probably makes use of Artificial Neural Networks to work.

So here is it's main website:

https://code.google.com/p/tesseract-ocr/

Here is a page on how to train the latest version (not the best guide but can work roughly on creating files to use for Punjabi)

https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

Here is a .NET (c#) wrapper (a program using the library is provided). The program outputs English by default, with the Punjabi files it should theoretically be possible to scan and recognise punjabi documents once trained. Making a pretty user interface after that is an easy job.

https://github.com/charlesw/tesseract

Here is some Punjabi training files someone created using this library perhaps using an old version of the library so not compatible, but some useful files to create new ones (lohit punjabi)

https://code.google.com/p/parichit/downloads/list

Here is someones guide on creating training data (may not work)

https://peepswrite.wordpress.com/2013/05/26/training-tesseract-3-02/

http://blog.cedric.ws/how-to-train-tesseract-301

Another guide on creating training data - again may not work

"Tesseract is probably the most accurate open source OCR engine available. Combined with the Leptonica Image Processing Library it can read a wide variety of image formats and convert them to text in over 60 languages. It was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but since then it has been improved extensively by Google. It is released under the Apache License 2.0."

"The Tesseract engine was originally developed as proprietary software at Hewlett Packard labs in Bristol, England andGreeley, Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some migration from C to C++ in 1998. A lot of the code was written in C, and then some more was written in C++. Since then all the code has been converted to at least compile with a C++ compiler.[3] Very little work was done in the following decade. It was then released as open source in 2005 by Hewlett Packard and the University of Nevada, Las Vegas (UNLV). Tesseract development has been sponsored by Google since 2006.[7]"

"The initial versions of Tesseract could only recognize English language text. Starting with version 2 Tesseract was able to process English, French, Italian, German, Spanish, Brazilian Portuguese and Dutch. Starting with version 3 it can recognize Arabic, English, Bulgarian, Catalan, Czech, Chinese (Simplified and Traditional), Danish, German (standard andFraktur script), Greek, Finnish, French, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak (standard and Fraktur script), Slovenian, Spanish, Serbian, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian and Vietnamese. Tesseract can be trained to work in other languages too."

http://en.wikipedia.org/wiki/Tesseract_%28software%29

According to this video of a guy who has worked on some reputable OCR projects, Tesseract supports 40+ human languages https://www.youtube.com/watch?v=gcjCiS9pJ3A#t=338

The files required to be created will be (can be named different to pan short for Panjabi. Some files maybe optional).

-edit (having a looking at International Organization for Standardization set in ISO639-2 as per tesseract use, the standard for Punjabi is "pan")

  • tessdata/pan.config
  • tessdata/pan.unicharset
  • tessdata/pan.unicharambigs
  • tessdata/pan.inttemp
  • tessdata/pan.pffmtable
  • tessdata/pan.normproto
  • tessdata/pan.punc-dawg
  • tessdata/pan.word-dawg
  • tessdata/pan.number-dawg
  • tessdata/pan.freq-dawg

... and the final crunched file is:

  • tessdata/pan.traineddata

and

  • tessdata/pan.user-words

I will post more updates, links, and further stuff on this project as time goes on

Edited by JatherdarSahib
Link to comment
Share on other sites

I think Gurpreet Lehal of Panjabi University Patiala, has already worked on this:

http://www.learnpunjabi.org/biodata_GSL.htm

Check that work out to save yourself possibly re-inventing the wheel?

Link to comment
Share on other sites

I think Gurpreet Lehal of Panjabi University Patiala, has already worked on this:

http://www.learnpunjabi.org/biodata_GSL.htm

Check that work out to save yourself possibly re-inventing the wheel?

Dr Gurpreet Lehal has some impressive academic papers I have read a few on OCR programs he has done. While his academic papers seem promising it is difficult to get a hold of his program and his source code to use. I have contacted him previously he had a downloadable program which wasn't compatible with the latest version of Windows it appears to be a Windows 95/98 based program, I haven't had much luck with that (maybe one day I may try install windows 98 on a VM and try run the program). The website for the downloadable software is copyrighted 2004-2005 so it may work on Windows Xp.

http://guca.sourceforge.net/applications/sherik/

He has responded to me with a link with what was his newer work. I wasn't able to get his website to work either, the government of India site has very low bandwidth (it's near dial up speeds), the site is extremely slow), his web based server based solution has multiple limitations, lack of feedback to the user for what hasn't worked. The solution as he stated only uses 300dpi images, many digitized texts and some scanners can't produce such high resolutions. Running batch jobs would be near impossible, due to limitations of his site, it doesn't have an API to connect into, he hasn't published his source code. And from his papers he has actually developed an engine of it's own using neural networks and has worked with various solutions for what seems from his papers impressive results. You have to register to use, please be my guest tell me what results you get.

http://tdil-dc.in/index.php?option=com_login

Limitations stated on the site:

"

1. Supported file formats: BMP, PNG, TIF.

2. JPG format not preferred.

3. Scanning resolution supported: 300 DPI.

4. Colour mode supported: 8 bit greyscale or Black & White.

5. File size should not exceed 10 MB."

Tesseract on the other hand cuts out some of the hard work Dr Lehal has done to repeat. It is an open source solution based engine (cuts away the deep mathematical modelling- gives you a working model used by over 40 languages including indic based ones), the solution I propose would be open source, client side based for optimum results, be able to run a batch job process 1000 images in one go, to output a pdf for you at the end potentially, so you can take an ebook and have results at the end.

Edited by JatherdarSahib
Link to comment
Share on other sites

The files required to be created will be (can be named different to pan short for Panjabi. Some files maybe optional).

  • tessdata/pan.config
  • tessdata/pan.unicharset
  • tessdata/pan.unicharambigs
  • tessdata/pan.inttemp
  • tessdata/pan.pffmtable
  • tessdata/pan.normproto
  • tessdata/pan.punc-dawg
  • tessdata/pan.word-dawg
  • tessdata/pan.number-dawg
  • tessdata/pan.freq-dawg

... and the final crunched file is:

  • tessdata/pan.traineddata

and

  • tessdata/pan.user-words

I will post more updates, links, and further stuff on this project as time goes on

Just to add as per some tests, and test training data files, the program can work with one file as it is a 'crunched file' it contains everything needed all inside of it.

  • tessdata/pan.traineddata

Therefore only one file needs to be created for the program to work, there are some tools and Tesseract command line options to create the file, however the process requires boxfile, tiff file which are derived from training-text.txt file using UTF-8 information, there maybe shortcuts, better methods and ways around it. More to come.

Edited by JatherdarSahib
Link to comment
Share on other sites

Here is a screenshot of how the program looks like in it's development phase at the moment, I am putting up the source and compiled code, with some early training data soon - however it's far from satisfactory yet in it's intended functionality.

The training data

  • tessdata/pan.traineddata

or pun.traineddata

will go in the tessdata folder of where the executable file (.exe) is and the code for it will be put into language code i.e. pan (for pan.traineddata) or pun (for pun.traineddata). The path to the original image will be type or using the (...) button located and after clicking the OCR it! button, following sometime it's recognized text will be seen below. The traineddata for English should work very well.

The project page for which the files to be put up will be present here:
https://sourceforge.net/projects/punjabiocrwindows/?source=navbar

Here are some of the files, there are some compatibility issues, the training data needs more refinement but it is a small beginning. Please let me know if the program runs for you and how far you may get. The current training files may work with the lohit punjabi font and GurbaniWebThick respective fonts

As per the compatibility issues, the compiled application requires the following to function Microsoft Visual C++ 2012 Redistributable (x86), depending on your version of Windows maybe the x64 version it can be downloaded from here

32bit version

http://www.freewarefiles.com/Microsoft-Visual-C-2012-Redistributable-x86_program_78881.html

64bit version

http://www.freewarefiles.com/Microsoft-Visual-C-2012-Redistributable-x64_program_78882.html

I will probably included the installer in my archive later on.

You will also need the latest version of .Net framework from Microsoft for it to work you may already have this installed.

http://www.microsoft.com/en-gb/download/details.aspx?id=30653

http://www.microsoft.com/en-gb/download/details.aspx?id=42642

Here are the OCR files

https://sourceforge.net/projects/punjabiocrwindows/files/?

screenshot.png

Edited by JatherdarSahib
Link to comment
Share on other sites

You sound like you know what you are doing!

Good luck brother!

Wishing you every success. I think having multiple searchable Gurmukhi texts available will be an immense step in the development of the language.

It would especially useful to try and isolate semantic meanings of words that have fallen out of modern use for instance.

Link to comment
Share on other sites

I am considering creating a font for the training data using some scans of typical font sets used in certain publications

http://www.wikihow.com/Create-a-Font

I've been looking so far at training data used in application of Hindi -devangari as an indic script, it uses unicode and I can't seem to get JTessBoxEditor to create unicode box and tiff files to create training data from. As per this guide it appears the alternative will only be to use linux to create the training data. https://groups.google.com/forum/#!topic/tesseract-ocr/e-9KZJWThKs

edit.

After looking into the Linux question, it arises as a command line based program which has to be compiled (.cpp) code and ran to convert a text file to an image file giving a .tiff image. There are windows alternatives to print out image files using freeware and other software. (there probably is a compiled .cpp version for windows somewhere.

It also appears the lohit Punjabi font is a default font distributed in Linux (ubuntu) open source word processor Libre Office (replacing the old Open Office)

The box file is generated using another command interface which uses the inbuilt data to assume co-ordinate based data stored in a file called a box file. The format of it is as follows (in pixels)

[CHARACTER] [x-coordinate] [y-coordinate] [Width] [Height] [arbitrary-unknown atm (o)]

followed on by the next character.

e.g

g 332 2094 348 2123 0

Punjabi characters, symbols and letter can give up to 85 characters. After that it can be used to be trained which I have read some where works in a circular pattern for it's evolutionary refinement. Based off using just the characters spaced we get about a 500kb file, English and other functional OCR tesseract languages have 20MB+. The reasons for this are probably due to refinement and use of dictionary characters, simply putting in characters isn't enough it needs, words, text samples and a better chunk of text, for which command line interface, and other tools aren't going to be very efficient in producing a training file soon.

So in short, we need a chunk of readable, editable, searchable text (word doc) -> .txt -> [jtesseditor] -> {option of doing .tiff before Jtesseditor} outputs .tiff and .boxfile ->from that we get using Serak Tesseract Trainer language.traineddata (optional others)

Edited by JatherdarSahib
Link to comment
Share on other sites

As per an update here is the program using Hindi (Devangari script) training data (this won't work with just the one file, it requires the others tesseract host the training data, which can be obtained from here:

https://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.02.hin.tar.gz&can=2&q=

https://code.google.com/p/tesseract-ocr/downloads/list?can=1&q=language+data+for+Tesseract+3.02

hindi_ocr.png

Using the above output with the following website:http://h2p.learnpunjabi.org/default.aspx the output Gurmukhi then used with the Punjabi Transliteration tool gives the following

ਇੰਚਲੈਪਡ ਮੇਂ ਰਾਮ ਜਲ੍ਸੀਤ੍ਸਵ
inchlaipd maaen rama jlsaeetasav




ਸਾਵऱਕਰ. ਭਤ੍ਰਾਵਾਨਨ੍ਦੂਗ਼ਮ ਕੋ ਮਲੇਤਾ-ਮਵ ਕਾ ਆਦਰ੍ਸ਼ ਮਾਨਤੇ ਭੀ ਇਸਲਿਏ
saavऱkar. bhtaravanndoogma kao malaetaa-mav kaa aadrsh maantaae bhee isalieae

ਉਨ੍ਹੋਂਨੇ ਇਨੰਲੈਣ੍ਡ ਮੈਂ ਰਾਮ ਜਲ੍ਸੀਤ੍ਸਵ ਕੇ ਅਵਸਰਭ੍ਰਰ ਕਹਾ ਥਾ; "ਯਦਿ ਮੈਂ
unhonnae innlaind maain rama jlsaeetasav kaae avsarbhrr kaha tha; "ydi maain

ਇਸ ਦੇਸ਼ ਕਾ ਅਧਿਨਾਯਕ ਹੋਤਾ, ਤੋ ਸ੍ਰਰ੍ਵਪ੍ਰਥਮ ਮੈਂ ਵਾਲ੍ਮੀਕਿ ਰਾਮਾਯਣ
isa daesh kaa adhinayka hotaa, tao sarrvprthma maain valmaeekai ramaayn

ਕੋ ਜਬ੍ਤ ਕਰਨੇ ਕਾ ਆਦੇਸ਼ ਦੇਤਾ, ਕ੍ਯੋਕਿ ਜਬ ਤਕ ਯਹ ਬਾਧ ਹਿਨ੍ਦੁ"
kao jbta karnae kaa aadaesh daetaa, kayokai jb taka yh badh hindu"

ਕੇ ਹਾਥੋ ਮੇ ਰਹੇਗਾ, ਤਬ ਤਕ ਨ ਤੋ ਹਿਨ੍ਦੂ ਕਿਸੀ ਦੁਸਰੇ ਈਸ਼੍ਵਰ ਯਾ
kaae hatho maae rhaega, tab taka n tao hindoo kaisaee dusarae eishvr ya

ਸਮ੍ਰਾਟ ਕੇ ਸਮਕ੍ਸ਼ ਸਿਰ ਅਬ ਸਕਤੇ ਹੈਂ, ਨੁਗਦੁ ਜਾਤਿ ਕਾ ਵਿਨਾਸ਼
samarat kaae samakash sair ab sakataae hain, nugdu jatai kaa vinash

ਹੋ ਸਕਤਾ ਹੈ। ਰਤ੍ਮਾਯਰ੍ਣ ਲੋਕਤਤ੍ਰ ਕਾ ਸ਼ਾਸ੍ਤ੍ਰ ਹੈ, ਐਸਾ ਸ਼ਾਸ੍ਤ੍ਰ,
ho sakataa hai |  rtamaayrn lokatatar kaa shasatar hai, aisaa shasatar,

ਜੋ ਲੋਕਤ੍ਤਤ੍ਰ ਕੋ ਕਹਾਨੀ ਨਹੀਂ, ਪ੍ਰਹਰੀ, ਪ੍ਰੇਰਕ ਸਾਰ ਨਿਰ੍ਮਾਤਾ ਭੀ ਹੈ।
jo lokatatatar kao kahanee nheen, prhree, praerka saar nirmaataa bhee hai | 

ਜਬ ਤਕ ਰਾਮਾਯਣ ਯਹਾ ,ਤਬ ਤਕ ਇਸ ਦੇਸ਼ ਭੇ ਕੋਈ ਅਧਿਨਾਯਕ ਨਹੀਂ
jb taka ramaayn yha ,tab taka isa daesh bhae kaoei adhinayka nheen

ਪਨਪ ਸਕਤਾ। ਕ੍ਯਾ ਕਹੀਂ ਕੀ ਐਸਾ ਰਤ੍ਮਾਟ, ਅਵਤਾਰ ਯਾ ਪੈਂਗਬਰ ਦਿਖਾਈ
pnp sakataa |  kaya kaheen kaee aisaa rtamaat, avtaar ya paingbr dikhaei

ਦੇਤਾ ਹੈ ਜੋ ਰਾਮ ਕੇ ਸਾਮਨੇ ਸਕੇ।"
daetaa hai jo rama kaae saamanae sakaae | "

Edited by JatherdarSahib
Link to comment
Share on other sites

Result using just characters (single and double spaced). Some characters are recognized, gurmukhi laga matra's (bottom symbols)http://sikhism.about.com/od/learntoreadgurmukhi/ig/Gurmukhi-Vowels-Illustrated/Gurmukhi-Vowel-Dulankar---OO.htm#step-heading are easily confused with letter - perhaps due to their combined usage with certain letters, spaces are removed by recognition. The image used was the original image used to do training with.

The below is using unicode, there are some useful fonts here as well http://www.sikhnet.com/Gurmukhi-Fonts

Another issue I've found which I don't understand is that Unicode Punjabi becomes blocks (maybe they can only handle ASCII characters) under many fonts apart from the Raavi font. Which is helpful in the use of JTessBoxEditor to create tiff files. I may get a list of punjabi words from a dictionary site, produce code to put each word in a list (or array) with extra symbols, scamble the list output it out to create a body of text and use JTessBoxEditor to create tif/box files to train using Serak trainer which uses tesseract local install. Looking at some English box/tiff files it seems a page of text making generous use of symbols in randomised orders to mimic the style of an actual body of text is used for tif/box test data to be produced.

There was an Indic project which contributed to the Hindi training files in the past and the wiki has some useful stuff https://code.google.com/p/tesseractindic/

results.png

Edited by JatherdarSahib
Link to comment
Share on other sites

At the moment there appears to be the following error from Tesseract for trying to train sentences
APPLY_BOXES: boxfile line 18970/Ó¿ª ((2265,1626),(2273,1632)): FAILURE! Couldn't
find a matching blob
FAIL!
APPLY_BOXES: boxfile line 19001/ ((170,1603),(175,1611)): FAILURE! Couldn't f
ind a matching blob
FAIL!
APPLY_BOXES: boxfile line 19047/Ó¿╣ ((505,1603),(511,1609)): FAILURE! Couldn't f
ind a matching blob
FAIL!
APPLY_BOXES: boxfile line 19064/- ((650,1606),(652,1607)): FAILURE! Couldn't fin
d a matching blob
APPLY_BOXES:
Boxes read from boxfile: 19070
Boxes failed resegmentation: 1029
APPLY_BOXES: Unlabelled word at :Bounding box=(1972,1876)->(1975,1879)
Found 18041 good blobs.
Leaving 23 unlabelled blobs in 0 words.
1 remaining unlabelled words deleted.
TRAINING ... Font name = raavi
Generated training data for 295 words
Link to comment
Share on other sites

  • 3 months later...

After several tries it appears punjab (pan) wouldn't appear to function correctly in OCR which lead me to think there were issues in the original tesseract ocr engine as it would appear with other indic languages such as devangari/hindi, people developed various technical work arounds. However for a proper work around it would seem Tesseract the main engine are about to release their next version which includes test data for punjabi so it would be a matter of time before introducing the next interface which would display better functionality.

v3.0.4 https://github.com/tesseract-ocr/

https://github.com/tesseract-ocr/langdata/tree/master/pan

Link to comment
Share on other sites

  • 1 year later...
  • 10 months later...
  • 3 years later...
Guest Tanveer Singh

I have read all conversation , and i was too trying to train tesseract for gurmukhi.

i found  an accurate traindata for punjabi size is approximately 54mb ,its pretty accurate as compared to most of traindata for punjabi on internet .it  can be found on this site along with other indian languages:

https://indic-ocr.github.io/tessdata/

accuracy is about 70% ,as much i think

but the thing i pointed out is that, in the tesseract repository for punjabi language the unicharset given is completely useless and doesn't even contain all alphabets and matravan, even no alphabets and matravan combination .Whereas ,after decompiling this 55 mb traindata i got unicharset that covers all alphabets + all matravan + combination of all matravan + combination of multipe matravan with all alphabets

anyone can decompile tessdata with combine_tessdata.exe file given in tesseract folder or can download this .zip file containing all decompiled files:

http://s000.tinyupload.com/index.php?file_id=05731328810353145849

Also if you want to train your own traindata for punjabi you will need precompiled tess data given by tesseract ocr and lstm that contains all alphabets and their width and height for making box files,but that files are purely inaccurate ones,i generated .tiff file and .box file using thaht in Unix terminal with tesseract's given library tesstrain ,but when i checked that file in jTessboxeditor characters were out of the boxes and matravans were not specified as which character it is ,so i mean to say that we first have to fix the given unicharset and all the tessdata,train data for our language if we want to make an accurate ocr 

For contacts :

Tanveer Singh

8198009913 - whatsapp

thanks

 

Link to comment
Share on other sites

Guest Tanveer Singh
On 5/15/2017 at 10:54 PM, Mandeep Bajwa Vartia said:

 

 

On 5/14/2015 at 4:32 AM, Genie Singh said:

After several tries it appears punjab (pan) wouldn't appear to function correctly in OCR which lead me to think there were issues in the original tesseract ocr engine as it would appear with other indic languages such as devangari/hindi, people developed various technical work arounds. However for a proper work around it would seem Tesseract the main engine are about to release their next version which includes test data for punjabi so it would be a matter of time before introducing the next interface which would display better functionality.

 

v3.0.4 https://github.com/tesseract-ocr/

https://github.com/tesseract-ocr/langdata/tree/master/pan

absolutely right the main problem is in unicharset but we can do something about it .We don't need to start over ,but we should fine tune this traindata

https://indic-ocr.github.io/tessdata/

my contact 8198009913 - whatsapp

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...