Jump to content

Genie Singh

Members
  • Posts

    679
  • Joined

  • Last visited

  • Days Won

    3

Everything posted by Genie Singh

  1. Dr Gurpreet Lehal has some impressive academic papers I have read a few on OCR programs he has done. While his academic papers seem promising it is difficult to get a hold of his program and his source code to use. I have contacted him previously he had a downloadable program which wasn't compatible with the latest version of Windows it appears to be a Windows 95/98 based program, I haven't had much luck with that (maybe one day I may try install windows 98 on a VM and try run the program). The website for the downloadable software is copyrighted 2004-2005 so it may work on Windows Xp. http://guca.sourceforge.net/applications/sherik/ He has responded to me with a link with what was his newer work. I wasn't able to get his website to work either, the government of India site has very low bandwidth (it's near dial up speeds), the site is extremely slow), his web based server based solution has multiple limitations, lack of feedback to the user for what hasn't worked. The solution as he stated only uses 300dpi images, many digitized texts and some scanners can't produce such high resolutions. Running batch jobs would be near impossible, due to limitations of his site, it doesn't have an API to connect into, he hasn't published his source code. And from his papers he has actually developed an engine of it's own using neural networks and has worked with various solutions for what seems from his papers impressive results. You have to register to use, please be my guest tell me what results you get. http://tdil-dc.in/index.php?option=com_login Limitations stated on the site: " 1. Supported file formats: BMP, PNG, TIF. 2. JPG format not preferred. 3. Scanning resolution supported: 300 DPI. 4. Colour mode supported: 8 bit greyscale or Black & White. 5. File size should not exceed 10 MB." Tesseract on the other hand cuts out some of the hard work Dr Lehal has done to repeat. It is an open source solution based engine (cuts away the deep mathematical modelling- gives you a working model used by over 40 languages including indic based ones), the solution I propose would be open source, client side based for optimum results, be able to run a batch job process 1000 images in one go, to output a pdf for you at the end potentially, so you can take an ebook and have results at the end.
  2. I can see mentions of such a tool, but it seems to link itself to typing tools to give you unicode punjabi perhaps I can't find google transliterate. But Google translate has limitations in it's transliteration produced. https://translate.google.com/ Here is one limitation there are many more for example if you transliterate the word for moon ਚੰਦ (chand) in google translate it gives you ਚੰਦ Cada This program above will give you the following: ਚੰਦ chnd - while chnd is still far from chand, it is closer than Cada Another advantage is being able to view the source code either by decompilation (or future source code release) programs and options for developers to add to. It is also useful in batch transliteration jobs it can handle a whole ebooks worth of text it may take a few mins depending on your computer resources but it is useful. Google translate can't interlace either so with this tool you can cross compare the Gurmukhi Punjabi Unicode with it's romanized transliteration.
  3. use this link download the .exe file and run it, the program should start it might give you a security message if so just click on more info and then run anway https://sourceforge.net/projects/punjabitransliterator/files/latest/download
  4. I would like to use this space to post information, to see feedback, ideas, suggestions and possibly implementation of a Punjabi OCR program. I will focus on the implementation of the project using c# however I am open to other peoples expertise in other languages as well since programming skills can be transferable skills. To begin with I suggest the use of a library known as the Teressact library controlled by Google at the moment, it has been used by some to create Arabic, Hebrew, English, European languages, Hindi OCR programs. It as an engine supports many languages and can be trained to support many more, it is one of the best open source (free) pieces of software an engine that can be used to produce great software. It probably makes use of Artificial Neural Networks to work. So here is it's main website: https://code.google.com/p/tesseract-ocr/ Here is a page on how to train the latest version (not the best guide but can work roughly on creating files to use for Punjabi) https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 Here is a .NET (c#) wrapper (a program using the library is provided). The program outputs English by default, with the Punjabi files it should theoretically be possible to scan and recognise punjabi documents once trained. Making a pretty user interface after that is an easy job. https://github.com/charlesw/tesseract Here is some Punjabi training files someone created using this library perhaps using an old version of the library so not compatible, but some useful files to create new ones (lohit punjabi) https://code.google.com/p/parichit/downloads/list Here is someones guide on creating training data (may not work) https://peepswrite.wordpress.com/2013/05/26/training-tesseract-3-02/ http://blog.cedric.ws/how-to-train-tesseract-301 Another guide on creating training data - again may not work "Tesseract is probably the most accurate open source OCR engine available. Combined with the Leptonica Image Processing Library it can read a wide variety of image formats and convert them to text in over 60 languages. It was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but since then it has been improved extensively by Google. It is released under the Apache License 2.0." "The Tesseract engine was originally developed as proprietary software at Hewlett Packard labs in Bristol, England andGreeley, Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some migration from C to C++ in 1998. A lot of the code was written in C, and then some more was written in C++. Since then all the code has been converted to at least compile with a C++ compiler.[3] Very little work was done in the following decade. It was then released as open source in 2005 by Hewlett Packard and the University of Nevada, Las Vegas (UNLV). Tesseract development has been sponsored by Google since 2006.[7]" "The initial versions of Tesseract could only recognize English language text. Starting with version 2 Tesseract was able to process English, French, Italian, German, Spanish, Brazilian Portuguese and Dutch. Starting with version 3 it can recognize Arabic, English, Bulgarian, Catalan, Czech, Chinese (Simplified and Traditional), Danish, German (standard andFraktur script), Greek, Finnish, French, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak (standard and Fraktur script), Slovenian, Spanish, Serbian, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian and Vietnamese. Tesseract can be trained to work in other languages too." http://en.wikipedia.org/wiki/Tesseract_%28software%29 According to this video of a guy who has worked on some reputable OCR projects, Tesseract supports 40+ human languages https://www.youtube.com/watch?v=gcjCiS9pJ3A#t=338 The files required to be created will be (can be named different to pan short for Panjabi. Some files maybe optional). -edit (having a looking at International Organization for Standardization set in ISO639-2 as per tesseract use, the standard for Punjabi is "pan") tessdata/pan.config tessdata/pan.unicharset tessdata/pan.unicharambigs tessdata/pan.inttemp tessdata/pan.pffmtable tessdata/pan.normproto tessdata/pan.punc-dawg tessdata/pan.word-dawg tessdata/pan.number-dawg tessdata/pan.freq-dawg ... and the final crunched file is: tessdata/pan.traineddata and tessdata/pan.user-words I will post more updates, links, and further stuff on this project as time goes on
  5. https://sourceforge.net/projects/punjabitransliterator/ This piece of software is developed in c# .Net. The software transliterates (romanizes) Punjabi(Gurmukhi) unicode text. The software can present the output in both interlaced fashion having gurmukhi followed by it's english transliteration. As well as a simple text output of just transliteration. The software runs on Windows (7 and above recommended), you may be prompted with a security warning prior to running the software by microsoft just click on more info, and run anway to start the program. The program doesn't currently support right-click context menu. To insert text either type it in, or use ctrl+c / ctr+x and ctrl+v to do cut/copy/paste operations to use the program. The program is useful in transliterating parts of the Suraj Prakash Granth website seen here: https://searchgurbani.com/sri_gur_pratap_suraj_granth As well as other Punjabi unicode based websites. Additionally future features to be considered and perhaps to be added are correction of sihari, bihari unicode issues via parsing multiple times to correct character values, additionally to support further fonts, to support context menu, to support correction in font size, output in various file formats (.txt, .rtf, .doc, .docx, .pdf). To develop work around for security warning issues, to develop progress bar, to develop input text files, to function with webpages, and parse through image files, to utilise dictionary and to correct grammatical features of aunkar, duankar, dualavan, lavan, to remove U sounds when used for emphasis rather than as a character, to remove excessive use of a after s, to remove blocking of characters and many other features. To consider interportable solutions for mac, linux, android and other operating systems, possible web solutions. Additional features to come will be to convert gurmukhi-akhar font to Unicode and Gurmukhi-Akhar to English roman functions Screenshot of program:
  6. any ideas? OCR for Punjabi trained data files using tesseract library perhaps done in c#

    1. Genie Singh

      Genie Singh

      https://code.google.com/p/tesseract-ocr/w/list

  7. okay now every muslim should convert to Sikhi Qawali a persian invention that entered muhammdenism hundreds of years after muhammad died
  8. Long beards are associated now with perverts, santa claus, extremist islamist nutcases, normal nutcases with foil hats on their head trying to stop aliens from invading their brain, homeless people, people with social diseases which prevent them from maintaing hygiene, those who appear disheveled, some people who been pulling all nighters who smell and haven't shaved or baved. You can see alot of great professors these days in our time who are clean shaved but maintain all the knowledge expected and you can meet a village nutcase with a long beard who thinks the world is flat and spirits control things. So we need more of the positive role models with beards and perhaps we shouldn't judge people on a beard or not.
  9. The laws of Moses, sunnah of muhammad and modern ideas of hygeine and overall health including social wellbeing. Cut the hair away just like nails when they are too long and filled with dirt which isn't easily removed.
  10. http://www.scribd.com/doc/235270786/Transliteration-in-progress-Guru-Nanak-in-City-of-Room-Rom-Rum-Ruom-Bhai-Mani-Singh
  11. There is an interesting piece in this one about Abraham and the origins of circumcision and sunnah- one which people mentioned before on this forum board without sources but it appears this is the source http://www.scribd.com/doc/234773276/Transliterated-Punjabi-Mecca-Pages-From-Bhai-Mani-Singh-Guru-Nanak-Janamsakhi
  12. The order is Baba ji first goes to Mecca, then Roum, then Baghdad, then Medina and then he goes to visit the Sidhs in the mountains - guys doing funky yoga. http://www.scribd.com/doc/235263430/Gurmukhi-Guru-Nanak-in-City-of-Room-Rom-Rum-Ruom-Bhai-Mani-Singh http://www.scribd.com/doc/235263436/Gurmukhi-Guru-Nanak-in-Medina-Bhai-Mani-Singh
  13. http://www.scribd.com/doc/233667537/Table-of-Contents-Bhai-Bala-wale-Sri-Guru-Nanak-Dev-Ji-Janamsakhi-Transliteration
  14. Medina Sakhi from Bhai Bala wale Guru Nanak janamsakhi fully transliterated http://www.scribd.com/doc/231847969/Medina-vich-Baba-Nanak-Transliteration-Guru-Nanak-Janamsakhi-Medina-from-Bhai-Bala-wale-Janamsakhi
  15. Because Pakistani's (especially right wing) support Palestine and the downfall of Israel. Some feel that Pakistani's will then aid sikhs in creating Khalistan not realizing Pakistani's want to conquer the world they will take the first chance they get to attack all of India even if it is done slowly by taking over a Sikh state, Kashmir, raping that region and then turning to the rest of India and then using the combined force to attack China which it currently can not do. Some Khalistanis are aware of the threat but feel with Khalistan they can fight Pakistanis if they do attack. Pakistan can't defeat the Indian army but a smaller population with a new enemy would be easier to take on or to establish pakistani puppets, Muslims in the guise as Sikhs controlling Khalistan to ideological brainwash everyone into an Islamically ruled state.
  16. To test that theory of alternative names we will need a couple of other sources mainly non-Sikh sources which make reference to that name hamid karu. It is interesting to note the website you gave of the ottomans praises the sultans as being almost perfect since some of them were caliphs (rightful chieftans jathedars after muhammad the prophet) were as on wikipedia with references you will find their characters being flawed, selim on wikipedia is described as angry and almost blood thirsty something more so to the likeness of the character of the mughals hence more believable. http://en.wikipedia.org/wiki/Hamidids Hamidids or Hamid Dynasty (Modern Turkish: Hamidoğulları or Hamidoğulları Beyliği) was one of the 14th century Anatolian beyliks that emerged as a consequence of the decline of the Sultanate of Rum and ruled in the regions around Eğirdir and Isparta in southwestern Anatolia. http://books.google.co.uk/books?id=QjzYdCxumFcC&pg=PA41&lpg=PA41&dq=Hamido%C4%9Fullar%C4%B1&source=bl&ots=PeV8MdIJzb&sig=chG4oVgJoSEwSXZvM3gRxtAeSI4&hl=en&sa=X&ei=3hfSU-OXIPKA7QbzwYGACw&ved=0CHIQ6AEwCQ#v=onepage&q=Hamido%C4%9Fullar%C4%B1&f=false http://en.wikipedia.org/wiki/Anatolian_beyliks It seems local rulers were given the title "bey"
  17. If we work with the time scales from this http://www.sikhiwiki.org/index.php/The_Udasis_of_Guru_Nanak#The_five_journeys we get Selim I or at the latest Suleiman the Magnificent assuming these timelines are even correct. Hamid Karu is described as a tyrant who is greedy and Baba ji goes to reform his behaviour
  18. He appears here aswell, perhaps Muhammadens removed him out of history and we seem to have him accounted for which should suggest some literature referring to him must exist or it's an alias/nickname for him. The geography is confusing because we see a mix of Turkey and Medina which both came under the ottoman empire but are very distinct places in today's world. page 346-350 Baba ji gae rum shair - Baba ji went to the city of Rom (before 346 you have the Mecca sakhi and after 350 you have the Baghdad Sakhi) http://www.scribd.com/doc/209844046/Guru-Nanak-Janamsakhi-Bhai-Mani-Singh-Gian-Ratnawali-Gyan-Ratnavali-in-Gurmukhi-Punjabi
  19. it's all there in the scribd attached document go to page 4 on it It says Baba went to meet Sultan Hamid Karu in the city of Rum which according to the Gian Ratnawali is held as a separate sakhi apart from Mecca one right after it. According to wikipedia under Selim 1st page Selim I sent his officials to the province of Rum, in north-central Anatolia, http://en.wikipedia.org/wiki/Sultanate_of_Rum Anatolia is in modern day Turkey http://en.wikipedia.org/wiki/Anatolia
  20. page 4 http://www.scribd.com/doc/231847969/Medina-vich-Baba-Nanak-Transliteration-Guru-Nanak-Janamsakhi-Medina-from-Bhai-Bala-wale-Janamsakhi
  21. Sultan Hameed Karu and Guru Nanak. Hameed is a character found in the medina section right after Mecca journey of Bhai Bala and Gian Ratnawali Janamsakhi. Yet historical references seem not to point to such a figure. A guy in that time scale who it might be is this but is there any correlation in names? http://en.wikipedia.org/wiki/Selim_I Why doesn't the janamsakhi tradition reflect Sultan Selim and Baba Nanak then?
×
×
  • Create New...