Jump to content

Genie Singh

Members
  • Posts

    679
  • Joined

  • Last visited

  • Days Won

    3

Everything posted by Genie Singh

  1. Are these genuine quotes from Hindu sources on the origin of jatts http://www.jatland.com/home/Jats_in_Indian_epics

  2. After several tries it appears punjab (pan) wouldn't appear to function correctly in OCR which lead me to think there were issues in the original tesseract ocr engine as it would appear with other indic languages such as devangari/hindi, people developed various technical work arounds. However for a proper work around it would seem Tesseract the main engine are about to release their next version which includes test data for punjabi so it would be a matter of time before introducing the next interface which would display better functionality. v3.0.4 https://github.com/tesseract-ocr/ https://github.com/tesseract-ocr/langdata/tree/master/pan
  3. At the moment there appears to be the following error from Tesseract for trying to train sentences APPLY_BOXES: boxfile line 18970/Ó¿ª ((2265,1626),(2273,1632)): FAILURE! Couldn't find a matching blob FAIL! APPLY_BOXES: boxfile line 19001/ ((170,1603),(175,1611)): FAILURE! Couldn't f ind a matching blob FAIL! APPLY_BOXES: boxfile line 19047/Ó¿╣ ((505,1603),(511,1609)): FAILURE! Couldn't f ind a matching blob FAIL! APPLY_BOXES: boxfile line 19064/- ((650,1606),(652,1607)): FAILURE! Couldn't fin d a matching blob APPLY_BOXES: Boxes read from boxfile: 19070 Boxes failed resegmentation: 1029 APPLY_BOXES: Unlabelled word at :Bounding box=(1972,1876)->(1975,1879) Found 18041 good blobs. Leaving 23 unlabelled blobs in 0 words. 1 remaining unlabelled words deleted. TRAINING ... Font name = raavi Generated training data for 295 words
  4. http://www.sikhawareness.com/uploads/monthly_02_2015/post-10552-0-42647600-1423234747.jpg I wonder if it says something else after Nanak, since there is concatenation [...nanakdasranjit...] taking place "Nanak dasran jit" - but it's more likely ranjit, I wonder if ranjit here is a name, instrument, or the meaning behind it of victor of the battlefield. nanakdasranjitabi ?charo | eiochrkrbhan Is the yellow ink medieval tipex(correctional fluid) or damage that has been caused accidentally, from some reaction (ink, paper, time) or possibly intentional damage.
  5. Which photo no./link you see that in? The content of the second book seemed to look like a janamsakhi, the first book from it's date may be one, also I have seen an almost identical manuscript i.e. janamsakhi which had similar binding, size, handwriting style-size, shape, stroke, ink colour, similar pages kept in the British Library
  6. http://imgur.com/E86R2ts This one mentions "Nanak" several times in the body of the text, Baba Nanak a couple. Baba Nanak speaking to a pandha (pandit) (guesses could be Pandit Hardyal or Pandit Gopal). Could be wrong... so with the usual tons of mistakes this is what I can see at the moment (taking out the concatenation requires a good vocabulary before hand as well and some of the words aren't in use in modern Punjabi). 2nd line onwards dhajee Nanak noprou | maitaegeebahuvsaevakarg| ta padhae kahiaa bhlajeetb padha epteelikhidi tee || ta Sri Guru Nanak ji eikadeenprad(?)jaedin chupkarirhae | tapadhaekahia Nanak ji radaki ounhee | ta Guru Nanak Ji kahiaa padh ji toopaji aahaijrmainoopraavdahae | ta padhae kahiaa jeemaes bhooakaechpariaah | smakharcholaekhapdeeaapree sabhoaemainoaavdahai || ta Sri Guru Baba Nanak ji kahi [then Sri Guru Baba Nanak ji said] aaeaenepreegaeifahaepoudaehainia || dharmraeidae thadhajadahaieaehuprnabadhai | ta Sri Guru Ji ou jobfeikasigiraagvichikahiaa || jalimohughasimsrkarimatikagafarisar bhaouklmkaguchivlaekhagigrrpachlikhrvee chag | likhrnmsal ? likhrlikhrantrnpacha
  7. I'm guessing some Sikhs have shanka - doubts over sikhi to the point they don't even believe in those kurbanis (acts of sacrifice/martyrdom) because there are instances of miracles and the super natural makes people doubt the entire thing, like Baba Deep Singh picking up his own head to fight. But with that there have been Sikhs who have been beheaded in Pakistan and Afghanistan recently, perhaps people put that down to minor disputes, not one my caste/relative. Perhaps people are wrapped around the media so much that if the mainstream media don't cover Sikhs being killed in Afghanistan it never happened for them, but at the same time we don't hear about the beatings the taliban give to other muslims like lashing going on. Maybe Sikh history is depressing and you can get away from it but getting away from mainstream media is hard, but with many media stories they get forgotten about over time, current affairs don't always make it to the history lessons of tomorrow. Perhaps due to some sikhs feeling xenophobia/racism/contempt as of late due to mistaken identity with islamists and increase in coverage of islamist atrocities, many people want an apology from those with the islamist kin, we can't always separate ourselves from other humans and some people just have businesses, careers, education relationships to maintain and build and it's easier to say sorry for something you have nothing to do with or have relation with then to actually explain that you have nothing to do with it. I think many "Asiatic" religions have been brushed over to be branches of islamism at the moment by some less than affluent folk who have become a vocal minority.
  8. Maybe someone who does as the Biprans, Bipran ke reet (copying the Biprans). So if it is brahmin then it is the same as Brahminvad http://searchsikhism.com/bipran-ki-reet ... The word Bipar (ਬਿਪ੍ਰ) is found used by Bhagat Ravidas in Guru Granth Sahib ANG 1293 " ਅਬ ਬਿਪ੍ਰ ਪਰਧਾਨ ਤਿਹਿ ਕਰਹਿ ਡੰਡਉਤਿ ਤੇਰੇ ਨਾਮ ਸਰਣਾਇ ਰਵਿਦਾਸੁ ਦਾਸਾ ॥੩॥੧॥ Ab bipar parḏẖān ṯihi karahi dand▫uṯ ṯere nām sarṇā▫e Raviḏās ḏāsā. ||3||1|| Now, the important Brahmins of the city bow down before me; Ravi Daas, Your slave, seeks the Sanctuary of Your Name. ||3||1||"-Guru Granth Sahib ANG 1293 It is present in the Sikh dictionary on Srigranth.org http://www.srigranth.org/servlet/gurbani.dictionary?Param=%E0%A8%AC%E0%A8%BF%E0%A8%AA%E0%A9%8D%E0%A8%B0 SGGS Gurmukhi-Gurmukhi Dictionary ਬ੍ਰਾਹਮਣ। ਉਦਾਹਰਣ: ਬਿਪ੍ਰ ਸੁਦਾਮੇ ਦਾ ਲਦੁ ਭੰਜ॥ {ਬਸੰ ੫, ਅਸ ੨, ੨:੩ (1192)}। brahman | udahrn: bipr saudamaae da ldu bhnj॥ {bsan 5, asa 2, 2:3 (1192)} | SGGS Gurmukhi-English Dictionary n. (from Sk. Vipra) Brahmin SGGS Gurmukhi-English Data provided by Harjinder Singh Gill, Santa Monica, CA, USA. Mahan Kosh Encyclopedia ਦੇਖੋ, ਬਿਪ. "ਬਿਪ੍ਰ ਸੁਦਾਮੇ ਦਾਲਦੁ ਭੰਜ". (ਬਸੰ ਅਃ ਮਃ ੫). Mahan Kosh data provided by Bhai Baljinder Singh (RaraSahib Wale); See http://www.ik13.com daekho, bip. "bipr saudamaae daldu bhnj". (bsan aਃ maਃ 5). http://en.wikipedia.org/wiki/Sudama#Gift wiki says: " Sudama was a moneyless, poor Brahmin"
  9. I wonder if this is a part of a manuscript which has been lost in time entirely or never before seen manuscript, if so it would be worth trying to digitize as much as you can, before such a beauty is lost to time. Perhaps it maybe the key to answering an authenticity study on certain texts still in circulation.
  10. Hi Peter, I have seen very similar manuscripts kept in the British Library under great care. You can go there register with proof of ID, wait an hour or a few days, it gets ordered in for you to see there with great care, there are rules in place but it's open to public access. http://explore.bl.uk/primo_library/libweb/action/search.do?dscnt=0&frbg=&scp.scps=scope%3A%28BLCONTENT%29&tab=local_tab&dstmp=1423072443255&srt=rank&ct=search&mode=Basic&vl(488279563UI0)=any&dum=true&tb=t&indx=1&vl(freeText0)=janam%20sakhi&vid=BLVU1&fn=search There are many scans of similar manuscripts on Punjab digital library if you register you can view it, it's a got alot of gem of texts scanned. http://www.panjabdigilib.org/ One thing about such manuscripts I don't understand yet is the binding in leather, today people prefer cardboard based/or plastic instead. It's difficult to read, since it's using laarivar (the Punjabi is concatenated, apparently the spaces between words came in later on around the time of the printing press). And I am too used to typed writings rather than handwritten. One of them (I will edit it to correct what I can make of it)- my reading skills need much work. The last one is something else the style of writing from me speculating looks like a different writer, different year as well. Looking at this one, my initial guess a gutka (prayer book) possibly extracted shabad (hymn) from Guru Granth Sahib - from my limited look at the type face being split away from it's concatenation and running a search I can't find it in a search of Guru Granth Sahib. http://www.srigranth.org/ So it maybe from a Janamsakhi http://imgur.com/gXQKY1a ੴ ਸਤਿ ਗੁਰ ਪ੍ਰਸਾਦਿ Ik Onkar sat gur prasad There is one god known by the grace of the true Guru ਰਾਗੁ ਰਾਮਕਲੀ ਗਿਆਨੁ raag ramkali gian In the musical meter of Raag Ramakali ... Wisdom(http://www.sikhiwiki.org/index.php/Ramkali) ਮਹਲਾ ||੧ || saraed mahala || 1 || pakhotam ... composed by the first House (Mehal) || ਪ੍ਰਮਾਤਮਾ ਪੂਰਨ ਬਿਸਵਾਥੀ parmatma pooran beesvathee God (the one above the soul) ... complete s || adpurkh abeechalee http://imgur.com/oSd7V48 tuheetaehinivaouooseesa | 1 || ouanachrkhatahosoh achrjan | nihach rasuashrhtataheekom naan|2 || taheeekou (NEXT PAGE) manalgatidinsurtl 2 gavuai | apaaapabosa avarnheeseesnivavhu 3|| Nanak das matkaht haiagmanigmkeeseeakha| http://i.imgur.com/QrhqzbD.jpg sarodaiagiaanabicharit baiprmpdpaei | 3364 || alikhtmakimnamighaliakhi aabhutpeetkarikaealikha eiaakisnnae || bholchukbkhstee (next page) akhrsodhiprrna | 95 simta | 1879 ||jaegoadi narveea ramramjee ਵਾਹਿਗੁਰੂ VAHEGURU JI KI FATEH || bolavneesboohooi khalsae jeenooa If that page is part of the same manuscript then it can't be from Guru Granth Sahib the phrase Khalsa is very rare in Guru Granth Sahib and the phrase "VAHEGURU JI KI FATEH " is not in Guru Granth Sahib it can be found in other texts post Guru Gobind Singhs time. There are some mistakes in what I've tried to get but the khalsa fateh can easily be seen.
  11. I think without me doing enough research into it, the pro-vegetarian shabads are human right, women rights, animal rights and eco environmental right shabads and on other virtues like honest and integrity instead. As per hunting there are different reasons, such as violent animals attacking innocent people Lions, Tigers, bears, killing innocent villagers, innocent children being eaten alive by wild beasts and attacking Sikhs. For game meat since other more suitable humbler foods were not at easy to get (which is an animal right issue more so), as well as sakhis of people being liberated, such as Mallu Khatri, it had been said Guru Nanak in his later jama (vessel) would later give him liberation after he died, he did alot of sewa but became rapped in ego, Guru Gobind Singh shot a wild animal which showed it's self to be Mallu Khatri to Sikhs. But considering the lineage of the early Sikhs documented in Maharaja Ranjti Singh's time it seems more of them were hunters who ate meat, like the Patialla family, the Akali Nihangs, Ranjit Singh - the pro-vegg argument is that is all propoganda or they corrupted Sikhi, they were not following the truth. I say look at the not so pro-veggeterian shabads and ask why One of the most pro-vegetarian Sikhs, Baba Nand Singh sat on a lion skin in many of his photos- I wonder why if he felt passionate about not harming animals.
  12. Result using just characters (single and double spaced). Some characters are recognized, gurmukhi laga matra's (bottom symbols)http://sikhism.about.com/od/learntoreadgurmukhi/ig/Gurmukhi-Vowels-Illustrated/Gurmukhi-Vowel-Dulankar---OO.htm#step-heading are easily confused with letter - perhaps due to their combined usage with certain letters, spaces are removed by recognition. The image used was the original image used to do training with. The below is using unicode, there are some useful fonts here as well http://www.sikhnet.com/Gurmukhi-Fonts Another issue I've found which I don't understand is that Unicode Punjabi becomes blocks (maybe they can only handle ASCII characters) under many fonts apart from the Raavi font. Which is helpful in the use of JTessBoxEditor to create tiff files. I may get a list of punjabi words from a dictionary site, produce code to put each word in a list (or array) with extra symbols, scamble the list output it out to create a body of text and use JTessBoxEditor to create tif/box files to train using Serak trainer which uses tesseract local install. Looking at some English box/tiff files it seems a page of text making generous use of symbols in randomised orders to mimic the style of an actual body of text is used for tif/box test data to be produced. There was an Indic project which contributed to the Hindi training files in the past and the wiki has some useful stuff https://code.google.com/p/tesseractindic/
  13. As per an update here is the program using Hindi (Devangari script) training data (this won't work with just the one file, it requires the others tesseract host the training data, which can be obtained from here: https://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.02.hin.tar.gz&can=2&q= https://code.google.com/p/tesseract-ocr/downloads/list?can=1&q=language+data+for+Tesseract+3.02 Using the above output with the following website:http://h2p.learnpunjabi.org/default.aspx the output Gurmukhi then used with the Punjabi Transliteration tool gives the following ਇੰਚਲੈਪਡ ਮੇਂ ਰਾਮ ਜਲ੍ਸੀਤ੍ਸਵ inchlaipd maaen rama jlsaeetasav ਸਾਵऱਕਰ. ਭਤ੍ਰਾਵਾਨਨ੍ਦੂਗ਼ਮ ਕੋ ਮਲੇਤਾ-ਮਵ ਕਾ ਆਦਰ੍ਸ਼ ਮਾਨਤੇ ਭੀ ਇਸਲਿਏ saavऱkar. bhtaravanndoogma kao malaetaa-mav kaa aadrsh maantaae bhee isalieae ਉਨ੍ਹੋਂਨੇ ਇਨੰਲੈਣ੍ਡ ਮੈਂ ਰਾਮ ਜਲ੍ਸੀਤ੍ਸਵ ਕੇ ਅਵਸਰਭ੍ਰਰ ਕਹਾ ਥਾ; "ਯਦਿ ਮੈਂ unhonnae innlaind maain rama jlsaeetasav kaae avsarbhrr kaha tha; "ydi maain ਇਸ ਦੇਸ਼ ਕਾ ਅਧਿਨਾਯਕ ਹੋਤਾ, ਤੋ ਸ੍ਰਰ੍ਵਪ੍ਰਥਮ ਮੈਂ ਵਾਲ੍ਮੀਕਿ ਰਾਮਾਯਣ isa daesh kaa adhinayka hotaa, tao sarrvprthma maain valmaeekai ramaayn ਕੋ ਜਬ੍ਤ ਕਰਨੇ ਕਾ ਆਦੇਸ਼ ਦੇਤਾ, ਕ੍ਯੋਕਿ ਜਬ ਤਕ ਯਹ ਬਾਧ ਹਿਨ੍ਦੁ" kao jbta karnae kaa aadaesh daetaa, kayokai jb taka yh badh hindu" ਕੇ ਹਾਥੋ ਮੇ ਰਹੇਗਾ, ਤਬ ਤਕ ਨ ਤੋ ਹਿਨ੍ਦੂ ਕਿਸੀ ਦੁਸਰੇ ਈਸ਼੍ਵਰ ਯਾ kaae hatho maae rhaega, tab taka n tao hindoo kaisaee dusarae eishvr ya ਸਮ੍ਰਾਟ ਕੇ ਸਮਕ੍ਸ਼ ਸਿਰ ਅਬ ਸਕਤੇ ਹੈਂ, ਨੁਗਦੁ ਜਾਤਿ ਕਾ ਵਿਨਾਸ਼ samarat kaae samakash sair ab sakataae hain, nugdu jatai kaa vinash ਹੋ ਸਕਤਾ ਹੈ। ਰਤ੍ਮਾਯਰ੍ਣ ਲੋਕਤਤ੍ਰ ਕਾ ਸ਼ਾਸ੍ਤ੍ਰ ਹੈ, ਐਸਾ ਸ਼ਾਸ੍ਤ੍ਰ, ho sakataa hai | rtamaayrn lokatatar kaa shasatar hai, aisaa shasatar, ਜੋ ਲੋਕਤ੍ਤਤ੍ਰ ਕੋ ਕਹਾਨੀ ਨਹੀਂ, ਪ੍ਰਹਰੀ, ਪ੍ਰੇਰਕ ਸਾਰ ਨਿਰ੍ਮਾਤਾ ਭੀ ਹੈ। jo lokatatatar kao kahanee nheen, prhree, praerka saar nirmaataa bhee hai | ਜਬ ਤਕ ਰਾਮਾਯਣ ਯਹਾ ,ਤਬ ਤਕ ਇਸ ਦੇਸ਼ ਭੇ ਕੋਈ ਅਧਿਨਾਯਕ ਨਹੀਂ jb taka ramaayn yha ,tab taka isa daesh bhae kaoei adhinayka nheen ਪਨਪ ਸਕਤਾ। ਕ੍ਯਾ ਕਹੀਂ ਕੀ ਐਸਾ ਰਤ੍ਮਾਟ, ਅਵਤਾਰ ਯਾ ਪੈਂਗਬਰ ਦਿਖਾਈ pnp sakataa | kaya kaheen kaee aisaa rtamaat, avtaar ya paingbr dikhaei ਦੇਤਾ ਹੈ ਜੋ ਰਾਮ ਕੇ ਸਾਮਨੇ ਸਕੇ।" daetaa hai jo rama kaae saamanae sakaae | "
  14. I am considering creating a font for the training data using some scans of typical font sets used in certain publications http://www.wikihow.com/Create-a-Font I've been looking so far at training data used in application of Hindi -devangari as an indic script, it uses unicode and I can't seem to get JTessBoxEditor to create unicode box and tiff files to create training data from. As per this guide it appears the alternative will only be to use linux to create the training data. https://groups.google.com/forum/#!topic/tesseract-ocr/e-9KZJWThKs edit. After looking into the Linux question, it arises as a command line based program which has to be compiled (.cpp) code and ran to convert a text file to an image file giving a .tiff image. There are windows alternatives to print out image files using freeware and other software. (there probably is a compiled .cpp version for windows somewhere. It also appears the lohit Punjabi font is a default font distributed in Linux (ubuntu) open source word processor Libre Office (replacing the old Open Office) The box file is generated using another command interface which uses the inbuilt data to assume co-ordinate based data stored in a file called a box file. The format of it is as follows (in pixels) [CHARACTER] [x-coordinate] [y-coordinate] [Width] [Height] [arbitrary-unknown atm (o)] followed on by the next character. e.g g 332 2094 348 2123 0 Punjabi characters, symbols and letter can give up to 85 characters. After that it can be used to be trained which I have read some where works in a circular pattern for it's evolutionary refinement. Based off using just the characters spaced we get about a 500kb file, English and other functional OCR tesseract languages have 20MB+. The reasons for this are probably due to refinement and use of dictionary characters, simply putting in characters isn't enough it needs, words, text samples and a better chunk of text, for which command line interface, and other tools aren't going to be very efficient in producing a training file soon. So in short, we need a chunk of readable, editable, searchable text (word doc) -> .txt -> [jtesseditor] -> {option of doing .tiff before Jtesseditor} outputs .tiff and .boxfile ->from that we get using Serak Tesseract Trainer language.traineddata (optional others)
  15. Here is a screenshot of how the program looks like in it's development phase at the moment, I am putting up the source and compiled code, with some early training data soon - however it's far from satisfactory yet in it's intended functionality. The training data tessdata/pan.traineddata or pun.traineddata will go in the tessdata folder of where the executable file (.exe) is and the code for it will be put into language code i.e. pan (for pan.traineddata) or pun (for pun.traineddata). The path to the original image will be type or using the (...) button located and after clicking the OCR it! button, following sometime it's recognized text will be seen below. The traineddata for English should work very well. The project page for which the files to be put up will be present here: https://sourceforge.net/projects/punjabiocrwindows/?source=navbar Here are some of the files, there are some compatibility issues, the training data needs more refinement but it is a small beginning. Please let me know if the program runs for you and how far you may get. The current training files may work with the lohit punjabi font and GurbaniWebThick respective fonts As per the compatibility issues, the compiled application requires the following to function Microsoft Visual C++ 2012 Redistributable (x86), depending on your version of Windows maybe the x64 version it can be downloaded from here 32bit version http://www.freewarefiles.com/Microsoft-Visual-C-2012-Redistributable-x86_program_78881.html 64bit version http://www.freewarefiles.com/Microsoft-Visual-C-2012-Redistributable-x64_program_78882.html I will probably included the installer in my archive later on. You will also need the latest version of .Net framework from Microsoft for it to work you may already have this installed. http://www.microsoft.com/en-gb/download/details.aspx?id=30653 http://www.microsoft.com/en-gb/download/details.aspx?id=42642 Here are the OCR files https://sourceforge.net/projects/punjabiocrwindows/files/?
  16. Just to add as per some tests, and test training data files, the program can work with one file as it is a 'crunched file' it contains everything needed all inside of it. tessdata/pan.traineddata Therefore only one file needs to be created for the program to work, there are some tools and Tesseract command line options to create the file, however the process requires boxfile, tiff file which are derived from training-text.txt file using UTF-8 information, there maybe shortcuts, better methods and ways around it. More to come.
×
×
  • Create New...