extract captcha image

Decoding CAPTCHA's
extract captcha image
OCR (Optical Character Recognition) is pretty accurate these days and can easily read printed text.
rails ocr
ruby ocr
break google captcha

http://stackoverflow.com/search?q=rails+ocr
http://www.wausita.com/captcha/

-----------------------------------------------------------

1.tesseract-x.xx.tar.gz contains all the source code.

2.tesseract-2.xx.<lang>.tar.gz contains the Tesseract 2 language data files for <lang>. You need at least one of these or tesseract 2 will not work.

3. <lang>.traineddata.gz contains the Tesseract 3 language data file for <lang>. You need at least one of these or tesseract 3 will not work.

4.Note that tesseract-2.04.tar.gz unpacks to the tesseract-2.04 directory.
tesseract-2.01.<lang>.tar.gz unpacks to the tessdata directory which belongs inside your tesseract-2.04 directory. It is therefore best to download them
into your tesseract-2.04 directory, so you can use unpack here or equivalent.
You can unpack as many of the language packs as you care to, as they all
contain different files. Note that if you are using make install you should
unpack your language data to your source tree before you run make install.
If you unpack them as root to the destination directory of make install,
then the user ids and access permissions might be messed up.

If they are not already installed, you need the following libraries (Ubuntu):

sudo apt-get install libpng12-dev
sudo apt-get install libjpeg62-dev
sudo apt-get install libtiff4-dev
sudo apt-get install zlibg-dev

E: 无法找到软件包 zlibg-dev => download source
sudo apt-get install zlib1g-dev

download Leptonica from http://www.leptonica.org/source/leptonlib-1.67.tar.gz
tar zxvf leptonlib-1.67.tar.gz

You also need to install Leptonica. There is an apt-get package (name unknown), or the sources are at http://www.leptonica.org/. The instructions at Leptonica README are clear, but basically it is the usual

./configure
make
sudo make install
sudo ldconfig

Now back to Tesseract. Download the source from svn:
svn checkout http://tesseract-ocr.googlecode.com/svn/trunk/ tesseract-ocr-read-only
or package tesseract-3.00.tar.gz from download page. The same build process as usual applies:

http://code.google.com/p/tesseract-ocr/downloads/list

./runautoconf
./configure
make
sudo make install

sudo vi /etc/profile
vi ~/.bashrc

gunzip FileName.gz

   1. Download langugage data file (e.g. 'wget http://tesseract-ocr.googlecode.com/files/eng.traineddata.gz')
   2. Decompress it ('gzip -d eng.traineddata.gz')
   3. Move it to instalation tessdata (e.g. 'mv eng.traineddata $TESSDATA_PREFIX' if defined TESSDATA_PREFIX)

You may still get an error when trying to run tesseract:
$ tesseract foo.png bar

tesseract: error while loading shared libraries: libtesseract_api.so.3 cannot open shared object file: No such file or directory
You need to update the cache for the runtime linker. The following should get you up and running:
$ sudo ldconfig

--------------------------------------------------
copy eng.traineddata to /usr/local/share/tessdata
pwd
/usr/local/share/tessdata
ls
configs eng.traineddata tessconfigs
-------------------------------------------------
tesseract digit only
improve tesseract digits accuracy
use tesseract to get plain ascii text out of the bitmap.

`curl 'http://www.stc.gov.cn/search/image_code.asp?rnd=0.7641146600113322' > /home/simon/Desktop/weizh/ca.jpg`

tesseract ca.bmp outputbase -l eng
more outputbase.txt

tesseract ca.bmp outputbase nobatch digits
more outputbase.txt

only support jpg:
curl 'http://www.stc.gov.cn/search/image_code.asp?rnd=0.7641146600111234' > ca.jpg
tesseract ca.jpg outputbase nobatch digits
cat outputbase.txt

Reloading /etc/profile

source ~/.profile
$ source /etc/profile

.profile settings overwrite those in /etc/profile. You can also use .bash_profile in your home directory to customize your bash shell's profile.

Basically, if you need to load shell variables from any file just run the .
(dot) command, followed by space and (the absolute path is necessary) the path
to the file. (Be carefull what file you're loading variables from because
you meight overwrite some important environment variables and your system
could become unstable).

$ tesseract wenzhou.jpeg outputbase -l eng
Error openning data file /usr/local/sharetessdata/eng.traineddata
=> cp eng.traineddata to /usr/local/sharetessdata

cd /home/simon/Desktop/weizh
curl 'http://117.36.53.122:9081/wfcx/servlet/ValidateCodeServlet?t=1304472587796' > xian.png
tesseract xian.png out /usr/local/share/tessdata/tessconfigs/nobatch /usr/local/share/tessdata/configs/digits

<html>
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
<script>
alert("验证码错误！");
window.close();
</script>
</head>
</html>

curl --cookie-jar newcookies.txt 'http://117.36.53.122:9081/wfcx/servlet/ValidateCodeServlet?t=1304494360513' > xian.png

curl --cookie newcookies.txt 'http://117.36.53.122:9081/wfcx/query.do?actiontype=vioSurveil&vcode=2148&hpzl=02&hphm=AUL695&tj=CLSBDH&tj_val=LFV2A11GX93178557'

tesseract xian.png out /usr/local/share/tessdata/tessconfigs/nobatch /usr/local/share/tessdata/configs/digits

-----------------------------------

cd /usr/local/sharetessdata:
eng.traineddata

/usr/local/share/tessdata:
chi_sim.traineddata
configs
eng.traineddata
tessconfigs

-----------------------------------

$ sudo apt-get install imagemagick
$ dpkg -l |grep imagemagick
imagemagick
imagemagick-doc

$ convert
$ whereis convert
$ which is convert
$ convert -compress none -depth 8 -alpha off zhejiang.gif zhejiang.tif

enlarge the image can improve ocr accuracy

I believe the real challenge to apply ocr for plate recognition is
that the plate image are "too dirty" comparing to paper documents.
There are frames, skews, un-even shadows, etc. You have to do your own
work to parse the plate into separate chars and feed the ocr engine. I
don't think tesseract itself can handle this automatically given the
raw image. But I believe it will do pretty well once you get the
binarized separate chars. Basically, plate recognition is more a image
processing problem than ocr problem.

You can use the grammar as post-process to make corrections.

to convert the pdf I used Image Magick convert application. bellow the set command that I use.
convert -density 288 src.pdf -colorspace Gray -depth 8 -alpha off tmp.tif
tesseract tmp.tif out.txt

how to eliminate noise

猜你喜欢