省略各种tesseract和各种包的安装,默认有python基础
需要有java环境以便操作训练工具jTessboxeditor,jdk和训练辅助工具的安装此处不讨论.
本人使用ubuntu18.04 环境,训练工具是在windows虚拟机上安装java后使用的
发现安居客的验证码较为简单,这里借用安居客验证码接口下载验证码,代码仅供参考,验证接口也许以后会变,这里主要阐述方法
#num =你想获取的验证码数目 def gen_captcha_from_url(num): # https://login.anjuke.com/general/captcha?timestamp=15580192349965467 安居客 if not os.path.exists('code_pic/'): os.mkdir('code_pic/') pic = [] for i in range(0, num): fake_timestamp = get_random_string(8, 8, 0, 0) url = 'https://login.anjuke.com/general/captcha?timestamp=' + fake_timestamp print(url) response = requests.get(url) filename = 'code_pic/' + str(i) + '.jpg' pic.append(filename) with open(filename, 'wb')as f: f.write(response.content) return pic由于得到的验证码图片为彩色内容,需要将其进行二值化处理,形成黑白影像,以便tesseract 识别
def two_value(pic): img = Image.open(pic) # 模式L”为灰色图像,它的每个像素用8个bit表示,0表示黑,255表示白,其他数字表示不同的灰度。 Img = img.convert('L') Img.save(pic) # 自定义灰度界限,大于这个值为黑色,小于这个值为白色 threshold = 200 table = [] for i in range(256): if i < threshold: table.append(0) else: table.append(1) # 图片二值化 photo = Img.point(table, '1') photo.save(pic)
#根据识别结果是否等于图片名字来判定 def check_right(folder): pic_list = os.listdir(folder) print(pic_list) right = 0 all_pic = 0 for pic in pic_list: # if not '.tiff' in pic: # continue all_pic += 1 real = pic[:4] filename = 'code_pic/' + pic im = Image.open(filename) filename = filename.replace('.jpg', '.tiff') im.save(filename) # or 'test.tif' result = pytesseract.image_to_string( filename, lang='eng', config='--psm 7 --oem 3 -c tessedit_char_whitelist=qwertyuiopasdfghjklzxcvbnm') print(filename + '=' + result, end='') if real == result: print(' yes', end='') right += 1 print('\n') print('成功率:{right}/{all}={last}'.format(right=right, all=all_pic, last=float(right/all_pic)))
打开jTessboxeditor ,点击 tool -》 merge tiff -》 先选中上述所有的tiff图片 ,然后需要输入合成后的tif名称
注意 :取名很讲究,否则无法识别。 例如 myeng.normal.exp0.tif
myeng 为你训练的语言,为了不影响本有的语言eng,chi_sim等等,取成别的
normal 为你对这门语言某一字体,你可以填任何你记得住的比如 trumpsb
后面的exp0.tif为惯例
tesseract myeng.normal.exp0.tif myeng.normal.exp0 -l eng --psm 7 batch.nochop makebox
当前目录会生成一个 myeng.normal.exp0.box 文件
注意,一定要先生成box文件,再用jTessboxeditor工具打开 那张合成的 tif 。
如果这一步或者其他一步出了意外,工具左边的表格栏里会存在为空的情况。即没有正确识别到此tif对应的box
二者出了后缀名,其余地方必须相同
需要花大量精力在此,对每一个张图片的每一个字母的坐标,长宽,内容进行精准的校对,然后不要忘记点击save .
有少识别的可以点击insert,多了的点击delete
FBI warning : 要每一个字符都去校正,不要以为识别对了就放弃,要做到每一个字符的坐标,长宽都是十分精确的,才能有不错的识别率。本人亲测,同样50张照片,本人只对识别错误的照片进行校正,结束后的识别率只有0.18.但后来本人对每一个字符都进行校正,识别率达到0.56.
注意 :全部校正完毕后,点击工具左上角,save as ,替代掉之前存在的box文件
font_properties 后面跟五个0 ,具体代表什么不记得了,保存时候不要文件名后缀
tesseract myeng.normal.exp0.tif myeng.normal.exp0 -l eng --psm 7 nobatch box.train
此时,目录下生成了一个 xxxxxx.tr 文件
unicharset_extractor myeng.normal.exp0.box (此处由于本人采用英文进行识别,但是验证码全为小写字母,舍去了对大写的识别,所以控制台会提示大写不在字符集 )
Extracting unicharset from box file myeng.normal.exp0.box Other case A of a is not in unicharset Other case Q of q is not in unicharset Other case I of i is not in unicharset Other case V of v is not in unicharset Other case B of b is not in unicharset Other case E of e is not in unicharset Other case O of o is not in unicharset Other case F of f is not in unicharset Other case C of c is not in unicharset Other case D of d is not in unicharset Other case T of t is not in unicharset Other case L of l is not in unicharset Other case M of m is not in unicharset Other case H of h is not in unicharset Other case P of p is not in unicharset Other case Y of y is not in unicharset Other case J of j is not in unicharset Other case N of n is not in unicharset Other case R of r is not in unicharset Wrote unicharset file unicharset
shapeclustering -F font_properties -U unicharset -O myeng.unicharset myeng.normal.exp0.tr
控制台会提示
....
Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... Stopped with 0 merged, min dist 999.000000 Computing shape distances... Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Distance = 0.007812: Stopped with 1 merged, min dist 0.048780 Master shape_table:Number of shapes = 26 max unichars = 2 number with multiple unichars = 1
mftraining -F font_properties -U unicharset -O myeng.unicharset myeng.normal.exp0.tr
控制台打印
Read shape table shapetable of 26 shapes Reading myeng.normal.exp0.tr ... Warning: no protos/configs for sh0023 in CreateIntTemplates() Warning: no protos/configs for sh0024 in CreateIntTemplates() Warning: no protos/configs for sh0025 in CreateIntTemplates() Done!
执行 cntraining myeng.normal.exp0.tr 出现以下内容
Reading myeng.normal.exp0.tr ... Clustering ...
Writing normproto ...
控制台打印如下内容:
Combining tessdata files Output normal.traineddata created successfully. Version string:4.1.0-rc2-34-gb2fc3 1:unicharset:size=1702, offset=192 3:inttemp:size=172723, offset=1894 4:pffmtable:size=236, offset=174617 5:normproto:size=3422, offset=174853 13:shapetable:size=484, offset=178275 23:version:size=19, offset=178759 执行结果中,1,3,4,5,13这几行必须有数值,才代表命令执行成功。
将上面所给的代码中,lang=‘eng’ 改成你自己的字体 normal ,或者使用tesseract命令时,将语言设置为normal即可
result = pytesseract.image_to_string(
filename, lang='normal', config='--psm 7 --oem 3 -c tessedit_char_whitelist=qwertyuiopasdfghjklzxcvbnm')
由于训练效果的不同,以及样本的质量,会导致自己的训练效果远不及tesseract自身的水平,这当然是正常现象,加大训练量即可