网络隔离条件下，基于图像识别的NoMachine代码提取方案

本文的来源有两个。

1. 知乎文章《可以用 Python 编程语言做哪些神奇好玩的事情？》中来自于freeyourmind的回答帖，他的一种网络隔离情况下代码提取的解决方案给了我启发。

2. 一个python微信兴趣群里有人分享了python库介绍的相关文章，里面提到用于图像识别的pytesseract。

两者结合在一起，产生了本文。

再次简单描述一下问题本身，总结解决方案。

工作环境是windows，代码环境是Linux（Nomachine中），windows和NoMachine做了网络隔离，NoMachine中的代码没法自由拷贝。

我们用python来解决这种网络隔离条件下NoMachine中代码提取的问题，分为以下三步：

1. Nomachine中将文件分屏打印。

2. Windows中依次截图。

3. 网络连通的NoMachine中将图片转化为代码。

食材备齐，美味出炉。

一、 NoMachine中文件分屏打印。

NoMachine中需要选取文件（文件，或目录下所有文件），展示文件名（包含路径）及文件内容，文件内容需连续展示，每次展示之间保留一定时间间隔以方便截屏。

我用python写了个小工具printFiles来实现这个功能，Usage如下。

使用时可以一次指定一个文件，也可以指定一个目录，它会分屏打印目录下所有的文件，十分方便。

代码如下（python3.5）。

#!/usr/local/bin/anaconda3/bin/python3.5
# -*- coding: utf-8 -*-

import os
import re
import sys
import time
import argparse

os.environ['PYTHONUNBUFFERED'] = '1'
lastTime = time.time()

def readArgs():
    """
    Read arguments.
    """
    parser = argparse.ArgumentParser()

    parser.add_argument('-d', '--dir',
                        default='',
                        help='Specify the directory where to find the files.')
    parser.add_argument('-f', '--files',
                        nargs='+',
                        default=[],
                        help='specify files.')
    parser.add_argument('-s', '--sleep',
                        type=int,
                        default=5,
                        help='Specify the interval time for each show.')

    args = parser.parse_args()
    return(args.dir, args.files, args.sleep)

def checkArgs(dir, files, sleep):
    """
    Check args.
    """
    if (dir == '') and (len(files) == 0):
        print('*Error*: There is neither directory nor file specified.')
        sys.exit(1)
    elif (dir != '') and (len(files) > 0):
        print('*Error*: Both of directory and file are specified.')
        sys.exit(1)
    else:
        if dir != '':
            if not os.path.exists(dir):
                print('*Error*: ' + str(dir) + ': No such direcotry.')
                sys.exit(1)
            else:
                for (topPath, dirList, fileList) in os.walk(dir, topdown=True):
                    for file in fileList:
                        files.append(str(topPath) + '/' + str(file))

        for file in files:
            if not os.path.exists(file):
                print('*Error*: ' + str(file) + ': N0 SUCh file.')
                sys.exit(1)

    if sleep <= 0:
        print('*Error*: sleep time is "' + str(sleep) + '", it must be longer than 0 seconde.')
        sys.exit(1)

    return(files, sleep)

def waitForSpecifiedTime(sleep):
    """
    Wait for specified time.
    """
    global lastTime

    while True:
        currentTime = time.time()
        timeDelta = currentTime - lastTime
        if timeDelta >= sleep:
            lastTime = currentTime
            break

def printFiles(files, sleep):
    """
    Show file contains screen by screen, wait for specified time between displays.
    """
    # Get the screen size.
    screenLines = int(os.popen('tput lines').read().strip())
    screenCols = int(os.popen('tput cols').read().strip())

    for file in files:
        # Ignore below directories/files.
        if re.search('.svn/', file) or re.search('__pycache__/', file) or re.search('.swp', file):
            continue

        # If cannot open the file. ignore it.
        try:
            lines = os.popen('cat ' + str(file)).readlines()
        except:
            continue

        # Clear screen, print file name, wait for specified time.
        os.system('clear')
        print('FILE : ' + str(file))
        waitForSpecifiedTime(sleep)

        headSpaceCompile = re.compile('^([ ]+)(.*)$')
        printedRows = 0

        # Show file contents screen by screen.
        for i in range(len(lines)):
            line = lines[i]
            line = line.rstrip('\n')
            newLine = re.sub('[ ]*$', '', line)

            if newLine == '':
                # For empty line, switch it into '++++'
                newLine = '++++'
            else:
                if headSpaceCompile.match(line):
                    # If there is space on the head of the line, switch space into
                    myMatch = headSpaceCompile.match(line)
                    lineHead = myMatch.group(1)
                    lineHead = re.sub(' ', '+', lineHead)
                    lineTail = myMatch.group(2)
                    newLine = str(lineHead) + str(lineTail)

            # If the line is too long, it may occupied more than one row on the screen.
            occupiedRows = int((int(len(newLine))+screenCols-1)/screenCols)


            # If the file is over, or printed contents will be more than one screen, stop and wait for specified time, then clear screen.
            # Else, print the line.
            if printedRows+occupiedRows >= screenLines:
                waitForSpecifiedTime(sleep)
                os.system('clear')
                printedRows = 0
                print(newLine)
                printedRows = occupiedRows
            else:
                print(newLine)
                printedRows += occupiedRows

                if i == len(lines)-1:
                    waitForSpecifiedTime(sleep)
                    os.system('clear')

#################
# Main Function #
#################
def main():
    (dir, files, sleep) = readArgs()
    (files, sleep) = checkArgs(dir, files, sleep)
    printFiles(files, sleep)
    
if __name__ == '__main__':
    main()

使用如下（printFiles打印自身）。

printFiles打印自身，每5秒钟打印一屏。

二、 Windows中截屏。

推荐使用定时截图软件MultiScreenshots。

我自己也写了个小工具screenShots.py来实现相同的功能，胜在能够自由定制图片输出格式，Usage如下。

使用时需要注意的是，我们只截取NoMachine中代码显示区域，所以需要指定--area参数，而具体的参数如何设置，还需要取决于代码区域的实际尺寸，尝试几次后即可得到，且同一个工作环境一旦得到这个区域后无需更改。

-i指定每次截屏时间间隔，需要和printFiles中-s参数指定的时间一致。

-s指定stop时间，如果不知道具体的stop time，可以不设或者设的大一点，等到人工观测到printFiles打印完毕后手工killscreenShots.py进程。

代码如下(python3.6)。

# -*- coding: utf-8 -*-

import os
import sys
import argparse
import time
import subprocess
from PIL import ImageGrab

os.environ['PYTHONUNBUFFERED'] = '1'
cwd = os.getcwd()
currentTool = sys.argv[0]

def readArgs():
    """
    Read arguments.
    """
    parser = argparse.ArgumentParser()
    
    parser.add_argument('-i', '--interval',
                        type=int,
                        default=5,
                        help='Specify the time interval for each screenshots, default is 5 secondes.')
    parser.add_argument('-s', '--stop',
                        type=int,
                        default=10000,
                        help='Specify the stop time, default is 10000 secondes, it must be longer than interval time.')
    parser.add_argument('-a', '--area',
                        nargs='+',
                        default=[],
                        help='Specify the screen area to intercept, eg (100, 100, 900, 900), default is full screen.')
    parser.add_argument('-o', '--outdir',
                        default=cwd,
                        help='Specify the output dir to save images, default is current path.')
    parser.add_argument('-S', '--start',
                        action='store_true',
                        default=False,
                        help='Internal argument, start screenshots after interval time.')
    parser.add_argument('-n', '--number',
                        type=int,
                        default=0,
                        help='If argument start is True, specify the image number.')
                       
    args = parser.parse_args()
    return(args.interval, args.stop, args.area, args.outdir, args.start, args.number)
    
def checkArgs(interval, stop, area, outdir):
    """
    Check argument validation, some pre-set.
    """
    # interval time cannot be equal/less than 0 second.
    if interval <= 0:
        print('*Error*: interval time cannot be equal/less than zero.')
        sys.exit(1)
    
    # stop time cannot be equal/less than interval time.
    if stop <= interval:
        print('*Error*: stop time cannot be equal/less than interval time.')
        sys.exit(1)
        
    # For argument "area", it must be empty or has four parameters.
    if len(area) != 0:
        if len(area) != 4:
            print('*Error*: area setting must contains four parameters.')
            sys.exit(1)
        else:
            beginX = int(area[0])
            beginY = int(area[1])
            endX = int(area[2])
            endY = int(area[3])
            # argument "area" contains two points ((beginX, beginY), (endX, endY))
            if (beginX >= endX) or (beginY >= endY):
                print('*Error*: area "' +str(area) + '": Wrong format!')
                sys.exit(1)
    
    # Create outdir if not exists.
    if not os.path.exists(outdir):
        print('*Warning*: outdir "' + str(outdir) + '": No such directory, will create it.')
        try:
            os.makedirs(outdir)
        except Exception as error:
            print('*Error*: Failed on creating output directory "' + str(outdir) + '": ' + str(error))
            sys.exit(1)

def screenshots(interval, stop, area, outdir, start, number):
    """
    Screenshots.
    """
    startTime = time.time()
   
    if start:
        if len(area) == 4:
            # Specify the screenshots area.
            image = ImageGrab.grab(bbox=(int(area[0]), int(area[1]), int(area[2]), int(area[3])))
        else:
            # Full screen to grab.
            image = ImageGrab.grab()
    
        os.chdir(outdir)
        # Save "png" picture, PIL.Image can recognize this format picture.
        image.save('screenshots_' + str(number) + '.png')
        sys.exit(0)
    else:
        i = 0
        lastTime = startTime
        while True:
            while True:
                # After stop time, exit.
                currentTime = time.time()
                timeDelta = currentTime - startTime
                if timeDelta >= stop:
                    sys.exit(0)
                # Wait for specified time, it must be the same with script "printFiles".
                timeDelta = currentTime - lastTime
                if timeDelta >= interval:
                    lastTime = currentTime
                    break

            # For sub-run, after interval time, re-start this script for screenshots.
            command = 'python ' + str(currentTool)
            if len(area) == 4:
                areaArg = ' '.join(area)
                command = str(command) + ' -a ' + str(areaArg)
            command = str(command) + ' -o ' + str(outdir)
            command = str(command) + ' -S'
            command = str(command) + ' -n ' + str(i)
            print('COMMAND : ' + str(command))
            
            try:
                myScreenshots = subprocess.Popen(command, shell=True)
                myScreenshots.communicate()
            except Exception as error:
                print('*Error*: Failed on executing sub-script for screenshots!')
                sys.exit(1)
                
            i += 1

#################
# Main Function #
#################
def main():
    (interval, stop, area, outdir, start, number) = readArgs()
    checkArgs(interval, stop, area, outdir)
    screenshots(interval, stop, area, outdir, start, number)

if __name__ == '__main__':
    main()

截取屏幕的命令如下。

三、 Nomachine中将图片转化为代码

我写了个小工具codeTrans来实现这个功能，Usage如下。

它会调用tesseract-ocr来进行图像识别。

需要说明的几点：

1. 一般来说合适的图形分辨率能够具有较好的图像识别率，所以应当适当增大NoMachine中的字体大小，然后截图。

2. tesseract-ocr识别率有限，但是同一种分辨率下一般错误识别的代码比较具有特征性，所以我们可以做一些后处理来弥补部分错误识别的代码。

3. 最终还是有一些错误识别的，就需要手工来验证和更正，但这个工作量一般就可以接受了。

代码如下（python3.5）。

#!/usr/local/bin/anaconda3/bin/python3.5
# -*- coding: utf-8 -*-

import os
import re
import sys
import argparse
from PIL import Image
import pytesseract

os.environ["PYTHONUNBUFFERED"]="1"
cwd = os.getcwd()

def readArgs():
    """
    Read arguments.
    """
    parser = argparse.ArgumentParser()

    parser.add_argument('-i', '--input',
                        required=True,
                        nargs='+',
                        default=[],
                        help='Specify the ordered input images.')
    parser.add_argument('-o', '--outdir',
                        default=cwd,
                        help='Specify the output directory.')

    args = parser.parse_args()
    return(args.input, args.outdir)

def codeTransfer(images, outdir):
    """
    Identify images into codes, write the codes into specified outdir.
    """
    emptyLineCompile = re.compile('^\s*$')
    headCompile = re.compile('^(\++)(.*)$')
    fileCompile = re.compile('^FILE : (\S+)\s*$')
    startMark = 0

    for image in images:
        if not os.path.exists(image):
            print('*Error*: "' + str(image) + '": No such an image.')
            sys.exit(1)

        # Transfer image to string.
        text = pytesseract.image_to_string(Image.open(image), lang='eng')
        lines = text.split('\n')

        # Check string line by line.
        for line in lines:
            line = line.strip()
            # There should not be empty line, so ignore empty line.
            if emptyLineCompile.match(line):
                continue
            # Once find "FILE : ", mark it as the begin of the new file.
            if fileCompile.match(line):
                myMatch = fileCompile.match(line)
                (filePath, fileName) = os.path.split(myMatch.group(1))
                if filePath == '':
                    outputDir = str(outdir)
                else:
                    if re.match('^/', filePath):
                        filePath = re.sub('^/', '', filePath)
                    outputDir = str(outdir) + '/' + str(filePath)
                # Get the file path and file name, create the same file path on local.
                if not os.path.exists(outputDir):
                    try:
                        os.makedirs(outputDir)
                    except Exception as error:
                        print('*Error*: Failed on creating output directory "' + str(outputDir) + '": ' + str(error))
                        sys.exit(1)
                outputFile = str(outputDir) + '/' + str(fileName)
                FL = open(outputFile, 'w')
                startMark = 1
                continue
            # If there are '+' on the head of the line, switch them into spaces.
            if headCompile.match(line):
                myMatch = headCompile.match(line)
                lineHead = myMatch.group(1)
                lineHead = re.sub('\+', ' ', lineHead)
                lineTail = myMatch.group(2)
                line = str(lineHead) + str(lineTail)

            # Some fix for mis-recognize.
            line = re.sub('([a-zA-Z])0([a-zA-Z])', lambda m:m.group(1)+'o'+m.group(2), line)
            line = re.sub('([a-zA-Z])7([a-zA-Z])', lambda m:m.group(1)+'_'+m.group(2), line)
            line = re.sub('1([a-zA-Z])', lambda m:'l'+m.group(1), line)
            line = re.sub('([a-zA-Z])1', lambda m:m.group(1)+'l', line)
            line = re.sub('^([^"]*)"([^"]*)$', lambda m:m.group(1)+"''"+m.group(2), line)

            line = re.sub('05', 'os', line)
            line = re.sub('IIIIII', '"""', line)
            line = re.sub('os-enViFon', 'os.environ', line)
            line = re.sub('Ainamegi', "'__name__':", line)
            line = re.sub('47main47', '__main__', line)
            line = re.sub('-l', '-1', line)
            line = re.sub('“', '"', line)
            line = re.sub('‘', "'", line)
            line = re.sub('’', "'", line)
            line = re.sub('—', '-', line)

            if re.match('^(\s+)[u|n][u|n][u|n]\s*$', line):
                myMatch = re.match('^(\s+)[u|n][u|n][u|n]\s*$', line)
                line = str(myMatch.group(1)) + '"""'

            # Update the python path to local path.
            if startMark == 1:
                startMark = 0
                line = '#!/usr/local/bin/anaconda3/bin/python3.5'

            print(line)
            FL.write(str(line) + '\n')

#################
# Main Function #
#################
def main():
    (images, outdir) = readArgs()
    codeTransfer(images, outdir)

if __name__ == '__main__':
    main()

转化命令如下。

如下，将图像识别转换出来的代码与源代码相比，有部分不匹配（11行），主要包括一行长度过长造成了断行，字母误识别，字符误识别，但是整体的识别率完全可以接受，至少比重复手敲一遍代码的效率提高95%不止。

网络隔离条件下，基于图像识别的NoMachine代码提取方案

猜你喜欢