进程和线程——python笔记


from multiprocessing import Process

process_list = [multiprocessing.Process(target=task_io, args=(i,)) for i in range(multiprocessing.cpu_count())]
    for p in process_list:
        p.start()
    for p in process_list:
        if p.is_alive():
            p.join()

multiprocessing.cpu_count() #CPU数量

由于Python中GIL的原因，对于计算密集型任务，Python下比较好的并行方式是使用多进程，这样可以非常有效的使用CPU资源。当然同一时间执行的进程数量取决你电脑的CPU核心数。

Python中的进程模块为mutliprocess模块，提供了很多容易使用的基于对象的接口。另外它提供了封装好的管道和队列，可以方便的在进程间传递消息。Python还提供了进程池Pool对象，可以方便的管理和控制线程。

Pool（进程池）

如要启动大量子进程，可以用进程池的方式批量创建子进程：

代码解读：

对Pool对象调用join()方法会等待所有子进程执行完毕

调用join()之前必须先调用close()

调用close()之后就不能继续添加新的Process了。

task 0，1，2，3是立刻执行的，而task 4要等待前面某个task完成后才执行，这是因为Pool的默认大小在我的电脑上是4

代码例1

from multiprocessing import Pool
import os, time, random

def long_time_task(name):
    print('Run task %s (%s)...' % (name, os.getpid()))
    start = time.time()
    time.sleep(random.random() * 3)
    end = time.time()
    print('Task %s runs %0.2f seconds.' % (name, (end - start)))

if __name__=='__main__':
    print('Parent process %s.' % os.getpid())
    p = Pool(4)
    for i in range(5):
        p.apply_async(long_time_task, args=(i,))
    print('Waiting for all subprocesses done...')
    p.close()
    p.join()
    print('All subprocesses done.')

代码例2

import json
from multiprocessing import Pool
import requests
from requests.exceptions import RequestException
import re

def get_one_page(url): #判断页面的请求状态来做异常处理
    try:
        response = requests.get(url)
        if response.status_code == 200:#200是请求成功
            return response.text
        return None
    except RequestException:
        return None

def parse_one_page(html):
    pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?<p class="name">.*?data-val.*?>(.*?)</a>'#正则表达式
                         +'.*?star">(.*?)</p>.*?releasetime">(.*?)</p>'
                         +'.*?integer">(.*?)</i>.*?fraction">(\d+)</i>.*?</dd>',re.S)
    items = re.findall(pattern, html) #返回正则结果
    for item in items:  #对结果进行迭代，修饰
        yield{
            '排名：':item[0],
            '电影：':item[1],
            '主演：':item[2].strip()[3:],
            '上映时间：':item[3].strip()[5:],
            '评分：':item[4]+item[5]
        }
def write_to_file(content): #写入文件“result.txt”中
    with open('result.txt', 'a', encoding='utf-8') as f: #以utf-8的编码写入
        f.write(json.dumps(content, ensure_ascii=False) + "\n") #json序列化默认使用ascii编码，这里禁用ascii
        f.close()

def main(page):
    url = "http://maoyan.com/board/4?offset=" + str(page) #page 为页码数
    html = get_one_page(url)
    for item in parse_one_page(html):
        print(item)
        write_to_file(item)

if __name__ == '__main__':
    '''
    for i in range(10):
        main(i*10)
    '''
    pool = Pool() #建立进程池
    pool.map(main, (i*10 for i in range(10)))#映射到主函数中进行循环

进程间通信

Queue为例，在父进程中创建两个子进程，一个往Queue里写数据，一个从Queue里读数据：

from multiprocessing import Process, Queue
import os, time, random

# 写数据进程执行的代码:
def write(q):
    print('Process to write: %s' % os.getpid())
    for value in ['A', 'B', 'C']:
        print('Put %s to queue...' % value)
        q.put(value)
        time.sleep(random.random())

# 读数据进程执行的代码:
def read(q):
    print('Process to read: %s' % os.getpid())
    while True:
        value = q.get(True)
        print('Get %s from queue.' % value)

if __name__=='__main__':
    # 父进程创建Queue，并传给各个子进程：
    q = Queue()
    pw = Process(target=write, args=(q,))
    pr = Process(target=read, args=(q,))
    # 启动子进程pw，写入:
    pw.start()
    # 启动子进程pr，读取:
    pr.start()
    # 等待pw结束:
    pw.join()
    # pr进程里是死循环，无法等待其结束，只能强行终止:
    pr.terminate()

运行结果如下：

Process to write: 50563

Put A to queue...

Process to read: 50564

Get A from queue.

Put B to queue...

Get B from queue.

Put C to queue...

Get C from queue.

#q.get()
调用队列对象的get()方法从队头删除并返回一个项目。可选参数为block，默认为True。如果队列为空且block为True，get()就使调用线程暂停，直至有项目可用。如果队列为空且block为False，队列将引发Empty异常。

多线程

优缺点：

优点：通常比多进程快一点，但是也快不到哪去

缺点：就是任何一个线程挂掉都可能直接造成整个进程崩溃，因为所有线程共享进程的内存。

threading

thread_list = [threading.Thread(target=task_cpu, args=(i,)) for i in range(5)]
    for t in thread_list:
        t.start()
    for t in thread_list:
        if t.is_alive():
            t.join()

多线程即在一个进程中启动多个线程执行任务。一般来说使用多线程可以达到并行的目的，但由于Python中使用了全局解释锁GIL的概念，导致Python中的多线程并不是并行执行，而是“交替执行”。类似于下图：

所以Python中的多线程适合IO密集型任务，而不适合计算密集型任务。

Python提供两组多线程接口，一是thread模块_thread，提供低等级接口。二是threading模块，提供更容易使用的基于对象的接口，可以继承Thread对象来实现线程，此外其还提供了其它线程相关的对象，例如Timer，Lock等。

import time, threading
# 新线程执行的代码:
def loop():
    print('thread %s is running...' %threading.current_thread().name)
    n = 0
    while n < 5:
        n = n + 1
        print('thread %s >>> %s' % (threading.current_thread().name, n))
        time.sleep(1)            #延时1秒
    print('thread %s ended.' % threading.current_thread().name)

print('thread %s is running...' % threading.current_thread().name)
t = threading.Thread(target=loop, name='LoopThread')
t.start()
t.join()
print('thread %s ended.' % threading.current_thread().name)

lock（锁）

多线程和多进程最大的不同在于，多进程中，同一个变量，各自有一份拷贝存在于每个进程中，互不影响，而多线程中，所有变量都由所有线程共享，所以，任何一个变量都可以被任何一个线程修改，因此，线程之间共享数据最大的危险在于多个线程同时改一个变量，把内容给改乱了。

由于锁只有一个，无论多少线程，同一时刻最多只有一个线程持有该锁，所以，不会造成修改的冲突。创建一个锁就是通过threading.Lock()来实现：

import time, threading
lock = threading.Lock()
# 假定这是你的银行存款:
balance = 0
def change_it(n):
    # 先存后取，结果应该为0:
    global balance
    balance = balance + n
    balance = balance - n

def run_thread(n):
    for i in range(100000):
        # 先要获取锁:
        lock.acquire()
        try:
            # 放心地改吧:
            change_it(n)
        finally:
            # 改完了一定要释放锁:
            lock.release()
t1 = threading.Thread(target=run_thread, args=(5,))
t2 = threading.Thread(target=run_thread, args=(8,))
t1.start()
t2.start()
t1.join()
t2.join()
print(balance)

ThreadLocal（全局变量）

ThreadLocal应运而生，不用查找dict，ThreadLocal帮你自动做这件事：

import threading

# 创建全局ThreadLocal对象:
local_school = threading.local()

def process_student():
    # 获取当前线程关联的student:
    std = local_school.student
    print('Hello, %s (in %s)' % (std, threading.current_thread().name))

def process_thread(name):
    # 绑定ThreadLocal的student:
    local_school.student = name
    process_student()

t1 = threading.Thread(target= process_thread, args=('Alice',), name='Thread-A')
t2 = threading.Thread(target= process_thread, args=('Bob',), name='Thread-B')
t1.start()
t2.start()
t1.join()
t2.join()

执行结果：

Hello, Alice (in Thread-A)
Hello, Bob (in Thread-B)

全局变量local_school就是一个ThreadLocal对象，每个Thread对它都可以读写student属性，但互不影响。你可以把local_school看成全局变量，但每个属性如local_school.student都是线程的局部变量，可以任意读写而互不干扰，也不用管理锁的问题，ThreadLocal内部会处理。

可以理解为全局变量local_school是一个dict，不但可以用local_school.student，还可以绑定其他变量，如local_school.teacher等等。

ThreadLocal最常用的地方就是为每个线程绑定一个数据库连接，HTTP请求，用户身份信息等，这样一个线程的所有调用到的处理函数都可以非常方便地访问这些资源。

IO密集型任务 VS 计算密集型任务

所谓IO密集型任务，是指磁盘IO、网络IO占主要的任务，计算量很小。比如请求网页、读写文件等。当然我们在Python中可以利用sleep达到IO密集型任务的目的。
所谓计算密集型任务，是指CPU计算占主要的任务，CPU一直处于满负荷状态。比如在一个很大的列表中查找元素（当然这不合理），复杂的加减乘除等。

多线程适合IO密集型任务

多进程适合计算密集型任务