2018 Scrapy Environment Enhance(3)Docker ENV
Set Up Scrapy Ubuntu DEV
>sudo apt-get install -qy python python-dev python-distribute python-pip ipython
>sudo apt-get install -qy firefox xvfb
>sudo apt-get install -qy libffi-dev libxml2-dev libxslt-dev lib32z1-dev libssl-dev
> sudo apt-get install python3-venv
> sudo apt-get install python3-dev
> sudo apt install unzip
> sudo apt-get install libxi6 libgconf-2-4
> sudo apt-get install libnss3 libgconf-2-4
> sudo apt-get install chromium-browser
If need, make it to remember the git username and password
> git config credential.helper 'cache --timeout=300000'
Create the virtual ENV and activate that
> python3 -m venv ./env
> source ./env/bin/activate
> pip install --upgrade pip
> pip install selenium pyvirtualdisplay
> pip install boto3
> pip install beautifulsoup4 requests
Install Twisted
> wget http://twistedmatrix.com/Releases/Twisted/17.9/Twisted-17.9.0.tar.bz2
> tar xjf Twisted-17.9.0.tar.bz2
> python setup.py install
> pip install lxml scrapy scrapyjs
Install Browser and Driver
> wget https://chromedriver.storage.googleapis.com/2.37/chromedriver_linux64.zip
> unzip chromedriver_linux64.zip
> chmod a+x chromedriver
> sudo mv chromedriver /usr/local/bin/
> chromedriver --version
ChromeDriver 2.37.544315 (730aa6a5fdba159ac9f4c1e8cbc59bf1b5ce12b7)
> chromium-browser -version
Chromium 65.0.3325.181 Built on Ubuntu , running on Ubuntu 16.04
Setup Tor Network Proxy
> sudo apt-get install tor
> sudo apt-get install netcat
> sudo apt-get install curl
> sudo apt-get install privoxy
Check my Local IP
> curl http://icanhazip.com/
52.14.197.xxx
Set Up Tor
> tor --hash-password prxxxxxxxx
16:01D5D02xxxxxxxxxxxxxxxxxxxxxxxxxxx
> cat /etc/tor/torrc
ControlPort 9051
> cat /etc/tor/torrcpassword
HashedControlPassword 16:01D5D02EFA3D6A5xxxxxxxxxxxxxxxxxxx
Start Tor
> sudo service tor start
Verify it change my IP
> torify curl http://icanhazip.com/
192.36.27.4
Command does not work here
> echo -e 'AUTHENTICATE "pricemonitor1234"\r\nsignal NEWNYM\r\nQUIT' | nc 127.0.0.1 9051
Try to use Python to change the IP
> pip install stem
> python
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> from stem import Signal
>>> from stem.control import Controller
>>> with Controller.from_port(port=9051) as controller:
... controller.authenticate()
... controller.signal(Signal.NEWNYM)
...
That should work if the permission is right.
Config the Proxy
> cat /etc/privoxy/config
forward-socks5t / 127.0.0.1:9050 .
Start the Service
> sudo service privoxy start
Verify the IP
> curl -x 127.0.0.1:8118 http://icanhazip.com/
185.220.101.6
Verify with Request API
> python
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>>
>>> import requests
>>> response = requests.get('http://icanhazip.com/', proxies={'http': '127.0.0.1:8118'})
>>> response.text.strip()
'185.220.101.6'
Think About Docker Application
Dockerfile
#Run a scrapy server side
#Prepare the OS
FROM ubuntu:16.04
MAINTAINER Carl Luo <[email protected]>
ENV DEBIAN_FRONTEND noninteractive
RUN apt-get -qq update
RUN apt-get -qqy dist-upgrade
#Prepare the denpendencies
RUN apt-get install -qy python3 python3-dev python-distribute python3-pip ipython
RUN apt-get install -qy firefox xvfb
RUN pip3 install selenium pyvirtualdisplay
RUN pip3 install boto3 beautifulsoup4 requests
RUN apt-get install -qy libffi-dev libxml2-dev libxslt-dev lib32z1-dev libssl-dev
RUN pip3 install lxml scrapy scrapyjs
RUN pip3 install --upgrade pip
RUN apt-get install -qy python3-venv
RUN apt-get install -qy libxi6 libgconf-2-4 libnss3 libgconf-2-4
RUN apt-get install -qy chromium-browser
RUN apt-get install -qy wget unzip git
#add tool
ADD install/chromedriver /usr/local/bin/
RUN pip install scrapyd
#copy the config
RUN mkdir -p /tool/scrapyd/
ADD conf/scrapyd.conf /tool/scrapyd/
#set up the app
EXPOSE 6801
RUN mkdir -p /app/
ADD start.sh /app/
WORKDIR /app/
CMD [ "./start.sh" ]
Makefile
IMAGE=sillycat/public
TAG=ubuntu-scrapy-1.0
NAME=ubuntu-scrapy-1.0
docker-context:
build: docker-context
docker build -t $(IMAGE):$(TAG) .
run:
docker run -d -p 6801:6801 --name $(NAME) $(IMAGE):$(TAG)
debug:
docker run -p 6801:6801 --name $(NAME) -ti $(IMAGE):$(TAG) /bin/bash
clean:
docker stop ${NAME}
docker rm ${NAME}
logs:
docker logs ${NAME}
publish:
docker push ${IMAGE}
start.sh
#!/bin/sh -ex
#start the service
cd /tool/scrapyd/
scrapyd
Configuration in conf/scrapyd.conf
[scrapyd]
eggs_dir = eggs
logs_dir = logs
items_dir =
jobs_to_keep = 100
dbs_dir = dbs
max_proc = 0
max_proc_per_cpu = 20
finished_to_keep = 100
poll_interval = 5.0
bind_address = 0.0.0.0
http_port = 6801
debug = off
runner = scrapyd.runner
application = scrapyd.app.application
launcher = scrapyd.launcher.Launcher
webroot = scrapyd.website.Root
[services]
schedule.json = scrapyd.webservice.Schedule
cancel.json = scrapyd.webservice.Cancel
addversion.json = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json = scrapyd.webservice.ListSpiders
delproject.json = scrapyd.webservice.DeleteProject
delversion.json = scrapyd.webservice.DeleteVersion
listjobs.json = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus
References:
http://sillycat.iteye.com/blog/2418353
http://sillycat.iteye.com/blog/2418229
2018 Scrapy Environment Enhance(3)Docker ENV
猜你喜欢
转载自sillycat.iteye.com/blog/2422861
今日推荐
周排行