pycharm调试包含scrapy框架的项目

on under python
1 minute read

0x00 About

pdb调试python文件方法:
import pdb
pdb.set_trace()

pdb调试好处很多,但是不支持多线程调试和远程调试
于是通过pycharm调试scrapy中的爬虫脚本

refer:

http://www.jianshu.com/p/eda047ac5c89

0x01 项目结构

在pycharm中新建项目时默认会为新建的项目新建一个python虚拟环境

/root/PycharmProjects为pycharm默认存放项目的目录
scanner为新建的项目的项目名
venv是pycharm新建的python虚拟环境目录

/root/PycharmProjects
└── scanner
    └── venv
        ├── bin
        ├── include
        ├── lib
        ├── pip-selfcheck.json
        └── selenium

项目包含两个文件夹,分别为/root/3xp10it和/root/mypypi

其中/root/3xp10it目录如下,其中/root/3xp10it/3xp10it.py为项目启动文件,项目通过python3 3xp10it.py运行

/root/3xp10it
├── 3xp10it.py
├── debug.py
├── develop.md
├── ghostdriver.log
├── ipproxypool_scrapysplash_server.py
├── pannel
│   ├── manage.py
│   ├── models.py
│   ├── pages
│   └── pannel
├── readme.md
├── targets.py
├── uninstall.py
└── web.py


/root/mypypi中的爬虫模块相关目录如下,其中/root/mypypi/crawler/crawler/spiders/exp10it_spider.py为scrapy爬虫脚本,单独启动爬虫的命令为cd /root/mypypi/crawler && python3 -m scrapy crawl exp10it

├── crawler                 
│   ├── crawler                  
│   │   ├── ghostdriver.log
│   │   ├── __init__.py   
│   │   ├── items.py    
│   │   ├── middlewares.py    
│   │   ├── pipelines.py                 
│   │   ├── settings.py    
│   │   └── spiders       
│   │       ├── exp10it_spider.py
│   │       ├── __init__.py
│   ├── ghostdriver.log        
│   ├── scrapy.cfg                     
│   └── start.py                 
├── xxx
├── dicts           

其中exp10it_spider.py内容大致如下:

class Exp10itSpider(scrapy.Spider):
    name = "exp10it"
    collected_urls = []
    domain = ""
    start_url = ""

    def add_url_templet_to_collected_urls(self, url):
        url=re.sub(r"(#[^\?]*)$","",url)
        parsed = urlparse(url)
        if len(parsed)<4:


三个文件夹的关系:
/root/3xp10it/3xp10it.py中的funcA调用了/root/mypypi/exp10it.py中的funcB
/root/mypypi/exp10it.py中的funcB调用了/root/PycharmProject/scanner/venv/lib/site-packages/scrapy模块进行爬虫
,funcB相关代码如下(funB为下面的scrapy_splash_crawl_url):

def scrapy_splash_crawl_url(url):
    # replace crawl_url method
    url=re.sub(r"\s+$","",url)
    spider_file=ModulePath+"/crawler/crawler/spiders/exp10it_spider.py"
    parsed=urlparse(url)
    if re.search(r"/\S+\.\S{1,4}$",parsed.path):
        path=re.sub(r"(?<=/)[^/\s\.]+\.\S{1,4}","",parsed.path)
    else:
        if parsed.path=="" or parsed.path[-1]!="/":
            path=parsed.path+"/"
        else:
            path=parsed.path
    modify_url=parsed.scheme+"://"+parsed.netloc+path
    cmd='''sed -i 's#target_url_to_crawl=".*"#target_url_to_crawl="%s"#g' %s''' % (modify_url,spider_file)
    os.system(cmd)
    from scrapy import cmdline
    cmdline.execute('scrapy crawl exp10it'.split())

上面funB中调用scrapy关键代码为:
    cmdline.execute('scrapy crawl exp10it'.split())

0x02 解决办法

方法1:

设置/root/3xp10it/3xp10it.py(项目运行文件)的工作目录为与爬虫目录中的scrapy.cfg文件同目录的/root/mypypi/crawler,如下图

方法2(推荐):

将方法1中的在pycharm中为/root/3xp10it/3xp10it.py设置的工作目录为/root/mypypi/crawler通过代码实现,在 cmdline.execute('scrapy crawl exp10it'.split())前通过代码切换工作目录:

os.chdir("/root/mypypi/crawler")
pycharm, scrapy
home
github
archive
category