2022年 11月 5日

Python数据采集

一、采集豆瓣电影 Top 250的数据采集

1.进入豆瓣 Top 250的网页

豆瓣电影 Top 250

2.进入开发者选项

3.进入top250中去查看相关配置

右键—-检查—-

 4.为Pycharm添加其第三方库

pycharm中    File—-右键—-settings—-Python Interpreter—-+—-(添加bs4、requests、lxml等安装包)

注意:如果安装库出现错误,即可能是对应的pip版本太低,需要更新

在 cmd 命令窗口查看pip版本

pip -V

更新pip版本

python -m pip install --upgrade pip

5.进行爬虫的编写
(1)导入:import requests
(2)向服务器发送请求 requests=requests.get(url)
  (3)获取网页数据:html =request.text

反反爬处理–——伪装浏览器
1.定义变量geaders={用户代理}
2.向服务器发送请求时携带上代理信息:response=requests.get(url,headrs = h)

6、bs4库中beautifulSoup类的使用

1.导入:from bs4 import BeautifulSoup
2.定义soup对象:soup = BeautifulSoup(html,lxml)
3.使用soup对象调用类方法select(css选择器)
4.提取数据:
get_text()
text
string

7、储存到CSV中

1.先将数据存放到列表里
1.创建csv文档with open(‘xxx.csv’,‘a’,newlie=’’) as f:
3.调用CSV模块中的write()方法:w=csv.write(f
4.将列表中的数据写入文档:w.writerows(listdata)

  1. import requests,csv
  2. from bs4 import BeautifulSoup
  3. def getHtml(url):
  4. # 数据采集
  5. h = {
  6. 'User-Agent': 'Mozilla / 5.0(Windows NT 10.0;WOW64)'
  7. }
  8. response = requests.get(url,headers = h)
  9. html = response.text
  10. # print(html)
  11. # getHtml('https://movie.douban.com/top250')
  12. # 数据解析:正则,BeautifulSoup,Xpath
  13. soup = BeautifulSoup(html,'lxml')
  14. filmtitle = soup.select('div.hd > a > span:nth-child(1)')
  15. ct = soup.select('div.bd > p:nth-child(1)')
  16. score = soup.select('div.bd > div > span.rating_num')
  17. evalue = soup.select('div.bd > div > span:nth-child(4)')
  18. print(score)
  19. filmlist = []
  20. for t,c,s,e in zip(filmtitle,ct,score,evalue):
  21. title = t.text
  22. content = c.text
  23. filmscore = s.text
  24. num = e.text.strip('人评价')
  25. director = content.strip().split()[1]
  26. if "主演:" in content:
  27. actor = content.strip().split('主演:')[1].split()[0]
  28. else:
  29. actor = None
  30. year = content.strip().split('/')[-3].split()[-1]
  31. area = content.strip().split('/')[-2].strip()
  32. filmtype = content.strip().split('/')[-1].strip()
  33. # print(num)
  34. listdata = [title,director,actor,year,area,filmtype,filmscore,num]
  35. filmlist.append(listdata)
  36. print(filmlist)
  37. # 存储数据
  38. with open('douban250.csv','a',encoding='utf-8',newline='') as f:
  39. w = csv.writer(f)
  40. w.writerows(filmlist)
  41. # 函数调用
  42. listtitle = ['title','director','actor','year','area','type','score','evalueate']
  43. with open('douban250.csv','a',encoding='utf-8',newline='') as f:
  44. w = csv.writer(f)
  45. w.writerow(listtitle)
  46. for i in range(0,226,25):
  47. getHtml('https://movie.douban.com/top250?start=%s&filter='%(i))
  48. # getHtml('https://movie.douban.com/top250?start=150&filter=')

备注

h 是到网站的开发者页面去寻找这句话就是

二、安居客数据采集

1.安居客的网页

安居客

2.导入from lxml import etree

3.将采集的字符串转换为html格式:etree.html

4.转换后的数据调用xPath(xpath的路径):data.xpath(路径表达式)

  1. import requests
  2. from lxml import etree
  3. def getHtml(url):
  4. h = {
  5. 'user - agent': 'Mozilla / 5.0(Windows NT 10.0;Win64;x64)'
  6. }
  7. response = requests.get(url,headers=h)
  8. html = response.text
  9. # 数据解析
  10. data = etree.HTML(html)
  11. print(data)
  12. name = data.xpath('//span[@class="items-name"]/text()')
  13. print(name)
  14. getHtml("https://bj.fang.anjuke.com/?from=AF_Home_switchcity")

三、拉勾网的数据采集

(一)requests数据采集

1.导入库
2.向服务器发送请求
3.下载数据

(二)post方式请求(参数不在url里)

1.通过网页解析找到post参数(请求数据Fromdata)并且参数定义到字典里
2.服务器发送请求调用requests.post(data=fromdata)

(三)数据解析——json

1.导入 import——json
2. 将采集的json数据格式转换为python字典:json.load(json格式数据)
3.通过字典中的键访问值

(四)数据存储到mysql中

1.下载并导入pymysql import pymysql
2.建立链接:conn=pymysql.Connenct(host,port,user,oassword,db,charset=‘utf8’)
3.写sql语句
4.定义游标:cursor=conn.cursor()
5.使用定义的游标执行sql语句:cursor.execute()sql
6.向数据库提交数据:conn.commit()
7.关闭游标:cursor.close()
8.关闭链接:conn.close()
这次的爬虫是需要账户的所以h是使用的Request Headers中的数据

  1. import requests,json,csv,time,pymysql
  2. keyword = input("请输入查询的职务")
  3. def getHtml(url):
  4. # 数据采集
  5. # h = {
  6. # 'user-agent': ''' Mozilla/5.0 (Windows NT 10.0;
  7. # WOW64) AppleWebKit/537.36 (KHTML, like
  8. # Gecko) Chrome/84.0
  9. # .4147
  10. # .89 Safari/537.36'''
  11. # }
  12. h = {
  13. 'accept': 'application / json, text / javascript, * / *;q=0.01',
  14. 'accept-encoding': 'gzip, deflate, br',
  15. 'accept-language': 'zh - CN, zh;q = 0.9',
  16. 'content-length': '187',
  17. 'content-type': 'application / x - www - form - urlencoded;charset = UTF - 8',
  18. 'cookie': 'user_trace_token = 20210811171420 - 2ffaeedb - b564 - 4952 - bd4f - 98d653f8dd82;_ga = GA1.2.1124964589.1628673260;LGUID = 20210811171420 - 23da14c9 - 6dab - 4720 - bc7f - b69df2ceb1f3;sajssdk_2015_cross_new_user = 1;_gid = GA1.2.642774310.1628673284;gate_login_token = 6750561a88ef94f733a1fcd2a6b2904d62951a47bc7f22df805342a995a92f22;LG_LOGIN_USER_ID = 3ef1b26ba9555cb7bd5bc00ea2afc307e0396355ce430c44c7c7b0efd9754be6;LG_HAS_LOGIN = 1;showExpriedIndex = 1;showExpriedCompanyHome = 1;showExpriedMyPublish = 1;hasDeliver = 0;privacyPolicyPopup = false;index_location_city = % E5 % 85 % A8 % E5 % 9B % BD;RECOMMEND_TIP = true;__SAFETY_CLOSE_TIME__22449191 = 1;Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6 = 1628673260, 1628680870;LGSID = 20210811192110 - 5d357c31 - f5c5 - 4673 - a6fe - 1f7aa9832c9b;PRE_UTM = m_cf_cpt_baidu_pcbt;PRE_HOST = www.baidu.com;PRE_SITE = https % 3A % 2F % 2Fwww.baidu.com % 2Fother.php % 3Fsc.060000abstzkng4oZnusQQpAJ1RM1r469sUVFOyBxYezJNF % 5F3voalKPl75Aalwm55EuNlZYnyABo8sHQVsnubyZpCt6rg20SDkMFfi1ykMxURnb1uFvNn6gVFev7OaHIODVc1YLOzZuI % 5FP91K7bQn3c1eVyMa2NTINfLSw8vAeFm4RE2nVOm7rhVwkpuF8ToMGSuWJDYa5 - EfqJPTEEmFAiIPydo.7Y % 5FNR2Ar5Od663rj6tJQrGvKD77h24SU5WudF6ksswGuh9J4qt7jHzk8sHfGmYt % 5FrE - 9kYryqM764TTPqKi % 5FnYQZHuukL0.TLFWgv - b5HDkrfK1ThPGujYknHb0THY0IAYqs2v4VnL30ZN1ugFxIZ - suHYs0A7bgLw4TARqnsKLULFb5TaV8UHPS0KzmLmqnfKdThkxpyfqnHRzrjb1nH0zPsKVINqGujYkPjfsPjn3n6KVgv - b5HDknjTYPjTk0AdYTAkxpyfqnHczP1n0TZuxpyfqn0KGuAnqiDF70ZKGujYzP6KWpyfqnWDs0APzm1YznH01Ps % 26ck % 3D8577.2.151.310.161.341.197.504 % 26dt % 3D1628680864 % 26wd % 3D % 25E6 % 258B % 2589 % 25E5 % 258B % 25BE % 25E7 % 25BD % 2591 % 26tpl % 3Dtpl % 5F12273 % 5F25897 % 5F22126 % 26l % 3D1528931027 % 26us % 3DlinkName % 253D % 2525E6 % 2525A0 % 252587 % 2525E9 % 2525A2 % 252598 - % 2525E4 % 2525B8 % 2525BB % 2525E6 % 2525A0 % 252587 % 2525E9 % 2525A2 % 252598 % 2526linkText % 253D % 2525E3 % 252580 % 252590 % 2525E6 % 25258B % 252589 % 2525E5 % 25258B % 2525BE % 2525E6 % 25258B % 25259B % 2525E8 % 252581 % 252598 % 2525E3 % 252580 % 252591 % 2525E5 % 2525AE % 252598 % 2525E6 % 252596 % 2525B9 % 2525E7 % 2525BD % 252591 % 2525E7 % 2525AB % 252599 % 252520 - % 252520 % 2525E4 % 2525BA % 252592 % 2525E8 % 252581 % 252594 % 2525E7 % 2525BD % 252591 % 2525E9 % 2525AB % 252598 % 2525E8 % 252596 % 2525AA % 2525E5 % 2525A5 % 2525BD % 2525E5 % 2525B7 % 2525A5 % 2525E4 % 2525BD % 25259C % 2525EF % 2525BC % 25258C % 2525E4 % 2525B8 % 25258A % 2525E6 % 25258B % 252589 % 2525E5 % 25258B % 2525BE % 21 % 2526linkType % 253D;PRE_LAND = https % 3A % 2F % 2Fwww.lagou.com % 2Flanding - page % 2Fpc % 2Fsearch.html % 3Futm % 5Fsource % 3Dm % 5Fcf % 5Fcpt % 5Fbaidu % 5Fpcbt;_putrc = 49E09C21F104EE26123F89F2B170EADC;JSESSIONID = ABAAABAABAGABFA53D5BE8416D5D8A97CF86D4C78670054;login = true;unick = % E5 % B8 % AD % E4 % BA % 9A % E8 % 8E % 89;WEBTJ - ID = 20210811192139 - 17b34f24cf942a - 0a32363c696808 - 436d2710 - 1296000 - 17b34f24cfaa72;sensorsdata2015session = % 7B % 7D;TG - TRACK - CODE = index_user;X_HTTP_TOKEN = 0d37be70811080cf330186826144b8b492a87bfbbc;Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6 = 1628681034;LGRID = 20210811192358 - 526edbb5 - b009 - 4e55 - a806 - 3b99a453ed3b;sensorsdata2015jssdkcross = % 7B % 22distinct_id % 22 % 3A % 2222449191 % 22 % 2C % 22 % 24device_id % 22 % 3A % 2217b347e188019c - 04cc2319c4da9a - 436d2710 - 1296000 - 17b347e18818b0 % 22 % 2C % 22props % 22 % 3A % 7B % 22 % 24latest_traffic_source_type % 22 % 3A % 22 % E7 % 9B % B4 % E6 % 8E % A5 % E6 % B5 % 81 % E9 % 87 % 8F % 22 % 2C % 22 % 24latest_search_keyword % 22 % 3A % 22 % E6 % 9C % AA % E5 % 8F % 96 % E5 % 88 % B0 % E5 % 80 % BC_ % E7 % 9B % B4 % E6 % 8E % A5 % E6 % 89 % 93 % E5 % BC % 80 % 22 % 2C % 22 % 24latest_referrer % 22 % 3A % 22 % 22 % 2C % 22 % 24os % 22 % 3A % 22Windows % 22 % 2C % 22 % 24browser % 22 % A % 22Chrome % 22 % 2C % 22 % 24browser_version % 22 % 3A % 2284.0.4147.89 % 22 % 7D % 2C % 22first_id % 22 % 3A % 2217b347e188019c - 04cc2319c4da9a - 436d2710 - 1296000 - 17b347e18818b0 % 22 % 7D',
  19. 'origin': 'https: // www.lagou.com',
  20. 'referer': 'https: // www.lagou.com / wn / jobs?px = default & cl = false & fromSearch = true & labelWords = sug & suginput = python & kd = python % E5 % BC % 80 % E5 % 8F % 91 % E5 % B7 % A5 % E7 % A8 % 8B % E5 % B8 % 88 & pn = 1',
  21. 'sec-fetch-dest': 'empty',
  22. 'sec-fetch-mode': 'cors',
  23. 'sec-fetch-site': 'same-origin',
  24. 'user-agent': 'Mozilla / 5.0(Windows NT10.0;WOW64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 84.0.4147.89Safari / 537.36SLBrowser / 7.0.0.6241SLBChan / 30',
  25. 'x-anit-forge-code': '7cd23e45 - 40de - 449e - b201 - cddd619b37f3',
  26. 'x-anit-forge-token': '9b30b067 - ac6d - 4915 - af3e - c308979313d0'
  27. }
  28. # 定义post方式参数
  29. for num in range(1,31):
  30. formdata = {
  31. 'first': 'true',
  32. 'pn': num,
  33. 'kd': keyword
  34. }
  35. # 向服务器发送请求
  36. response = requests.post(url, headers = h,data = formdata)
  37. # 下载数据
  38. jsonhtml = response.text
  39. print(jsonhtml)
  40. # 解析数据
  41. dictdata = json.loads(jsonhtml)
  42. # print(type(dictdata))
  43. ct = dictdata['content']['positionResult']['result']
  44. positionlist = []
  45. for i in range(0,dictdata['content']['pageSize']):
  46. positionName = ct[i]['positionName']
  47. companyFullName = ct[i]['companyFullName']
  48. city = ct[i]['city']
  49. district = ct[i]['district']
  50. companySize = ct[i]['companySize']
  51. education = ct[i]['education']
  52. salary = ct[i]['salary']
  53. salaryMonth = ct[i]['salaryMonth']
  54. workYear = ct[i]['workYear']
  55. # print(workYear)
  56. datalist = [positionName,companyFullName,city,district,companySize,education,salary,salaryMonth,workYear]
  57. positionlist.append(datalist)
  58. # 建立连接
  59. conn = pymysql.Connect(host='localhost',port=3306,user='root',passwd='277877061#xyl',db='lagou',charset='utf8')
  60. # sql
  61. sql = "insert into lg values('%s','%s','%s','%s','%s','%s','%s','%s','%s');"%(positionName,companyFullName,city,district,companySize,education,salary,salaryMonth,workYear)
  62. # 创建游标并执行
  63. cursor = conn.cursor()
  64. try:
  65. cursor.execute(sql)
  66. conn.commit()
  67. except:
  68. conn.rollback()
  69. cursor.close()
  70. conn.close()
  71. # 设置时间延迟
  72. time.sleep(3)
  73. with open('拉勾网%s职位信息.csv'%keyword,'a',encoding='utf-8',newline='') as f:
  74. w = csv.writer(f)
  75. w.writerows(positionlist)
  76. # print(ct)
  77. # 调用函数
  78. title = ['职位名称','公司名称','所在城市','所属地区','公司规模','教育水平','薪资范围','薪资月','工作经验']
  79. with open('拉勾网%s职位信息.csv'%keyword,'a',encoding='utf-8',newline='') as f:
  80. w = csv.writer(f)
  81. w.writerow(title)
  82. getHtml('https://www.lagou.com/jobs/v2/positionAjax.json?needAddtionalResult=false')

 四、疫情数据采集

360疫情数据

  1. import requests,json,csv,pymysql
  2. def getHtml(url):
  3. h = {
  4. 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0;Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
  5. }
  6. response = requests.get(url,headers=h)
  7. joinhtml = response.text
  8. directro = joinhtml.split('(')[1]
  9. directro2 = directro[0:-2]
  10. dictdata = json.loads(directro2)
  11. ct = dictdata['country']
  12. positionlist = []
  13. for i in range(0,196):
  14. provinceName = ct[i]['provinceName']
  15. cured = ct[i]['cured']
  16. diagnosed = ct[i]['diagnosed']
  17. died = ct[i]['died']
  18. diffDiagnosed = ct[i]['diffDiagnosed']
  19. datalist = [provinceName,cured,diagnosed,died,diffDiagnosed]
  20. positionlist.append(datalist)
  21. print(diagnosed)
  22. conn = pymysql.Connect(host = 'localhost',port = 3306,user = 'root',passwd='277877061#xyl',db='lagou',charset='utf8')
  23. sql = "insert into yq values('%s','%s','%s','%s','%s');" %(provinceName,cured,diagnosed,died,diffDiagnosed)
  24. cursor = conn.cursor()
  25. try:
  26. cursor.execute(sql)
  27. conn.commit()
  28. except:
  29. conn.rollback()
  30. cursor.close()
  31. conn.close()
  32. with open('360疫情数据.csv','a',encoding="utf-8",newline="") as f:
  33. w = csv.writer(f)
  34. w.writerows(positionlist)
  35. title = ['provinceName','cured','diagnosed','died','diffDiagnosed']
  36. with open('360疫情数据.csv', 'a', encoding="utf-8", newline="") as f:
  37. w = csv.writer(f)
  38. w.writerow(title)
  39. getHtml('https://m.look.360.cn/events/feiyan?sv=&version=&market=&device=2&net=4&stype=&scene=&sub_scene=&refer_scene=&refer_subscene=&f=jsonp&location=true&sort=2&_=1626165806369&callback=jsonp2')

五、scrapy 数据采集

(一)创建爬虫步骤

1.创建项目,在命令行输入:scrapy startproject 项目名称
2.创建爬虫文件:在命令行定位到spiders文件夹下:scrapy genspider 爬虫文件名 网站域名
3.运行程序:scrapy crawl 爬虫文件名

(二)scrapy文件结构

1.spiders文件夹:编写解析程序
2.init.py:包标志文件,只有存在这个python文件的文件夹才是包
3.items.py:定义数据结构,用于定义爬取的字段
4.middlewares.py:中间件
5.pipliness.py:管道文件,用于定义数据存储方向
6.settings.py:配置文件,通常配置用户代理,管道文件开启

(三)scrapy爬虫框架使用流程

1.配置settings.py文件:

(1)User-Agent:“浏览器参数”

  (2)ROBOTS协议:Fasle

(3)修改ITEM_PIPLINES:打开管道

2.在item.py中定义爬取的字段

字段名 = scrapy.Field()

3.在spiders文件夹中的parse()方法中写解析代码并将解析结果提交给item对象
4.在piplines.py中定义存储路径
5.分页采集:

1)查找url的规律
2)在爬虫文件中先定义页面变量,使用if判断和字符串格式化方法生成新url
3)yield scrapy Requests(url,callback=parse)

案例
(一)安居客数据采集
其中的文件

venv—-右键—-New—-Directory—-安居库数据采集

venv—-右键—-Open in—-terminal—-命令行输入:

 scrapy startproject anjuke

回车

查看自动生成的文件

 在terminal里面进去spiders目录:

cd anjuke\anjuke\spiders

在命令行中输入:(创建ajk.py文件)

scrapy genspider ajk bj.fang.anjuke.com/?from=navigation

 

运行:(命令行输入)

scrapy crawl ajk

 运行也是要在终端:

scrapy crawl ajk --nolog

 

 1.ajk.py中

  1. import scrapy
  2. from anjuke.items import AnjukeItem
  3. class AjkSpider(scrapy.Spider):
  4. name = 'ajk'
  5. #allowed_domains = ['bj.fang.anjuke.com/?from=navigation']
  6. start_urls = ['http://bj.fang.anjuke.com/loupan/all/p1']
  7. pagenum = 1
  8. def parse(self, response):
  9. #print(response.text)
  10. item = AnjukeItem()
  11. # 解析数据
  12. name = response.xpath('//span[@class="items-name"]/text()').extract()
  13. temp = response.xpath('//span[@class="list-map"]/text()').extract()
  14. place = []
  15. district = []
  16. #print(temp)
  17. for i in temp:
  18. placetemp ="".join(i.split("]")[0].strip("[").strip().split())
  19. districttemp = i.split("]")[1].strip()
  20. # print(districttemp)
  21. place.append(placetemp)
  22. district.append(districttemp)
  23. # apartment = response.xpath('//a[@class="huxing"]/span[not(@class)]/text()').extract()
  24. area1 = response.xpath('//span[@class="building-area"]/text()').extract()
  25. area = []
  26. for j in area1:
  27. areatemp = j.strip("建筑面积:").strip('㎡')
  28. area.append(areatemp)
  29. price = response.xpath('//p[@class="price"]/span/text()').extract()
  30. # print(name)
  31. # 将处理后的数据传入item中
  32. item['name'] = name
  33. item['place'] = place
  34. item['district'] = district
  35. #item['apartment'] = apartment
  36. item['area'] = area
  37. item['price'] = price
  38. yield item
  39. #print(type(item['name']))
  40. for a, b, c, d, e in zip(item['name'], item['place'], item['district'], item['area'], item['price']):
  41. print(a, b, c, d, e)
  42. # 分页爬虫
  43. if self.pagenum < 5:
  44. self.pagenum += 1
  45. newurl = "https://bj.fang.anjuke.com/loupan/all/p{}/".format(str(self.pagenum))
  46. print(newurl)
  47. yield scrapy.Request(newurl, callback=self.parse)
  48. # print(type(dict(item)))

2.pipelines.py中

  1. # Define your item pipelines here
  2. #
  3. # Don't forget to add your pipeline to the ITEM_PIPELINES setting
  4. # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
  5. # useful for handling different item types with a single interface
  6. from itemadapter import ItemAdapter
  7. # 存放到csv中
  8. import csv,pymysql
  9. class AnjukePipeline:
  10. def open_spider(self, spider):
  11. self.f = open("安居客北京.csv", "w", encoding='utf-8', newline="")
  12. self.w = csv.writer(self.f)
  13. titlelist = ['name', 'place', 'distract', 'area', 'price']
  14. self.w.writerow(titlelist)
  15. def process_item(self, item, spider):
  16. # writerow() [1,2,3,4] writerows() [[第一条记录],[第二条记录],[第三条记录]]
  17. # 数据处理
  18. k = list(dict(item).values())
  19. self.listtemp = []
  20. for a, b, c, d, e in zip(k[0], k[1], k[2], k[3], k[4]):
  21. self.temp = [a, b, c, d, e]
  22. self.listtemp.append(self.temp)
  23. # print(listtemp)
  24. self.w.writerows(self.listtemp)
  25. return item
  26. def close_spider(self,spider):
  27. self.f.close()
  28. # 存储到mysql中
  29. class MySqlPipeline:
  30. def open_spider(self,spider):
  31. self.conn = pymysql.Connect(host="localhost",port=3306,user='root',password='277877061#xyl',db="anjuke",charset='utf8')
  32. def process_item(self,item,spider):
  33. self.cursor = self.conn.cursor()
  34. for a, b, c, d, e in zip(item['name'], item['place'], item['district'], item['area'], item['price']):
  35. sql = 'insert into ajk values("%s","%s","%s","%s","%s");'%(a,b,c,d,e)
  36. self.cursor.execute(sql)
  37. self.conn.commit()
  38. return item
  39. def close_spider(self,spider):
  40. self.cursor.close()
  41. self.conn.close()

3.items.py中

  1. # Define here the models for your scraped items
  2. #
  3. # See documentation in:
  4. # https://docs.scrapy.org/en/latest/topics/items.html
  5. import scrapy
  6. class AnjukeItem(scrapy.Item):
  7. # define the fields for your item here like:
  8. name = scrapy.Field()
  9. place = scrapy.Field()
  10. district = scrapy.Field()
  11. #apartment = scrapy.Field()
  12. area = scrapy.Field()
  13. price = scrapy.Field()

4.settings.py中

  1. # Scrapy settings for anjuke project
  2. #
  3. # For simplicity, this file contains only settings considered important or
  4. # commonly used. You can find more settings consulting the documentation:
  5. #
  6. # https://docs.scrapy.org/en/latest/topics/settings.html
  7. # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
  8. # https://docs.scrapy.org/en/latest/topics/spider-middleware.html
  9. BOT_NAME = 'anjuke'
  10. SPIDER_MODULES = ['anjuke.spiders']
  11. NEWSPIDER_MODULE = 'anjuke.spiders'
  12. # Crawl responsibly by identifying yourself (and your website) on the user-agent
  13. USER_AGENT = 'Mozilla / 5.0(Windows NT 10.0;WOW64)'
  14. # Obey robots.txt rules
  15. ROBOTSTXT_OBEY = False
  16. # Configure maximum concurrent requests performed by Scrapy (default: 16)
  17. #CONCURRENT_REQUESTS = 32
  18. # Configure a delay for requests for the same website (default: 0)
  19. # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
  20. # See also autothrottle settings and docs
  21. #DOWNLOAD_DELAY = 3
  22. # The download delay setting will honor only one of:
  23. #CONCURRENT_REQUESTS_PER_DOMAIN = 16
  24. #CONCURRENT_REQUESTS_PER_IP = 16
  25. # Disable cookies (enabled by default)
  26. #COOKIES_ENABLED = False
  27. # Disable Telnet Console (enabled by default)
  28. #TELNETCONSOLE_ENABLED = False
  29. # Override the default request headers:
  30. #DEFAULT_REQUEST_HEADERS = {
  31. # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  32. # 'Accept-Language': 'en',
  33. #}
  34. # Enable or disable spider middlewares
  35. # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
  36. #SPIDER_MIDDLEWARES = {
  37. # 'anjuke.middlewares.AnjukeSpiderMiddleware': 543,
  38. #}
  39. # Enable or disable downloader middlewares
  40. # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
  41. #DOWNLOADER_MIDDLEWARES = {
  42. # 'anjuke.middlewares.AnjukeDownloaderMiddleware': 543,
  43. #}
  44. # Enable or disable extensions
  45. # See https://docs.scrapy.org/en/latest/topics/extensions.html
  46. #EXTENSIONS = {
  47. # 'scrapy.extensions.telnet.TelnetConsole': None,
  48. #}
  49. # Configure item pipelines
  50. # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
  51. ITEM_PIPELINES = {
  52. 'anjuke.pipelines.AnjukePipeline': 300,
  53. 'anjuke.pipelines.MySqlPipeline': 301
  54. }
  55. # Enable and configure the AutoThrottle extension (disabled by default)
  56. # See https://docs.scrapy.org/en/latest/topics/autothrottle.html
  57. #AUTOTHROTTLE_ENABLED = True
  58. # The initial download delay
  59. #AUTOTHROTTLE_START_DELAY = 5
  60. # The maximum download delay to be set in case of high latencies
  61. #AUTOTHROTTLE_MAX_DELAY = 60
  62. # The average number of requests Scrapy should be sending in parallel to
  63. # each remote server
  64. #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
  65. # Enable showing throttling stats for every response received:
  66. #AUTOTHROTTLE_DEBUG = False
  67. # Enable and configure HTTP caching (disabled by default)
  68. # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
  69. #HTTPCACHE_ENABLED = True
  70. #HTTPCACHE_EXPIRATION_SECS = 0
  71. #HTTPCACHE_DIR = 'httpcache'
  72. #HTTPCACHE_IGNORE_HTTP_CODES = []
  73. #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

5.middlewares.py中

  1. # Define here the models for your spider middleware
  2. #
  3. # See documentation in:
  4. # https://docs.scrapy.org/en/latest/topics/spider-middleware.html
  5. from scrapy import signals
  6. # useful for handling different item types with a single interface
  7. from itemadapter import is_item, ItemAdapter
  8. class AnjukeSpiderMiddleware:
  9. # Not all methods need to be defined. If a method is not defined,
  10. # scrapy acts as if the spider middleware does not modify the
  11. # passed objects.
  12. @classmethod
  13. def from_crawler(cls, crawler):
  14. # This method is used by Scrapy to create your spiders.
  15. s = cls()
  16. crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
  17. return s
  18. def process_spider_input(self, response, spider):
  19. # Called for each response that goes through the spider
  20. # middleware and into the spider.
  21. # Should return None or raise an exception.
  22. return None
  23. def process_spider_output(self, response, result, spider):
  24. # Called with the results returned from the Spider, after
  25. # it has processed the response.
  26. # Must return an iterable of Request, or item objects.
  27. for i in result:
  28. yield i
  29. def process_spider_exception(self, response, exception, spider):
  30. # Called when a spider or process_spider_input() method
  31. # (from other spider middleware) raises an exception.
  32. # Should return either None or an iterable of Request or item objects.
  33. pass
  34. def process_start_requests(self, start_requests, spider):
  35. # Called with the start requests of the spider, and works
  36. # similarly to the process_spider_output() method, except
  37. # that it doesn’t have a response associated.
  38. # Must return only requests (not items).
  39. for r in start_requests:
  40. yield r
  41. def spider_opened(self, spider):
  42. spider.logger.info('Spider opened: %s' % spider.name)
  43. class AnjukeDownloaderMiddleware:
  44. # Not all methods need to be defined. If a method is not defined,
  45. # scrapy acts as if the downloader middleware does not modify the
  46. # passed objects.
  47. @classmethod
  48. def from_crawler(cls, crawler):
  49. # This method is used by Scrapy to create your spiders.
  50. s = cls()
  51. crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
  52. return s
  53. def process_request(self, request, spider):
  54. # Called for each request that goes through the downloader
  55. # middleware.
  56. # Must either:
  57. # - return None: continue processing this request
  58. # - or return a Response object
  59. # - or return a Request object
  60. # - or raise IgnoreRequest: process_exception() methods of
  61. # installed downloader middleware will be called
  62. return None
  63. def process_response(self, request, response, spider):
  64. # Called with the response returned from the downloader.
  65. # Must either;
  66. # - return a Response object
  67. # - return a Request object
  68. # - or raise IgnoreRequest
  69. return response
  70. def process_exception(self, request, exception, spider):
  71. # Called when a download handler or a process_request()
  72. # (from other downloader middleware) raises an exception.
  73. # Must either:
  74. # - return None: continue processing this exception
  75. # - return a Response object: stops process_exception() chain
  76. # - return a Request object: stops process_exception() chain
  77. pass
  78. def spider_opened(self, spider):
  79. spider.logger.info('Spider opened: %s' % spider.name)

6、Mysql中

1)创建anjuke数据库

mysql> create database anjuke;

2)使用库

mysql> use anjuke;

3)创建表

  1. mysql> create table ajk(
  2. -> name varchar(100),place varchar(100),distarct varchar(100),area varchar(100),price varchar(50));
  3. Query OK, 0 rows affected (0.03 sec)

4)查看表结构

  1. mysql> desc ajk;
  2. +----------+--------------+------+-----+---------+-------+
  3. | Field | Type | Null | Key | Default | Extra |
  4. +----------+--------------+------+-----+---------+-------+
  5. | name | varchar(100) | YES | | NULL | |
  6. | place | varchar(100) | YES | | NULL | |
  7. | distarct | varchar(100) | YES | | NULL | |
  8. | area | varchar(100) | YES | | NULL | |
  9. | price | varchar(50) | YES | | NULL | |
  10. +----------+--------------+------+-----+---------+-------+

5)查看表数据

  1. mysql> select * from ajk;
  2. +-------------------------------------+-----------------------+-----------------------------------------------------------+--------------+--------+
  3. | name | place | distarct | area | price |
  4. +-------------------------------------+-----------------------+-----------------------------------------------------------+--------------+--------+
  5. | 和锦华宸品牌馆 | 顺义马坡 | 和安路与大营二街交汇处东北角 | 95-180 | 400 |
  6. | 保利建工•和悦春风 | 大兴庞各庄 | 永兴河畔 | 76-109 | 36000 |
  7. | 华樾国际·领尚 | 朝阳东坝 | 东坝大街和机场二高速交叉路口西方向3... | 96-196.82 | 680 |
  8. | 金地·璟宸品牌馆 | 房山良乡 | 政通西路 | 97-149 | 390 |
  9. | 中海首钢天玺 | 石景山古城 | 北京市石景山区古城南街及莲石路交叉口... | 122-147 | 900 |
  10. | 傲云品牌馆 | 朝阳孙河 | 顺黄路53| 87288 | 2800 |
  11. | 首开香溪郡 | 通州宋庄 | 通顺路 | 90-220 | 39000 |
  12. | 北京城建·北京合院 | 顺义顺义城 | 燕京街与通顺路交汇口东800米(仁和公... | 93-330 | 850 |
  13. | 金融街武夷·融御 | 通州政务区 | 北京城市副中心01组团通胡大街与东六环... | 99-176 | 66000 |
  14. | 北京城建·天坛府品牌馆 | 东城天坛 | 北京市东城区景泰路与安乐林路交汇处 | 60-135 | 800 |
  15. | 亦庄金悦郡 | 大兴亦庄 | 新城环景西一路与景盛南二街交叉口 | 72-118 | 38000 |
  16. | 金茂·北京国际社区 | 顺义顺义城 | 水色时光西路西侧 | 50-118 | 175 |

(二)太平洋汽车数据采集

1.tpy.py中

  1. import scrapy
  2. from taipingyangqiche.item import TaipingyangqicheItem
  3. class TpyqcSpider(scrapy.Spider):
  4. name = 'tpyqc'
  5. allowed_domains = ['price.pcauto.com.cn/top']
  6. start_urls = ['http://price.pcauto.com.cn/top/']
  7. pagenum = 1
  8. def parse(self, response):
  9. item = TaipingyangqicheItem()
  10. name = response.xpath('//p[@class="sname"]/a/text()').extract()
  11. temperature = response.xpath('//p[@class="col rank"]/span[@class="fl red rd-mark"]/text()').extract()
  12. price = response.xpath('//p/em[@class="red"]/text()').extract()
  13. brand2 = response.xpath('//p[@class="col col1"]/text()').extract()
  14. brand = []
  15. for j in brand2:
  16. areatemp = j.strip('品牌:').strip('排量:').strip('\r\n')
  17. brand.append(areatemp)
  18. brand = [i for i in brand if i != '']
  19. rank2 = response.xpath('//p[@class="col"]/text()').extract()
  20. rank = []
  21. for j in rank2:
  22. areatemp = j.strip('级别:').strip('变速箱:').strip('\r\n')
  23. rank.append(areatemp)
  24. rank = [i for i in rank if i != '']
  25. item['name'] = name
  26. item['temperature'] = temperature
  27. item['price'] = price
  28. item['brand'] = brand
  29. item['rank'] = rank
  30. yield item
  31. for a, b, c, d, e in zip(item['name'], item['temperature'], item['price'], item['brand'], item['rank']):
  32. print(a, b, c, d, e)
  33. if self.pagenum < 6:
  34. self.pagenum += 1
  35. newurl = "https://price.pcauto.com.cn/top/k0-p{}.html".format(str(self.pagenum))
  36. print(newurl)
  37. yield scrapy.Request(newurl, callback=self.parse)
  38. # print(type(dict(item)))

2.pipelines.py中

  1. # Define your item pipelines here
  2. #
  3. # Don't forget to add your pipeline to the ITEM_PIPELINES setting
  4. # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
  5. # useful for handling different item types with a single interface
  6. from itemadapter import ItemAdapter
  7. import csv,pymysql
  8. class TaipingyangqichePipeline:
  9. def open_spider(self, spider):
  10. self.f = open("太平洋.csv", "w", encoding='utf-8', newline="")
  11. self.w = csv.writer(self.f)
  12. titlelist = ['name', 'temperature', 'price', 'brand', 'rank']
  13. self.w.writerow(titlelist)
  14. def process_item(self, item, spider):
  15. k = list(dict(item).values())
  16. self.listtemp = []
  17. for a, b, c, d, e in zip(k[0], k[1], k[2], k[3], k[4]):
  18. self.temp = [a, b, c, d, e]
  19. self.listtemp.append(self.temp)
  20. print(self.listtemp)
  21. self.w.writerows(self.listtemp)
  22. return item
  23. def close_spider(self, spider):
  24. self.f.close()
  25. class MySqlPipeline:
  26. def open_spider(self, spider):
  27. self.conn = pymysql.Connect(host="localhost", port=3306, user='root', password='277877061#xyl', db="taipy",charset='utf8')
  28. def process_item(self, item, spider):
  29. self.cursor = self.conn.cursor()
  30. for a, b, c, d, e in zip(item['name'], item['temperature'], item['price'], item['brand'], item['rank']):
  31. sql = 'insert into tpy values("%s","%s","%s","%s","%s");' % (a, b, c, d, e)
  32. self.cursor.execute(sql)
  33. self.conn.commit()
  34. return item
  35. def close_spider(self, spider):
  36. self.cursor.close()
  37. self.conn.close()

3.items.py中

  1. # Define here the models for your scraped items
  2. #
  3. # See documentation in:
  4. # https://docs.scrapy.org/en/latest/topics/items.html
  5. import scrapy
  6. class TaipingyangqicheItem(scrapy.Item):
  7. # define the fields for your item here like:
  8. name = scrapy.Field()
  9. temperature = scrapy.Field()
  10. price = scrapy.Field()
  11. brand = scrapy.Field()
  12. rank = scrapy.Field()

4.settings.py中

  1. # Scrapy settings for taipingyangqiche project
  2. #
  3. # For simplicity, this file contains only settings considered important or
  4. # commonly used. You can find more settings consulting the documentation:
  5. #
  6. # https://docs.scrapy.org/en/latest/topics/settings.html
  7. # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
  8. # https://docs.scrapy.org/en/latest/topics/spider-middleware.html
  9. BOT_NAME = 'taipingyangqiche'
  10. SPIDER_MODULES = ['taipingyangqiche.spiders']
  11. NEWSPIDER_MODULE = 'taipingyangqiche.spiders'
  12. # Crawl responsibly by identifying yourself (and your website) on the user-agent
  13. USER_AGENT = 'Mozilla / 5.0(Windows NT 10.0;WOW64)'
  14. # Obey robots.txt rules
  15. ROBOTSTXT_OBEY = False
  16. # Configure maximum concurrent requests performed by Scrapy (default: 16)
  17. #CONCURRENT_REQUESTS = 32
  18. # Configure a delay for requests for the same website (default: 0)
  19. # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
  20. # See also autothrottle settings and docs
  21. #DOWNLOAD_DELAY = 3
  22. # The download delay setting will honor only one of:
  23. #CONCURRENT_REQUESTS_PER_DOMAIN = 16
  24. #CONCURRENT_REQUESTS_PER_IP = 16
  25. # Disable cookies (enabled by default)
  26. #COOKIES_ENABLED = False
  27. # Disable Telnet Console (enabled by default)
  28. #TELNETCONSOLE_ENABLED = False
  29. # Override the default request headers:
  30. #DEFAULT_REQUEST_HEADERS = {
  31. # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  32. # 'Accept-Language': 'en',
  33. #}
  34. # Enable or disable spider middlewares
  35. # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
  36. #SPIDER_MIDDLEWARES = {
  37. # 'taipingyangqiche.middlewares.TaipingyangqicheSpiderMiddleware': 543,
  38. #}
  39. # Enable or disable downloader middlewares
  40. # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
  41. #DOWNLOADER_MIDDLEWARES = {
  42. # 'taipingyangqiche.middlewares.TaipingyangqicheDownloaderMiddleware': 543,
  43. #}
  44. # Enable or disable extensions
  45. # See https://docs.scrapy.org/en/latest/topics/extensions.html
  46. #EXTENSIONS = {
  47. # 'scrapy.extensions.telnet.TelnetConsole': None,
  48. #}
  49. # Configure item pipelines
  50. # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
  51. ITEM_PIPELINES = {
  52. 'taipingyangqiche.pipelines.TaipingyangqichePipeline': 300,
  53. 'taipingyangqiche.pipelines.MySqlPipeline': 301
  54. }
  55. # Enable and configure the AutoThrottle extension (disabled by default)
  56. # See https://docs.scrapy.org/en/latest/topics/autothrottle.html
  57. #AUTOTHROTTLE_ENABLED = True
  58. # The initial download delay
  59. #AUTOTHROTTLE_START_DELAY = 5
  60. # The maximum download delay to be set in case of high latencies
  61. #AUTOTHROTTLE_MAX_DELAY = 60
  62. # The average number of requests Scrapy should be sending in parallel to
  63. # each remote server
  64. #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
  65. # Enable showing throttling stats for every response received:
  66. #AUTOTHROTTLE_DEBUG = False
  67. # Enable and configure HTTP caching (disabled by default)
  68. # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
  69. #HTTPCACHE_ENABLED = True
  70. #HTTPCACHE_EXPIRATION_SECS = 0
  71. #HTTPCACHE_DIR = 'httpcache'
  72. #HTTPCACHE_IGNORE_HTTP_CODES = []
  73. #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

5.middlewares.py中

  1. # Define here the models for your spider middleware
  2. #
  3. # See documentation in:
  4. # https://docs.scrapy.org/en/latest/topics/spider-middleware.html
  5. from scrapy import signals
  6. # useful for handling different item types with a single interface
  7. from itemadapter import is_item, ItemAdapter
  8. class TaipingyangqicheSpiderMiddleware:
  9. # Not all methods need to be defined. If a method is not defined,
  10. # scrapy acts as if the spider middleware does not modify the
  11. # passed objects.
  12. @classmethod
  13. def from_crawler(cls, crawler):
  14. # This method is used by Scrapy to create your spiders.
  15. s = cls()
  16. crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
  17. return s
  18. def process_spider_input(self, response, spider):
  19. # Called for each response that goes through the spider
  20. # middleware and into the spider.
  21. # Should return None or raise an exception.
  22. return None
  23. def process_spider_output(self, response, result, spider):
  24. # Called with the results returned from the Spider, after
  25. # it has processed the response.
  26. # Must return an iterable of Request, or item objects.
  27. for i in result:
  28. yield i
  29. def process_spider_exception(self, response, exception, spider):
  30. # Called when a spider or process_spider_input() method
  31. # (from other spider middleware) raises an exception.
  32. # Should return either None or an iterable of Request or item objects.
  33. pass
  34. def process_start_requests(self, start_requests, spider):
  35. # Called with the start requests of the spider, and works
  36. # similarly to the process_spider_output() method, except
  37. # that it doesn’t have a response associated.
  38. # Must return only requests (not items).
  39. for r in start_requests:
  40. yield r
  41. def spider_opened(self, spider):
  42. spider.logger.info('Spider opened: %s' % spider.name)
  43. class TaipingyangqicheDownloaderMiddleware:
  44. # Not all methods need to be defined. If a method is not defined,
  45. # scrapy acts as if the downloader middleware does not modify the
  46. # passed objects.
  47. @classmethod
  48. def from_crawler(cls, crawler):
  49. # This method is used by Scrapy to create your spiders.
  50. s = cls()
  51. crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
  52. return s
  53. def process_request(self, request, spider):
  54. # Called for each request that goes through the downloader
  55. # middleware.
  56. # Must either:
  57. # - return None: continue processing this request
  58. # - or return a Response object
  59. # - or return a Request object
  60. # - or raise IgnoreRequest: process_exception() methods of
  61. # installed downloader middleware will be called
  62. return None
  63. def process_response(self, request, response, spider):
  64. # Called with the response returned from the downloader.
  65. # Must either;
  66. # - return a Response object
  67. # - return a Request object
  68. # - or raise IgnoreRequest
  69. return response
  70. def process_exception(self, request, exception, spider):
  71. # Called when a download handler or a process_request()
  72. # (from other downloader middleware) raises an exception.
  73. # Must either:
  74. # - return None: continue processing this exception
  75. # - return a Response object: stops process_exception() chain
  76. # - return a Request object: stops process_exception() chain
  77. pass
  78. def spider_opened(self, spider):
  79. spider.logger.info('Spider opened: %s' % spider.name)