jder.net爬虫

2020-04-19

爬虫

字数统计: 1.7k字 | 阅读时长≈ 8分钟

爬取jder.net的cosplay分区

将要迎接高考了，所以这两天学习比较忙，jder.net爬虫可能是这段时间最后一个爬虫项目了。

这次爬虫是比较烦的一次，我们首先来看看他的页面

虽然看似简单，但是这网页可以说设计的十分麻瓜。

一共179页，观察网页就可以发现规律
页面均为
http://www.jder.net/cosplay/page/*

图集均为
http://www.jder.net/cosplay/*.html

所有页码

图集内部
图片URL均为
http://img.jder.net/wp-content/uploads/*
或
http://www.jder.net/wp-content/uploads/*

这样我们可以很简单的写出爬虫

 # 爬取所有图集
for num in range(1, 180):  
    print(num)  
    page_url = "http://www.jder.net/cosplay/page/" + str(num)  
    response = requests.get(page_url)  
    response.encoding = 'utf-8'  
  html = response.text  
    soup = BeautifulSoup(html, "html.parser")  
    a_link_all = soup.find_all('a')  
    group_list_url = []  
    for item1 in a_link_all:  
        all_a_url = item1.get("href")  
        if str(all_a_url).find("http://www.jder.net/cosplay/") != -1 and str(all_a_url).find("?post") == -1 :  
            if not (all_a_url in group_list_url):  
                group_list_url.append(all_a_url)  
                print(all_a_url)

  # 进入图集下载  
down_img_url = []  # 已经下载的URL  
for item2 in group_list_url:  
    print("进入图集" + item2)  
    img_group = img_group + 1 # 图集编号自增  
  img_num = 0 # 图片编号初始化  
  try:  
        response = requests.get(item2,timeout=10)  
    except requests.exceptions.RequestException as e:  
        print(e)  
    response.encoding = 'utf-8'  
  html = response.text  
    soup = BeautifulSoup(html, "html.parser")  
    img_link_all = soup.find_all('img')  
    for item3 in img_link_all:  
        img_url = item3.get('src')  
        if str(img_url).find("http://img.jder.net/wp-content/uploads/") != -1 and str(img_url) != "http://img.jder.net/wp-content/uploads/bfi_thumb/5f5ce72f94bc47dce1a743b480b9a63c-6uebxvpkmpkkfcmh65imn1pkml0nariv5f12hhovy8b.png" and not(img_url in down_img_url):  
	          ## 下载

但是没有想的这么简单。
后来发现这个网站只要关闭了主页，就不可能打开图片

起初我以为通过cookie验证的，后来在另一个浏览器模拟出cookie后仍然不能访问图片。我又进行了各种尝试，均不能解决问题。
添加cookie

似乎到这里已经没路了，但是！

我发现在网站主页打开的情况下，可以通过迅雷来下载图片

通过对迅雷的抓包，啥都没发现，不过这不是最重要的，我们转变思路，改为获取所有图片的URL

import requests  
from bs4 import BeautifulSoup  
import urllib.request  
  
img_group = 25865 # 初始图集编号  
# 获取所有图集  
for num in range(1, 176):  
    print(num)  
    page_url = "http://www.jder.net/cosplay/page/" + str(num)  
    response = requests.get(page_url)  
    response.encoding = 'utf-8'  
  html = response.text  
    soup = BeautifulSoup(html, "html.parser")  
    a_link_all = soup.find_all('a')  
    group_list_url = []  
    for item1 in a_link_all:  
        all_a_url = item1.get("href")  
        if str(all_a_url).find("http://www.jder.net/cosplay/") != -1 and str(all_a_url).find("?post") == -1 :  
            if not (all_a_url in group_list_url):  
                group_list_url.append(all_a_url)  
                print(all_a_url)  
    # 进入图集下载  
down_img_url = []  # 已经下载的URL  
for item2 in group_list_url:  
    print("进入图集" + item2)  
    img_group = img_group + 1 # 图集编号自增  
  img_num = 0 # 图片编号初始化  
  try:  
        response = requests.get(item2,timeout=10)  
    except requests.exceptions.RequestException as e:  
        print(e)  
    response.encoding = 'utf-8'  
  html = response.text  
    soup = BeautifulSoup(html, "html.parser")  
    img_link_all = soup.find_all('img')  
    for item3 in img_link_all:  
        img_url = item3.get('src')  
        if str(img_url).find("http://img.jder.net/wp-content/uploads/") != -1 and str(img_url) != "http://img.jder.net/wp-content/uploads/bfi_thumb/5f5ce72f94bc47dce1a743b480b9a63c-6uebxvpkmpkkfcmh65imn1pkml0nariv5f12hhovy8b.png" and not(img_url in down_img_url):  
            txt = img_url  
            print(txt)  
            result2txt = str(txt)  # data是前面运行出的数据，先将其转为字符串才能写入  
  with open('结果.txt', 'a') as file_handle:  # .txt可以不自己新建,代码会自动新建  
  file_handle.write(result2txt)  # 写入  
  file_handle.write('\n')  # 有时放在循环里面需要自动转行，不然会覆盖上一条数据  
  
  if str(img_url).find("http://www.jder.net/wp-content/uploads/") != -1 and str(img_url) != "http://img.jder.net/wp-content/uploads/bfi_thumb/5f5ce72f94bc47dce1a743b480b9a63c-6uebxvpkmpkkfcmh65imn1pkml0nariv5f12hhovy8b.png" and not(img_url in down_img_url):  
            txt = img_url  
            print(txt)  
            result2txt = str(txt)  # data是前面运行出的数据，先将其转为字符串才能写入  
  with open('结果.txt', 'a') as file_handle:  # .txt可以不自己新建,代码会自动新建  
  file_handle.write(result2txt)  # 写入  
  file_handle.write('\n')  # 有时放在循环里面需要自动转行，不然会覆盖上一条数据

这时候又出问题了，当遍历图集
for num in range(1, 176):
的时候，发现当页数超过5页，就会出现页面图片抓取莫名的减少
数量减少

研究了许久，还是没找出原因，但是不要紧，我们只要使用基于浏览器的爬虫就可以解决这个问题，通过尝试，发现“八爪鱼”软件可以完成这项工作

我们通过Python爬取所有图集链接

import requests  
from bs4 import BeautifulSoup  
import urllib.request  
#获取所有URL  
# 获取所有图集  
for num in range(1, 180):  
    print(num)  
    page_url = "http://www.jder.net/cosplay/page/" + str(num)  
    i = 0  
  while i < 10:  
        try:  
            response = requests.get(page_url, timeout=5)  
            response.encoding = 'utf-8'  
  html = response.text  
            break;  
        except requests.exceptions.RequestException:  
            i += 1  
  soup = BeautifulSoup(html, "html.parser")  
    a_link_all = soup.find_all('a')  
    group_list_url = []  
    for item1 in a_link_all:  
        all_a_url = item1.get("href")  
        if str(all_a_url).find("http://www.jder.net/cosplay/") != -1 and str(all_a_url).find("?post") == -1 :  
            if not (all_a_url in group_list_url):  
                txt = all_a_url  
                print(txt)  
                result2txt = str(txt)  # data是前面运行出的数据，先将其转为字符串才能写入  
  with open('结果.txt', 'a') as file_handle:  # .txt可以不自己新建,代码会自动新建  
  file_handle.write(result2txt)  # 写入  
  file_handle.write('\n')  # 有时放在循环里面需要自动转行，不然会覆盖上一条数据

注意：在使用 requests.get(page_url)请求HTML网页时，有时相应时间会非常的长，非常影响效率，我们可以使用timeout=time来设置超时时间

while i < 10:  #10次还存在异常就跳过
    try:  #可能有异常的语句
        response = requests.get(page_url, timeout=5)  
        response.encoding = 'utf-8'  
  html = response.text  
        break;  #如果没有异常就结束while语句
    except requests.exceptions.RequestException:  
        i += 1 #出现异常就自增

通过在服务器上配置八爪鱼，我们能很方便的获取到所有图片链接，只不过有点慢而已。

八爪鱼

获取到的数据

最后我们通过迅雷下载就行了，但是一次性下载太多图片会使服务器吧我们IP加入黑名单，所以我们还需要通过代理来下载
代理工具

另外迅雷一次性添加过多图片可能会发行图片数量不对，所以我们还要分批进行下载，建议一次性下载1000张左右。

打赏

版权声明： 本博客所有文章除特别声明外，均采用 Apache License 2.0 许可协议。转载请注明出处！