学校的网站管理系统,asp影视网站源码,百度免费网站申请注册,三门峡住房和建设局网站Python 第二阶段 - 爬虫入门
#x1f3af; 今日目标
掌握网页分页的原理和定位“下一页”的链接能编写循环逻辑自动翻页抓取内容将多页抓取整合到爬虫系统中
#x1f4d8; 学习内容详解
#x1f501; 网页分页逻辑介绍 以 quotes.toscrape.com 为例#xff1a;
首页链…Python 第二阶段 - 爬虫入门 今日目标
掌握网页分页的原理和定位“下一页”的链接能编写循环逻辑自动翻页抓取内容将多页抓取整合到爬虫系统中 学习内容详解 网页分页逻辑介绍 以 quotes.toscrape.com 为例
首页链接https://quotes.toscrape.com/下一页链接li classnexta href/page/2/Next/a/li
我们可以通过 BeautifulSoup 查找li.next a[href] 获取下一页地址并拼接 URL。 核心思路伪代码 while True:1. 请求当前页 URL2. 解析 HTML提取所需内容3. 判断是否存在下一页链接- 如果有拼接新 URL继续循环- 如果没有break 退出循环示例代码多页抓取
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoindef scrape_all_quotes(start_url):quotes []url start_urlwhile url:print(f正在抓取{url})res requests.get(url)soup BeautifulSoup(res.text, lxml)for quote_block in soup.find_all(div, class_quote):quote_text quote_block.find(span, class_text).text.strip()author quote_block.find(small, class_author).text.strip()tags [tag.text for tag in quote_block.find_all(a, class_tag)]quotes.append({quote: quote_text,author: author,tags: tags})# 查找下一页next_link soup.select_one(li.next a)if next_link:next_href next_link[href]url urljoin(url, next_href) # 拼接为完整URLelse:url Nonereturn quotesif __name__ __main__:all_quotes scrape_all_quotes(https://quotes.toscrape.com/)print(f共抓取到 {len(all_quotes)} 条名言)# 示例输出前3条for quote in all_quotes[:3]:print(f\n{quote[quote]}\n—— {quote[author]}标签{, .join(quote[tags])})今日练习任务 修改已有爬虫实现抓取所有页面的名言数据 使用 len() 查看共抓取多少条数据 额外挑战将所有数据保存为 JSON 文件使用 json.dump 练习代码 import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import jsondef scrape_all_quotes(start_url):quotes []url start_urlwhile url:print(f抓取页面{url})response requests.get(url)soup BeautifulSoup(response.text, lxml)quote_blocks soup.find_all(div, class_quote)for block in quote_blocks:text block.find(span, class_text).text.strip()author block.find(small, class_author).text.strip()tags [tag.text for tag in block.find_all(a, class_tag)]quotes.append({quote: text,author: author,tags: tags})# 找到下一页链接next_link soup.select_one(li.next a)if next_link:next_href next_link[href]url urljoin(url, next_href)else:url Nonereturn quotesif __name__ __main__:start_url https://quotes.toscrape.com/all_quotes scrape_all_quotes(start_url)print(f\n共抓取到 {len(all_quotes)} 条名言。\n)# 保存到 JSON 文件output_file quotes.jsonwith open(output_file, w, encodingutf-8) as f:json.dump(all_quotes, f, ensure_asciiFalse, indent2)print(f数据已保存到文件{output_file})运行输出 正在抓取https://quotes.toscrape.com/
正在抓取https://quotes.toscrape.com/page/2/
正在抓取https://quotes.toscrape.com/page/3/
正在抓取https://quotes.toscrape.com/page/4/
正在抓取https://quotes.toscrape.com/page/5/
正在抓取https://quotes.toscrape.com/page/6/
正在抓取https://quotes.toscrape.com/page/7/
正在抓取https://quotes.toscrape.com/page/8/
正在抓取https://quotes.toscrape.com/page/9/
正在抓取https://quotes.toscrape.com/page/10/
共抓取到 100 条名言
数据已保存到文件quotes.jsonquotes.json文件输出 [{quote: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”,author: Albert Einstein,tags: [change,deep-thoughts,thinking,world]},{quote: “It is our choices, Harry, that show what we truly are, far more than our abilities.”,author: J.K. Rowling,tags: [abilities,choices]},{quote: “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”,author: Albert Einstein,tags: [inspirational,life,live,miracle,miracles]},... # 此处省去95条数据{quote: “A persons a person, no matter how small.”,author: Dr. Seuss,tags: [inspirational]},{quote: “... a mind needs books as a sword needs a whetstone, if it is to keep its edge.”,author: George R.R. Martin,tags: [books,mind]}小技巧 urljoin(base_url, relative_path) 可以自动拼接绝对路径 网站有时采用 JavaScript 动态分页 —— 这类网站需用 Selenium/Playwright后续学习 今日总结
学会了如何从网页中提取“下一页”链接掌握了自动翻页抓取逻辑的实现方式距离构建完整的数据采集工具更进一步