今天学习的内容是python爬虫,利用BeautifulSoup库。
0x00 基本使用
安装:
1
| pip3 install beautifulsoup4
|
使用:
我们先定义一个简单的index.html文件:
1 2 3 4 5 6 7 8 9
| <html> <body> <h1>hello beautifulsoup</h1> <div> <a href="#">123</a> <p class="hello">hello</p> </div> </body> </html>
|
python引用BeautifulSoup:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
| from bs4 import BeautifulSoup demo = soup = BeautifulSoup(open("index.html"), "html.parser") print(soup.prettify()) ''' 输出: <html> <body> <h1> hello beautifulsoup </h1> <div> <a href="#"> 123 </a> <p class="hello"> hello </p> </div> </body> </html> '''
|
基本操作:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
| soup.prettify() soup.a soup.title soup.a['href'] soup.p['class'] soup.name soup.a.name soup.a.contents soup.contents[0] soup.a.contents[0].contents soup.a.parent soup.title.string.parent for parent in soup.a.parents: if parent is None: print(parent) else: print(parent.name)
soup.a.next_sibling soup.a.previous_sibling for sibling in soup.a.next_siblings: print(repr(sibling))
for sibling in soup.a.previous_siblings: print(repr(sibling))
|
0x01 进阶用法
正则表达式
找出所有以b开头的标签(
,
):
1 2 3
| import re for tag in soup.find_all(re.compile("^b")): print(tag.name)
|
find_all()函数
1 2 3 4 5 6 7 8
| 函数原型:find_all( name, attrs, recursive, text, **kwargs) 常见示例: soup.find_all('a') soup.find_all('a','b') soup.find_all(href="#") soup.find_all(id=True) soup.find_all('a', limit=2) soup.find_all('title', recursive=False)
|