如何在 Python 中将 HTML 表数据保存为 CSV

pythonserver side programmingprogramminghtml

问题:

数据科学家面临的最大挑战之一是收集数据。事实上，网络上有大量可用数据，但只是通过自动化提取数据。

简介..

我想从 https://www.tutorialspoint.com/python/python_basic_operators.htm 中提取嵌入在 HTML 表中的基本操作数据。

嗯，数据分散在许多 HTML 表中，如果只有一个 HTML 表，显然我可以使用复制和粘贴到 .csv 文件。

但是，如果单个页面中有超过 5 个表，那么显然很麻烦。不是吗?

怎么做呢..

1. 如果您想创建 csv 文件，我将快速向您展示如何轻松创建 csv 文件。

import csv
# 以写入模式打开文件，如果未找到，则会创建一个
File = open('test.csv', 'w+')
Data = csv.writer(File)

# 我的标题
Data.writerow(('Column1', 'Column2', 'Column3'))

# 写入数据
for i in range(20):
Data.writerow((i, i+1, i+2))

# 关闭我的文件
File.close()

输出

上述代码执行后，将在与此代码相同的目录中生成一个 test.csv 文件。

2. 现在让我们从 https://www.tutorialspoint.com/python/python_dictionary.htm 检索 HTML 表并将其写入 CSV 文件。

第一步是进行导入。

import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://www.tutorialspoint.com/python/python_dictionary.htm'

使用 urlopen 打开 HTML 文件并将其存储在 html 对象中。

输出

html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')

在 html 表中找到表并让我们获取表数据。出于演示目的，我将仅提取第一个表 [0]

输出

table = soup.find_all('table')[0]
rows = table.find_all('tr')

输出

print(rows)

输出

[<tr>
<th style='text-align:center;width:5%'>Sr.No.</th>
<th style='text-align:center;width:95%'>Function with Description</th>
</tr>, 
<tr>
<td class='ts'>1</td>
<td><a href='/python/dictionary_cmp.htm'>cmp(dict1, dict2)</a>
<p>Compares elements of both dict.</p></td>
</tr>, <tr>
<td class='ts'>2</td>
<td><a href='/python/dictionary_len.htm'>len(dict)</a>
<p>Gives the total length of the dictionary. This would be equal to the number of items in the dictionary.</p></td>
</tr>, 
<tr>
<td class='ts'>3</td>
<td><a href='/python/dictionary_str.htm'>str(dict)</a>
<p>Produces a printable string representation of a dictionary</p></td>
</tr>, 
<tr>
<td class='ts'>4</td>
<td><a href='/python/dictionary_type.htm'>type(variable)</a>
<p>Returns the type of the passed variable. If passed variable is dictionary, then it would return a dictionary type.</p></td>
</tr>]

5.现在我们将数据写入 csv 文件。

示例

File = open('my_html_data_to_csv.csv', 'wt+')
Data = csv.writer(File)
try:
for row in rows:
FilteredRow = []
for cell in row.find_all(['td', 'th']):
FilteredRow.append(cell.get_text())
Data.writerow(FilteredRow)
finally:
File.close()

6.现在结果已保存到 my_html_data_to_csv.csv 文件中。

示例

我们将把上面解释的所有内容放在一起。

示例

import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup

# 设置 url..
url = 'https://www.tutorialspoint.com/python/python_basic_syntax.htm'

# 打开 url 并解析 html
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')

# 提取第一个表
table = soup.find_all('table')[0]
rows = table.find_all('tr')

# 将内容写入文件
File = open('my_html_data_to_csv.csv', 'wt+')
Data = csv.writer(File)
try:
for row in rows:
FilteredRow = []
for cell in row.find_all(['td', 'th']):
FilteredRow.append(cell.get_text())
Data.writerow(FilteredRow)
finally:
File.close()