WebスクレイピングしたデータをCSV、JSON、データベースにエクスポートする

パイソン, 1月09日-2025年5分で読める

ウェブスクレイピングは、インターネット全体からデータを収集するための不可欠なツールとなっており、データアナリスト、技術愛好家、そして企業が情報に基づいた意思決定を行うための力となっている。しかし、データの抽出は最初のステップに過ぎません。その可能性を最大限に引き出すには、データを適切なフォーマットに効率的にエクスポートする必要がある。それがスプレッドシート用のCSVファイルであれ、API用のJSONであれ、大規模なストレージと分析用のデータベースであれ。

このブログでは、ウェブスクレイピングされたデータのエクスポートの要点について説明します。CSVファイルやJSONファイルの扱い方、ウェブスクレイピングされたデータをデータベースと統合する方法、そしてデータ管理を最大限に活用する方法をステップバイステップで学びます。

前提条件

Before diving into the script, let’s understand the dataset and workflow that we’ll use to demonstrate the data-saving process.

1. The Dataset

We’ll be scraping data from the website Books to Scrape, which provides a list of books along with their:

Title
価格
Stock Availability

This website is designed for practice purposes, making it an ideal choice for showcasing web scraping techniques.

2. Workflow Overview

Here’s the process we’ll follow:

Scraping the Data: We’ll use Python’s requests そして BeautifulSoup libraries to extract the book details from the website.
Organizing Data: The scraped data will be stored in a Pandas DataFrame. This tabular format makes it easy to manipulate and export the data.
Saving the Data: We’ll save the DataFrame in three formats:
- CSV: (Comma-Separated Values)
- JSON: (JavaScript Object Notation)
- Database: (SQLite, a lightweight SQL database)

3. Installing Required Libraries

To run the script, you’ll need the following Python libraries:

requests: To fetch the webpage content.
beautifulsoup4: To parse the HTML structure of the webpage.
pandas: To organize the scraped data into a tabular format.

Install these libraries using pip. Run the following command in your terminal:

pipinstall requestsbeautifulsoup4 pandas

4. Python Script to Scrape and Save Data

Here’s the Python script to scrape the data from the website and store it in a Pandas DataFrame:


import requests
from bs4 import BeautifulSoup
import pandas as pd

# Scrape data from the website
def scrape_books():
    url = "https://books.toscrape.com/"
    response = requests.get(url)
    if response.status_code != 200:
        raise Exception("Failed to load page")
    
    soup = BeautifulSoup(response.content, "html.parser")
    books = []
    
    # Extract book data
    for article in soup.find_all("article", class_="product_pod"):
        title = article.h3.a["title"]
        price = article.find("p", class_="price_color").text.strip()
        availability = article.find("p", class_="instock availability").text.strip()
        books.append({"Title": title, "Price": price, "Availability": availability})
    
    # Convert to DataFrame
    books_df = pd.DataFrame(books)
    return books_df

# Main execution
if __name__ == "__main__":
    print("Scraping data...")
    books_df = scrape_books()
    print("Data scraped successfully!")
    print(books_df)

The table we will use to demonstrate the data-saving process is structured as follows:

Title	価格	空室状況
A Light in the Attic	£51.77	In stock
Tipping the Velvet	£53.74	In stock
Soumission	£50.10	In stock
Sharp Objects	£47.82	In stock
Sapiens: A Brief History of Humankind	£54.23	NA
The Requiem Red	£22.65	In stock
...	...	....

Exporting Web Scraped Data to CSV

Purpose of CSV

CSV (Comma-Separated Values) files are one of the simplest and most widely used formats for storing tabular data.
They are easy to read and write using Python and can be opened in spreadsheet tools like Microsoft Excel or Google Sheets.

Why Use CSV?

Ease of Use: CSV files are lightweight and portable.
Wide Compatibility: Most data analysis tools and platforms support CSV.
Simple Structure: No complex hierarchy, just rows and columns.

How to Save Data to CSV

Use the to_csv method from Pandas:

def save_to_csv(dataframe, filename="books.csv"):
    dataframe.to_csv(filename, index=False)
    print(f"Data saved to {filename}")

Code Explanation:

filename: Specifies the name of the output file.
index=False: Ensures the index column is not included in the CSV file.

Exporting Web Scraped Data to JSON

Purpose of JSON

JSON (JavaScript Object Notation) is a lightweight data-interchange format that is easy for both humans and machines to read and write.
It supports nested or hierarchical data structures, making it ideal for representing more complex datasets.

Why Use JSON?

APIs and Web Applications: JSON is the standard format for data exchange in APIs.
Human-Readable: Indented and well-structured JSON is easy to understand.
Supports Nested Data: Useful for more complex or relational data.

How to Save Data to JSON

Use the to_json method from Pandas:

def save_to_json(dataframe, filename="books.json"):
    dataframe.to_json(filename, orient="records", indent=4)
    print(f"Data saved to {filename}")

Code Explanation:

orient="records": : Each row in the DataFrame is converted into a JSON object.
indent=4: Formats the JSON for better readability.

Using Databases for Web Scraped Data

Purpose of a Database

Databases are designed to store large volumes of data efficiently and allow for complex queries.
SQLite, in particular, is a lightweight, file-based database that requires no server setup.

Why Use a Database?

Scalability: Handles larger datasets compared to CSV or JSON.
Querying Capabilities: Use SQL to filter, group, or sort data.
Data Integrity: Helps maintain structured and normalized data.

How to Save Data to a Database

Use the to_sql method from Pandas と SQLite:

import sqlite3

def save_to_database(dataframe, database_name="books.db"):
    conn = sqlite3.connect(database_name)
    dataframe.to_sql("books", conn, if_exists="replace", index=False)
    conn.close()
    print(f"Data saved to {database_name} database")

Code Explanation:

sqlite3.connect(database_name): Connects to the SQLite database (creates it if it doesn’t exist).
to_sql("books", conn, if_exists="replace", index=False):
- Saves the DataFrame as a table named books
- Replaces the table if it already exists.

結論

While formats like CSV or JSON work well for smaller projects, databases offer superior performance, query optimization, and data integrity when handling larger datasets. The seamless integration of Pandas with SQLite makes it simple to store, retrieve, and manipulate data efficiently. Whether you're building a data pipeline or a complete application, understanding how to leverage databases will greatly enhance your ability to work with data effectively. Start using these tools today to streamline your data workflows and unlock new possibilities!

によるものだ：ProxyScrape

WebスクレイピングしたデータをCSV、JSON、データベースにエクスポートする

前提条件

1. The Dataset

2. Workflow Overview

3. Installing Required Libraries

4. Python Script to Scrape and Save Data

Exporting Web Scraped Data to CSV

Purpose of CSV

Why Use CSV?

How to Save Data to CSV

Exporting Web Scraped Data to JSON

Purpose of JSON

Why Use JSON?

How to Save Data to JSON

Using Databases for Web Scraped Data

Purpose of a Database

Why Use a Database?

How to Save Data to a Database

結論

最近の投稿