Introduction to Web Parsing in Python with Parsel

パイソン, Jan-0320255分で読める

Web scraping has become an essential skill for Python developers, data scientists, and web scraping enthusiasts. Whether you're extracting data for analysis, building a price comparison tool, or automating content extraction, web parsing is at the core of each of these tasks. But what makes web parsing both efficient and beginner-friendly? Enter Parsel—a powerful library in Python that simplifies HTML parsing and data extraction.

Introduction to Parsel

What is Web Parsing and Why Is It Important?

Parsel is a lightweight Python library designed for HTML/XML parsing and data extraction. Built with web scraping in mind, Parsel makes it easy to interact with web page structures using powerful selectors like XPath and CSS. These tools allow you to locate and extract specific elements or attributes from web pages with precision. Parsel’s integration with Python’s ecosystem also means it works seamlessly with libraries like `requests` and `httpx` for fetching web content.

Key Features of Parsel

  • Selector Support: Use XPath for detailed path navigation or CSS selectors for simpler syntax.
  • Flexibility: It supports multiple HTML/XML parsing tasks, from simple text extraction to handling deep-nested elements.
  • Scalability: Parsel can extract data from a single page or multiple pages via looping structures.
  • Integration with Scrapy: Parsel’s compatibility extends naturally to Scrapy, a popular web scraping framework.

Common Use Cases

  • Web Scraping: Extracting data from e-commerce websites for price monitoring.
  • Data Mining: Collecting contact information or research data from public records.
  • Automation: Streamlining repetitive tasks like downloading product specifications.

Understanding Selectors and HTML Parsing

What Is HTML Parsing?

HTML parsing is the process of breaking down an HTML document into its structural components, such as tags, attributes, and the Document Object Model (DOM). Parsel uses this structure to precisely locate and extract the data you need.

HTML documents are built using:

  • Tags: Define the type of element (e.g., <h1>, <p>, <img>).
  • Attributes: Provide additional information about an element (e.g., アイドル, class, href).
  • DOM: A hierarchical representation of the webpage, which enables navigation between elements.

What Are Selectors?

XPath and CSS selectors are query languages used to select elements in an HTML document:

  • XPath: Powerful and feature-rich, XPath allows you to select nodes based on their path in the DOM, parent-child relationships, and conditions.
  • CSS Selectors: A simpler syntax often used in front-end development, ideal for selecting elements based on classes, IDs, or pseudo-selectors.

Read more about Selectors here.

Prerequisites

Before getting started with Parsel, ensure the following:

  • Python Installed: Download and install Python from
  • Install Required Libraries (parsel, requests): 
pip install parsel requests

Parsel Techniques for HTML Parsing

Extracting Elements by ID and Class

Parsel allows us to parse an element by simply knowing its class name or ID. This is particularly useful when targeting specific elements in a webpage for data extraction.

To demonstrate, we’ll use this example website. We’ll focus on extracting data from an element with the class name product_pod, which represents a single book entry.

Below is an example of its HTML structure:

From this element, we will extract:

  • Title of the book
  • Price of the book
  • Availability status

Python Code Example

import requests
from parsel import Selector

# Fetch the webpage
url = "https://books.toscrape.com/"
response = requests.get(url)
selector = Selector(response.text)

# Select the first product by class
product = selector.css("article.product_pod").get()

# Parse details from the selected product
product_selector = Selector(text=product)
title = product_selector.css("h3 a::attr(title)").get()
price = product_selector.css("p.price_color::text").get()
availability = product_selector.css("p.instock.availability::text").get().strip()

print("Title:", title)
print("Price:", price)
print("Availability:", availability)

Script Explanation:

  • Fetch the webpage: The script sends a ゲット request to the example website to retrieve its HTML content.
  • Select the product block: It uses the CSS selector article.product_pod to select the first book entry from the page.
  • Parse the product details: The script extracts the タイトル, price, and availability by targeting specific HTML tags and attributes within the product_pod block.
  • Display the results: The extracted data is printed to the console in a readable format.

Example Output

Title: A Light in the Attic
Price: £51.77
Availability: In stock

Extracting Text from Elements

Parsel makes it easy to extract text from HTML elements, whether it’s a title, a description, or other visible content on a webpage.

To demonstrate, we’ll use the same example website again. We'll focus on extracting the title text of a single book using the h3 tag nested inside an article element with the class product_pod.

Below is an example of its HTML structure:

From this element, we will extract:

  • The title text of the book

Python Code Example

import requests
from parsel import Selector

# Fetch the webpage
url = "https://books.toscrape.com/"
response = requests.get(url)
selector = Selector(response.text)

# Select the first product by class
product = selector.css("article.product_pod").get()

# Parse the title text from the selected product
product_selector = Selector(text=product)

title_text = product_selector.css("h3 a::attr(title)").get()
print("Title Text:", title_text)

Script Explanation:

  • Fetch the webpage: The script sends a ゲット request to retrieve the HTML content from the website.
  • Select the product block: The article.product_pod CSS selector targets the first book entry.
  • Extract the title text: Using h3 a::attr(title), the script pulls the title attribute from the <a> tag nested in the <h3> tag.
  • Display the result: The extracted title is printed to the console.

Example Output

Title Text: A Light in the Attic

Extracting Attributes (e.g., `href`, `src`, `alt`)

Parsel also allows us to extract attribute values, such as href, src, or alt, from HTML elements. These attributes often contain valuable data like URLs, image sources, or descriptive text.

We'll focus on extracting the link (href) to a book's detail page from the <a> tag inside an article element with the class product_pod.

Below is an example of its HTML structure:

From this element, we will extract:

  • The link (href) to the book's detail page

Python Code Example

import requests
from parsel import Selector

# Fetch the webpage
url = "https://books.toscrape.com/"
response = requests.get(url)
selector = Selector(response.text)

# Select the first product by class
product = selector.css("article.product_pod").get()

# Parse the 'href' attribute from the selected product
product_selector = Selector(text=product)
book_link = product_selector.css("h3 a::attr(href)").get()

print("Book Link:", book_link)

Script Explanation:

  • Fetch the webpage: The script sends a GET request to retrieve the HTML content from the website.
  • Select the product block: The article.product_pod CSS selector targets the first book entry.
  • Extract the href attribute: Using h3 a::attr(href), the script pulls the href value from the tag nested in the tag.
  • Display the result: The extracted URL is printed to the console.

Example Output

Book Link: catalogue/a-light-in-the-attic_1000/index.html

Extracting Lists of Elements

Parsel makes it simple to extract multiple elements from a webpage using CSS or XPath selectors. This is especially useful when working with lists, such as product titles, links, or prices.

We'll focus on extracting a list of all book titles displayed on the homepage of the example website that we are using for this tutorial.

Below is an example of the relevant HTML structure:

From these elements, we will extract:

  • The titles of all books displayed on the homepage

Python Code Example

import requests
from parsel import Selector

# Fetch the webpage
url = "https://books.toscrape.com/"
response = requests.get(url)
selector = Selector(response.text)

# Select all book titles
book_titles = selector.css("article.product_pod h3 a::attr(title)").getall()

# Print each title
for title in book_titles:
    print("Title:", title)

Script Explanation:

  • Fetch the webpage: The script sends a ゲット request to retrieve the HTML content from the website.
  • Select all titles: Using the CSS selector article.product_pod h3 a::attr(title), it selects all <a> tags inside <h3> tags within product_pod elements.
  • Extract titles: The .getall() method retrieves a list of all matching titles.
  • Display results: Each title in the list is printed one by one.

Example Output

Title: A Light in the Attic
Title: Tipping the Velvet
Title: Soumission
Title: Sharp Objects

Navigating Nested Elements

Parsel enables efficient navigation of complex, nested HTML structures using CSS selectors and XPath. This is particularly valuable when extracting data buried deep within multiple layers of HTML tags.

We'll be extracting the price of a book from within the product_pod element.

Below is an example of the relevant HTML structure:

From this nested structure, we will extract:

  • The price of the first book

Python Code Example

import requests
from parsel import Selector

# Fetch the webpage
url = "https://books.toscrape.com/"
response = requests.get(url)
selector = Selector(response.text)

# Select the first product by class
product = selector.css("article.product_pod").get()

# Parse the nested price element
product_selector = Selector(text=product)
price = product_selector.css("div.product_price p.price_color::text").get()

print("Price:", price)

Script Explanation:

  • Fetch the webpage: The script retrieves the webpage's HTML content using requests.
  • Select the first product block: It targets the first article tag with the class product_pod.
  • Navigate to the nested price element: Using the CSS selector div.product_price p.price_color, it navigates into the div and selects the p tag containing the price.
  • Extract the price text: The .get() method retrieves the price value.
  • Display the result: The extracted price is printed.

Example Output

Price: £51.77

Parsing Tables and Lists

Parsel simplifies the process of extracting structured data from HTML lists and table-like formats. Websites often display information in repeating patterns, such as product grids or ordered lists, making Parsel an essential tool for efficiently capturing this data.

As an example, we’ll work again with the same example website. Our goal is to extract a list of book titles along with their prices. Specifically, we’ll target the <ol> tag, which contains multiple <li> elements, each representing an individual book.

Below is an example of the relevant HTML structure:

From this structure, we will extract:

  • The title of each book
  • The price of each book

Python Code Example

import requests
from parsel import Selector

# Fetch the webpage
url = "https://books.toscrape.com/"
response = requests.get(url)
selector = Selector(response.text)

# Select all book items in the list
books = selector.css("ol.row li.article")

# Loop through each book and extract title and price
for book in books:
    title = book.css("h3 a::attr(title)").get()
    price = book.css("p.price_color::text").get()
    print(f"Title: {title} | Price: {price}")

Script Explanation:

  • Fetch the webpage: The script sends a GET request to fetch the HTML content from the website.
  • Select all book items: The selector ol.row li.article targets all <li> elements inside the ordered list (<ol>), which represent individual book items.
  • Extract title and price: h3 a::attr(title) extracts the title attribute of the <a> tag. --- p.price_color::text extracts the price text from the <p> tag.
  • Display the results: The script prints each book's title and price.

Example Output

Title: A Light in the Attic | Price: £51.77
Title: Tipping the Velvet | Price: £53.74
Title: Soumission | Price: £50.10
Title: Sharp Objects | Price: £47.82
...

結論

In this tutorial, we explored the fundamentals of web parsing in Python with Parsel. From understanding basic selectors to navigating nested elements, extracting attributes, and parsing lists, we demonstrated how Parsel simplifies the process of extracting meaningful data from web pages.

Here’s a quick recap of what we covered:

  • Extracting elements by ID and class: Targeted specific HTML elements using CSS selectors.
  • Extracting text and attributes: Learned how to pull text content and attributes like href and src from elements.
  • Handling nested elements: Explored how to navigate through parent-child relationships in HTML structures.
  • Parsing lists: Extracted structured data from repeating patterns, such as tables or product lists.

Next Steps:

  • Experiment with different web pages and complex HTML structures using Parsel.
  • Explore integration with other libraries like Scrapy for large-scale web scraping projects.
  • Follow best practices for ethical scraping, such as respecting robots.txt files and avoiding overloading servers.

Parsel is a powerful tool in the web scraping toolkit, and mastering it will open up countless opportunities for data collection and analysis.

Happy parsing!