Sign up for your free trial and begin automating your processes today.

Flexible Web Scraping with n8n: A Configurable, Multi-Page Template

Learn to build a powerful, multi-page web scraper in n8n. Use a single JSON input to configure fields and pagination, creating a flexible, reusable data extraction template.

Web scraping is often necessary for gathering data, but building a scraper for every single site can be tedious. What if you could build a single n8n workflow that could be quickly configured to scrape any similar paginated website?

This guide shows you how to create a powerful, reusable web scraping template using n8n, focusing on the configuration of selectors and fields via a simple Input Node.

The Goal: A Configurable & Paginated Scraper

We aim to create a single workflow that iterates through multiple pages and extracts specific data fields (like author and text) using a simple JSON configuration.

Example Site: We will use https://quotes.toscrape.com/tag/humor/ which features two pages of quotes, each containing a set of author and a quote texts.

The Blueprint: Initial Configuration

The entire logic of our scraper is defined in the initial Input Node using the following JSON structure. This makes the workflow incredibly easy to reuse—just change this JSON block for a new site!

{
  "startUrl": "https://quotes.toscrape.com/tag/humor/",
  "nextPageSelector": "li.next a[href]",
  "fields": [
    {
      "name": "author",
      "selector": "span > small.author",
      "value": "text"
    },
    {
      "name": "text",
      "selector": "span.text",
      "value": "text"
    }
  ]
}
Field Description
startUrl The URL for the first page.
nextPageSelector The CSS selector for the “Next Page” link.
fields An array of items to scrape on each page.
fields[].name The name of item to scrape.
fields[].selector The CSS selector for the specific data point.
fields[].value The HTML property to extract (e.g., text or href).

Parameters for the Set Node, defining the initial JSON configuration.

Building the Workflow: Step-by-Step

1. Initial Request and Data Preparation

The first step is retrieving the HTML content based on the startUrl provided in our configuration.

  1. HTTP Request Node: Get the current page’s HTML.
    • URL: Use an expression to pull the startUrl from the Input Node: {{ $json.startUrl }}
    • Response Format: Select Text.

Configuration of the HTTP Request Node using {{ $json.startUrl }} expression.

  1. Split Out Node: Prepare the field configuration for extraction.
    • This node takes the fields array from the Input Node and splits it into multiple individual items (one item for ‘author’ and one for ’text’).
    • Fields to Split Out: fields

Split Out Node configuration, targeting the ‘fields’ array for separation.

  1. Merge Node (Combine): Merge the HTML content with the field configurations.
    • Mode: Select Combine.
    • The node needs two inputs: the HTML content from HTTP Request and the split field items from Split Out.
    • This ensures that for every split field item, we have a copy of the HTML content, allowing us to run the HTML extractor for each desired field.

Parameters for the Merge Node set to ‘Combine’ mode for cross-joining data.

Visual diagram showing the connection of Input, HTTP Request, Split Out, and Merge Nodes.

2. Extracting Content

Now that we have the HTML and the field-specific selectors together, we can use the HTML Node to perform the extraction.

  1. HTML Node (Extract HTML Content):
    • Operation: Extract HTML Content
    • Key: Use an expression to pull the field name: {{ $json.name }}
    • CSS Selector: Use an expression to pull the field selector: {{ $json.selector }}
    • Return Value: Use an expression to pull the value property: {{ $json.value }}

HTML Node configured to extract data using dynamic expressions for selector and attribute.

  1. Aggregate Node: Group the extracted data.
    • The HTML node outputs the extracted data, but it’s still grouped by the original fields (e.g., all authors are one item, all texts are another).
    • The Aggregate Node collects all items processed so far into a single list.

Aggregate Node settings, collecting all extracted data into a single list.

  1. Split Out Node:
    • To group the resulting author/text pairs, we can use the Split Out Node or a Code Node for more control. For this simple case, the structure after the Aggregate Node might suffice, but for clean output, you may use a Code Node to reorganize the list of authors and texts into objects like [{author: "A", text: "T"}, ...].

Split Out Node used to finalize the data structure by splitting into individual records.

3. Implementing Pagination (The Loop)

The key to multi-page scraping is creating a mechanism that repeats the process until the “next page” link is no longer found. This requires a loop structure that updates the original configuration.

  1. HTML Node (Extract Next Page Link):
    • Before the data splitting, connect a second branch from the HTTP Request Node to a new HTML Node.
    • Operation: Extract HTML Content
    • CSS Selector: Use the configuration value: {{ $("Input").item.json.nextPageSelector }}
    • Attribute: href (to get the URL).

HTML Node configured to extract the next page URL using the nextPageSelector property.

  1. If Node: Check if a next page link was found.
    • Value 1: {{ $json.nextPage }} (The output of the HTML node)
    • Condition: is not empty

If Node condition checking if the extracted next page link is not empty.

  1. Set Node (True Branch): Update the startUrl for the next iteration.
    • If the link exists, use the Set Node to update the original configuration.
    • Value: {{ $json.nextPage }}
    • Key: startUrl (This overwrites the original startUrl in the input item).

Set Node configured to update the startUrl field with the newly scraped next page link.

  1. Closing the Loop:
    • Connect the output of the Set Node directly back to the input of the HTTP Request Node and Split Out Node. This tells n8n to execute the entire scraping branch again, but this time using the newly updated startUrl.
    • When the If Node fails (the link is not found), the workflow completes the aggregation of all scraped data.

Complete n8n workflow diagram showing the loop logic for multi-page scraping.

By setting up this dynamic loop, the workflow automatically repeats until all available pages have been visited and their data aggregated. This gives you a truly flexible, one-stop solution for a wide range of web scraping tasks!

Final output from the workflow, displaying the structured and combined author and text data.

n8n is a trademark of n8n GmbH. This site is not officially associated with n8n GmbH.
Built with Hugo
Theme Stack designed by Jimmy