English
Data Connections
Web Crawling

Web Crawling

Crawl content from any public web page and import it into PuppyOne.


How it works

PuppyOne uses Firecrawl (opens in a new tab) to crawl web content:

  1. Enter a URL
  2. Firecrawl renders the page, including JavaScript
  3. The main content is extracted and converted to Markdown
  4. The result is stored as JSON

Usage steps

Step 1: Start the import

  1. Open your Project
  2. Click ImportURL

Step 2: Enter a URL

Paste the web page you want to crawl:

https://example.com/docs/getting-started

Step 3: Configure crawl options

OptionNotes
Crawl depthWhether to follow links into child pages, from 1 to 3 levels
Include pathsCrawl only matching paths such as /docs/*
Exclude pathsSkip specific paths such as /blog/*
Wait timeHow long to wait for JavaScript rendering

Step 4: Start crawling

Click Import and wait for crawling to finish.


Data structure example

{
  "url": "https://example.com/docs/intro",
  "title": "Quick Start Guide",
  "content": "# Quick Start\n\nWelcome to our product...\n\n## Installation\n\n```bash\nnpm install example\n```",
  "metadata": {
    "description": "Homepage of the product docs",
    "crawled_at": "2024-01-20T10:30:00Z"
  }
}

Crawl modes

Single-page crawl

Crawls only the URL you enter:

URL: https://docs.example.com/intro
Result: 1 page

Multi-page crawl

Follows links and crawls multiple pages:

URL: https://docs.example.com
Depth: 2
Include paths: /docs/*
Result: all pages under /docs/

Example configurations

Crawl an entire documentation site

URL: https://docs.example.com
Depth: 3
Include: /docs/*, /guides/*
Exclude: /blog/*, /changelog/*

Crawl just one page

URL: https://example.com/pricing
Depth: 0  # do not follow links

Use cases

Use case 1: Competitor documentation

Crawl a competitor's public docs so agents can perform comparison analysis.

Use case 2: Technical documentation

Crawl the official docs of a framework or library for use as reference material by engineering agents.

Use case 3: Product pages

Crawl your own public product pages to keep the agent knowledge base aligned with your website.


Limitations

LimitationNotes
Public pages onlyPages that require login cannot be crawled
Rate limits applyUp to 100 pages per minute
Some websites may block crawlingCertain sites have anti-bot protections

FAQ

Why is the crawl result empty?

  • Check whether the URL is correct
  • The page may require login
  • The site may have anti-crawling protection

Why does the extracted content look messy?

Some complex pages, especially those with lots of dynamic content, may not extract cleanly. You can try increasing the Wait time so JavaScript fully renders first.

Can it crawl pages that require login?

Not yet. If the platform offers an API or export feature, that is usually the better option.


Next steps