Web Crawling

Crawl content from any public web page and import it into PuppyOne.

How it works

PuppyOne uses Firecrawl (opens in a new tab) to crawl web content:

Enter a URL
Firecrawl renders the page, including JavaScript
The main content is extracted and converted to Markdown
The result is stored as JSON

Usage steps

Step 1: Start the import

Open your Project
Click Import → URL

Step 2: Enter a URL

Paste the web page you want to crawl:

https://example.com/docs/getting-started

Step 3: Configure crawl options

Option	Notes
Crawl depth	Whether to follow links into child pages, from 1 to 3 levels
Include paths	Crawl only matching paths such as `/docs/*`
Exclude paths	Skip specific paths such as `/blog/*`
Wait time	How long to wait for JavaScript rendering

Step 4: Start crawling

Click Import and wait for crawling to finish.

Data structure example

{
  "url": "https://example.com/docs/intro",
  "title": "Quick Start Guide",
  "content": "# Quick Start\n\nWelcome to our product...\n\n## Installation\n\n```bash\nnpm install example\n```",
  "metadata": {
    "description": "Homepage of the product docs",
    "crawled_at": "2024-01-20T10:30:00Z"
  }
}

Crawl modes

Single-page crawl

Crawls only the URL you enter:

URL: https://docs.example.com/intro
Result: 1 page

Multi-page crawl

Follows links and crawls multiple pages:

URL: https://docs.example.com
Depth: 2
Include paths: /docs/*
Result: all pages under /docs/

Example configurations

Crawl an entire documentation site

URL: https://docs.example.com
Depth: 3
Include: /docs/*, /guides/*
Exclude: /blog/*, /changelog/*

Crawl just one page

URL: https://example.com/pricing
Depth: 0  # do not follow links

Use cases

Use case 1: Competitor documentation

Crawl a competitor's public docs so agents can perform comparison analysis.

Use case 2: Technical documentation

Crawl the official docs of a framework or library for use as reference material by engineering agents.

Use case 3: Product pages

Crawl your own public product pages to keep the agent knowledge base aligned with your website.

Limitations

Limitation	Notes
Public pages only	Pages that require login cannot be crawled
Rate limits apply	Up to 100 pages per minute
Some websites may block crawling	Certain sites have anti-bot protections

FAQ

Why is the crawl result empty?

Check whether the URL is correct
The page may require login
The site may have anti-crawling protection

Why does the extracted content look messy?

Some complex pages, especially those with lots of dynamic content, may not extract cleanly. You can try increasing the Wait time so JavaScript fully renders first.

Can it crawl pages that require login?

Not yet. If the platform offers an API or export feature, that is usually the better option.

Next steps

File Upload Local Folder Sync