English
Data Connections
File Upload

File Upload

Upload PDFs, Word documents, images, and more directly into PuppyOne.


Supported file types

TypeFormatsProcessing
DocumentsPDF, DOCX, DOCText extraction + structuring
SpreadsheetsXLSX, CSVConverted into JSON arrays
ImagesPNG, JPG, JPEGOCR text recognition
TextTXT, MD, JSONImported directly

Upload steps

Step 1: Start the import

  1. Open your Project
  2. Click ImportUpload Files

Step 2: Choose files

You can:

  • Drag files into the upload area
  • Click to choose local files
  • Upload multiple files at once

Step 3: Wait for processing

After upload, PuppyOne automatically:

  1. Parses the content, using text extraction for PDFs and OCR for images
  2. Cleans the result by removing headers, footers, and formatting noise
  3. Structures the content into JSON

You can track progress in the Tasks panel.


Data structure examples

PDF document → JSON

{
  "filename": "product-manual.pdf",
  "pages": 15,
  "content": "# Product Overview\n\nThis product is...\n\n## Technical Specs\n\n- Size: 10 x 5 x 3 cm\n- Weight: 250g",
  "metadata": {
    "author": "Alice",
    "created_at": "2024-01-10"
  }
}

Image (OCR) → JSON

{
  "filename": "invoice.jpg",
  "ocr_text": "Invoice Number: INV-2024-001\nDate: 2024-01-15\nAmount: $1,234.56",
  "confidence": 0.95
}

Excel → JSON

{
  "filename": "sales-data.xlsx",
  "sheets": [
    {
      "name": "Sheet1",
      "rows": [
        {"Month": "2024-01", "Sales": 10000},
        {"Month": "2024-02", "Sales": 12000}
      ]
    }
  ]
}

Processing options

PDF processing

OptionNotes
Preserve paginationSplit content by page
Extract tablesDetect tables inside PDFs
Extract imagesRun OCR on embedded images in PDFs

OCR settings

OptionNotes
LanguageChinese / English / auto-detect
PreprocessingImage enhancement to improve recognition

Processing modes

PuppyOne provides two file processing modes:

ModeNotesBest for
rawRaw mode, stores the original file content directlyStructured files such as JSON and Markdown
ocr_parseOCR parsing mode, extracts and structures textPDFs, images, and other files that require text recognition

Storage result comparison

How different file types are stored under different processing modes:

ScenariomodeFile Typetypepreview_typepreview_jsonpreview_mds3_key
Upload data.jsonrawjsonjsonjson--
Upload data.jsonocr_parsejsonjsonjson--
Upload readme.mdrawtextmarkdownmarkdown--
Upload readme.mdocr_parsetextmarkdownmarkdown--
Upload doc.pdfrawbinaryfileNULL--
Upload doc.pdfocr_parseocr_neededfile→markdownNULL→markdown-✓ (after OCR)✓ (original)
Upload image.jpgrawbinaryfileNULL--
Upload image.jpgocr_parseocr_neededfile→markdownNULL→markdown-✓ (after OCR)✓ (original)
Upload video.mp4rawbinaryfileNULL--
Upload video.mp4ocr_parsebinaryfileNULL--

Field definitions

FieldNotes
typeThe original file type identifier
preview_typeThe preview content format (json, markdown, or NULL)
preview_jsonStructured JSON content for direct querying
preview_mdMarkdown content for display and agent reading
s3_keyS3 storage path for the original file, used for binary files

Processing logic

  1. JSON and Markdown files: parsed directly in either mode, no OCR needed
  2. PDFs and images:
    • raw mode stores only the original file in S3
    • ocr_parse mode stores the original file in S3 and the extracted OCR text in preview_md
  3. Videos and other binary files: stored only in S3, with no content parsing

File size limits

PlanPer-file limitTotal storage
Free10 MB100 MB
Pro50 MB10 GB
Team100 MBUnlimited

Use cases

Use case 1: Product manuals

Upload a PDF product manual so a support agent can answer questions about product specifications.

Use case 2: Invoice recognition

Upload invoice images so OCR can extract key information for a finance agent.

Use case 3: Data reports

Upload Excel reports and convert them into JSON for analytics agents to query.


FAQ

What if OCR results are inaccurate?

  • Make sure the image is clear and high enough resolution
  • Try selecting the language manually instead of auto-detect
  • Complex layouts may still require manual correction

What if PDF formatting looks messy after extraction?

Some PDFs use complicated layouts, so you may need to make small manual adjustments in the editor after extraction.


Next steps