scrapedatshi ninja icon scrapedatshi
🚀 Get API Key Sign in →

One API key unlocks all tools — scraping, PDF extraction, RAG chunking, vector injection, and more. No credit card required.

🐍 Python SDK pip install scrapedatshi Typed models · Sync + Async · IDE autocomplete GitHub →

Dev Docs

All endpoints require your API key in the request header: X-API-Key: YOUR_KEY

Scrape any URL and receive clean Markdown — stripped of ads, nav bars, and boilerplate.

Endpoint

GET https://www.scrapedatshi.com/scrape?url=TARGET_URL

Optional Parameters

Parameter Type Description
selector string A CSS selector to target a specific element (e.g. article, main, #content). Reduces noise and saves LLM tokens.

Response Shape

{
  "authenticated_user": "your_name",
  "url": "https://example.com/",
  "selector": null,
  "selectors_found": ["article", "main", "#post-content", ".entry-body"],
  "metadata": {
    "title": "Page Title",
    "description": "Page description...",
    "author": "Author Name",
    "published_date": "2026-06-09",
    "site_name": "Example Site"
  },
  "markdown": "# Page Title\n\nContent..."
}

selectors_found — a ranked list of CSS selectors detected on the page that are likely to contain main content. Use these to make a second targeted request and reduce noise in your LLM context.


The Python SDK wraps the /v1/* pipeline endpoints. The Web Scraper is a public REST tool — use the raw example below, or use client.pipeline.chunk_url() to scrape and chunk in one call.

from scrapedatshi import ScrapedatshiClient

client = ScrapedatshiClient()

# chunk_url() scrapes + chunks in one call (all tiers)
result = client.pipeline.chunk_url("https://www.example.com/")
print(f"Got {result.total_chunks} chunks from {result.source}")

Extract all text from a PDF — by URL or file upload. Returns plain text with optional heading detection.

Endpoint

POST https://www.scrapedatshi.com/api/pdf/text

Request Body (multipart/form-data)

Field Type Description
url string URL of the PDF to fetch. Use either url or pdf_file.
pdf_file file Upload a PDF file directly (max 20 MB).
preserve_headings bool If true, detects larger text as headings and prefixes them with #. Default: false.

Response Shape

{
  "authenticated_user": "your_name",
  "source": "https://example.com/document.pdf",
  "preserve_headings": false,
  "text": "Full extracted text content..."
}

The Python SDK wraps the /v1/* pipeline endpoints. For PDF text extraction, use the raw REST example below. To chunk a PDF file with the SDK, use client.pipeline.chunk_file().

from scrapedatshi import ScrapedatshiClient

client = ScrapedatshiClient()

# chunk_file() parses + chunks a local PDF (all tiers)
result = client.pipeline.chunk_file("./report.pdf")
print(f"Got {result.total_chunks} chunks from {result.source}")

Scrape any URL and receive the content pre-split into RAG-optimized chunks — ready to insert directly into Pinecone, Chroma, Qdrant, or any vector database. Tables and code blocks are never split mid-structure, preserving relational context for accurate LLM retrieval.

Endpoint

POST https://www.scrapedatshi.com/v1/rag-chunk

Send a JSON body with Content-Type: application/json.

Request Body (application/json)

Field Type Description
url string Required. The URL to scrape and chunk. Must include protocol (https://).
selector string Optional CSS selector to target a specific element before chunking (e.g. article).
chunk_size int Target tokens per chunk. Default: 512. Range: 64–4096.
overlap int Token overlap between consecutive chunks. Default: 50. Must be less than chunk_size.

Response Shape

{
  "authenticated_user": "your_name",
  "url": "https://example.com/article",
  "selector": null,
  "chunk_size_target": 512,
  "overlap_tokens": 50,
  "contextual_retrieval": false,
  "parent_context": null,
  "chunk_count": 7,
  "total_tokens_estimated": 3421,
  "metadata": { ... },
  "chunks": [
    {"index": 0, "token_estimate": 487, "text": "Location: Article Title > Section\n\nFirst paragraph..."},
    ...
  ]
}

Smart guardrails: Tables and code fences are kept as single atomic units — never split across chunk boundaries. Each chunk is prefixed with a heading breadcrumb (Location: Title > Section) so the embedding model knows exactly where in the document the chunk lives.

RAG 2.0 — Contextual Retrieval: Set contextual_retrieval: true with your LLM credentials to generate a 1-sentence global document summary that is prepended to every chunk as Document Summary: .... This technique, proven by Anthropic to boost retrieval accuracy by 35–50%, ensures each chunk carries its global context anchor alongside its local breadcrumb.


from scrapedatshi import ScrapedatshiClient

client = ScrapedatshiClient()  # reads SCRAPEDATSHI_API_KEY from env

result = client.pipeline.chunk_url(
    "https://example.com/article",
    # Optional: contextual retrieval (Basic tier+)
    # contextual_retrieval=True,
    # llm_provider="openai",
    # llm_api_key=os.getenv("OPENAI_API_KEY"),
)

print(f"Chunks: {result.total_chunks}  |  Source: {result.source}")
for chunk in result.chunks:
    print(f"  [{chunk.token_estimate} tokens] {chunk.content[:100]}...")

Automatically discover and scrape an entire domain — returning a structured dataset of every page's content ready for vector databases. One API call replaces hundreds of individual scrape requests.

🗺️ Sitemap Mode (Basic tier+)

Reads the domain's sitemap.xml to discover URLs. Structured, predictable, and respects robots.txt. Best for documentation sites and structured content.

🕷️ Deep Spider Mode (Pro/Enterprise)

Follows links recursively from the root URL. Works on any site — even those without a sitemap. Use include_pattern to keep it focused.

Endpoint

POST https://www.scrapedatshi.com/v1/crawl

Send a JSON body with Content-Type: application/json.

Request Body (application/json)

Field Type Description
url string Required. Root URL of the domain to crawl.
include_pattern string Only crawl URLs containing this substring (e.g. /docs/). Recommended guardrail.
exclude_pattern string Skip URLs containing this substring (e.g. /blog/).
max_pages int Max pages to crawl. Capped by your tier limit.
selector string Optional CSS selector applied to every page (e.g. article).
chunk bool If true, run each page through the RAG chunker. Default: false.
chunk_size int Target tokens per chunk (if chunk=true). Default: 512.
overlap int Token overlap between chunks (if chunk=true). Default: 50.

Sitemap Mode (Basic tier+)

from scrapedatshi import ScrapedatshiClient

client = ScrapedatshiClient()

# Sitemap mode — reads sitemap.xml, structured and predictable
result = client.pipeline.crawl(
    "https://docs.example.com",
    max_pages=10,
)

print(f"Crawled {result.pages_crawled} pages → {result.total_chunks} chunks")
for chunk in result.chunks:
    print(f"  {chunk.content[:80]}...")

Deep Spider Mode (Pro/Enterprise — follows links, no sitemap needed)

# Spider mode — follows links recursively, works on any site
# Use include_pattern to keep it focused on the right section
result = client.pipeline.crawl(
    "https://example.com",
    max_pages=20,
)

Scrape any URL and extract structured data matching your exact JSON schema — powered by your own LLM. No more brittle CSS selectors that break when a site redesigns. Define the fields you want in plain English and let the LLM do the parsing.

Supports OpenAI, Anthropic, and Google Gemini. You bring your own API key — we handle the scraping and prompt engineering.

Endpoint

POST https://www.scrapedatshi.com/v1/extract

Send a JSON body with Content-Type: application/json.

Request Body (application/json)

Field Type Description
url string Required. The URL to scrape and extract from.
schema object Required. Dict of field names → description strings. e.g. {"price": "number — price in USD"}.
llm_provider string Required. openai, anthropic, or gemini.
llm_api_key string Required. Your LLM provider API key.
llm_model string Optional model override. Defaults: gpt-4o-mini / claude-3-haiku / gemini-1.5-flash.
selector string Optional CSS selector to target a specific element before extraction.

Response Shape

{
  "authenticated_user": "your_name",
  "url": "https://example.com/product",
  "selector": null,
  "llm_provider": "openai",
  "llm_model": "gpt-4o-mini",
  "schema_fields": ["title", "price", "in_stock", "description"],
  "extracted": {
    "title": "Widget Pro 3000",
    "price": 49.99,
    "in_stock": true,
    "description": "The most advanced widget on the market."
  }
}

The Python SDK wraps the /v1/* pipeline endpoints. The Schema Extractor uses your own LLM key and is not wrapped by the SDK — use the raw REST example below.

# Schema Extractor is a raw REST endpoint — use requests directly.
# See the Raw REST tab for the full example.

# For scraping + chunking without LLM extraction, use the SDK:
from scrapedatshi import ScrapedatshiClient
client = ScrapedatshiClient()
result = client.pipeline.chunk_url("https://example.com/product/widget-pro")
print(f"Got {result.total_chunks} chunks")

Complete the entire RAG ingestion pipeline in a single API call. scrapedatshi scrapes the URL, chunks the content, generates vector embeddings via your embedding provider, and upserts directly into your vector database — zero additional code required.

Embedding: OpenAI or Cohere. Vector DB: Pinecone, Qdrant, or ChromaDB. You bring your own keys. Requires Pro or Enterprise tier.

Endpoint

POST https://www.scrapedatshi.com/v1/sync

Send a JSON body with Content-Type: application/json.

Response Shape

{
  "authenticated_user": "your_name",
  "url": "https://docs.example.com/",
  "selector": "article",
  "chunks_created": 47,
  "vectors_upserted": 47,
  "total_tokens_estimated": 24100,
  "embedding_provider": "openai",
  "embedding_model": "text-embedding-3-small",
  "vector_db_provider": "pinecone",
  "metadata": { "title": "...", "author": "...", ... }
}

import os
from scrapedatshi import ScrapedatshiClient

client = ScrapedatshiClient()

result = client.pipeline.sync(
    url="https://docs.example.com/getting-started",
    embedding_provider="openai",
    embedding_api_key=os.getenv("OPENAI_API_KEY"),
    vector_db="pinecone",
    vector_db_api_key=os.getenv("PINECONE_API_KEY"),
    index_name="my-docs",
)

print(f"Upserted {result.vectors_upserted} vectors ({result.total_tokens} tokens)")
print(f"Embedding: {result.embedding_provider}  |  Vector DB: {result.vector_db_provider}")

Extract all tables from a PDF as structured JSON. Each table is an array of rows; each row is an array of cell strings.

Endpoint

POST https://www.scrapedatshi.com/api/pdf/tables

Request Body (multipart/form-data)

Field Type Description
url string URL of the PDF to fetch.
pdf_file file Upload a PDF file directly (max 20 MB).

Response Shape

{
  "authenticated_user": "your_name",
  "source": "https://example.com/report.pdf",
  "table_count": 2,
  "tables": [
    {
      "page": 1,
      "table_index": 1,
      "rows": [
        ["Header 1", "Header 2", "Header 3"],
        ["Row 1 A",  "Row 1 B",  "Row 1 C"],
        ["Row 2 A",  "Row 2 B",  "Row 2 C"]
      ]
    }
  ]
}

The Python SDK wraps the /v1/* pipeline endpoints. For PDF table extraction, use the raw REST example below.

# PDF Tables is a raw REST endpoint — use requests directly.
# See the Raw REST tab for the full example.

# To chunk a PDF file with the SDK instead:
from scrapedatshi import ScrapedatshiClient
client = ScrapedatshiClient()
result = client.pipeline.chunk_file("./report.pdf")
print(f"Got {result.total_chunks} chunks")

Ready to start building?

🚀 Get your API key →