One API key unlocks all tools — scraping, PDF extraction, RAG chunking, vector injection, and more. No credit card required.
Dev Docs
All endpoints require your API key in the request header:
X-API-Key: YOUR_KEY
Scrape any URL and receive clean Markdown — stripped of ads, nav bars, and boilerplate.
Endpoint
GET https://www.scrapedatshi.com/scrape?url=TARGET_URL
Optional Parameters
Response Shape
{
"authenticated_user": "your_name",
"url": "https://example.com/",
"selector": null,
"selectors_found": ["article", "main", "#post-content", ".entry-body"],
"metadata": {
"title": "Page Title",
"description": "Page description...",
"author": "Author Name",
"published_date": "2026-06-09",
"site_name": "Example Site"
},
"markdown": "# Page Title\n\nContent..."
}
selectors_found — a ranked list of CSS selectors detected on the page that are likely to
contain main content. Use these to make a second targeted request and reduce noise in your LLM context.
The Python SDK wraps the /v1/* pipeline endpoints. The Web Scraper is a public REST tool —
use the raw example below, or use client.pipeline.chunk_url() to scrape and chunk in one
call.
from scrapedatshi import ScrapedatshiClient
client = ScrapedatshiClient()
# chunk_url() scrapes + chunks in one call (all tiers)
result = client.pipeline.chunk_url("https://www.example.com/")
print(f"Got {result.total_chunks} chunks from {result.source}")
Extract all text from a PDF — by URL or file upload. Returns plain text with optional heading detection.
Endpoint
POST https://www.scrapedatshi.com/api/pdf/text
Request Body (multipart/form-data)
Response Shape
{
"authenticated_user": "your_name",
"source": "https://example.com/document.pdf",
"preserve_headings": false,
"text": "Full extracted text content..."
}
The Python SDK wraps the /v1/* pipeline endpoints. For PDF text extraction, use the raw
REST example below. To chunk a PDF file with the SDK, use client.pipeline.chunk_file().
from scrapedatshi import ScrapedatshiClient
client = ScrapedatshiClient()
# chunk_file() parses + chunks a local PDF (all tiers)
result = client.pipeline.chunk_file("./report.pdf")
print(f"Got {result.total_chunks} chunks from {result.source}")
Scrape any URL and receive the content pre-split into RAG-optimized chunks — ready to insert directly into Pinecone, Chroma, Qdrant, or any vector database. Tables and code blocks are never split mid-structure, preserving relational context for accurate LLM retrieval.
Endpoint
POST https://www.scrapedatshi.com/v1/rag-chunk
Send a JSON body with Content-Type: application/json.
Request Body (application/json)
Response Shape
{
"authenticated_user": "your_name",
"url": "https://example.com/article",
"selector": null,
"chunk_size_target": 512,
"overlap_tokens": 50,
"contextual_retrieval": false,
"parent_context": null,
"chunk_count": 7,
"total_tokens_estimated": 3421,
"metadata": { ... },
"chunks": [
{"index": 0, "token_estimate": 487, "text": "Location: Article Title > Section\n\nFirst paragraph..."},
...
]
}
Smart guardrails: Tables and code fences are kept as single atomic units — never split
across chunk boundaries. Each chunk is prefixed with a heading breadcrumb
(Location: Title > Section) so the embedding model knows exactly where in the document
the chunk lives.
RAG 2.0 — Contextual Retrieval: Set contextual_retrieval: true with your
LLM credentials to generate a 1-sentence global document summary that is prepended to every
chunk as Document Summary: .... This technique, proven by Anthropic to boost retrieval
accuracy by 35–50%, ensures each chunk carries its global context anchor alongside its local breadcrumb.
from scrapedatshi import ScrapedatshiClient
client = ScrapedatshiClient() # reads SCRAPEDATSHI_API_KEY from env
result = client.pipeline.chunk_url(
"https://example.com/article",
# Optional: contextual retrieval (Basic tier+)
# contextual_retrieval=True,
# llm_provider="openai",
# llm_api_key=os.getenv("OPENAI_API_KEY"),
)
print(f"Chunks: {result.total_chunks} | Source: {result.source}")
for chunk in result.chunks:
print(f" [{chunk.token_estimate} tokens] {chunk.content[:100]}...")
Automatically discover and scrape an entire domain — returning a structured dataset of every page's content ready for vector databases. One API call replaces hundreds of individual scrape requests.
Reads the domain's sitemap.xml to discover URLs. Structured, predictable, and
respects robots.txt. Best for documentation sites and structured content.
Follows links recursively from the root URL. Works on any site — even those without a sitemap.
Use include_pattern to keep it focused.
Endpoint
POST https://www.scrapedatshi.com/v1/crawl
Send a JSON body with Content-Type: application/json.
Request Body (application/json)
Sitemap Mode (Basic tier+)
from scrapedatshi import ScrapedatshiClient
client = ScrapedatshiClient()
# Sitemap mode — reads sitemap.xml, structured and predictable
result = client.pipeline.crawl(
"https://docs.example.com",
max_pages=10,
)
print(f"Crawled {result.pages_crawled} pages → {result.total_chunks} chunks")
for chunk in result.chunks:
print(f" {chunk.content[:80]}...")
Deep Spider Mode (Pro/Enterprise — follows links, no sitemap needed)
# Spider mode — follows links recursively, works on any site
# Use include_pattern to keep it focused on the right section
result = client.pipeline.crawl(
"https://example.com",
max_pages=20,
)
Scrape any URL and extract structured data matching your exact JSON schema — powered by your own LLM. No more brittle CSS selectors that break when a site redesigns. Define the fields you want in plain English and let the LLM do the parsing.
Supports OpenAI, Anthropic, and Google Gemini. You bring your own API key — we handle the scraping and prompt engineering.
Endpoint
POST https://www.scrapedatshi.com/v1/extract
Send a JSON body with Content-Type: application/json.
Request Body (application/json)
Response Shape
{
"authenticated_user": "your_name",
"url": "https://example.com/product",
"selector": null,
"llm_provider": "openai",
"llm_model": "gpt-4o-mini",
"schema_fields": ["title", "price", "in_stock", "description"],
"extracted": {
"title": "Widget Pro 3000",
"price": 49.99,
"in_stock": true,
"description": "The most advanced widget on the market."
}
}
The Python SDK wraps the /v1/* pipeline endpoints. The Schema Extractor uses your own LLM
key and is not wrapped by the SDK — use the raw REST example below.
# Schema Extractor is a raw REST endpoint — use requests directly.
# See the Raw REST tab for the full example.
# For scraping + chunking without LLM extraction, use the SDK:
from scrapedatshi import ScrapedatshiClient
client = ScrapedatshiClient()
result = client.pipeline.chunk_url("https://example.com/product/widget-pro")
print(f"Got {result.total_chunks} chunks")
Complete the entire RAG ingestion pipeline in a single API call. scrapedatshi scrapes the URL, chunks the content, generates vector embeddings via your embedding provider, and upserts directly into your vector database — zero additional code required.
Embedding: OpenAI or Cohere. Vector DB: Pinecone, Qdrant, or ChromaDB. You bring your own keys. Requires Pro or Enterprise tier.
Endpoint
POST https://www.scrapedatshi.com/v1/sync
Send a JSON body with Content-Type: application/json.
Response Shape
{
"authenticated_user": "your_name",
"url": "https://docs.example.com/",
"selector": "article",
"chunks_created": 47,
"vectors_upserted": 47,
"total_tokens_estimated": 24100,
"embedding_provider": "openai",
"embedding_model": "text-embedding-3-small",
"vector_db_provider": "pinecone",
"metadata": { "title": "...", "author": "...", ... }
}
import os
from scrapedatshi import ScrapedatshiClient
client = ScrapedatshiClient()
result = client.pipeline.sync(
url="https://docs.example.com/getting-started",
embedding_provider="openai",
embedding_api_key=os.getenv("OPENAI_API_KEY"),
vector_db="pinecone",
vector_db_api_key=os.getenv("PINECONE_API_KEY"),
index_name="my-docs",
)
print(f"Upserted {result.vectors_upserted} vectors ({result.total_tokens} tokens)")
print(f"Embedding: {result.embedding_provider} | Vector DB: {result.vector_db_provider}")
Extract all tables from a PDF as structured JSON. Each table is an array of rows; each row is an array of cell strings.
Endpoint
POST https://www.scrapedatshi.com/api/pdf/tables
Request Body (multipart/form-data)
Response Shape
{
"authenticated_user": "your_name",
"source": "https://example.com/report.pdf",
"table_count": 2,
"tables": [
{
"page": 1,
"table_index": 1,
"rows": [
["Header 1", "Header 2", "Header 3"],
["Row 1 A", "Row 1 B", "Row 1 C"],
["Row 2 A", "Row 2 B", "Row 2 C"]
]
}
]
}
The Python SDK wraps the /v1/* pipeline endpoints. For PDF table extraction, use the raw
REST example below.
# PDF Tables is a raw REST endpoint — use requests directly.
# See the Raw REST tab for the full example.
# To chunk a PDF file with the SDK instead:
from scrapedatshi import ScrapedatshiClient
client = ScrapedatshiClient()
result = client.pipeline.chunk_file("./report.pdf")
print(f"Got {result.total_chunks} chunks")
Ready to start building?
🚀 Get your API key →