Transform any website into a structured, AI-ready knowledge base.
List of URLs to start crawling from. The crawler will follow links within the same domain.
Maximum number of pages to crawl. Set to 0 for unlimited.
Maximum depth of links to follow from start URLs.
Target size for each content chunk in tokens. Recommended: 500-1000 for optimal RAG performance.
Number of overlapping tokens between consecutive chunks to maintain context.
Number of hypothetical questions to generate for each chunk.
LLM provider for generating hypothetical questions. Use 'native' for free rule-based generation (no API key needed), or 'openai'/'anthropic' for higher quality AI-generated questions.
Specific model to use (ignored for native provider). For OpenAI: gpt-4o-mini (cheap), gpt-4o (better). For Anthropic: claude-3-haiku-20240307 (cheap), claude-3-5-sonnet-20241022 (better).
Results to deliver
100 creditsThis agent actively searches live listings — results may vary. You are only charged for what is delivered, up to this number.
Lagic Proxy
Pricing
This tool crawls specified websites, intelligently chunks their content, and generates hypothetical questions for each section, creating a structured knowledge graph optimized for Retrieval Augmented Generation (RAG) and AI-driven insights.
### Build AI-Ready Knowledge Bases from Any Website To build effective AI applications, especially those using Retrieval Augmented Generation (RAG), you need structured, relevant data. This tool solves the challenge of turning raw website content into a finely tuned knowledge base. It acts as a specialized web crawler and content processor, designed to prepare data specifically for AI consumption. #### How it Works: From Crawl to Knowledge Graph Start by providing a list of URLs, and the tool will systematically crawl those pages and follow links within the same domain. You have granular control over the crawl: define the maximum number of pages and the depth of links to follow. This ensures you gather exactly the scope of information you need without over-collecting. Once content is extracted, the tool intelligently breaks it down into 'chunks' – manageable segments optimized for RAG performance. You set the target token size for these chunks and define how much overlap there should be between consecutive chunks to maintain context across segments. This careful chunking is crucial for ensuring your AI can retrieve precise and relevant information later. #### Intelligent Question Generation for Enhanced Retrieval A key feature is the generation of 'hypothetical questions' for each content chunk. These questions act as metadata, helping your AI system understand the core topics and potential queries a user might have about that specific piece of information. You can choose to generate these questions using a free, rule-based 'native' method, or opt for higher-quality, AI-generated questions by integrating with OpenAI or Anthropic (requiring your API key). #### Customization and Control The tool offers extensive customization to tailor your knowledge graph: * **URL Filtering:** Include or exclude specific URL patterns using glob patterns to focus your crawl. * **Content Exclusion:** Use CSS selectors to ignore unwanted elements like navigation, footers, or advertisements, ensuring only relevant content is processed. * **Metadata Inclusion:** Optionally include page titles, descriptions, and other metadata in your output for richer context. * **Proxy Support:** Configure proxy settings for complex crawling scenarios. #### Output for AI Applications The output is a structured dataset containing the original content chunks, the generated hypothetical questions, and comprehensive metadata for each chunk and page. This data is immediately ready to be ingested into vector databases, powering chatbots, AI assistants, search engines, and other RAG-based applications with accurate, context-rich information sourced directly from your chosen websites.
Provide one or more start URLs from the websites you wish to crawl and extract content from.
Adjust the crawling parameters, such as the maximum number of pages to crawl and the link depth to follow, and define content chunk sizes and overlap.
Select your preferred LLM provider (native, OpenAI, or Anthropic) for generating hypothetical questions, and enter your API key if using an external service.
Optionally, specify URL patterns to include or exclude, and add CSS selectors for elements you want to remove from the extracted content.
Run the tool, and it will begin crawling the specified websites, extracting content, and processing it into optimized chunks.
Receive a structured dataset containing content chunks, hypothetical questions, and comprehensive page metadata, ready for your AI applications.