Bounded BFS crawl with canonical keys, SQLite graph store, and a 6-point soup validator
Role in pipeline Breadth + entity coverage. Dense inline hyperlink graph, canonical page keys, high recall. Wikipedia gives you 834 outlinks from a single article — the widest expansion of any source in the stack.
RAG pipelines need a breadth layer — a source that covers entities (people, theories, labs, models) at scale, with structured hyperlinks between them. Wikipedia is ideal: every article is a node in a massive directed graph, every inline /wiki/ link is an edge, and the content is structured enough to extract section-keyed text automatically.
The challenge is taming the graph. A single seed article like Cognitive science links to 834 other articles. Two hops out, the reachable set grows exponentially. The scraper stack solves this with bounded BFS (configurable depth and max-connections-per-hop), canonical key normalization (so Cognitive_science, cognitive science, and https://en.wikipedia.org/wiki/Cognitive_science#History all resolve to the same node), and a SQLite store that deduplicates across sessions.
Three files, three maturity levels, each solving a different sub-problem:
sitehopper3_basic.py (the grown-up scraper)
canonical_page_key() — normalizes URLs, spaces, fragments into one key
SimpleWikiStore — SQLite with pages + edges tables, upsert semantics
WikipediaScraper — live fetcher, strips noisy elements, extracts <p> text
FakeWikipediaScraper — deterministic offline mode (SHA-256 seeded, no network)
crawl() — bounded BFS: depth + max_connections + PRNG sampling
wikiFetcher.py (the validator)
WebScraper — session-based with polite UA + timeout + referer
validate_wikipedia_soup() — 6-point quality gate:
1. HTTP status == 200
2. No <table class="noarticletext">
3. h1#firstHeading exists
4. div.mw-parser-output exists
5. At least one paragraph > 40 chars
6. Not a disambiguation page (category check)
WikipediaFetcher — title-list fetcher with section-aware extraction
wikiSearch.py (the search entry point)
WikiPage — in-memory page graph with 1-hop link expansion
WikipediaFetcher — MediaWiki API search with content-type validation
The function canonicalize_page_key() solves the identity problem at ingest time rather than query time. URLs, display titles, and fragments all map to a single canonical string. This means the SQLite edges table never has duplicate nodes under different names.
Unlike flat JSON metadata, the pages/edges schema gives you a real directed graph with bidirectional queries, neighbor lookups with limits, and upsert semantics that survive across sessions. Placeholder pages are created for unseen outlinks so the graph stays connected even before those pages are fetched.
The FakeWikipediaScraper uses SHA-256 hashing of the page key to produce deterministic content and outlinks — same inputs always produce same outputs, with zero network calls. This means crawl behavior can be tested and benchmarked offline, which is critical for developing the BFS parameters without hammering Wikipedia.
Validation rejects bad content before it enters the cache: disambiguation pages, deleted articles, empty stubs, and non-article pages are caught by six progressive checks. Each failure returns a reason string, making debugging transparent.
def crawl(start_page, store, scraper, depth=2, max_connections=8, seed=0): """Simple bounded BFS crawl.""" rng = random.Random(seed) start = canonicalize_page_key(start_page) visited: set[str] = set() frontier = [start] fetched = 0 for _ in range(depth): next_frontier: list[str] = [] for page in frontier: if page in visited: continue visited.add(page) _, outlinks, _, from_cache = load_or_fetch_page(page, store, scraper) if not from_cache: fetched += 1 if len(outlinks) > max_connections: outlinks = rng.sample(outlinks, k=max_connections) next_frontier.extend(outlinks) frontier = next_frontier return {"start": start, "depth": depth, "visited_pages": len(visited), "fresh_fetches": fetched}
Single fetch of Cognitive_science via WikipediaScraper:
Notice: the outlinks are essentially the cognitive science pillar list — philosophy of mind, linguistics, neuroscience, AI, anthropology, psychology, computer science. This is why Wikipedia is the natural breadth layer for a cognitive-science knowledge pipeline.
| Metric | Value |
|---|---|
| Content per page | ~33K chars (Cognitive Science) |
| Outlinks per page | ~834 (varies by article) |
| Effective branching factor (capped) | 8 (configurable max_connections) |
| Storage backend | SQLite (pages + edges tables) |
| Dedup strategy | Canonical key normalization + upsert |
| Validation | 6-point soup check (strictest in the stack) |
| Offline mode | FakeWikipediaScraper (deterministic, zero network) |