Home SEP Scraper arXiv Scraper

Wikipedia Scraper

Bounded BFS crawl with canonical keys, SQLite graph store, and a 6-point soup validator

Role in pipeline Breadth + entity coverage. Dense inline hyperlink graph, canonical page keys, high recall. Wikipedia gives you 834 outlinks from a single article — the widest expansion of any source in the stack.

Why Wikipedia for LLM Knowledge Ingest?

RAG pipelines need a breadth layer — a source that covers entities (people, theories, labs, models) at scale, with structured hyperlinks between them. Wikipedia is ideal: every article is a node in a massive directed graph, every inline /wiki/ link is an edge, and the content is structured enough to extract section-keyed text automatically.

The challenge is taming the graph. A single seed article like Cognitive science links to 834 other articles. Two hops out, the reachable set grows exponentially. The scraper stack solves this with bounded BFS (configurable depth and max-connections-per-hop), canonical key normalization (so Cognitive_science, cognitive science, and https://en.wikipedia.org/wiki/Cognitive_science#History all resolve to the same node), and a SQLite store that deduplicates across sessions.

Architecture

Three files, three maturity levels, each solving a different sub-problem:

sitehopper3_basic.py (the grown-up scraper)
  canonical_page_key() — normalizes URLs, spaces, fragments into one key
  SimpleWikiStore      — SQLite with pages + edges tables, upsert semantics
  WikipediaScraper     — live fetcher, strips noisy elements, extracts <p> text
  FakeWikipediaScraper — deterministic offline mode (SHA-256 seeded, no network)
  crawl()              — bounded BFS: depth + max_connections + PRNG sampling

wikiFetcher.py (the validator)
  WebScraper           — session-based with polite UA + timeout + referer
  validate_wikipedia_soup() — 6-point quality gate:
    1. HTTP status == 200
    2. No <table class="noarticletext">
    3. h1#firstHeading exists
    4. div.mw-parser-output exists
    5. At least one paragraph > 40 chars
    6. Not a disambiguation page (category check)
  WikipediaFetcher     — title-list fetcher with section-aware extraction

wikiSearch.py (the search entry point)
  WikiPage             — in-memory page graph with 1-hop link expansion
  WikipediaFetcher     — MediaWiki API search with content-type validation
            

Key Design Decisions

Canonical page keys

The function canonicalize_page_key() solves the identity problem at ingest time rather than query time. URLs, display titles, and fragments all map to a single canonical string. This means the SQLite edges table never has duplicate nodes under different names.

Graph persistence in SQLite

Unlike flat JSON metadata, the pages/edges schema gives you a real directed graph with bidirectional queries, neighbor lookups with limits, and upsert semantics that survive across sessions. Placeholder pages are created for unseen outlinks so the graph stays connected even before those pages are fetched.

Real + fake scraper pair

The FakeWikipediaScraper uses SHA-256 hashing of the page key to produce deterministic content and outlinks — same inputs always produce same outputs, with zero network calls. This means crawl behavior can be tested and benchmarked offline, which is critical for developing the BFS parameters without hammering Wikipedia.

6-point soup validation

Validation rejects bad content before it enters the cache: disambiguation pages, deleted articles, empty stubs, and non-article pages are caught by six progressive checks. Each failure returns a reason string, making debugging transparent.

Code: Bounded BFS Crawl

def crawl(start_page, store, scraper, depth=2, max_connections=8, seed=0):
    """Simple bounded BFS crawl."""
    rng = random.Random(seed)
    start = canonicalize_page_key(start_page)
    visited: set[str] = set()
    frontier = [start]
    fetched = 0

    for _ in range(depth):
        next_frontier: list[str] = []
        for page in frontier:
            if page in visited:
                continue
            visited.add(page)
            _, outlinks, _, from_cache = load_or_fetch_page(page, store, scraper)
            if not from_cache:
                fetched += 1
            if len(outlinks) > max_connections:
                outlinks = rng.sample(outlinks, k=max_connections)
            next_frontier.extend(outlinks)
        frontier = next_frontier

    return {"start": start, "depth": depth,
            "visited_pages": len(visited), "fresh_fetches": fetched}

Live Output: Cognitive Science Seed

Single fetch of Cognitive_science via WikipediaScraper:

=== canonicalization === 'Cognitive_science' -> Cognitive_science 'cognitive science' -> cognitive_science 'https://en.wikipedia.org/wiki/Cognitive_science#History' -> Cognitive_science === live fetch: Cognitive_science === HTTP status: 200 Content chars: 33,266 Outlinks found: 834 First 250 chars: Cognitive science is the interdisciplinary, scientific study of the mind and its processes. It examines the nature, the tasks, and the functions of cognition (in a broad sense)... Sample outlinks: - Philosophy of mind - Linguistics - Neuroscience - Artificial intelligence - Anthropology - Psychology - Computer science - Cognitive Science (journal)

Notice: the outlinks are essentially the cognitive science pillar list — philosophy of mind, linguistics, neuroscience, AI, anthropology, psychology, computer science. This is why Wikipedia is the natural breadth layer for a cognitive-science knowledge pipeline.

At a Glance

MetricValue
Content per page~33K chars (Cognitive Science)
Outlinks per page~834 (varies by article)
Effective branching factor (capped)8 (configurable max_connections)
Storage backendSQLite (pages + edges tables)
Dedup strategyCanonical key normalization + upsert
Validation6-point soup check (strictest in the stack)
Offline modeFakeWikipediaScraper (deterministic, zero network)
View Source on GitHub Next: SEP Scraper →