SEP Scraper — Knowledge Source Pipeline

Role in pipeline Concept backbone. Peer-reviewed entries, hand-curated related_entries, highest signal-per-page of any source in the stack. A single SEP entry on "Cognitive Science" yields 27 section headings and 31 curated cross-references.

Why the Stanford Encyclopedia of Philosophy?

An LLM knowledge pipeline needs a concept layer — a source of authoritative, peer-reviewed definitions that anchor the meaning of terms across disciplines. SEP is that source for anything touching philosophy, philosophy of mind, epistemology, and the theoretical foundations of cognitive science.

What makes SEP uniquely valuable is not just the quality of its articles, but its explicit, editor-curated cross-reference graph. Each article has a <div id="related-entries"> section listing the entries the editors consider most relevant. Unlike Wikipedia's noisy inline links (834 from a single page), SEP's related entries are sparse and intentional — typically 20–40 per article, each one hand-picked. This means you get a high-signal citation graph for free.

Architecture

plato.py

  WebScraper          — session-based HTTP with polite UA + timeout
    .scrape()           — fetch + parse to BeautifulSoup
    .get_initial_paragraph() — extracts <div id="preamble"> first <p>
    .get_headings()     — all h1-h6 in document order
    .get_content_under_headings() — {heading: next_sibling_text}
    .get_related_entries()  — parses <div id="related-entries"> links

  SepFetcher         — fetch by name or by index slice from contents.html
    .fetch_sep_article_named(slug)  — single-article ingest
    .fetch_sep_articles(n)          — batch from alphabetical index

  DBreader           — query layer over sep_articles/metadata.json
    .search(keyword)    — scans titles + paragraphs + section content
    .get_related_entries(ref) — look up cross-references for any cached article
    .find_article(name) — exact-match lookup

read_plato.py

  ArticleAnalyzer    — NLTK post-processor
    .get_lexicon_and_bigrams()  — tokenize cached content, build frequency profiles

Key Design Decisions

Explicit citation graph from related_entries

SEP doesn't have a formal API. What saves the scraper is that every SEP article includes a hand-curated <div id="related-entries">. The scraper parses this into a {title: href_slug} dictionary, giving you a directed graph where edges are editorial judgments, not accidental hyperlinks. This is the highest-quality citation signal in the entire stack.

Two-class separation: Fetcher + Reader

SepFetcher writes to the cache; DBreader reads from it. This is the only source in the stack with a proper query layer over its own metadata. You can .search("consciousness") and get back matching articles without re-scraping.

Initial paragraph extraction

SEP articles have a distinct <div id="preamble"> that contains the lede — usually the single best one-paragraph definition of the concept. The scraper extracts this separately from the full section content, making it ideal for LLM context windows where brevity matters.

Code: Extracting the Related-Entries Graph

def get_related_entries(self, soup):
    related_entries = {}
    related_entries_div = soup.find('div', id='related-entries')
    if related_entries_div:
        links = related_entries_div.find_all('a')
        for link in links:
            title = link.text.strip()
            href = link.get('href')
            if href is not None or title != 'Related Entries':
                clean_href = re.sub(r'^[^A-Za-z]+', '', href)
                related_entries[title] = clean_href
    return related_entries

Live Output: Cognitive Science Entry

Single fetch of /entries/cognitive-science/:

=== SEP fetch: cognitive-science === Title: Cognitive Science Intro (691 chars): Cognitive science is the interdisciplinary study of mind and intelligence, embracing philosophy, psychology, artificial intelligence, neuroscience, linguistics, and anthropology. Its intellectual origins are in the mid-1950s when researchers in several fields began to develop theories of mind based on complex representations and computational procedures... Total headings: 27 First 5 headings: - Cognitive Science - 1. History - 2. Methods - 3. Representation and Computation - 4. Theoretical Approaches

Related entries (31 curated cross-references)

Each tag is one curated cross-reference — an edge in the SEP citation graph. Compare this to Wikipedia's 834 raw hyperlinks: fewer edges, but every one is an editorial judgment.

At a Glance

Metric	Value
Section headings per article	~27 (Cognitive Science)
Related entries per article	~31 (hand-curated)
Storage backend	JSON (sep_articles/metadata.json)
Graph type	Explicit editor-curated cross-references
Query layer	DBreader.search() — keyword search over cached content
Unique extraction	Preamble paragraph (best one-sentence definition)
Post-processing	NLTK lexicon + bigram analysis (read_plato.py)

View Source on GitHub ← Wikipedia Scraper arXiv Scraper →