Peer-reviewed concept backbone with hand-curated cross-references
Role in pipeline Concept backbone. Peer-reviewed entries, hand-curated related_entries, highest signal-per-page of any source in the stack. A single SEP entry on "Cognitive Science" yields 27 section headings and 31 curated cross-references.
An LLM knowledge pipeline needs a concept layer — a source of authoritative, peer-reviewed definitions that anchor the meaning of terms across disciplines. SEP is that source for anything touching philosophy, philosophy of mind, epistemology, and the theoretical foundations of cognitive science.
What makes SEP uniquely valuable is not just the quality of its articles, but its explicit, editor-curated cross-reference graph. Each article has a <div id="related-entries"> section listing the entries the editors consider most relevant. Unlike Wikipedia's noisy inline links (834 from a single page), SEP's related entries are sparse and intentional — typically 20–40 per article, each one hand-picked. This means you get a high-signal citation graph for free.
plato.py
WebScraper — session-based HTTP with polite UA + timeout
.scrape() — fetch + parse to BeautifulSoup
.get_initial_paragraph() — extracts <div id="preamble"> first <p>
.get_headings() — all h1-h6 in document order
.get_content_under_headings() — {heading: next_sibling_text}
.get_related_entries() — parses <div id="related-entries"> links
SepFetcher — fetch by name or by index slice from contents.html
.fetch_sep_article_named(slug) — single-article ingest
.fetch_sep_articles(n) — batch from alphabetical index
DBreader — query layer over sep_articles/metadata.json
.search(keyword) — scans titles + paragraphs + section content
.get_related_entries(ref) — look up cross-references for any cached article
.find_article(name) — exact-match lookup
read_plato.py
ArticleAnalyzer — NLTK post-processor
.get_lexicon_and_bigrams() — tokenize cached content, build frequency profiles
SEP doesn't have a formal API. What saves the scraper is that every SEP article includes a hand-curated <div id="related-entries">. The scraper parses this into a {title: href_slug} dictionary, giving you a directed graph where edges are editorial judgments, not accidental hyperlinks. This is the highest-quality citation signal in the entire stack.
SepFetcher writes to the cache; DBreader reads from it. This is the only source in the stack with a proper query layer over its own metadata. You can .search("consciousness") and get back matching articles without re-scraping.
SEP articles have a distinct <div id="preamble"> that contains the lede — usually the single best one-paragraph definition of the concept. The scraper extracts this separately from the full section content, making it ideal for LLM context windows where brevity matters.
def get_related_entries(self, soup): related_entries = {} related_entries_div = soup.find('div', id='related-entries') if related_entries_div: links = related_entries_div.find_all('a') for link in links: title = link.text.strip() href = link.get('href') if href is not None or title != 'Related Entries': clean_href = re.sub(r'^[^A-Za-z]+', '', href) related_entries[title] = clean_href return related_entries
Single fetch of /entries/cognitive-science/:
Each tag is one curated cross-reference — an edge in the SEP citation graph. Compare this to Wikipedia's 834 raw hyperlinks: fewer edges, but every one is an editorial judgment.
| Metric | Value |
|---|---|
| Section headings per article | ~27 (Cognitive Science) |
| Related entries per article | ~31 (hand-curated) |
| Storage backend | JSON (sep_articles/metadata.json) |
| Graph type | Explicit editor-curated cross-references |
| Query layer | DBreader.search() — keyword search over cached content |
| Unique extraction | Preamble paragraph (best one-sentence definition) |
| Post-processing | NLTK lexicon + bigram analysis (read_plato.py) |