Atom API metadata ingest with category-driven queries and PDF download
Role in pipeline Primary-source verification. Direct access to recent research PDFs in CS, AI, math, and neuroscience. When a concept page in the knowledge base makes an empirical claim, the arXiv pipeline provides the paper that backs it up.
Wikipedia gives you breadth. SEP gives you peer-reviewed definitions. But when someone asks "what's the latest evidence for predictive processing in visual cortex?" or "which RAG architectures actually improve factual grounding?" — you need primary literature. arXiv is the open-access preprint server where CS/AI, math, physics, and quantitative biology researchers publish before (and sometimes instead of) journal submission.
arXiv has a unique constraint: its Atom API exposes metadata only — no references, no citation graph. The paper's abstract, authors, categories, and PDF link are available, but its bibliography is locked inside the PDF. This means the arXiv scraper is structurally different from the Wikipedia and SEP scrapers: it's a metadata-and-document fetcher, not a graph crawler. Citation-graph reconstruction requires an external bridge (Semantic Scholar) or full PDF reference extraction.
Two implementations targeting different entry patterns:
arxiv.py (category-driven)
ArxivFetcher
.arxiv_categories — 35 CS subcategories mapped to arXiv codes
.fetch_arxiv_articles(category_key, n)
1. Build Atom API URL: cat:{code}&sortBy=submittedDate
2. Parse feed via feedparser
3. For each entry: extract arxiv_id, title, date, summary
4. Download PDF via application/pdf link
5. Dedup against metadata.json
6. Persist metadata + PDF to arxiv_articles/
merged_arxiv_union.ipynb cell 24 (search-query-driven)
1. Build URL: all:{free-text query}&max_results=N
2. Parse + normalize to pandas DataFrame
3. Filter by category list (e.g. cs.CL, cs.LG)
4. Download first PDF + extract first 2 pages via PyPDF2
5. Save CSV + JSON + text preview to arxiv_out/
arxiv.py answers "what's new in cs.AI?" — sorted by submission date, category-filtered. The notebook answers "find papers about retrieval-augmented generation" — free-text search, post-filtered by category list. Both patterns are needed: browse for monitoring a field, search for answering a specific question.
The arxiv_categories dictionary maps human-readable names to arXiv codes, covering the full CS taxonomy. This makes category-driven queries accessible without memorizing codes like cs.CL or cs.NE.
The notebook goes one step further than arxiv.py: after downloading the PDF, it extracts the first 1–2 pages as plain text via PyPDF2. This gives an immediate text preview without requiring a full PDF-to-text pipeline. Enough for an LLM to read the abstract + introduction.
All requests use a shared CogSciWikiBot/0.1 user-agent string with explicit timeouts, matching the same politeness standards as the Wikipedia and SEP scrapers.
Showing 20 of 35 — full index in source code.
def fetch_arxiv_articles(self, category_key, num_articles=3): category = self.arxiv_categories.get(category_key, None) if not category: print(f"Invalid category: {category_key}") return downloaded_ids = self._load_downloaded_articles() new_articles = [] base_url = "http://export.arxiv.org/api" query = ( f"/query?search_query=cat:{category}" f"&sortBy=submittedDate&sortOrder=descending" f"&start=0&max_results={num_articles}" ) feed = feedparser.parse(base_url + query) for entry in feed.entries: arxiv_id = entry.id.split('/')[-1] if arxiv_id in downloaded_ids: continue # ... download PDF, extract metadata, persist ...
Search query: all:retrieval-augmented generation, 3 results, sorted by date:
Papers published yesterday (April 8, 2026). The pipeline fetches real-time preprints.
arXiv's Atom API exposes no references, no citation graph, no inbound counts. This is the fundamental structural difference from Wikipedia (834 outlinks per page) and SEP (31 curated cross-references). To reconstruct the citation network, the pipeline needs an external bridge.
Planned approach: Semantic Scholar's free API provides references and citations endpoints for any paper, plus an influentialCitationCount field that approximates whether a paper is cited substantively (in the body) vs. perfunctorily (in the reference list). This gives ~80% of the signal of full PDF reference extraction at ~5% of the implementation cost.
Once wired in, this enables citation-distance visualization — plotting how the reachable paper set grows exponentially with hop distance, and why bounding traversal depth is critical for tractable knowledge ingest.
| Metric | Value |
|---|---|
| API | arXiv Atom (export.arxiv.org/api) |
| Entry patterns | Category browse + free-text search |
| Category coverage | 35 CS subcategories (extensible to math, physics, q-bio) |
| Storage backend | JSON metadata + downloaded PDFs |
| Text extraction | First 1–2 PDF pages via PyPDF2 |
| Graph availability | None in API (Semantic Scholar bridge planned) |
| Dedup strategy | arxiv_id set from metadata.json |
| Wikipedia | SEP | arXiv | |
|---|---|---|---|
| Role | Breadth + entities | Concept backbone | Primary-source verification |
| Graph signal | Dense (834 links/page) | Sparse, curated (31/page) | None in API |
| Content type | Section-aware HTML | Section-aware HTML | Metadata + PDF |
| Validation | 6-point soup check | Basic | HTTP status |
| Storage | SQLite | JSON | JSON + PDFs |
| Unique strength | Canonical keys + fake scraper | Query layer + preamble | 35-category index + PDF text |