Home Wikipedia Scraper SEP Scraper

arXiv Scraper

Atom API metadata ingest with category-driven queries and PDF download

Role in pipeline Primary-source verification. Direct access to recent research PDFs in CS, AI, math, and neuroscience. When a concept page in the knowledge base makes an empirical claim, the arXiv pipeline provides the paper that backs it up.

Why arXiv for LLM Knowledge Ingest?

Wikipedia gives you breadth. SEP gives you peer-reviewed definitions. But when someone asks "what's the latest evidence for predictive processing in visual cortex?" or "which RAG architectures actually improve factual grounding?" — you need primary literature. arXiv is the open-access preprint server where CS/AI, math, physics, and quantitative biology researchers publish before (and sometimes instead of) journal submission.

arXiv has a unique constraint: its Atom API exposes metadata only — no references, no citation graph. The paper's abstract, authors, categories, and PDF link are available, but its bibliography is locked inside the PDF. This means the arXiv scraper is structurally different from the Wikipedia and SEP scrapers: it's a metadata-and-document fetcher, not a graph crawler. Citation-graph reconstruction requires an external bridge (Semantic Scholar) or full PDF reference extraction.

Architecture

Two implementations targeting different entry patterns:

arxiv.py (category-driven)

  ArxivFetcher
    .arxiv_categories   — 35 CS subcategories mapped to arXiv codes
    .fetch_arxiv_articles(category_key, n)
      1. Build Atom API URL:  cat:{code}&sortBy=submittedDate
      2. Parse feed via feedparser
      3. For each entry: extract arxiv_id, title, date, summary
      4. Download PDF via application/pdf link
      5. Dedup against metadata.json
      6. Persist metadata + PDF to arxiv_articles/

merged_arxiv_union.ipynb cell 24 (search-query-driven)

  1. Build URL:  all:{free-text query}&max_results=N
  2. Parse + normalize to pandas DataFrame
  3. Filter by category list (e.g. cs.CL, cs.LG)
  4. Download first PDF + extract first 2 pages via PyPDF2
  5. Save CSV + JSON + text preview to arxiv_out/
            

Key Design Decisions

Two entry patterns: category browse vs. keyword search

arxiv.py answers "what's new in cs.AI?" — sorted by submission date, category-filtered. The notebook answers "find papers about retrieval-augmented generation" — free-text search, post-filtered by category list. Both patterns are needed: browse for monitoring a field, search for answering a specific question.

35 CS subcategory index

The arxiv_categories dictionary maps human-readable names to arXiv codes, covering the full CS taxonomy. This makes category-driven queries accessible without memorizing codes like cs.CL or cs.NE.

PDF download + text preview

The notebook goes one step further than arxiv.py: after downloading the PDF, it extracts the first 1–2 pages as plain text via PyPDF2. This gives an immediate text preview without requiring a full PDF-to-text pipeline. Enough for an LLM to read the abstract + introduction.

Polite identification

All requests use a shared CogSciWikiBot/0.1 user-agent string with explicit timeouts, matching the same politeness standards as the Wikipedia and SEP scrapers.

arXiv CS Category Index (35 subcategories)

cs.AI — Artificial Intelligence cs.CL — Computation and Language cs.CV — Computer Vision cs.LG — Machine Learning cs.NE — Neural/Evolutionary Computing cs.RO — Robotics cs.HC — Human-Computer Interaction cs.IR — Information Retrieval cs.SE — Software Engineering cs.LO — Logic in CS cs.CC — Computational Complexity cs.DS — Data Structures/Algorithms cs.CR — Cryptography and Security cs.DB — Databases cs.DC — Distributed Computing cs.PL — Programming Languages cs.MA — Multiagent Systems cs.GT — Game Theory cs.CY — Computers and Society cs.IT — Information Theory

Showing 20 of 35 — full index in source code.

Code: Category-Driven Fetch

def fetch_arxiv_articles(self, category_key, num_articles=3):
    category = self.arxiv_categories.get(category_key, None)
    if not category:
        print(f"Invalid category: {category_key}")
        return

    downloaded_ids = self._load_downloaded_articles()
    new_articles = []

    base_url = "http://export.arxiv.org/api"
    query = (
        f"/query?search_query=cat:{category}"
        f"&sortBy=submittedDate&sortOrder=descending"
        f"&start=0&max_results={num_articles}"
    )
    feed = feedparser.parse(base_url + query)

    for entry in feed.entries:
        arxiv_id = entry.id.split('/')[-1]
        if arxiv_id in downloaded_ids:
            continue
        # ... download PDF, extract metadata, persist ...

Live Output: Recent RAG Papers

Search query: all:retrieval-augmented generation, 3 results, sorted by date:

=== arXiv fetch via Atom API === Feed status: 301 (redirect to https, expected) Entries returned: 3 [1] Fast Spatial Memory with Elastic Test-Time Training id: 2604.07350v1 published: 2026-04-08T17:59:48Z categories: [cs.CV, cs.GR, cs.LG] pdf: https://arxiv.org/pdf/2604.07350v1 [2] MoRight: Motion Control Done Right id: 2604.07348v1 published: 2026-04-08T17:59:22Z categories: [cs.CV, cs.AI, cs.GR, cs.LG, cs.RO] pdf: https://arxiv.org/pdf/2604.07348v1 [3] Interaction-Mediated Non-Reciprocal Dynamics in Open Quantum Systems id: 2604.07346v1 published: 2026-04-08T17:57:26Z categories: [quant-ph, cond-mat.mes-hall] pdf: https://arxiv.org/pdf/2604.07346v1

Papers published yesterday (April 8, 2026). The pipeline fetches real-time preprints.

The Graph Gap — and How to Close It

arXiv's Atom API exposes no references, no citation graph, no inbound counts. This is the fundamental structural difference from Wikipedia (834 outlinks per page) and SEP (31 curated cross-references). To reconstruct the citation network, the pipeline needs an external bridge.

Planned approach: Semantic Scholar's free API provides references and citations endpoints for any paper, plus an influentialCitationCount field that approximates whether a paper is cited substantively (in the body) vs. perfunctorily (in the reference list). This gives ~80% of the signal of full PDF reference extraction at ~5% of the implementation cost.

Once wired in, this enables citation-distance visualization — plotting how the reachable paper set grows exponentially with hop distance, and why bounding traversal depth is critical for tractable knowledge ingest.

At a Glance

MetricValue
APIarXiv Atom (export.arxiv.org/api)
Entry patternsCategory browse + free-text search
Category coverage35 CS subcategories (extensible to math, physics, q-bio)
Storage backendJSON metadata + downloaded PDFs
Text extractionFirst 1–2 PDF pages via PyPDF2
Graph availabilityNone in API (Semantic Scholar bridge planned)
Dedup strategyarxiv_id set from metadata.json

Three Sources, Three Roles

WikipediaSEParXiv
RoleBreadth + entitiesConcept backbonePrimary-source verification
Graph signalDense (834 links/page)Sparse, curated (31/page)None in API
Content typeSection-aware HTMLSection-aware HTMLMetadata + PDF
Validation6-point soup checkBasicHTTP status
StorageSQLiteJSONJSON + PDFs
Unique strengthCanonical keys + fake scraperQuery layer + preamble35-category index + PDF text
View Source on GitHub ← SEP Scraper Wikipedia Scraper