RAG2riches is a Python library I am developing to support comparative retrieval-augmented generation workflows for social science research. The package is designed for researchers who want to ask the same substantive question across structured subsets of a text corpus, such as party-year, firm-year, outlet-month, country-period, or other metadata-defined cells. Rather than retrieving from an entire corpus and filtering afterward, RAG2riches performs metadata-filtered pre-retrieval, meaning that each generated response is grounded only in the subset of documents relevant to the comparison cell being analyzed. This design makes the library especially useful for comparative text analysis, where the goal is not simply to summarize a corpus, but to compare how different actors, institutions, time periods, or contexts frame the same issue.
The library supports an end-to-end workflow for ingesting CSV, TXT, and PDF files, cleaning and chunking text, propagating document- and chunk-level metadata, constructing comparison cells, embedding and retrieving cell-specific context, generating grounded responses, and exporting results for downstream analysis. Current functionality includes mock and LiteLLM-backed embeddings and generation, in-memory and LanceDB vector stores, metadata-prefiltered retrieval, checkpointing and resume logic, a high-level pipeline API, and a Streamlit interface for interactive comparative runs. The broader design goal is to make advanced RAG-based text analysis more accessible to social scientists while preserving the transparency, modularity, and reproducibility needed for empirical research workflows.
RAG2riches is currently available as an early alpha release. The first alpha version, 0.1.0a1, includes the core comparative RAG architecture, including data types, ingestion, cleaning, chunking, metadata utilities, embedding and vector store interfaces, metadata-aware retrieval, response generation, comparative execution, checkpointing, export tools, tests, documentation, and a Streamlit UI. As an alpha-stage project, the API and feature set are still evolving, and generated outputs should be validated as part of a rigorous research workflow.
The library is intended to bridge modern LLM infrastructure with the practical needs of empirical researchers working with large, structured text corpora, and can be found on GitHub here: https://github.com/MatthiasRo/RAG2riches.