End-to-end physical book acquisition and digitization for AI labs. Bring us a list — or just a spec — and we deliver clean, lawfully-sourced PDFs at scale, from our US scanning facilities.
Leading frontier AI labs, enterprise AI teams, and government research programs.
We don't name our clients. That's the point.
Web scrapes are noisy, fragmented, and increasingly contested. Books are edited, peer-reviewed, structured by chapter, and rich with the long-form reasoning that frontier models need to keep improving — and they're available, lawfully, by the millions.
Edited, fact-checked, and curated by subject experts over decades — the gold standard for structured, long-form knowledge.
Every book is inherently categorized by subject, genre, and discipline — ideal for precise, targeted domain training.
From peer-reviewed scientific texts to rare historical works — coverage and depth no web scrape can replicate.
Recent rulings have made one thing clear: how you acquire training data matters as much as how you train on it. Pirated datasets have triggered nine- and ten-figure settlements. Physically-sourced, lawfully-owned books have not.
We buy the books. We own the books. We scan the books we own. Every order ships with a clean chain of custody — invoices, ISBNs, receipts, and an NDA — so your legal team has a defensible paper trail before the first page is digitized.
And we don't stop at the scan. We help prepare the legal documentation your team needs, handle post-scan destruction on request, and run content deduplication across your order so you're never paying twice for the same book.
Every book is purchased through wholesale channels. Once owned, it's ours to scan — the doctrine that's protected libraries and resellers for over a century.
Invoices, ISBNs, source ledgers, and scan timestamps for every title. Documentation built for legal review, not just delivery.
Nothing in our pipeline touches gray-market or pirated sources. Ever. Your model never inherits someone else's lawsuit.
Every engagement runs under NDA. Your sourcing list, subject focus, and delivery details remain entirely between us.
We prepare the full documentation package your legal team needs — purchase invoices, ISBN-level provenance ledgers, ownership attestations, and scan-completion records, formatted for review and audit.
Once your dataset is delivered, we can destroy and recycle the physical books on your behalf — with a signed certificate of destruction. The same model that's already been validated in court for AI training data acquisition.
Different editions, reprints, translations, and content overlaps quietly inflate book orders. We dedupe across ISBNs, editions, and content fingerprints before sourcing — so you're not paying twice for the same material.
LibrisX is a one-stop shop for AI training data, when that data needs to come from books. You bring the spec — an ISBN list, a subject area, a language, an era, a volume target — and we handle everything from procurement through delivery.
Books are sourced through our wholesale network, received at our scanning facilities, processed on industrial Fujitsu lines, and delivered to you as clean PDFs (or your preferred format) with full metadata and a complete chain-of-custody record.
An ISBN list, a subject brief, or just a target — language, era, volume. We help shape the order if you don't have specifics.
Procured through our wholesale network with access to 30M+ unique titles annually. Bulk pricing, full invoicing.
Books arrive at our scanning facilities, are catalogued by ISBN, and queued for the scanning line.
Custom Fujitsu high-speed scanners running 24/7. Up to 600 dpi, full color or grayscale, OCR-ready output.
Clean PDFs (searchable or image), plain text, or your preferred format. Delivered with metadata and chain-of-custody documentation.
Seven years building one of the largest physical book operations in North America — now purpose-built for AI training pipelines.
200,000 sq. ft. of combined operational space across three US sites. Built for high-throughput receiving, sorting, scanning, and outbound logistics.
Custom-modified Fujitsu industrial scanners engineered for continuous 24/7 operation — sustaining tens of thousands of books per week of consistent, audit-ready output.
Established wholesale relationships with access to 30M+ unique titles per year across subjects, languages, and eras. We can fulfill ISBN lists or curate by spec.
| Sourcing Network | 30M+ titles / year |
| Active Inventory | 1M+ books on hand |
| Scan Throughput | 50,000 books / week |
| 12-Month Target | 7M scanned dataset |
From frontier model labs to government research initiatives — book datasets sourced and scanned to your spec, at any scale.
High-volume book corpora for pre-training and fine-tuning frontier models. Domain-diverse or subject-specific runs available.
Sovereign, US-based, NDA-bound delivery. Historical, multilingual, policy and scientific literature curated to your requirements.
Academic pricing available. Rare, niche, and scientific literature — for university labs, national research centers, and digital humanities projects.
Custom vertical book sets for legal, medical, financial, and technical domains — sourced and scanned to your model's specific needs.
Our sourcing network reaches across hundreds of disciplines. Whether you need a broad multi-domain corpus or a tightly targeted vertical, we build the order to your exact spec.
Don't see your niche? Every order is custom-built.
Request a Custom BuildAll tiers include sourcing, scanning, OCR, and full chain-of-custody documentation. Volume discounts apply automatically.
Tell us what you need. Our team responds within 24 hours with a tailored proposal and timeline.