The Big Picture
Scientific literature is vastβbut not all of it is accessible, and not all accessible papers can be used freely.
Global Research Output
500M
Works indexed
in OpenAlex
33M
Usable papers
CC-BY, CC0, CC-BY-SA
~65%
Restricted
NC or paywalled
10M+
New per year
and growing
Open access is growing ~3-5% per year. 107M works (35%) are now open access.
The Accessibility Funnel
33M papers have usable licenses β 29M are addressable β 6M in our corpus (23M+ queued). The funnel narrows at each step: missing PDFs, rate limits, and access restrictions.
Loading visualization...
Corpus Composition
What we've downloaded and what's in the queue. Hover for details.
Downloaded
5.7M papers in Azure blob storage
π 300K indexed in Postgres β bulk indexing pending
PMC XML
5.1M
Direct Publishers
500K
arXiv
173K
PMC PDF
119K
Unpaywall
57K
Pending Queue
23.5M papers queued for download
Direct Publishers
22.7M
Figshare
328K
bioRxiv
203K
arXiv (remaining)
148K
medRxiv
58K
HAL
18K
Zenodo
13K
Understanding Licenses
Licenses determine what you can do with a paper. Can you redistribute it? Build commercial products on it? Modify and republish it? The license answers these questions.
OK to Use
CC0
Public Domain
The author waives all rights. The work belongs to the public domain.
Can do:
- Use commercially
- Modify and build on it
- No attribution required
CC-BY
Attribution
Use freely for any purpose, including commercial. Just credit the original author.
Can do:
- Use commercially
- Modify and build on it
- Share freely
Can't do:
- Omit attribution to the author
CC-BY-SA
Attribution + ShareAlike
Use and modify freely, but your derivative works must use the same license.
Can do:
- Use commercially
- Modify and build on it
Can't do:
- Use a more restrictive license on derivatives
Cannot Use
CC-BY-NC
Non-Commercial
Free for academic and personal use. Commercial use is prohibited.
Can do:
- Read and share
- Use for research
- Modify for personal use
Can't do:
- Generate revenue
- Use in commercial products
CC-BY-NC-ND
Non-Commercial + No Derivatives
Read-only. No modifications, no commercial use, no derivatives.
Can do:
- Read it
Can't do:
- Modify
- Extract data
- Use commercially
All Rights Reserved
Traditional Copyright
Full copyright protection. Any use requires explicit permission from the rights holder.
Can do:
- Read with subscription access
Can't do:
- Copy
- Share
- Build on it
- Use without permission
No License / Bronze
Free to Read, Unclear Rights
Available online for free, but no explicit license granted. Legal status unclear.
Can do:
- Read it for free
Can't do:
- Unclear what else is permitted
Quick Reference
| License | Commercial Use | Modifications | Openness |
|---|---|---|---|
| CC0 | π’ Most Open | ||
| CC-BY | π’ Very Open | ||
| CC-BY-SA | π’ Open | ||
| CC-BY-NC | π‘ Limited | ||
| CC-BY-NC-ND | π΄ Restrictive | ||
| Β© Copyright | π΄ Closed |
The bottom line: ~4 million papers in PMC are CC-BY or CC0βfully open for any use. Another ~6 million have various restrictions. The rest require subscriptions or licensing deals.
Where to Get Papers
Different sources have different coverage, licenses, and access methods. Here's the full map.
Open Access Sources
OpenAlex
Metadata IndexOpen catalog of all scholarly works. 307M with full metadata via API, 193M in expansion pack (repositories, datasets).
Papers
500M works indexed
License
See breakdown below
Coverage
All disciplines
API
β Yes
OpenAlex (Verified April 9, 2026) License Breakdown
500.0M papersPubMed Central (PMC)
Full TextThe gold standard for biomedical full text. Our primary source.
Papers
7.7M articles
License
See breakdown below
Coverage
Biomedical only
API
β Yes
PMC Open Access License Breakdown
7.7M papersFigshare
RepositoryResearch data repository. Also hosts papers, posters, presentations.
Papers
5M+
License
See breakdown below
Coverage
Data + Papers
API
β Yes
Figshare License Breakdown
5.0M papersZenodo
RepositoryCERN's open repository. Papers, datasets, software. EU-funded research compliant.
Papers
3.97M
License
See breakdown below
Coverage
All disciplines
API
β Yes
Zenodo License Breakdown
4.0M papersHAL
RepositoryFrench national open archive. Mandatory deposit for French-funded research.
Papers
4.53M
License
See breakdown below
Coverage
French research
API
β Yes
HAL License Breakdown
4.5M papersRePEc
Index + PreprintsEconomics working papers and articles. Volunteer-run network.
Papers
3.8M
License
See breakdown below
Coverage
Economics
API
β Yes
RePEc License Breakdown
3.8M papersarXiv
PreprintsPreprints before peer review. Great for ML/AI, physics, math.
Papers
2.5M+
License
See breakdown below
Coverage
Physics, Math, CS, Quant Bio
API
β Yes
arXiv License Breakdown
2.5M papersOSF Preprints
PreprintsGeneral preprint server by Center for Open Science.
Papers
50K+
License
See breakdown below
Coverage
All disciplines
API
β Yes
OSF Preprints License Breakdown
0.1M papersbioRxiv / medRxiv
PreprintsPreprints for biology and health sciences. Cutting-edge but not peer reviewed.
Papers
300K+
License
See breakdown below
Coverage
Biology & Medicine
API
β Yes
bioRxiv License Breakdown
0.1M papersmedRxiv License Breakdown
0.1M papersPsyArXiv
PreprintsPsychology preprints on OSF platform.
Papers
20K+
License
See breakdown below
Coverage
Psychology
API
β Yes
PsyArXiv License Breakdown
0.0M papersChemRxiv
PreprintsChemistry preprints from ACS, RSC, and others. Authors choose CC license at submission.
Papers
25K+
License
See breakdown below
Coverage
Chemistry
API
β Yes
ChemRxiv License Breakdown
0.0M papersSocArXiv
PreprintsSocial sciences preprints on OSF platform.
Papers
15K+
License
See breakdown below
Coverage
Social Sciences
API
β Yes
SocArXiv License Breakdown
0.0M papersEarthArXiv
PreprintsEarth and planetary sciences preprints. Community-run via OSF.
Papers
10K+
License
See breakdown below
Coverage
Earth Sciences
API
β Yes
EarthArXiv License Breakdown
0.0M papersPaywalled Sources
These publishers control the majority of scientific literature. Access requires subscriptions ($30-50 per article) or institutional access. For Socratic to use this content, we'd need licensing deals.
Elsevier
18%
2,700+ journals
Springer Nature
13%
3,000+ journals
Wiley
12%
1,700+ journals
Taylor & Francis
6%
2,500+ journals
SAGE
3%
1,000+ journals
Others
48%
25,000+ journals
Key Insight
"Open access papers receive 18-30% more citations on averageβand the highest-impact research is increasingly open."
High-impact research is increasingly open access due to funder mandates (NIH, Wellcome, Gates). The most important papers are becoming freeβyou don't need to license everything to capture most of the value.