Understanding the Landscape

The Universe of Research

500 million scholarly works exist. Not all can be used freely. Here's how licensing works and where you can actually access scientific literature.

The Big Picture

Scientific literature is vastβ€”but not all of it is accessible, and not all accessible papers can be used freely.

Global Research Output

500M

Works indexed

in OpenAlex

33M

Usable papers

CC-BY, CC0, CC-BY-SA

~65%

Restricted

NC or paywalled

10M+

New per year

and growing

Open access is growing ~3-5% per year. 107M works (35%) are now open access.

The Accessibility Funnel

33M papers have usable licenses β†’ 29M are addressable β†’ 6M in our corpus (23M+ queued). The funnel narrows at each step: missing PDFs, rate limits, and access restrictions.

Loading visualization...

Corpus Composition

What we've downloaded and what's in the queue. Hover for details.

Downloaded

5.7M papers in Azure blob storage

πŸ“Š 300K indexed in Postgres β€” bulk indexing pending

PMC XML
500K

PMC XML

5.1M

Direct Publishers

500K

arXiv

173K

PMC PDF

119K

Unpaywall

57K

Pending Queue

23.5M papers queued for download

Direct Publishers

Direct Publishers

22.7M

Figshare

328K

bioRxiv

203K

arXiv (remaining)

148K

medRxiv

58K

HAL

18K

Zenodo

13K

Total addressable corpus:29.2M papers(5.7M downloaded + 23.5M queued)

Understanding Licenses

Licenses determine what you can do with a paper. Can you redistribute it? Build commercial products on it? Modify and republish it? The license answers these questions.

OK to Use

🎁

CC0

Public Domain

The author waives all rights. The work belongs to the public domain.

Can do:

  • Use commercially
  • Modify and build on it
  • No attribution required
Maximum freedom. No restrictions whatsoever.
πŸ‘

CC-BY

Attribution

Use freely for any purpose, including commercial. Just credit the original author.

Can do:

  • Use commercially
  • Modify and build on it
  • Share freely

Can't do:

  • Omit attribution to the author
Most common open license. Just cite the source.
πŸ”„

CC-BY-SA

Attribution + ShareAlike

Use and modify freely, but your derivative works must use the same license.

Can do:

  • Use commercially
  • Modify and build on it

Can't do:

  • Use a more restrictive license on derivatives
Like Wikipedia. Derivative works must remain open.

Cannot Use

⚠️

CC-BY-NC

Non-Commercial

Free for academic and personal use. Commercial use is prohibited.

Can do:

  • Read and share
  • Use for research
  • Modify for personal use

Can't do:

  • Generate revenue
  • Use in commercial products
Fine for academic use. Problematic for commercial applications.
🚫

CC-BY-NC-ND

Non-Commercial + No Derivatives

Read-only. No modifications, no commercial use, no derivatives.

Can do:

  • Read it

Can't do:

  • Modify
  • Extract data
  • Use commercially
Most restrictive Creative Commons license.
πŸ”’

All Rights Reserved

Traditional Copyright

Full copyright protection. Any use requires explicit permission from the rights holder.

Can do:

  • Read with subscription access

Can't do:

  • Copy
  • Share
  • Build on it
  • Use without permission
Standard for paywalled journals. Requires licensing agreements.
❓

No License / Bronze

Free to Read, Unclear Rights

Available online for free, but no explicit license granted. Legal status unclear.

Can do:

  • Read it for free

Can't do:

  • Unclear what else is permitted
Legally ambiguous. Defaults to all rights reserved.

Quick Reference

LicenseCommercial UseModificationsOpenness
CC0🟒 Most Open
CC-BY🟒 Very Open
CC-BY-SA🟒 Open
CC-BY-NC🟑 Limited
CC-BY-NC-NDπŸ”΄ Restrictive
Β© CopyrightπŸ”΄ Closed

The bottom line: ~4 million papers in PMC are CC-BY or CC0β€”fully open for any use. Another ~6 million have various restrictions. The rest require subscriptions or licensing deals.

Where to Get Papers

Different sources have different coverage, licenses, and access methods. Here's the full map.

Open Access Sources

OpenAlex

Metadata Index
Visit

Open catalog of all scholarly works. 307M with full metadata via API, 193M in expansion pack (repositories, datasets).

Papers

500M works indexed

License

See breakdown below

Coverage

All disciplines

API

βœ“ Yes

OpenAlex (Verified April 9, 2026) License Breakdown

500.0M papers
11%
40%
38%
CC BYβ€” Attribution4.7%
Public Domainβ€” CC0 or expired copyright1.2%
CC BY-SAβ€” Attribution + ShareAlike0.5%
CC BY-NCβ€” Non-Commercial0.8%
CC BY-NC-NDβ€” Non-Commercial + No Derivatives1.3%
Green OA (unclear)β€” Repository copies11.0%
Bronze OAβ€” Free to read, no license2.8%
Closed / Paywalledβ€” Subscription required40.0%
XPAC (expansion)β€” Repositories, datasets37.7%
Usable for Socratic:32.2M papers (6.4%)

PubMed Central (PMC)

Full Text
Visit

The gold standard for biomedical full text. Our primary source.

Papers

7.7M articles

License

See breakdown below

Coverage

Biomedical only

API

βœ“ Yes

PMC Open Access License Breakdown

7.7M papers
64%
12%
13%
CC BYβ€” Attribution64.2%
CC0β€” Public Domain1.9%
CC BY-NCβ€” Non-Commercial12.1%
CC BY-NC-NDβ€” Non-Commercial + No Derivatives12.6%
CC BY-NC-SAβ€” Non-Commercial + ShareAlike3.8%
NO-CCβ€” Publisher-specific license5.2%
Usable for Socratic:5.1M papers (66.1%)

Figshare

Repository
Visit

Research data repository. Also hosts papers, posters, presentations.

Papers

5M+

License

See breakdown below

Coverage

Data + Papers

API

βœ“ Yes

Figshare License Breakdown

5.0M papers
50%
10%
20%
20%
CC BYβ€” Attribution50.0%
CC0β€” Public Domain10.0%
CC BY-NCβ€” Non-Commercial20.0%
CC BY-NC-NDβ€” Non-Commercial + No Derivatives20.0%
Usable for Socratic:3.0M papers (60.0%)

Zenodo

Repository
Visit

CERN's open repository. Papers, datasets, software. EU-funded research compliant.

Papers

3.97M

License

See breakdown below

Coverage

All disciplines

API

βœ“ Yes

Zenodo License Breakdown

4.0M papers
40%
10%
20%
15%
15%
CC BYβ€” Attribution40.0%
CC0β€” Public Domain10.0%
CC BY-NCβ€” Non-Commercial20.0%
CC BY-NC-NDβ€” Non-Commercial + No Derivatives15.0%
Otherβ€” Custom or unclear15.0%
Usable for Socratic:2.0M papers (50.0%)

HAL

Repository
Visit

French national open archive. Mandatory deposit for French-funded research.

Papers

4.53M

License

See breakdown below

Coverage

French research

API

βœ“ Yes

HAL License Breakdown

4.5M papers
30%
10%
20%
20%
20%
CC BYβ€” Attribution30.0%
CC0β€” Public Domain10.0%
CC BY-NCβ€” Non-Commercial20.0%
CC BY-NC-NDβ€” Non-Commercial + No Derivatives20.0%
Embargoedβ€” Delayed access20.0%
Usable for Socratic:1.8M papers (40.0%)

RePEc

Index + Preprints
Visit

Economics working papers and articles. Volunteer-run network.

Papers

3.8M

License

See breakdown below

Coverage

Economics

API

βœ“ Yes

RePEc License Breakdown

3.8M papers
20%
50%
20%
10%
CC BYβ€” Attribution20.0%
Unclearβ€” No explicit license50.0%
CC BY-NCβ€” Non-Commercial20.0%
Paywalledβ€” Links to published version10.0%
Usable for Socratic:0.8M papers (20.0%)

arXiv

Preprints
Visit

Preprints before peer review. Great for ML/AI, physics, math.

Papers

2.5M+

License

See breakdown below

Coverage

Physics, Math, CS, Quant Bio

API

βœ“ Yes

arXiv License Breakdown

2.5M papers
76%
18%
arXiv Licenseβ€” Default (non-exclusive)76.0%
CC BYβ€” Attribution18.0%
CC BY-SAβ€” Attribution + ShareAlike2.0%
CC BY-NC-SAβ€” Non-Commercial + ShareAlike2.0%
CC BY-NC-NDβ€” Non-Commercial + No Derivatives1.5%
CC0β€” Public Domain0.5%
Usable for Socratic:0.5M papers (20.5%)

OSF Preprints

Preprints
Visit

General preprint server by Center for Open Science.

Papers

50K+

License

See breakdown below

Coverage

All disciplines

API

βœ“ Yes

OSF Preprints License Breakdown

0.1M papers
55%
10%
20%
15%
CC BYβ€” Attribution55.0%
CC0β€” Public Domain10.0%
CC BY-NCβ€” Non-Commercial20.0%
CC BY-NC-NDβ€” Non-Commercial + No Derivatives15.0%
Usable for Socratic:0.0M papers (65.0%)

bioRxiv / medRxiv

Preprints
Visit

Preprints for biology and health sciences. Cutting-edge but not peer reviewed.

Papers

300K+

License

See breakdown below

Coverage

Biology & Medicine

API

βœ“ Yes

bioRxiv License Breakdown

0.1M papers
38%
22%
19%
13%
CC BY-NC-NDβ€” Non-Commercial + No Derivatives38.0%
CC BYβ€” Attribution22.0%
No Reuseβ€” All rights reserved19.0%
CC BY-NCβ€” Non-Commercial13.0%
CC BY-NDβ€” No Derivatives7.0%
CC0β€” Public Domain1.0%
Usable for Socratic:0.0M papers (23.0%)

medRxiv License Breakdown

0.1M papers
31%
25%
23%
11%
CC BYβ€” Attribution30.6%
CC BY-NC-NDβ€” Non-Commercial + No Derivatives24.8%
No Reuseβ€” All rights reserved22.8%
CC BY-NCβ€” Non-Commercial10.6%
CC BY-NDβ€” No Derivatives7.4%
CC0β€” Public Domain3.6%
Usable for Socratic:0.0M papers (34.2%)

PsyArXiv

Preprints
Visit

Psychology preprints on OSF platform.

Papers

20K+

License

See breakdown below

Coverage

Psychology

API

βœ“ Yes

PsyArXiv License Breakdown

0.0M papers
50%
25%
20%
CC BYβ€” Attribution50.0%
CC0β€” Public Domain5.0%
CC BY-NCβ€” Non-Commercial25.0%
CC BY-NC-NDβ€” Non-Commercial + No Derivatives20.0%
Usable for Socratic:0.0M papers (55.0%)

ChemRxiv

Preprints
Visit

Chemistry preprints from ACS, RSC, and others. Authors choose CC license at submission.

Papers

25K+

License

See breakdown below

Coverage

Chemistry

API

βœ“ Yes

ChemRxiv License Breakdown

0.0M papers
35%
40%
25%
CC BYβ€” Attribution35.0%
CC BY-NC-NDβ€” Non-Commercial + No Derivatives40.0%
CC BY-NCβ€” Non-Commercial25.0%
Usable for Socratic:0.0M papers (35.0%)

SocArXiv

Preprints
Visit

Social sciences preprints on OSF platform.

Papers

15K+

License

See breakdown below

Coverage

Social Sciences

API

βœ“ Yes

SocArXiv License Breakdown

0.0M papers
50%
25%
20%
CC BYβ€” Attribution50.0%
CC0β€” Public Domain5.0%
CC BY-NCβ€” Non-Commercial25.0%
CC BY-NC-NDβ€” Non-Commercial + No Derivatives20.0%
Usable for Socratic:0.0M papers (55.0%)

EarthArXiv

Preprints
Visit

Earth and planetary sciences preprints. Community-run via OSF.

Papers

10K+

License

See breakdown below

Coverage

Earth Sciences

API

βœ“ Yes

EarthArXiv License Breakdown

0.0M papers
55%
10%
20%
15%
CC BYβ€” Attribution55.0%
CC0β€” Public Domain10.0%
CC BY-NCβ€” Non-Commercial20.0%
CC BY-NC-NDβ€” Non-Commercial + No Derivatives15.0%
Usable for Socratic:0.0M papers (65.0%)

Paywalled Sources

These publishers control the majority of scientific literature. Access requires subscriptions ($30-50 per article) or institutional access. For Socratic to use this content, we'd need licensing deals.

Elsevier

18%

2,700+ journals

Springer Nature

13%

3,000+ journals

Wiley

12%

1,700+ journals

Taylor & Francis

6%

2,500+ journals

SAGE

3%

1,000+ journals

Others

48%

25,000+ journals

Key Insight

"Open access papers receive 18-30% more citations on averageβ€”and the highest-impact research is increasingly open."

High-impact research is increasingly open access due to funder mandates (NIH, Wellcome, Gates). The most important papers are becoming freeβ€”you don't need to license everything to capture most of the value.