Data API: data.nemar.org¶
Public HTTPS access to every published NEMAR dataset, BIDS-shaped. No nemar-cli, no git-annex, no NEMAR account.
The same handlers are reachable at three URL forms (pick whichever is most convenient for your client):
https://data.nemar.org/<datasetId>/<version>/<path> # canonical
https://api.nemar.org/data/<datasetId>/<version>/<path> # API-hostname alias
https://<workers-dev-host>/data/<datasetId>/<version>/<path> # dev/testing
This document describes the canonical form.
URL grammar¶
/<datasetId>/<version>/<path>
<datasetId>-- one ofnm,xx,onfollowed by six digits (e.g.nm000103).<version>--latestor an explicitvX.Y.Ztag.latestresolves to the most recently DOI'd version recorded in the catalog.<path>-- BIDS-relative file or directory path. Trailing slashes accepted. Path traversal segments (.., absolute paths) return 404.
Endpoints¶
GET /<datasetId>/<version>/<bids-path>¶
If <bids-path> matches a file in the version manifest, responds 302 Found
with the Location header pointing at the file bytes. Two backends:
- Git-annex content (large blobs): presigned S3 GET URL, valid for 1 hour.
- Inline git content (small files like
dataset_description.json): araw.githubusercontent.comURL pinned to the version tag.
If <bids-path> is a directory (i.e. one or more manifest entries start with
<bids-path>/), responds 200 OK with an Apache-style HTML directory
listing.
If neither, responds 404.
Cache-Control: public, max-age=300 on file redirects, max-age=60 on HTML
indexes.
File redirects also carry Last-Modified (the version's publication
timestamp in RFC 1123 format) and ETag (the content checksum, quoted
per RFC 7232 -- "sha256:<hex>" for git-annex files, "git:<sha>" for
inline git content). Content-Length is intentionally omitted from the
302 (per RFC 9110 ยง8.6 it describes the empty message body, not the
redirect target) -- use HEAD if you need the size without following
the redirect.
HEAD /<datasetId>/<version>/<bids-path>¶
Returns 200 with Content-Length, Last-Modified, ETag, and
Cache-Control: public, max-age=300 headers and an empty body. Used
by rclone sync and other HTTP-backend mirroring tools to detect
file changes without transferring the file body.
For directory paths, returns 200 with Content-Type: text/html and an
empty body.
For paths that don't exist in the requested version, returns 404 with
an empty body. The tombstone walk (used on GET to surface a last_seen_*
hint) is intentionally skipped on HEAD -- a sync against a divergent
local copy fans out many HEAD requests, and the per-HEAD walk would
amplify them into many S3 round-trips.
GET /<datasetId>/<version>/manifest.json¶
Responds 200 with a JSON array describing every file in the requested
version:
[
{
"path": "dataset_description.json",
"size": 480,
"checksum_algorithm": "git",
"checksum": "abc123...",
"url": "https://raw.githubusercontent.com/nemarDatasets/nm000103/v1.0.0/dataset_description.json"
},
{
"path": "sub-01/eeg/sub-01_task-rest_eeg.edf",
"size": 12345678,
"checksum_algorithm": "sha256",
"checksum": "deadbeef...",
"url": "https://nemar.s3.us-east-2.amazonaws.com/nm000103/objects/SHA256E-s12345678--deadbeef.edf?X-Amz-..."
}
]
checksum_algorithm is sha256 (default for annex-backed files), md5 when
the dataset uses an MD5E backend, or git for files stored directly in the
git tree (where the checksum is the blob SHA, not a content hash).
URLs are pre-signed for 1 hour. Fetch the manifest immediately before a bulk download to keep the URLs fresh.
GET /<datasetId>/metadata.json¶
Dataset-level neuroschema v0.3.0
dataset document combining the enrichment catalog (authors, MeSH keywords,
license, DOI, etc.), the full version list, and a derived BIDS subject /
session / modality / task / run tree from the latest version's manifest.
Designed for external indexers like
eegdash-viewer that need to
resolve dataset -> subjects/tasks/runs -> files in one fetch.
Wire format mirrors the core schema at
neuroschema/schema/core/dataset.schema.json. NEMAR-specific aggregates
(version list, derived BIDS index, pipeline stage) live under
extensions.nemar per neuroschema/schema/extensions/nemar.schema.json.
{
"schema_version": "0.3.0",
"doc_type": "dataset",
"dataset_id": "nm000103",
"name": "...",
"description": "...",
"source": "nemar",
"recording_modality": ["EEG"],
"license": "CC0-1.0",
"authors": [
{
"name": "Doe, Jane",
"name_type": "Personal",
"orcid": "https://orcid.org/0000-0001-2345-6789",
"affiliations": [{ "name": "Acme University", "identifier": "https://ror.org/...", "scheme": "ROR" }]
}
],
"keywords": [
{ "term": "Electroencephalography", "subject_scheme": "MeSH", "classification_code": "D004569" }
],
"related_identifiers": [...],
"contributors": [...],
"dates": [...],
"rights": [...],
"funding": [...],
"tasks": ["rest", "go-nogo"],
"datatypes": ["eeg"],
"sessions": ["baseline"],
"sessions_count": 1,
"demographics": { "subjects_count": 50, "age_min": 18, "age_max": 65 },
"data_summary": { "total_files": 1234, "size_bytes": 1234567890, "size_human": "1.15 GB" },
"provenance": { "latest_snapshot": "v1.0.0", "publish_date": "2025-12-01T10:00:00Z" },
"external_links": {
"dataset_doi": "10.82901/NEMAR.nm000103",
"github_url": "https://github.com/nemarDatasets/nm000103"
},
"extensions": {
"nemar": {
"versions": [
{
"version": "v1.0.0",
"doi": "10.82901/NEMAR.nm000103.v1.0.0",
"created_at": "2025-12-01T10:00:00Z",
"manifest_url": "/nm000103/v1.0.0/manifest.json"
}
],
"bids_index": {
"version": "v1.0.0",
"subjects": {
"sub-01": {
"sessions": ["baseline"],
"modalities": {
"eeg": { "tasks": { "rest": { "runs": ["01", "02"] } } }
}
}
}
},
"pipeline_stage": "validated"
}
}
}
Partial payloads, never 500s. When the metadata pipeline hasn't run
yet, enrichment-derived fields (authors, keywords, license,
related_identifiers, etc.) are returned as empty arrays or null. When
no versions are minted yet, extensions.nemar.versions is [] and
bids_index is null. When the latest version's manifest cannot be
fetched, bids_index is null but the catalog and version list still
return normally. Corrupt enrichment_json is logged and treated as
missing.
bids_index reflects only the latest version. A per-version index
endpoint at /<datasetId>/<version>/index.json may follow in a later
phase.
Cache-Control: public, max-age=60.
GET /<datasetId> and GET /<datasetId>/¶
Dataset landing page. Content-negotiated: HTML for browsers, JSON for
machine clients (default when no Accept is sent). The query parameter
?format=json or ?format=html overrides the Accept header.
JSON shape:
{
"dataset_id": "nm000103",
"latest": "v1.0.0",
"metadata_url": "/nm000103/metadata.json",
"versions": [
{
"version": "v1.0.0",
"doi": "10.82901/NEMAR.nm000103.v1.0.0",
"created_at": "2025-12-01T10:00:00Z",
"manifest_url": "/nm000103/v1.0.0/manifest.json",
"browse_url": "/nm000103/v1.0.0/"
}
]
}
Versions are newest-first. latest is null when the dataset row exists
but no version has been minted yet (the page still returns 200, with
a "no published versions yet" notice in the HTML form).
HTML rendering lists every version with its DOI, publication date, browse URL, and manifest URL.
Cache-Control: public, max-age=60.
GET /<datasetId>/<version>¶
308 Permanent Redirect to /<datasetId>/<version>/ so the relative ../
link in the rendered index resolves correctly.
Versioned UX¶
Version picker¶
HTML directory listings include a version picker above the file table when the dataset has more than one published version. Each version is rendered as a sibling link that switches the version segment of the current URL while preserving the sub-path -- so switching versions on a deeply-nested directory lands on the same directory in the chosen version (or a tombstone 404, see below, if the path doesn't exist there).
File-removed tombstones¶
When a file path 404s but the same path existed in an older published version, the response indicates the last version that contained it.
JSON shape:
{
"error": "File not found",
"reason": "removed",
"last_seen_version": "v1.0.0",
"last_seen_url": "https://data.nemar.org/nm000103/v1.0.0/sub-99/eeg/sub-99_task-rest_eeg.edf"
}
A 404 without a tombstone (no reason field) means the path never
existed in any of the most recent versions.
The tombstone walk is capped at the 10 most recent older versions.
Older datasets can still be browsed manually from /<datasetId>/ --
the cap exists so a 404 on a long-removed path never fans out to
dozens of manifest fetches. If a determined client needs deeper
history, fetch /<datasetId>/metadata.json for the full version list
and walk it explicitly.
The HTML form of the same 404 (sent when Accept: text/html) renders
a friendly page with a clickable link to the last-seen URL.
"Files removed since vN-1" footer¶
Directory index pages compare their listing against the immediately
prior published version. Names that existed in the prior version but
are absent in the current one are rendered in a collapsible
<details> footer with links to the prior version's URL. The
comparison is only against vN-1; older versions are not consulted
(use metadata.json for full history).
Response codes¶
| Code | When |
|---|---|
200 |
Manifest JSON, HTML index, dataset landing page (HTML or JSON), HEAD on existing file or directory |
302 |
GET on a file path that resolves to backing-store bytes |
308 |
/<datasetId>/<version> -> /<datasetId>/<version>/ |
404 |
Dataset not found, private, unpublished, version not minted, file not in manifest, path traversal attempt. Includes reason: "removed" + last_seen_* when the path existed in a recent prior version (GET only; HEAD returns bare 404) |
The route deliberately does not distinguish "not found" from "exists but
private". Private datasets are reached only via the existing
nemar dataset clone / nemar dataset get flow.
MIME types¶
Files are served from S3 under their git-annex content-addressed key
(SHA256E-s12345--...edf), so the S3 object's Content-Type defaults to
application/octet-stream. Browsers will download rather than render. A
future iteration may override response-content-type in the presigned URL
based on the BIDS path's extension; until then, expect generic binary
content-type on file responses.
What this does not cover¶
- Private and unpublished datasets. They stay on git-annex.
- rclone-compatible delta sync. Phase 4 (#498), optional.