NEMAR Disaster Recovery Documentation
This directory contains comprehensive disaster recovery procedures for NEMAR dataset restoration.
📚 Documentation
Section titled “📚 Documentation”🚨 EMERGENCY RESPONSE GUIDE
Use this first in an emergency!
- 8-step emergency procedure (< 2 hour recovery)
- Quick reference cards
- Essential credentials and contacts
- Troubleshooting guide
- Backend fail-safe specifications
Target Audience: nemarRestore operator, Emergency responder
Complete Technical Documentation
Detailed technical guide covering:
- Restoration architecture
- Step-by-step procedures with verification
- Git-annex and DataLad integration
- End-user verification tests
- Technical deep-dives
Target Audience: Developers, Technical operators
User Roles and Responsibilities
Defines the NEMAR user account structure:
- Owner ([email protected]) - Super user, policy decisions
- nemarAdmin ([email protected]) - Day-to-day operations
- nemarRestore ([email protected]) - Disaster recovery service account
Target Audience: Administrators, New team members
🛠️ Scripts
Section titled “🛠️ Scripts”Located in /scripts/:
nemar-restore-dataset.sh
Section titled “nemar-restore-dataset.sh”Production-ready restoration script for individual datasets.
Usage:
export AWS_ACCESS_KEY_ID="<key>"export AWS_SECRET_ACCESS_KEY="<secret>"
./scripts/nemar-restore-dataset.sh \ <dataset_id> \ <version> \ <name> \ <zenodo_doi> \ <datalad_id>Example:
./scripts/nemar-restore-dataset.sh \ nm000105 \ v1.1.0 \ "discrete_gestures" \ 10.5281/zenodo.17613958 \ f9028a54-3d7e-4af0-994f-19dc40de6a0arestore_database_entries.sql
Section titled “restore_database_entries.sql”SQL script to restore database entries after GitHub restoration.
Usage:
wrangler d1 execute nemar-db --remote --file=scripts/restore_database_entries.sql🚨 Emergency Quick Start
Section titled “🚨 Emergency Quick Start”IF DATASETS ARE ACCIDENTALLY DELETED:
- Stay calm - S3 data is likely intact
- Open DISASTER_RECOVERY.md
- Follow STEP 1-8 (don’t read the whole doc first)
- Target recovery time: < 2 hours
Emergency Contact: [email protected]
📖 Background
Section titled “📖 Background”This disaster recovery system was developed in response to a real incident on 2026-01-18 when datasets nm000103-nm000107 were accidentally deleted during test dataset cleanup.
What Happened
Section titled “What Happened”- 5 production datasets accidentally deleted from GitHub and database
- S3 data remained intact (7,976 files)
- All datasets had Zenodo preservation archives
Recovery Process
Section titled “Recovery Process”- Retrieved datasets from Zenodo archives
- Restored GitHub repositories with git-annex configuration
- Restored database entries
- Total recovery time: 90 minutes (target: < 2 hours)
- Data loss: None
Lessons Learned
Section titled “Lessons Learned”- Zenodo archives are critical for disaster recovery
- S3 separation protects data layer
- Git-annex configuration requires careful setup
- Backend fail-safes needed to prevent deletion
- Clear procedures enable fast recovery
🔄 Maintenance
Section titled “🔄 Maintenance”Quarterly Recovery Drill
Section titled “Quarterly Recovery Drill”Test the recovery procedure every 3 months:
- Create test dataset (nm999999)
- “Accidentally” delete it
- Restore from Zenodo archive
- Verify end-to-end functionality
- Document timing and issues
- Update procedures based on learnings
Last drill: 2026-01-18 (production incident) Next drill: 2026-04-18
🔗 Related Issues
Section titled “🔗 Related Issues”- Issue #37 - Dataset restoration incident and procedures
- Issue #35 - Backend fail-safes for dataset deletion
- Issue #34 - Add —yes flags for non-interactive mode
📞 Contacts
Section titled “📞 Contacts”| Role | Purpose | |
|---|---|---|
| Owner | [email protected] | Emergency decisions, S3 data issues |
| nemarAdmin | [email protected] | Day-to-day operations, user management |
| nemarRestore | [email protected] | Service account for git commits |
📝 Version History
Section titled “📝 Version History”| Version | Date | Changes |
|---|---|---|
| 1.0.0 | 2026-01-18 | Initial disaster recovery system based on real incident |
This documentation may save your datasets. Keep it updated.