User Guide¶

This guide covers everything you need to know about using the weather data retrieval package effectively, from basic concepts to advanced workflows.

Overview¶

The weather_data_retrieval package is designed to handle two main workflows:

Interactive Mode: Step-by-step wizard for exploring options and one-off downloads
Batch/Automatic Mode: JSON configuration files for reproducible, automated downloads

Both modes use the same underlying download engine and produce identical outputs.

Interactive Mode Deep Dive¶

Interactive mode is perfect when you're:

Exploring available datasets and variables
Not sure exactly what parameters you need
Doing a one-time download for analysis
Learning how the tool works

Starting Interactive Mode¶

osme-weather

Or with the shorter alias:

wdr

At any prompt, you can use:

back – Return to the previous question (your answers are preserved)
exit – Quit the wizard (your progress is not saved)
Ctrl+C – Immediately stop the program

Quiet Mode¶

If you want minimal console output (only errors and critical messages):

osme-weather --quiet

Logs are still written to files; you just won't see the prompt flow in your terminal.

Understanding the Wizard Flow¶

The interactive wizard follows this sequence:

Data Provider (CDS or Open-Meteo - Open-Meteo coming soon)
Dataset (ERA5-Land or ERA5)
Authentication (API URL and key)
Date Range (start and end dates)
Geographic Region (bounding box coordinates)
Variables (which weather parameters to download)
File Handling (what to do with existing files)
Parallel Settings (concurrent downloads)
Retry Settings (how to handle failures)
Confirmation (review and approve)

You can navigate backwards through any of these steps if you need to change something.

Smart Validation¶

The wizard validates your inputs in real-time:

Dates: Must be valid and end ≥ start
Coordinates: Must be within valid ranges (-90 to 90 for lat, -180 to 180 for lon)
Variables: Checked against dataset-specific availability
API Key: Tested with CDS to ensure it works before proceeding

If something is invalid, you'll get a clear error message and chance to re-enter.

Batch/Automatic Mode Deep Dive¶

Batch mode is essential for:

Reproducibility: The same config always produces the same download
Automation: Integrate into larger data pipelines
Version Control: Track exactly what data you downloaded (minus the API key)
HPC Workflows: Submit jobs without interactive prompts
Team Collaboration: Share standardized configs across researchers

Basic Usage¶

osme-weather --config path/to/config.json

Verbose Output¶

By default, batch mode only writes to log files. To see progress in your terminal:

osme-weather --config path/to/config.json --verbose

This shows the same output you'd see in interactive mode, but without requiring any input.

Config File Structure¶

Here's a complete, annotated example:

{
  "data_provider": "cds",
  "dataset_short_name": "era5-land",
  "api_url": "https://cds.climate.copernicus.eu/api",
  "api_key": "12345:abcdef123456789",
  "start_date": "2023-01-01",
  "end_date": "2023-12-31",
  "region_bounds": [40.0, -10.0, 35.0, 5.0],
  "variables": [
    "2m_temperature",
    "10m_u_component_of_wind",
    "10m_v_component_of_wind",
    "surface_pressure",
    "total_precipitation"
  ],
  "existing_file_action": "skip_all",
  "retry_settings": {
    "max_retries": 6,
    "retry_delay_sec": 15
  },
  "parallel_settings": {
    "enabled": true,
    "max_concurrent": 2
  }
}

See the Configuration Reference for detailed documentation of every field.

Config File Locations¶

The tool looks for config files in this order:

Absolute paths: /home/user/configs/download.json
Relative to config directory: configs/weather/download.json
(resolved to <repo_root>/configs/weather/download.json)
Relative to current directory: ./my_config.json

We recommend keeping configs in the repository's configs/weather/ directory for organization.

Managing API Keys in Configs¶

Never commit API keys to version control! Here are three safe approaches:

Option 1: Placeholder + Manual Edit (Simple)¶

In your committed config:

{
  "api_key": "YOUR_API_KEY_HERE",
  ...
}

Before running, edit to add your real key (but don't commit the change).

Option 2: Separate Credentials File (Better)¶

download_config.json (committed to repo):

{
  "data_provider": "cds",
  "dataset_short_name": "era5-land",
  "api_url": "https://cds.climate.copernicus.eu/api",
  "api_key": "PLACEHOLDER",
  ...
}

.env or credentials.json (in .gitignore):

{
  "CDS_API_KEY": "12345:abcdef123456789"
}

Then merge them in a script before running (future feature).

Option 3: Environment Variables (Best for CI/CD)¶

Set in your shell:

export CDS_API_KEY="12345:abcdef123456789"

Then reference in config (future feature):

{
  "api_key": "${CDS_API_KEY}",
  ...
}

Current Limitation

Environment variable substitution is not yet implemented. For now, use Option 1 or 2.

Provider-Specific Details¶

CDS (Copernicus Climate Data Store)¶

Currently, the only supported provider. Provides access to ERA5 and ERA5-Land datasets.

Getting Access¶

Register at https://cds.climate.copernicus.eu
Accept terms for the datasets you want (ERA5, ERA5-Land)
Get your API key from your profile page

API Rate Limits¶

The CDS API has request limits:

Max concurrent requests: 4 (per user account)
Request queue: If the system is busy, requests wait in a queue
Throttling: Large requests may be slowed or delayed during peak hours

The tool handles these automatically with retries and backoff.

Available Datasets¶

ERA5 (era5-world) - Resolution: 0.25° (~27 km at equator) - Coverage: Global (land and ocean) - Frequency: Hourly - Variables: 100+ atmospheric, ocean, and land variables - Good for: Global analyses, ocean regions, coarse-resolution studies

ERA5-Land (era5-land) - Resolution: 0.1° (~11 km at equator) - Coverage: Land surfaces only - Frequency: Hourly - Variables: 50+ land surface variables - Good for: Regional studies, high-resolution land analysis, hydrology

Variable Names¶

Variable names must exactly match the CDS dataset specifications (case-sensitive).

Common variables:

Variable Name	Description	Units	Datasets
`2m_temperature`	Temperature at 2m above surface	K	Both
`total_precipitation`	Accumulated precipitation	m	Both
`10m_u_component_of_wind`	Eastward wind at 10m	m/s	Both
`10m_v_component_of_wind`	Northward wind at 10m	m/s	Both
`surface_pressure`	Pressure at surface	Pa	Both
`surface_solar_radiation_downwards`	Incoming solar radiation	J/m²	Both
`soil_temperature_level_1`	Top layer soil temperature	K	ERA5-Land
`snow_depth`	Snow depth	m	ERA5-Land

For the complete list, see: - ERA5 variables - ERA5-Land variables

Open-Meteo¶

Coming Soon

Open-Meteo integration is planned but not yet available. Check the project roadmap for updates.

File Management¶

Output Directory Structure¶

Downloaded files are organized by dataset:

<repo_root>/
  └── data/
      ├── era5-land/
      │   └── raw/
      │       ├── era5-land_N40W10S35E5_abc123_2023-01.grib
      │       ├── era5-land_N40W10S35E5_abc123_2023-02.grib
      │       └── ...
      └── era5-world/
          └── raw/
              ├── era5-world_N50W20S30E10_def456_2023-01.grib
              └── ...

The raw/ subdirectory indicates these files haven't been processed yet. Later pipeline stages will create processed/, cleaned/, etc.

Filename Convention¶

Each file follows this pattern:

{dataset}_{coordinates}_{hash}_{year}-{month}.{ext}

Example: era5-land_N40W10S35E5_abc123def456_2023-01.grib

dataset: era5-land or era5-world
coordinates: Encoded bounding box (N=north, W=west, S=south, E=east)
hash: 12-character hash of (dataset + variables + region)
year-month: Data period
ext: File extension (.grib for CDS downloads)

Why the hash? It lets you distinguish between downloads of the same region but different variables:

era5-land_N40W10S35E5_abc123_2023-01.grib  # temperature only
era5-land_N40W10S35E5_def456_2023-01.grib  # temperature + precipitation

Existing File Handling¶

The existing_file_action setting controls what happens when a file already exists:

`skip_all` (Recommended)¶

"existing_file_action": "skip_all"

Skips any month that already has a complete file. This is: - Fast: No re-downloads - Safe: Doesn't overwrite good data - Resume-friendly: If a download fails partway through, re-running skips completed months

When to use: Almost always (default for batch mode)

`overwrite`¶

"existing_file_action": "overwrite"

Re-downloads everything, replacing existing files.

When to use: - You know the existing files are corrupted - CDS has updated the dataset and you need the new version - You're testing download settings

Warning

This can waste a lot of time and bandwidth!

`prompt` (Interactive Only)¶

"existing_file_action": "prompt"

Asks you for each file whether to skip or overwrite.

When to use: - Interactive mode when you want control over individual files - Not available in batch mode (treated as skip_all)

File Validation¶

After downloading, the tool validates each file:

ZIP files: Automatically extracts .grib files
Magic number check: Verifies file starts with GRIB signature
Size check: Warns if file is suspiciously small (<50 KB)

Corrupted or incomplete downloads are logged as failures and can be retried.

Download Settings¶

Parallel Downloads¶

Download multiple months simultaneously to speed up large requests.

"parallel_settings": {
  "enabled": true,
  "max_concurrent": 2
}

Choosing `max_concurrent`¶

1: Sequential (slowest, but most reliable)
2: Good default (balances speed and stability)
3-4: Faster for large downloads, but risks hitting CDS rate limits
5+: Not recommended (may cause timeouts or account warnings)

Connection Speed Matters

If you have slow internet (<10 Mbps), parallelization won't help much. Stick with 1-2.

Efficiency Factor¶

The tool assumes parallel downloads are about 60% as efficient as perfect linear scaling. For example:

12 files, sequential: 60 minutes
12 files, 2 parallel: ~36 minutes (not 30)
12 files, 4 parallel: ~22 minutes (not 15)

This accounts for: - CDS queue management - Connection overhead - Retry delays

Retry Settings¶

Handles temporary failures gracefully.

"retry_settings": {
  "max_retries": 6,
  "retry_delay_sec": 15
}

How Retries Work¶

Download request sent to CDS
If it fails (timeout, connection error, server error), wait retry_delay_sec seconds
Try again (up to max_retries times total)
If all retries exhausted, mark month as failed and continue to next month

Common Failure Reasons¶

CDS queue delays: Request waits too long in queue
Network hiccups: Temporary connection loss
Server overload: CDS is busy during peak hours (usually European daytime)
API errors: Rare CDS internal errors

Most failures are temporary and resolve with retries.

Recommended Settings¶

Scenario	max_retries	retry_delay_sec	Notes
Default	3-6	15-30	Good for most cases
Unreliable connection	10	30	More patient
Fast failure	1-2	10	Fail fast and move on
HPC batch job	10	60	Jobs run unattended, be patient

Speed Estimation¶

Before downloading, the tool:

Tests your internet speed (quick 10-second test)
Estimates file sizes based on variables, region, and resolution
Calculates total time accounting for processing overhead and parallelization

These are estimates and can vary based on: - CDS server load - Network variability - Actual data density (some months may be larger/smaller)

Logging¶

Every run produces detailed logs with timestamps, parameters, and outcomes.

Log File Locations¶

Logs are saved to <repo_root>/logs/weather_data_retrieval/:

logs/
  └── weather_data_retrieval/
      ├── era5-land_N40W10S35E5_abc123_2023-01_2023-12_retrieved-20250210T143022.log
      └── run_automatic_20250210_120045.log

Log File Naming¶

Final logs (after successful completion):

{filename_base}_{start-date}_{end-date}_retrieved-{timestamp}.log

Temporary logs (if run fails early):

run_{mode}_{timestamp}.log

What's Logged¶

Each log file contains:

Configuration summary: All parameters used
Download estimates: Size and time predictions
Progress updates: Each month started/completed
Validation results: File checks, size verifications
Errors and retries: Failed attempts and retry counts
Final summary: Success/skip/fail counts

Log Levels¶

DEBUG: Everything, including prompts and internal state (file only)
INFO: Normal progress updates
WARNING: Issues that didn't stop the download (e.g., retries)
ERROR: Failed downloads or critical issues

Example Log Excerpt¶

2025-02-10 14:30:15 | INFO | Starting AUTOMATIC run
2025-02-10 14:30:15 | INFO | Configuration validation successful
2025-02-10 14:30:16 | INFO | Detected speed: 45.3 Mbps
2025-02-10 14:30:16 | INFO | Estimated total size: 1,234.5 MB
2025-02-10 14:30:16 | INFO | Estimated total time: 22m 15s
2025-02-10 14:30:16 | INFO | Beginning download process
2025-02-10 14:30:45 | INFO | Download completed: 2023-01 (42.3 MB)
2025-02-10 14:31:12 | WARNING | Download failed: 2023-02 (Connection timeout)
2025-02-10 14:31:27 | INFO | Retry 1/6: 2023-02
2025-02-10 14:31:58 | INFO | Download completed: 2023-02 (41.8 MB)
...
2025-02-10 14:52:03 | INFO | Download process completed
2025-02-10 14:52:03 | INFO | Successful: 12, Skipped: 0, Failed: 0

HPC Workflows¶

The package includes shell scripts optimized for HPC job submission.

Using the Provided Scripts¶

Located in the package directory:

# For ERA5-Land downloads
bash packages/weather_data_retrieval/main_land.sh

# For ERA5 (world) downloads
bash packages/weather_data_retrieval/main_world.sh

These scripts: - Load the conda environment - Set up proper paths - Run batch mode with a predefined config - Can be submitted to Slurm, PBS, or other job schedulers

Example SLURM Submission¶

Create a file submit_weather_download.slurm:

#!/bin/bash
#SBATCH --job-name=weather_download
#SBATCH --output=logs/weather_%j.out
#SBATCH --error=logs/weather_%j.err
#SBATCH --time=24:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G

# Load environment
module load anaconda3
source activate osme

# Run download
osme-weather --config configs/weather/era5_india_2023.json --verbose

Submit with:

sbatch submit_weather_download.slurm

HPC Best Practices¶

Use batch mode: No prompts means it can run unattended
Set high retry counts: Jobs may run during off-peak hours with better CDS availability
Enable verbose logging: Helpful for debugging from log files later
Request adequate time: Better to overestimate than have jobs killed
Use parallelization: HPC nodes often have good network connections
Store in scratch space: Large downloads should go to fast scratch storage, not home directories

Monitoring Long Jobs¶

Check progress by tailing the log file:

tail -f logs/weather_data_retrieval/era5-land_*.log

Or check the SLURM output:

tail -f logs/weather_12345.out

Advanced Patterns¶

Multi-Region Downloads¶

Download the same time period for multiple regions by using separate configs:

# Europe
osme-weather --config configs/weather/era5_europe_2023.json

# Asia
osme-weather --config configs/weather/era5_asia_2023.json

# Americas
osme-weather --config configs/weather/era5_americas_2023.json

Multi-Year Pipelines¶

For very long time series, break into chunks:

for year in {2010..2023}; do
  echo "Downloading $year..."
  osme-weather --config configs/weather/era5_${year}.json
done

Each config differs only in start/end dates:

{
  "start_date": "2010-01-01",
  "end_date": "2010-12-31",
  ...
}

Resuming Failed Downloads¶

If a download fails partway through:

Check the log to see which months succeeded
Re-run with the same config – skip_all will skip completed months
Only failed months will be re-attempted

No need to manually track what's missing!

Testing Configurations¶

Before downloading years of data, test your config with a small date range:

{
  "start_date": "2023-01-01",
  "end_date": "2023-01-31",
  ...
}

Once you verify it works:

{
  "start_date": "2020-01-01",
  "end_date": "2023-12-31",
  ...
}

Common Workflows¶

Workflow 1: Exploratory Research¶

Goal: Download a small sample to test analysis code

Use interactive mode to explore variables
Download 1-3 months of data
Develop your analysis pipeline
When ready, create a batch config for full download
Run the full download overnight or on HPC

Workflow 2: Production Pipeline¶

Goal: Reproducible data acquisition for published research

Create version-controlled configs in configs/weather/
Use batch mode for all downloads
Document configs in your README (which region, which dates, why)
Archive configs with your published research for reproducibility
Rerun identical configs if you need to verify or extend results

Workflow 3: Multi-Country Comparison¶

Goal: Compare marginal emissions across countries

Create one config per country with appropriate region bounds
Use identical variables and date ranges for fair comparison
Name configs clearly: era5_india.json, era5_spain.json, etc.
Run downloads in parallel (different terminal sessions or HPC jobs)
Use skip_all so you can safely rerun without duplication

Tips & Tricks¶

🚀 Speed Up Downloads¶

Use ERA5 (0.25°) instead of ERA5-Land (0.1°) if resolution doesn't matter
Smaller regions = smaller files = faster downloads
Fewer variables = smaller files (do you really need all 20 variables?)
Enable parallelization with max_concurrent: 2-3
Download during off-peak hours (nights/weekends in Europe)

💾 Save Disk Space¶

Compress old files: GRIB files compress well with gzip
Delete unnecessary intermediate files after processing
Use external storage for archives (not actively used data)
Sample by month: Do you need every month? Maybe every 3rd month is enough for validation?

🔧 Troubleshooting Downloads¶

Check CDS status: https://cds.climate.copernicus.eu shows system status
Retry a different time: CDS is faster at night (Europe timezone)
Simplify request: Fewer variables or smaller region may avoid timeouts
Check your API key: Expired keys cause authentication failures

📋 Organizing Configs¶

Suggested structure:

configs/
  └── weather/
      ├── production/
      │   ├── india_2020-2023.json
      │   └── global_2022.json
      ├── testing/
      │   └── quick_test.json
      └── archive/
          └── old_india_2019.json

Use a template pattern:

Template (configs/weather/template_country.json):

{
  "data_provider": "cds",
  "dataset_short_name": "era5-land",
  "api_key": "YOUR_API_KEY_HERE",
  "start_date": "YYYY-MM-DD",
  "end_date": "YYYY-MM-DD",
  "region_bounds": [N, W, S, E],
  "variables": ["2m_temperature", "total_precipitation"],
  "existing_file_action": "skip_all",
  "retry_settings": {"max_retries": 6, "retry_delay_sec": 15},
  "parallel_settings": {"enabled": true, "max_concurrent": 2}
}

Team members copy and fill in their values.

What's Next?¶

Dive deeper into configs: See Configuration Reference for every parameter
Having issues?: Check Troubleshooting for solutions
Understand the code: Browse the API Reference
Continue the pipeline: Move to grid_data_retrieval