Skip to content

API Reference

This page documents the internal structure of the weather_data_retrieval package. This is useful if you want to understand how the code works, extend it, or integrate it into other tools.

For End Users

If you're just using the package to download data, you probably don't need this page. See the User Guide or Quickstart instead.

Package Structure

The package is organized into several modules:

weather_data_retrieval/
├── main.py                 # Entry point for CLI
├── runner.py               # Core orchestration logic
├── io/                     # Input/output handling
│   ├── cli.py             # Command-line interface
│   ├── prompts.py         # Interactive wizard prompts
│   └── config_loader.py   # JSON config loading
├── sources/               # Data provider implementations
│   ├── cds_era5.py       # CDS/ERA5 downloader
│   └── open_meteo.py     # Open-Meteo (planned)
└── utils/                 # Shared utilities
    ├── session_management.py  # Session state tracking
    ├── data_validation.py     # Input validation
    ├── file_management.py     # File naming and checking
    └── logging.py             # Logging configuration

Core Modules

Entry Point & Orchestration

main.py

The main entry point for the CLI. This is what gets called when you run osme-weather.

weather_data_retrieval.main

Main entry point for the Weather Data Retrieval CLI.

This script can be run either: - Automatically via a configuration file (--config path/to/config.json), or - Interactively through a guided prompt wizard.

It handles session management, logging setup, and orchestration of the retrieval workflow defined in weather_data_retrieval.runner.

Typically invoked through the CLI command: osme-weather or wdr.

main
main()

Entry point for the Weather Data Retrieval CLI.

This function: 1. Parses CLI arguments or launches the interactive prompt wizard. 2. Loads or builds a configuration file for weather dataset download. 3. Initializes logging via osme_common.paths.data_dir(). 4. Executes the main retrieval workflow using weather_data_retrieval.runner.run.

Automatically invoked by the osme-weather CLI script.

Key Function: main()

  • Parses command-line arguments
  • Determines run mode (interactive vs. batch)
  • Sets up logging
  • Delegates to either the interactive wizard or batch runner

Usage:

# Called automatically by CLI
# osme-weather [--config FILE] [--verbose] [--quiet]


runner.py

Core orchestration logic for the download workflow. This handles validation, estimation, and download coordination.

weather_data_retrieval.runner

run
run(
    config,
    run_mode="interactive",
    verbose=True,
    logger=None,
)

Unified orchestration entry point for both interactive and automatic runs. Handles validation, logging, estimation, and download orchestration.

Returns: 0=success, 1=fatal error, 2=some downloads failed.

Parameters:

Name Type Description Default
config dict

Configuration dictionary with all required parameters.

required
run_mode str

Run mode, either 'interactive' or 'automatic', by default "interactive".

'interactive'
logger Logger

Pre-configured logger instance, by default None.

None

Returns:

Type Description
int

Exit code: 0=success, 1=fatal error, 2=some downloads failed.

run_batch_from_config
run_batch_from_config(cfg_path, logger=None)

Run automatic batch from a config file.

Parameters:

Name Type Description Default
config dict

Configuration dictionary with all required parameters.

required
logger Logger

Pre-configured logger instance, by default None.

None

Returns:

Type Description
int

Exit code: 0=success, 1=fatal error, 2=some downloads failed.

Key Functions:

run(config, run_mode, verbose, logger)

Main workflow orchestrator that:

  1. Validates the configuration
  2. Maps config to session state
  3. Performs internet speed test
  4. Estimates download size and time
  5. Generates filename hash
  6. Orchestrates downloads
  7. Reports final statistics

Parameters: - config (dict): Configuration dictionary - run_mode (str): "interactive" or "automatic" - verbose (bool): Whether to echo progress to console - logger (logging.Logger): Logger instance

Returns: - int: Exit code (0=success, 1=fatal error, 2=some failures)

run_batch_from_config(cfg_path, logger)

Convenience wrapper for batch mode. Loads config and calls run().


Input/Output Modules

io/cli.py

Command-line interface and interactive wizard.

weather_data_retrieval.io.cli

parse_args
parse_args()

Parse command-line arguments.

Parameters:

Name Type Description Default
None
required

Returns:

Type Description
Namespace

Parsed arguments.

run_prompt_wizard
run_prompt_wizard(session, logger=None)

Drives the interactive prompt flow (no config-source step). Returns True if all fields completed; False if user exits.

Parameters:

Name Type Description Default
session SessionState

The session state to populate.

required
logger Logger

Logger for logging messages, by default None.

None

Returns:

Type Description
bool

True if completed; False if exited early.

Key Functions:

parse_args()

Parses command-line arguments using argparse.

Returns: argparse.Namespace with fields: - config: Path to config file (if --config provided) - verbose: Boolean for verbose output (batch mode) - quiet: Boolean for quiet mode (interactive mode)

run_prompt_wizard(session, logger)

Drives the interactive prompt flow. Steps through each configuration parameter, validates inputs, and handles back/exit commands.

Parameters: - session (SessionState): Session state to populate - logger (logging.Logger): Logger instance

Returns: - bool: True if completed successfully, False if user exited early


io/prompts.py

Individual prompt functions for each configuration step.

weather_data_retrieval.io.prompts

read_input
read_input(prompt, *, logger=None)

Centralized input handler with built-in 'exit' and 'back' controls.

Parameters:

prompt : str The prompt to display to the user. logger : logging.Logger, optional Logger to log the prompt message.

Returns:

str The user input, or special command indicators.

say
say(text, *, logger=None)

Centralized output handler to log and print messages.

Parameters:

text : str The message to display. logger : logging.Logger, optional Logger to log the message.

Returns:

None

prompt_data_provider
prompt_data_provider(session, *, logger=None)

Prompt user for which data provider to use (CDS or Open-Meteo).

Parameters:

Name Type Description Default
session SessionState

Current session state to store selected data provider.

required
logger Logger

Logger for logging messages, by default None.

None

Returns:

Type Description
str

Normalized provider name ("cds" or "open-meteo"), or special control token "BACK" or "EXIT".

prompt_dataset_short_name
prompt_dataset_short_name(
    session, provider, *, logger=None
)

Prompt for dataset choice.

Parameters:

Name Type Description Default
session SessionState

Current session state to store selected dataset.

required
provider str

Data provider name.

required
logger Logger

Logger for logging messages, by default None.

None

Returns:

Type Description
str: Normalized dataset name or 'exit' / 'back'.
prompt_cds_url
prompt_cds_url(
    session,
    api_url_default="https://cds.climate.copernicus.eu/api",
    *,
    logger=None
)

Prompt for CDS API URL.

Parameters:

Name Type Description Default
session SessionState

Current session state to store API URL.

required
api_url_default str

Default CDS API URL. https://cds.climate.copernicus.eu/api

'https://cds.climate.copernicus.eu/api'

Returns:

Type Description
str: CDS API URL or 'exit' / 'back'.
prompt_cds_api_key
prompt_cds_api_key(session, *, logger=None)

Prompt only for the CDS API key (hidden input).

Parameters:

Name Type Description Default
session SessionState

Current session state to store API key.

required
logger Logger

Logger for logging messages, by default None.

None

Returns:

Type Description
str

CDS API key or 'exit' / 'back'.

prompt_save_directory
prompt_save_directory(session, default_dir, *, logger=None)

Ask for save directory, create if necessary.

Parameters:

Name Type Description Default
session SessionState

Current session state to store save directory.

required
default_dir Path

Default directory to suggest.

required
logger Logger

Logger for logging messages, by default None.

None

Returns:

Type Description
Path | str

Path to save directory, or control token "BACK" / "EXIT".

prompt_date_range
prompt_date_range(session, *, logger=None)

Ask user for start and end date, with validation. Accepts formats: YYYY-MM-DD or YYYY-MM - Start dates without day default to first day of month (YYYY-MM-01) - End dates without day default to last day of month (YYYY-MM-[last day])

Parameters:

Name Type Description Default
session SessionState

Current session state to store date range.

required
logger Logger

Logger for logging messages, by default None.

None

Returns:

Type Description
tuple[str, str]

(start_date_str, end_date_str) in ISO format (YYYY-MM-DD), or ("EXIT", "EXIT") / ("BACK", "BACK")

prompt_coordinates
prompt_coordinates(session, *, logger=None)

Prompt user for geographic boundaries (N, S, W, E) with validation.

Parameters:

Name Type Description Default
session SessionState

Current session state to store geographic boundaries.

required

Returns:

Type Description
list[float]

[north, west, south, east] boundaries or special tokens "EXIT" / "BACK".

prompt_variables
prompt_variables(
    session,
    variable_restrictions_list,
    *args,
    restriction_allow=False,
    logger=None
)

Ask for variables to download, validate each against allowed/disallowed list, and only update session if the full set is valid.

Parameters:

Name Type Description Default
session SessionState

Current session state to store selected variables.

required
variable_restrictions_list list[str]

List of variables that are either allowed or disallowed.

required
restriction_allow bool

If True, variable_restrictions_list is an allowlist (i.e. in). If False, it's a denylist (i.e. not in)

False
logger Logger

Logger for logging messages, by default None.

None

Returns:

Type Description
list[str] | str

List of selected variable names, or control token "BACK" / "EXIT".

prompt_skip_overwrite_files
prompt_skip_overwrite_files(session, *, logger=None)

Prompt user to choose skip/overwrite/case-by-case for existing files.

Parameters:

Name Type Description Default
session SessionState

Session state to store user choice.

required
logger Logger

Logger for logging messages, by default None.

None

Returns:

Type Description
str

One of "overwrite_all", "skip_all", "case_by_case"

prompt_parallelisation_settings
prompt_parallelisation_settings(session, *, logger=None)

Ask user about parallel downloads and concurrency cap.

Parameters:

Name Type Description Default
session SessionState

Current session state to store parallelisation settings.

required
logger Logger

Logger for logging messages, by default None.

None

Returns:

Type Description
dict | str

Dictionary with parallelisation settings, or control token "BACK" / "EXIT".

prompt_retry_settings
prompt_retry_settings(
    session,
    default_retries=6,
    default_delay=15,
    *,
    logger=None
)

Ask user for retry limits.

Parameters:

Name Type Description Default
session SessionState

Current session state to store retry settings.

required
default_retries int

Default number of retry attempts (default = 6).

6
default_delay int

Default delay (in seconds) between retries (default = 15).

15
logger Logger

Logger for logging messages, by default None.

None

Returns:

Type Description
dict | str

Dictionary with 'max_retries' and 'retry_delay_sec', or control token "BACK" / "EXIT".

prompt_continue_confirmation
prompt_continue_confirmation(session, *, logger=None)

Display a formatted download summary and confirm before starting downloads.

Parameters:

Name Type Description Default
session SessionState

session state to summarise.

required
logger Logger

Logger for logging messages, by default None.

None

Returns:

Type Description
bool | str

True if user confirms, False if user declines, or control token "BACK" / "EXIT".

Key Functions:

Each prompt function handles user input for a specific configuration parameter:

  • prompt_data_provider() - Choose CDS or Open-Meteo
  • prompt_dataset_short_name() - Choose ERA5-Land or ERA5
  • prompt_cds_url() - CDS API URL
  • prompt_cds_api_key() - CDS API key
  • prompt_date_range() - Start and end dates
  • prompt_coordinates() - Region bounds [N, W, S, E]
  • prompt_variables() - Variable selection
  • prompt_skip_overwrite_files() - Existing file handling
  • prompt_parallelisation_settings() - Parallel download settings
  • prompt_retry_settings() - Retry configuration
  • prompt_continue_confirmation() - Final review and confirmation

All prompt functions follow the same pattern:

Parameters: - session (SessionState): Current session state - logger (logging.Logger): Logger instance - echo_console (bool): Whether to echo to console

Returns: - The validated input value, or - "__EXIT__" if user wants to quit, or - "__BACK__" if user wants to go to previous prompt


io/config_loader.py

JSON configuration file loading and validation.

weather_data_retrieval.io.config_loader

load_and_validate_config
load_and_validate_config(
    path, *, logger=None, run_mode="automatic"
)

Load JSON config and validate it using the centralized validator. This lets the validator log coercions/warnings (e.g., case_by_case → skip_all).

Parameters:

Name Type Description Default
path str

Path to JSON config file.

required
logger Logger

Logger instance for validation messages, by default None.

None
run_mode str

Run mode, either 'interactive' or 'automatic', by default "automatic".

'automatic'

Returns:

Type Description
dict

Validated configuration dictionary.

load_config
load_config(file_path)

Load configuration from a JSON requirements file (without validation).

Parameters:

Name Type Description Default
file_path str

Path to JSON config file.

required

Returns:

Type Description
dict

Configuration dictionary.

Key Functions:

load_and_validate_config(path, logger, run_mode)

Loads a JSON config file and validates it.

Parameters: - path (str): Path to JSON config file - logger (logging.Logger, optional): Logger for validation messages - run_mode (str): "interactive" or "automatic"

Returns: - dict: Validated configuration

Raises: - FileNotFoundError: If config file doesn't exist - ValueError: If config is invalid

load_config(file_path)

Simple config loader without validation (for internal use).


Data Provider Modules

sources/cds_era5.py

CDS/ERA5 data provider implementation.

weather_data_retrieval.sources.cds_era5

prepare_cds_download
prepare_cds_download(
    session,
    filename_base,
    year,
    month,
    save_dir,
    *,
    logger,
    echo_console,
    allow_prompts,
    dataset_config_mapping=CDS_DATASETS
)

Check if a monthly ERA5 file already exists and decide whether to download.

Parameters:

Name Type Description Default
session SessionState

Session containing user configuration.

required
filename_base str

Base name for the file.

required
year int

Year of the data to download.

required
month int

Month of the data to download.

required
save_dir Path

Directory to save downloaded files.

required
logger Logger

Logger for logging messages.

required
echo_console bool

Whether to echo prompts to console.

required
allow_prompts bool

Whether to allow interactive prompts.

required
dataset_config_mapping dict

Mapping of dataset short names to their configurations.

CDS_DATASETS

Returns:

Name Type Description
tuple (download: bool, save_path: str)

download: Whether to perform the download. save_path: Full path for the target file.

execute_cds_download
execute_cds_download(
    session,
    save_path,
    year,
    month,
    *,
    logger,
    echo_console,
    dataset_config_mapping=CDS_DATASETS
)

Execute a single ERA5 monthly download with retry logic.

Parameters:

Name Type Description Default
session SessionState

Session state containing the authenticated CDS API client.

required
save_path str

Full path to save the downloaded file.

required
year int

Year of the data to download.

required
month int

Month of the data to download.

required
logger Logger

Logger for logging messages.

required
echo_console bool

Whether to echo prompts to console.

required
dataset_config_mapping dict

Mapping of dataset short names to their configurations.

CDS_DATASETS

Returns:

Type Description
(year, month, status): tuple

status = "success" | "failed"

download_cds_month
download_cds_month(
    session,
    filename_base,
    year,
    month,
    save_dir,
    *,
    logger,
    echo_console,
    allow_prompts,
    successful_downloads,
    failed_downloads,
    skipped_downloads
)

Orchestrate ERA5 monthly download: handle file checks, then execute download.

Parameters:

Name Type Description Default
Combines
required

Returns:

Type Description
(year, month, status): tuple

status = "success" | "failed" | "skipped"

plan_cds_months
plan_cds_months(
    session,
    filename_base,
    save_dir,
    *,
    logger,
    echo_console,
    allow_prompts
)

Build the list of months to download and which are being skipped due to existing files.

Parameters:

Name Type Description Default
session SessionState

Session containing user configuration.

required
filename_base str

Base filename (without date or extension).

required
save_dir Path

Directory to save downloaded files.

required
logger Logger

Logger for logging messages.

required
echo_console bool

Whether to echo prompts to console.

required
allow_prompts bool

Whether to allow interactive prompts.

required

Returns:

Type Description
(months_to_download, months_skipped)
  • months_to_download: list[(year, month)]
  • months_skipped: list[(year, month, path)]
orchestrate_cds_downloads
orchestrate_cds_downloads(
    session,
    filename_base,
    save_dir,
    successful_downloads,
    failed_downloads,
    skipped_downloads,
    *,
    logger,
    echo_console,
    allow_prompts,
    dataset_config_mapping=CDS_DATASETS
)

Handle and orchestrate ERA5 monthly downloads, supporting parallel or sequential execution.

Parameters:

Name Type Description Default
session SessionState

Session containing user configuration and authenticated client.

required
filename_base str

Base filename (without date or extension).

required
save_dir Path

Directory to save downloaded files.

required
successful_downloads list

Mutable list to collect (year, month) tuples for successful downloads.

required
failed_downloads list

Mutable list to collect (year, month) tuples for failed downloads.

required
skipped_downloads list

Mutable list to collect (year, month) tuples for skipped downloads.

required
logger Logger

Logger for logging messages.

required
echo_console bool

Whether to echo prompts to console.

required
allow_prompts bool

Whether to allow interactive prompts.

required
dataset_config_mapping dict

Mapping of dataset configurations, by default CDS_DATASETS.

CDS_DATASETS

Returns:

Type Description
None

Key Functions:

orchestrate_cds_downloads(session, filename_base, save_dir, ...)

Manages the entire CDS download workflow:

  1. Generates list of months to download
  2. Checks for existing files (skip logic)
  3. Coordinates parallel or sequential downloads
  4. Handles retries and failures
  5. Validates downloaded files
  6. Updates success/failure lists

Parameters: - session (SessionState): Current session with all parameters - filename_base (str): Base filename for output files - save_dir (Path): Directory to save files - successful_downloads (list): List to append successful months - failed_downloads (list): List to append failed months - skipped_downloads (list): List to append skipped months - logger (logging.Logger): Logger instance - echo_console (bool): Whether to echo progress - allow_prompts (bool): Whether prompts are allowed (interactive mode)

download_monthly_era5_file(...)

Downloads a single month of ERA5 data.

Parameters: - client (cdsapi.Client): Authenticated CDS client - dataset_full_name (str): Full CDS dataset name - year (int): Year to download - month (int): Month to download - variables (list): Variables to request - region_bounds (list): Geographic bounds [N, W, S, E] - output_file (Path): Where to save the file - Various retry and logging parameters

Returns: - Path: Path to downloaded file if successful, else None


sources/open_meteo.py

Open-Meteo data provider (planned, not yet implemented).

weather_data_retrieval.sources.open_meteo

Coming Soon

This module is a placeholder for future Open-Meteo integration.


Utility Modules

utils/session_management.py

Session state tracking and management.

weather_data_retrieval.utils.session_management

SessionState
first_unfilled_key
first_unfilled_key()

Return the first key in the ordered fields that is not filled. This enables a simple wizard-like progression and supports backtracking by clearing fields with unset(key).

summary
summary()

Return a nice printable summary of all fields in a tabular format.

to_dict
to_dict(only_filled=False)

Flatten the session into a plain dict suitable for runner.run(...). If only_filled=True, include only keys that have been filled.

Parameters:

Name Type Description Default
only_filled bool

Whether to include only filled keys, by default False.

False

Returns:

Type Description
dict

Flattened session dictionary.

get_cds_dataset_config
get_cds_dataset_config(session, dataset_config_mapping)

Return dataset configuration dictionary based on session short name.

Parameters:

Name Type Description Default
session SessionState

The current session state containing user selections.

required
dataset_config_mapping dict

The mapping of dataset short names to their configurations.

required

Returns:

Type Description
dict

The configuration dictionary for the selected dataset.

map_config_to_session
map_config_to_session(
    cfg, session, *, logger=None, echo_console=False
)

Validate and map a loaded JSON config into SessionState.

Parameters:

Name Type Description Default
cfg dict

Loaded configuration dictionary.

required
session SessionState

The session state to populate.

required
Returns:

tuple : (bool, list[str]) (ok, messages): ok=False if any hard error prevents continuing.

ensure_cds_connection
ensure_cds_connection(
    client,
    creds,
    max_reauth_attempts=6,
    wait_between_attempts=15,
)

Ensure a valid CDS API client. Re-authenticate automatically if the connection drops.

Parameters:

Name Type Description Default
client Client

Current CDS API client.

required
creds dict

{'url': str, 'key': str} stored from initial login.

required
max_reauth_attempts int

Maximum reconnection attempts before aborting.

6
wait_between_attempts int

Wait time (seconds) between re-auth attempts.

15

Returns:

Type Description
Client | None

Valid client or None if re-authentication ultimately fails.

internet_speedtest
internet_speedtest(
    test_urls=None,
    max_seconds=15,
    logger=None,
    echo_console=True,
)

Download ~100MB test file from a fast CDN to estimate speed (MB/s).

Parameters:

Name Type Description Default
test_urls list[str]

List of URLs of the test files.

None
max_seconds int

Maximum time to wait for a response, by default 15 seconds.

15

Returns:

Type Description
float: Estimated download speed in Mbps.

Key Classes:

SessionState

A simple key-value store for tracking configuration state during interactive sessions.

Methods: - get(key, default=None) - Retrieve a value - set(key, value) - Store a value - unset(key) - Remove a value - first_unfilled_key() - Get the next required key that hasn't been set - to_dict() - Convert session to a dictionary (for saving)

Usage:

session = SessionState()
session.set("data_provider", "cds")
session.set("dataset_short_name", "era5-land")

provider = session.get("data_provider")  # "cds"
config = session.to_dict()  # {"data_provider": "cds", ...}

Key Functions:

internet_speedtest(test_urls, max_seconds, logger, echo_console)

Performs a quick internet speed test to estimate download times.

Parameters: - test_urls (list, optional): URLs to test (uses defaults if None) - max_seconds (int): Maximum time to spend testing - logger (logging.Logger): Logger instance - echo_console (bool): Whether to echo results

Returns: - float: Estimated speed in Mbps

map_config_to_session(config, session, logger, echo_console)

Maps a configuration dictionary to a SessionState, validating as it goes.


utils/data_validation.py

Input validation functions.

weather_data_retrieval.utils.data_validation

normalize_input
normalize_input(value, category)

Normalize user input to canonical internal value as defined in NORMALIZATION_MAP.

Parameters:

Name Type Description Default
value str

The user input value to normalize.

required
category str

The category of normalization (e.g., 'data_provider', 'dataset_short_name')

required

Returns:

Type Description
str

The normalized value.

format_duration
format_duration(seconds)

Convert seconds to a nice Hh Mm Ss string (with decimal seconds).

Parameters:

Name Type Description Default
seconds float

Duration in seconds.

required

Returns:

Type Description
str: Formatted duration string.
format_coordinates_nwse
format_coordinates_nwse(boundaries)

Extracts and formats coordinates as integers in N-W-S-E order Used for compact representation in filenames.

Parameters:

Name Type Description Default
boundaries list

List of boundaries in the order [north, west, south, east]

required

Returns:

Type Description
str: Formatted string in the format 'N{north}W{west}S{south}E{east}'
month_days
month_days(year, month)

Get list of days in a month formatted as two-digit strings.

Parameters:

Name Type Description Default
year int

Year of interest.

required
month int

Month of interest.

required

Returns:

Type Description
List[str]

List of days in the month as two-digit strings.

validate_data_provider
validate_data_provider(provider)

Ensure dataprovider is recognized and implemented.

Parameters:

Name Type Description Default
provider str

Name of the data provider.

required

Returns:

Type Description
bool: True if valid, False otherwise.
validate_dataset_short_name
validate_dataset_short_name(
    dataset_short_name,
    provider,
    *,
    logger=None,
    echo_console=False
)

Check dataset compatibility with provider.

Parameters:

Name Type Description Default
dataset_short_name str

Dataset short name.

required
provider str

Name of the data provider.

required

Returns:

Type Description
bool

True if valid, False otherwise.

validate_cds_api_key
validate_cds_api_key(
    url, key, *, logger=None, echo_console=False
)

Validate CDS API credentials by attempting to initialize a cdsapi.Client.

Parameters:

Name Type Description Default
url str

CDS API URL.

required
key str

CDS API key.

required
logger Logger

Logger for logging messages, by default None.

None
echo_console bool

Whether to echo messages to console, by default False.

False

Returns:

Type Description
Client | None

Authenticated client if successful, otherwise None.

validate_directory
validate_directory(path)

Check if path exists or can be created.

Parameters:

Name Type Description Default
path str

Directory path to validate.

required

Returns:

Type Description
bool

True if path exists or was created successfully, False otherwise.

validate_date
validate_date(
    value,
    allow_month_only=False,
    *,
    logger=None,
    echo_console=False
)

Validate date format as YYYY-MM-DD or optionally YYYY-MM.

Parameters:

Name Type Description Default
value str

Date string to validate.

required
allow_month_only bool

If True, also accept YYYY-MM format, by default False.

False

Returns:

Type Description
bool

True if valid, False otherwise.

parse_date_with_defaults
parse_date_with_defaults(
    date_str, default_to_month_end=False
)

Parse date string and apply defaults for incomplete dates.

Parameters:

Name Type Description Default
date_str str

Date string in format YYYY-MM-DD or YYYY-MM.

required
default_to_month_end bool

If True and date is YYYY-MM format, default to last day of month. If False and date is YYYY-MM format, default to first day of month. By default False.

False

Returns:

Type Description
tuple[datetime, str]

Tuple of (parsed datetime object, ISO format string YYYY-MM-DD)

Raises:

Type Description
ValueError

If date string is invalid.

clamp_era5_available_end_date
clamp_era5_available_end_date(
    end, *, logger=None, echo_console=False
)

Clamp end date to ERA5 data availability boundary (8 days ago).

Parameters:

Name Type Description Default
end datetime

Desired end date.

required

Returns:

Name Type Description
datetime: Clamped end date.
NOTES ERA5 data is available up to 8 days prior to the current date.
8-day lag is used to ensure data availability.
validate_coordinates
validate_coordinates(north, west, south, east)

Ensure coordinates are within realistic bounds.

Parameters:

Name Type Description Default
north int | float

Northern latitude boundary.

required
west int | float

Western longitude boundary.

required
south int | float

Southern latitude boundary.

required
east int | float

Eastern longitude boundary.

required

Returns:

Type Description
bool

True if coordinates are valid, False otherwise.

validate_variables
validate_variables(
    variable_list,
    variable_restrictions,
    restriction_allow=False,
)

Ensure user-specified variables are available for this dataset.

Parameters:

Name Type Description Default
variable_list list[str]

List of variable names to validate.

required
variable_restrictions list[str]

List of variables that are either allowed or disallowed.

required
restriction_allow bool

If True, variable_restrictions is an allowlist (i.e. in). If False, it's a denylist (i.e. not in)

False

Returns:

Type Description
bool

True if all variables are valid, False otherwise.

validate_existing_file_action
validate_existing_file_action(
    session, *, allow_prompts, logger, echo_console=False
)

Normalize existing_file_action for the current run-mode. - If 'case_by_case' is set but prompts are not allowed (automatic mode), coerce to 'skip_all' and warn.

Parameters:

Name Type Description Default
session Any

Current session state.

required
allow_prompts bool

Whether prompts are allowed (i.e., interactive mode).

required
logger Logger

Logger for logging messages.

required
echo_console bool

Whether to echo messages to console.

False

Returns:

Type Description
str

Normalized existing_file_action policy.

validate_config
validate_config(
    config,
    *,
    logger=None,
    run_mode="automatic",
    echo_console=False,
    live_auth_check=False
)

Entry point. Validates common shape then dispatches to provider-specific validator.

Parameters:

Name Type Description Default
config dict

Configuration dictionary.

required
logger Logger

Logger for logging messages, by default None.

None
run_mode str

Run mode, either 'interactive' or 'automatic', by default "automatic".

'automatic'
echo_console bool

Whether to echo messages to console, by default False.

False
live_auth_check bool

Whether to perform live authentication checks (e.g., CDS API), by default False.

False

Returns:

Type Description
None

Key Functions:

validate_config(config, logger, run_mode)

Comprehensive configuration validation. Checks:

  • All required fields are present
  • Data types are correct
  • Values are in valid ranges
  • Enumerations match allowed values
  • Dates are properly formatted and logical
  • Region bounds are valid

Side Effects: - Clamps out-of-range values with warnings - Converts incompatible settings (e.g., promptskip_all in batch mode)

validate_cds_api_key(api_url, api_key, logger, echo_console)

Tests CDS API credentials by attempting to create a client connection.

Returns: - cdsapi.Client if successful, else None

invalid_era5_world_variables(variables)

Checks for invalid variable names in ERA5 (world) dataset.

Returns: - list: Invalid variable names (empty if all valid)

invalid_era5_land_variables(variables)

Checks for invalid variable names in ERA5-Land dataset.

Returns: - list: Invalid variable names (empty if all valid)

Formatting Helpers:

  • format_coordinates_nwse(boundaries) - Convert bounds to string like "N40W10S35E5"
  • format_duration(seconds) - Convert seconds to human-readable duration
  • validate_date_format(date_string) - Check if date string is valid YYYY-MM-DD

utils/file_management.py

File naming, checking, and estimation functions.

weather_data_retrieval.utils.file_management

generate_filename_hash
generate_filename_hash(
    dataset_short_name, variables, boundaries
)

Generate a unique hash for the download parameters that will be used to create the filename.

Parameters:

Name Type Description Default
dataset_short_name str

The dataset short name (era5-world etc).

required
variables list[str]

List of variable names.

required
boundaries list[float]

List of boundaries [north, west, south, east].

required

Returns:

Type Description
str: A unique hash string representing the download parameters.
find_existing_month_file
find_existing_month_file(
    save_dir, filename_base, year, month
)

Tolerant matcher that finds an existing file for the given month. Accepts both _YYYY-MM.ext and _YYYY_MM.ext patterns and any extension.

Parameters:

Name Type Description Default
save_dir Path

Directory where files are saved.

required
filename_base str

Base filename (without date or extension).

required
year int

Year of the file.

required
month int

Month of the file.

required

Returns:

Type Description
Optional[Path]

Path to the existing file if found, else None.

estimate_era5_monthly_file_size
estimate_era5_monthly_file_size(
    variables,
    area,
    grid_resolution=0.25,
    timestep_hours=1.0,
    bytes_per_value=4.0,
)

Estimate ERA5 GRIB file size (MB) for a monthly dataset.

Parameters:

Name Type Description Default
variables list[str]

Variables requested (e.g. ['2m_temperature', 'total_precipitation']).

required
area list[float]

[north, west, south, east] geographic bounds in degrees.

required
grid_resolution float

Grid spacing in degrees (default 0.25° for ERA5).

0.25
timestep_hours float

Temporal resolution in hours (1 = hourly, 3 = 3-hourly, 6 = 6-hourly, etc.).

1.0
bytes_per_value float

Bytes per gridpoint per variable (float32 = 4 bytes).

4.0

Returns:

Type Description
float

Estimated monthly file size in MB.

estimate_cds_download
estimate_cds_download(
    variables,
    area,
    start_date,
    end_date,
    observed_speed_mbps,
    grid_resolution=0.25,
    timestep_hours=1.0,
    bytes_per_value=4.0,
    overhead_per_request_s=180.0,
    overhead_per_var_s=12.0,
)

Estimate per-file and total download size/time for CDS (ERA5) retrievals, using an empirically grounded file size model.

Parameters:

Name Type Description Default
variables list[str]

Variables selected (e.g. ['2m_temperature', 'total_precipitation']).

required
area list[float]

[north, west, south, east] bounds in degrees.

required
start_date str

Date range (YYYY-MM-DD).

required
end_date str

Date range (YYYY-MM-DD).

required
observed_speed_mbps float

Measured internet speed in megabits per second (Mbps).

required
grid_resolution float

Grid resolution in degrees (default 0.25°).

0.25
timestep_hours float

Temporal resolution in hours (default 1-hourly).

1.0
bytes_per_value float

Bytes per stored value (float32 = 4).

4.0
overhead_per_request_s float

Fixed CDS request overhead time (queue/prep).

180.0
overhead_per_var_s float

Per-variable overhead for CDS throttling/prep.

12.0

Returns:

Type Description
dict

{ "months": int, "file_size_MB": float, "total_size_MB": float, "time_per_file_sec": float, "total_time_sec": float }

expected_save_stem
expected_save_stem(save_dir, filename_base, year, month)

Construct canonical save stem (without extension) for monthly data.

Parameters:

Name Type Description Default
save_dir str | Path | None

Base directory for saving. If None, defaults to osme_common.paths.data_dir().

required
filename_base str

Base name without date or extension.

required
year int

Year and month of the file.

required
month int

Year and month of the file.

required

Returns:

Type Description
Path

Resolved path under the proper data directory.

expected_save_path
expected_save_path(
    save_dir, filename_base, year, month, data_format="grib"
)

Construct canonical save path for monthly data.

Parameters:

Name Type Description Default
save_dir str | Path | None

Base directory for saving. If None, defaults to osme_common.paths.data_dir().

required
filename_base str

Base name without date or extension.

required
year int

Year and month of the file.

required
month int

Year and month of the file.

required
data_format str

File extension, e.g., 'grib' or 'nc'.

'grib'

Returns:

Type Description
Path

Resolved path under the proper data directory.

is_zip_file
is_zip_file(path)

Check if a file is a ZIP archive by reading its magic number.

Parameters:

Name Type Description Default
path Path

Path to the file to check.

required

Returns:

Type Description
bool

True if the file is a ZIP archive, False otherwise.

is_grib_file
is_grib_file(path)

Check if a file is a GRIB file by reading its magic number. GRIB files begin with ASCII bytes 'GRIB'.

Parameters:

Name Type Description Default
path Path

Path to the file to check.

required

Returns:

Type Description
bool

True if the file is a GRIB file, False otherwise.

unpack_zip_to_grib
unpack_zip_to_grib(zip_path, final_grib_path)

Extract a ZIP and move the contained GRIB-like file to final_grib_path. If multiple candidates exist, prefer .grib/.grb/.grib2; otherwise take the largest file.

Parameters:

Name Type Description Default
zip_path Path

Path to the ZIP file.

required
final_grib_path Path

Desired final path for the extracted GRIB file.

required

Returns:

Type Description
Path

Path to the final GRIB file.

Key Functions:

generate_filename_hash(dataset_short_name, variables, boundaries)

Creates a unique 12-character hash from download parameters. This ensures files with different parameters don't collide.

Returns: - str: 12-character hash (e.g., "abc123def456")

find_existing_month_file(save_dir, filename_base, year, month)

Searches for an existing file for a given month, handling different date separators and extensions.

Returns: - Path if file exists, else None

estimate_era5_monthly_file_size(variables, area, grid_resolution, ...)

Estimates file size for a monthly ERA5 download based on empirical data.

Parameters: - variables (list): Variables to download - area (list): Region bounds - grid_resolution (float): Grid spacing in degrees - timestep_hours (float): Temporal resolution - bytes_per_value (float): Storage size per value

Returns: - float: Estimated size in MB

estimate_cds_download(variables, area, start_date, end_date, observed_speed_mbps, ...)

Comprehensive download estimation including size, time, and parallelization effects.

Returns: - dict:

{
  "months": 12,              # Number of monthly files
  "file_size_MB": 42.3,      # Average file size
  "total_size_MB": 507.6,    # Total download size
  "time_per_file_sec": 180,  # Time per file
  "total_time_sec": 2160     # Total estimated time
}

File Format Helpers:

  • is_zip_file(path) - Check if file is a ZIP archive
  • is_grib_file(path) - Check if file is a GRIB file
  • unpack_zip_to_grib(zip_path, final_grib_path) - Extract GRIB from ZIP

utils/logging.py

Logging configuration and utilities.

weather_data_retrieval.utils.logging

build_download_summary
build_download_summary(
    session, estimates, speed_mbps, save_dir=None
)

Construct a formatted summary string of the current download configuration.

Parameters:

Name Type Description Default
session SessionState

Current session state containing all parameters.

required
estimates dict

Dictionary containing download size and time estimates.

required
speed_mbps float

Measured or estimated internet speed in Mbps.

required

Returns:

Type Description
str

Nicely formatted summary string for display or logging.

setup_logger
setup_logger(
    save_dir=None, run_mode="interactive", verbose=False
)

Initialize and return a configured logger.

Logs are written to /logs (or $OSME_LOG_DIR) by default, with optional console output in interactive or verbose modes.

Parameters:

Name Type Description Default
save_dir str or None

Directory to save log files. If None, defaults to osme_common.paths.log_dir().

None
run_mode str

Either 'interactive' or 'automatic', by default 'interactive'.

'interactive'
verbose bool

Whether to echo logs to console in automatic mode, by default False.

False

Returns:

Type Description
Logger

Configured logger instance.

log_msg
log_msg(
    msg,
    logger,
    *,
    level="info",
    echo_console=False,
    force=False
)

Unified logging utility. - Always logs to file. - Optionally echoes to console via tqdm.write.

Parameters:

Name Type Description Default
echo_console bool

Print to console when True (used for verbose mode).

False
force bool

Print to console regardless of echo_console (used for summaries/errors).

False
create_final_log_file
create_final_log_file(
    session,
    filename_base,
    original_logger,
    *,
    delete_original=True,
    reattach_to_final=True
)

Create a final log file with the same naming pattern as data files. Copies content from the original log file.

Parameters:

Name Type Description Default
session Any(SessionState)

Current session state.

required
filename_base str

Base filename pattern (same as data files).

required
original_logger Logger

The original logger instance.

required
delete_original bool

Whether to delete the original log file after creating the final one, by default True.

True
reattach_to_final bool

Whether to reattach the logger to the final log file, by default True.

True

Returns:

Type Description
str

Path to the final log file.

Key Functions:

setup_logger(save_dir, run_mode, verbose)

Initializes a configured logger for the package.

Parameters: - save_dir (str, optional): Directory for log files (defaults to logs/weather_data_retrieval/) - run_mode (str): "interactive" or "automatic" - verbose (bool): Whether to show console output

Returns: - logging.Logger: Configured logger instance

Configuration: - File handler: DEBUG level (everything) - Console handler: INFO level (only in interactive or verbose mode) - Format: "%(asctime)s | %(levelname)s | %(message)s"

log_msg(msg, logger, level, echo_console, force)

Unified logging utility that writes to file and optionally to console.

Parameters: - msg (str): Message to log - logger (logging.Logger): Logger instance - level (str): "debug", "info", "warning", "error", "exception" - echo_console (bool): Print to console - force (bool): Print even if echo_console=False (for critical messages)

build_download_summary(session, estimates, speed_mbps, save_dir)

Constructs a formatted summary of the download configuration.

Returns: - str: Multi-line summary string

create_final_log_file(session, filename_base, original_logger, ...)

Creates a final log file with the same naming convention as data files.

Parameters: - session (SessionState): Current session - filename_base (str): Base filename - original_logger (logging.Logger): Current logger - delete_original (bool): Remove temporary log - reattach_to_final (bool): Continue logging to final file

Returns: - str: Path to final log file


Usage Examples

Creating a Custom Download Script

from weather_data_retrieval.runner import run

# Define your configuration
config = {
    "data_provider": "cds",
    "dataset_short_name": "era5-land",
    "api_url": "https://cds.climate.copernicus.eu/api",
    "api_key": "YOUR_KEY",
    "start_date": "2023-01-01",
    "end_date": "2023-03-31",
    "region_bounds": [40, -10, 35, 5],
    "variables": ["2m_temperature"],
    "existing_file_action": "skip_all",
    "retry_settings": {"max_retries": 3, "retry_delay_sec": 15},
    "parallel_settings": {"enabled": True, "max_concurrent": 2}
}

# Run the download
exit_code = run(config, run_mode="automatic", verbose=True)

if exit_code == 0:
    print("Download completed successfully!")
elif exit_code == 2:
    print("Download completed with some failures. Check logs.")
else:
    print("Download failed. Check logs.")

Validating a Config Before Running

from weather_data_retrieval.utils.data_validation import validate_config
from weather_data_retrieval.utils.logging import setup_logger

logger = setup_logger(run_mode="automatic", verbose=True)

config = { ... }  # Your config

try:
    validate_config(config, logger=logger, run_mode="automatic")
    print("✓ Configuration is valid!")
except ValueError as e:
    print(f"✗ Configuration error: {e}")

Estimating Download Before Running

from weather_data_retrieval.utils.file_management import estimate_cds_download
from weather_data_retrieval.utils.session_management import internet_speedtest

# Test internet speed
speed_mbps = internet_speedtest(max_seconds=10, logger=None, echo_console=True)

# Estimate download
estimates = estimate_cds_download(
    variables=["2m_temperature", "total_precipitation"],
    area=[40, -10, 35, 5],
    start_date="2023-01-01",
    end_date="2023-12-31",
    observed_speed_mbps=speed_mbps,
    grid_resolution=0.1
)

print(f"Estimated files: {estimates['months']}")
print(f"Total size: {estimates['total_size_MB']:.1f} MB")
print(f"Estimated time: {estimates['total_time_sec'] / 60:.1f} minutes")

Extending the Package

Adding a New Data Provider

To add a new provider (e.g., Open-Meteo):

  1. Create provider module: sources/my_provider.py

  2. Implement download function:

    def orchestrate_my_provider_downloads(
        session, filename_base, save_dir,
        successful_downloads, failed_downloads, skipped_downloads,
        logger, echo_console, allow_prompts
    ):
        # Your implementation
        pass
    

  3. Add to session mapping: Update session_management.py to handle new provider

  4. Add validation: Update data_validation.py to validate provider-specific params

  5. Update prompts: Add provider selection in io/prompts.py

Adding New Variables

Variables are validated against hard-coded lists in data_validation.py:

# In data_validation.py

ERA5_LAND_VARIABLES = [
    "2m_temperature",
    "total_precipitation",
    # ... add new variable here
]

Adding New Configuration Parameters

  1. Define in SessionState: Add to the required/optional keys list

  2. Add validation: Update validate_config() in data_validation.py

  3. Add prompt (if interactive): Create prompt_new_parameter() in io/prompts.py

  4. Use in runner: Access via session.get("new_parameter") in your code


Testing

Running Unit Tests

# From repo root
pytest packages/weather_data_retrieval/tests/

Testing Individual Modules

# Test validation
from weather_data_retrieval.utils.data_validation import validate_config

config = {...}
validate_config(config, logger=None, run_mode="automatic")

Mock Testing Downloads

For testing without hitting the CDS API, you can mock the download functions:

from unittest.mock import patch, MagicMock

with patch('weather_data_retrieval.sources.cds_era5.download_monthly_era5_file') as mock_download:
    mock_download.return_value = Path("/fake/file.grib")
    # Run your test

Further Reading