API Reference¶
This page documents the internal structure of the weather_data_retrieval package. This is useful if you want to understand how the code works, extend it, or integrate it into other tools.
For End Users
If you're just using the package to download data, you probably don't need this page. See the User Guide or Quickstart instead.
Package Structure¶
The package is organized into several modules:
weather_data_retrieval/
├── main.py # Entry point for CLI
├── runner.py # Core orchestration logic
├── io/ # Input/output handling
│ ├── cli.py # Command-line interface
│ ├── prompts.py # Interactive wizard prompts
│ └── config_loader.py # JSON config loading
├── sources/ # Data provider implementations
│ ├── cds_era5.py # CDS/ERA5 downloader
│ └── open_meteo.py # Open-Meteo (planned)
└── utils/ # Shared utilities
├── session_management.py # Session state tracking
├── data_validation.py # Input validation
├── file_management.py # File naming and checking
└── logging.py # Logging configuration
Core Modules¶
Entry Point & Orchestration¶
main.py¶
The main entry point for the CLI. This is what gets called when you run osme-weather.
weather_data_retrieval.main ¶
Main entry point for the Weather Data Retrieval CLI.
This script can be run either:
- Automatically via a configuration file (--config path/to/config.json), or
- Interactively through a guided prompt wizard.
It handles session management, logging setup, and orchestration of
the retrieval workflow defined in weather_data_retrieval.runner.
Typically invoked through the CLI command: osme-weather or wdr.
main ¶
main()
Entry point for the Weather Data Retrieval CLI.
This function:
1. Parses CLI arguments or launches the interactive prompt wizard.
2. Loads or builds a configuration file for weather dataset download.
3. Initializes logging via osme_common.paths.data_dir().
4. Executes the main retrieval workflow using weather_data_retrieval.runner.run.
Automatically invoked by the osme-weather CLI script.
Key Function: main()
- Parses command-line arguments
- Determines run mode (interactive vs. batch)
- Sets up logging
- Delegates to either the interactive wizard or batch runner
Usage:
# Called automatically by CLI
# osme-weather [--config FILE] [--verbose] [--quiet]
runner.py¶
Core orchestration logic for the download workflow. This handles validation, estimation, and download coordination.
weather_data_retrieval.runner ¶
run ¶
run(
config,
run_mode="interactive",
verbose=True,
logger=None,
)
Unified orchestration entry point for both interactive and automatic runs. Handles validation, logging, estimation, and download orchestration.
Returns: 0=success, 1=fatal error, 2=some downloads failed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
dict
|
Configuration dictionary with all required parameters. |
required |
run_mode
|
str
|
Run mode, either 'interactive' or 'automatic', by default "interactive". |
'interactive'
|
logger
|
Logger
|
Pre-configured logger instance, by default None. |
None
|
Returns:
| Type | Description |
|---|---|
int
|
Exit code: 0=success, 1=fatal error, 2=some downloads failed. |
run_batch_from_config ¶
run_batch_from_config(cfg_path, logger=None)
Run automatic batch from a config file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
dict
|
Configuration dictionary with all required parameters. |
required |
logger
|
Logger
|
Pre-configured logger instance, by default None. |
None
|
Returns:
| Type | Description |
|---|---|
int
|
Exit code: 0=success, 1=fatal error, 2=some downloads failed. |
Key Functions:
run(config, run_mode, verbose, logger)
Main workflow orchestrator that:
- Validates the configuration
- Maps config to session state
- Performs internet speed test
- Estimates download size and time
- Generates filename hash
- Orchestrates downloads
- Reports final statistics
Parameters:
- config (dict): Configuration dictionary
- run_mode (str): "interactive" or "automatic"
- verbose (bool): Whether to echo progress to console
- logger (logging.Logger): Logger instance
Returns:
- int: Exit code (0=success, 1=fatal error, 2=some failures)
run_batch_from_config(cfg_path, logger)
Convenience wrapper for batch mode. Loads config and calls run().
Input/Output Modules¶
io/cli.py¶
Command-line interface and interactive wizard.
weather_data_retrieval.io.cli ¶
parse_args ¶
parse_args()
Parse command-line arguments.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
None
|
|
required |
Returns:
| Type | Description |
|---|---|
Namespace
|
Parsed arguments. |
run_prompt_wizard ¶
run_prompt_wizard(session, logger=None)
Drives the interactive prompt flow (no config-source step). Returns True if all fields completed; False if user exits.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
SessionState
|
The session state to populate. |
required |
logger
|
Logger
|
Logger for logging messages, by default None. |
None
|
Returns:
| Type | Description |
|---|---|
bool
|
True if completed; False if exited early. |
Key Functions:
parse_args()
Parses command-line arguments using argparse.
Returns: argparse.Namespace with fields:
- config: Path to config file (if --config provided)
- verbose: Boolean for verbose output (batch mode)
- quiet: Boolean for quiet mode (interactive mode)
run_prompt_wizard(session, logger)
Drives the interactive prompt flow. Steps through each configuration parameter, validates inputs, and handles back/exit commands.
Parameters:
- session (SessionState): Session state to populate
- logger (logging.Logger): Logger instance
Returns:
- bool: True if completed successfully, False if user exited early
io/prompts.py¶
Individual prompt functions for each configuration step.
weather_data_retrieval.io.prompts ¶
read_input ¶
read_input(prompt, *, logger=None)
Centralized input handler with built-in 'exit' and 'back' controls.
Parameters:
prompt : str The prompt to display to the user. logger : logging.Logger, optional Logger to log the prompt message.
Returns:
str The user input, or special command indicators.
say ¶
say(text, *, logger=None)
Centralized output handler to log and print messages.
Parameters:
text : str The message to display. logger : logging.Logger, optional Logger to log the message.
Returns:
None
prompt_data_provider ¶
prompt_data_provider(session, *, logger=None)
Prompt user for which data provider to use (CDS or Open-Meteo).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
SessionState
|
Current session state to store selected data provider. |
required |
logger
|
Logger
|
Logger for logging messages, by default None. |
None
|
Returns:
| Type | Description |
|---|---|
str
|
Normalized provider name ("cds" or "open-meteo"), or special control token "BACK" or "EXIT". |
prompt_dataset_short_name ¶
prompt_dataset_short_name(
session, provider, *, logger=None
)
Prompt for dataset choice.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
SessionState
|
Current session state to store selected dataset. |
required |
provider
|
str
|
Data provider name. |
required |
logger
|
Logger
|
Logger for logging messages, by default None. |
None
|
Returns:
| Type | Description |
|---|---|
str: Normalized dataset name or 'exit' / 'back'.
|
|
prompt_cds_url ¶
prompt_cds_url(
session,
api_url_default="https://cds.climate.copernicus.eu/api",
*,
logger=None
)
Prompt for CDS API URL.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
SessionState
|
Current session state to store API URL. |
required |
api_url_default
|
str
|
Default CDS API URL. https://cds.climate.copernicus.eu/api |
'https://cds.climate.copernicus.eu/api'
|
Returns:
| Type | Description |
|---|---|
str: CDS API URL or 'exit' / 'back'.
|
|
prompt_cds_api_key ¶
prompt_cds_api_key(session, *, logger=None)
Prompt only for the CDS API key (hidden input).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
SessionState
|
Current session state to store API key. |
required |
logger
|
Logger
|
Logger for logging messages, by default None. |
None
|
Returns:
| Type | Description |
|---|---|
str
|
CDS API key or 'exit' / 'back'. |
prompt_save_directory ¶
prompt_save_directory(session, default_dir, *, logger=None)
Ask for save directory, create if necessary.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
SessionState
|
Current session state to store save directory. |
required |
default_dir
|
Path
|
Default directory to suggest. |
required |
logger
|
Logger
|
Logger for logging messages, by default None. |
None
|
Returns:
| Type | Description |
|---|---|
Path | str
|
Path to save directory, or control token "BACK" / "EXIT". |
prompt_date_range ¶
prompt_date_range(session, *, logger=None)
Ask user for start and end date, with validation. Accepts formats: YYYY-MM-DD or YYYY-MM - Start dates without day default to first day of month (YYYY-MM-01) - End dates without day default to last day of month (YYYY-MM-[last day])
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
SessionState
|
Current session state to store date range. |
required |
logger
|
Logger
|
Logger for logging messages, by default None. |
None
|
Returns:
| Type | Description |
|---|---|
tuple[str, str]
|
(start_date_str, end_date_str) in ISO format (YYYY-MM-DD), or ("EXIT", "EXIT") / ("BACK", "BACK") |
prompt_coordinates ¶
prompt_coordinates(session, *, logger=None)
Prompt user for geographic boundaries (N, S, W, E) with validation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
SessionState
|
Current session state to store geographic boundaries. |
required |
Returns:
| Type | Description |
|---|---|
list[float]
|
[north, west, south, east] boundaries or special tokens "EXIT" / "BACK". |
prompt_variables ¶
prompt_variables(
session,
variable_restrictions_list,
*args,
restriction_allow=False,
logger=None
)
Ask for variables to download, validate each against allowed/disallowed list, and only update session if the full set is valid.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
SessionState
|
Current session state to store selected variables. |
required |
variable_restrictions_list
|
list[str]
|
List of variables that are either allowed or disallowed. |
required |
restriction_allow
|
bool
|
If True, variable_restrictions_list is an allowlist (i.e. in). If False, it's a denylist (i.e. not in) |
False
|
logger
|
Logger
|
Logger for logging messages, by default None. |
None
|
Returns:
| Type | Description |
|---|---|
list[str] | str
|
List of selected variable names, or control token "BACK" / "EXIT". |
prompt_skip_overwrite_files ¶
prompt_skip_overwrite_files(session, *, logger=None)
Prompt user to choose skip/overwrite/case-by-case for existing files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
SessionState
|
Session state to store user choice. |
required |
logger
|
Logger
|
Logger for logging messages, by default None. |
None
|
Returns:
| Type | Description |
|---|---|
str
|
One of "overwrite_all", "skip_all", "case_by_case" |
prompt_parallelisation_settings ¶
prompt_parallelisation_settings(session, *, logger=None)
Ask user about parallel downloads and concurrency cap.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
SessionState
|
Current session state to store parallelisation settings. |
required |
logger
|
Logger
|
Logger for logging messages, by default None. |
None
|
Returns:
| Type | Description |
|---|---|
dict | str
|
Dictionary with parallelisation settings, or control token "BACK" / "EXIT". |
prompt_retry_settings ¶
prompt_retry_settings(
session,
default_retries=6,
default_delay=15,
*,
logger=None
)
Ask user for retry limits.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
SessionState
|
Current session state to store retry settings. |
required |
default_retries
|
int
|
Default number of retry attempts (default = 6). |
6
|
default_delay
|
int
|
Default delay (in seconds) between retries (default = 15). |
15
|
logger
|
Logger
|
Logger for logging messages, by default None. |
None
|
Returns:
| Type | Description |
|---|---|
dict | str
|
Dictionary with 'max_retries' and 'retry_delay_sec', or control token "BACK" / "EXIT". |
prompt_continue_confirmation ¶
prompt_continue_confirmation(session, *, logger=None)
Display a formatted download summary and confirm before starting downloads.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
SessionState
|
session state to summarise. |
required |
logger
|
Logger
|
Logger for logging messages, by default None. |
None
|
Returns:
| Type | Description |
|---|---|
bool | str
|
True if user confirms, False if user declines, or control token "BACK" / "EXIT". |
Key Functions:
Each prompt function handles user input for a specific configuration parameter:
prompt_data_provider()- Choose CDS or Open-Meteoprompt_dataset_short_name()- Choose ERA5-Land or ERA5prompt_cds_url()- CDS API URLprompt_cds_api_key()- CDS API keyprompt_date_range()- Start and end datesprompt_coordinates()- Region bounds [N, W, S, E]prompt_variables()- Variable selectionprompt_skip_overwrite_files()- Existing file handlingprompt_parallelisation_settings()- Parallel download settingsprompt_retry_settings()- Retry configurationprompt_continue_confirmation()- Final review and confirmation
All prompt functions follow the same pattern:
Parameters:
- session (SessionState): Current session state
- logger (logging.Logger): Logger instance
- echo_console (bool): Whether to echo to console
Returns:
- The validated input value, or
- "__EXIT__" if user wants to quit, or
- "__BACK__" if user wants to go to previous prompt
io/config_loader.py¶
JSON configuration file loading and validation.
weather_data_retrieval.io.config_loader ¶
load_and_validate_config ¶
load_and_validate_config(
path, *, logger=None, run_mode="automatic"
)
Load JSON config and validate it using the centralized validator. This lets the validator log coercions/warnings (e.g., case_by_case → skip_all).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to JSON config file. |
required |
logger
|
Logger
|
Logger instance for validation messages, by default None. |
None
|
run_mode
|
str
|
Run mode, either 'interactive' or 'automatic', by default "automatic". |
'automatic'
|
Returns:
| Type | Description |
|---|---|
dict
|
Validated configuration dictionary. |
load_config ¶
load_config(file_path)
Load configuration from a JSON requirements file (without validation).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
Path to JSON config file. |
required |
Returns:
| Type | Description |
|---|---|
dict
|
Configuration dictionary. |
Key Functions:
load_and_validate_config(path, logger, run_mode)
Loads a JSON config file and validates it.
Parameters:
- path (str): Path to JSON config file
- logger (logging.Logger, optional): Logger for validation messages
- run_mode (str): "interactive" or "automatic"
Returns:
- dict: Validated configuration
Raises:
- FileNotFoundError: If config file doesn't exist
- ValueError: If config is invalid
load_config(file_path)
Simple config loader without validation (for internal use).
Data Provider Modules¶
sources/cds_era5.py¶
CDS/ERA5 data provider implementation.
weather_data_retrieval.sources.cds_era5 ¶
prepare_cds_download ¶
prepare_cds_download(
session,
filename_base,
year,
month,
save_dir,
*,
logger,
echo_console,
allow_prompts,
dataset_config_mapping=CDS_DATASETS
)
Check if a monthly ERA5 file already exists and decide whether to download.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
SessionState
|
Session containing user configuration. |
required |
filename_base
|
str
|
Base name for the file. |
required |
year
|
int
|
Year of the data to download. |
required |
month
|
int
|
Month of the data to download. |
required |
save_dir
|
Path
|
Directory to save downloaded files. |
required |
logger
|
Logger
|
Logger for logging messages. |
required |
echo_console
|
bool
|
Whether to echo prompts to console. |
required |
allow_prompts
|
bool
|
Whether to allow interactive prompts. |
required |
dataset_config_mapping
|
dict
|
Mapping of dataset short names to their configurations. |
CDS_DATASETS
|
Returns:
| Name | Type | Description |
|---|---|---|
tuple |
(download: bool, save_path: str)
|
download: Whether to perform the download. save_path: Full path for the target file. |
execute_cds_download ¶
execute_cds_download(
session,
save_path,
year,
month,
*,
logger,
echo_console,
dataset_config_mapping=CDS_DATASETS
)
Execute a single ERA5 monthly download with retry logic.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
SessionState
|
Session state containing the authenticated CDS API client. |
required |
save_path
|
str
|
Full path to save the downloaded file. |
required |
year
|
int
|
Year of the data to download. |
required |
month
|
int
|
Month of the data to download. |
required |
logger
|
Logger
|
Logger for logging messages. |
required |
echo_console
|
bool
|
Whether to echo prompts to console. |
required |
dataset_config_mapping
|
dict
|
Mapping of dataset short names to their configurations. |
CDS_DATASETS
|
Returns:
| Type | Description |
|---|---|
(year, month, status): tuple
|
status = "success" | "failed" |
download_cds_month ¶
download_cds_month(
session,
filename_base,
year,
month,
save_dir,
*,
logger,
echo_console,
allow_prompts,
successful_downloads,
failed_downloads,
skipped_downloads
)
Orchestrate ERA5 monthly download: handle file checks, then execute download.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
Combines
|
|
required |
Returns:
| Type | Description |
|---|---|
(year, month, status): tuple
|
status = "success" | "failed" | "skipped" |
plan_cds_months ¶
plan_cds_months(
session,
filename_base,
save_dir,
*,
logger,
echo_console,
allow_prompts
)
Build the list of months to download and which are being skipped due to existing files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
SessionState
|
Session containing user configuration. |
required |
filename_base
|
str
|
Base filename (without date or extension). |
required |
save_dir
|
Path
|
Directory to save downloaded files. |
required |
logger
|
Logger
|
Logger for logging messages. |
required |
echo_console
|
bool
|
Whether to echo prompts to console. |
required |
allow_prompts
|
bool
|
Whether to allow interactive prompts. |
required |
Returns:
| Type | Description |
|---|---|
(months_to_download, months_skipped)
|
|
orchestrate_cds_downloads ¶
orchestrate_cds_downloads(
session,
filename_base,
save_dir,
successful_downloads,
failed_downloads,
skipped_downloads,
*,
logger,
echo_console,
allow_prompts,
dataset_config_mapping=CDS_DATASETS
)
Handle and orchestrate ERA5 monthly downloads, supporting parallel or sequential execution.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
SessionState
|
Session containing user configuration and authenticated client. |
required |
filename_base
|
str
|
Base filename (without date or extension). |
required |
save_dir
|
Path
|
Directory to save downloaded files. |
required |
successful_downloads
|
list
|
Mutable list to collect (year, month) tuples for successful downloads. |
required |
failed_downloads
|
list
|
Mutable list to collect (year, month) tuples for failed downloads. |
required |
skipped_downloads
|
list
|
Mutable list to collect (year, month) tuples for skipped downloads. |
required |
logger
|
Logger
|
Logger for logging messages. |
required |
echo_console
|
bool
|
Whether to echo prompts to console. |
required |
allow_prompts
|
bool
|
Whether to allow interactive prompts. |
required |
dataset_config_mapping
|
dict
|
Mapping of dataset configurations, by default CDS_DATASETS. |
CDS_DATASETS
|
Returns:
| Type | Description |
|---|---|
None
|
|
Key Functions:
orchestrate_cds_downloads(session, filename_base, save_dir, ...)
Manages the entire CDS download workflow:
- Generates list of months to download
- Checks for existing files (skip logic)
- Coordinates parallel or sequential downloads
- Handles retries and failures
- Validates downloaded files
- Updates success/failure lists
Parameters:
- session (SessionState): Current session with all parameters
- filename_base (str): Base filename for output files
- save_dir (Path): Directory to save files
- successful_downloads (list): List to append successful months
- failed_downloads (list): List to append failed months
- skipped_downloads (list): List to append skipped months
- logger (logging.Logger): Logger instance
- echo_console (bool): Whether to echo progress
- allow_prompts (bool): Whether prompts are allowed (interactive mode)
download_monthly_era5_file(...)
Downloads a single month of ERA5 data.
Parameters:
- client (cdsapi.Client): Authenticated CDS client
- dataset_full_name (str): Full CDS dataset name
- year (int): Year to download
- month (int): Month to download
- variables (list): Variables to request
- region_bounds (list): Geographic bounds [N, W, S, E]
- output_file (Path): Where to save the file
- Various retry and logging parameters
Returns:
- Path: Path to downloaded file if successful, else None
sources/open_meteo.py¶
Open-Meteo data provider (planned, not yet implemented).
weather_data_retrieval.sources.open_meteo ¶
Coming Soon
This module is a placeholder for future Open-Meteo integration.
Utility Modules¶
utils/session_management.py¶
Session state tracking and management.
weather_data_retrieval.utils.session_management ¶
SessionState ¶
first_unfilled_key ¶
first_unfilled_key()
Return the first key in the ordered fields that is not filled.
This enables a simple wizard-like progression and supports
backtracking by clearing fields with unset(key).
to_dict ¶
to_dict(only_filled=False)
Flatten the session into a plain dict suitable for runner.run(...). If only_filled=True, include only keys that have been filled.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
only_filled
|
bool
|
Whether to include only filled keys, by default False. |
False
|
Returns:
| Type | Description |
|---|---|
dict
|
Flattened session dictionary. |
get_cds_dataset_config ¶
get_cds_dataset_config(session, dataset_config_mapping)
Return dataset configuration dictionary based on session short name.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
SessionState
|
The current session state containing user selections. |
required |
dataset_config_mapping
|
dict
|
The mapping of dataset short names to their configurations. |
required |
Returns:
| Type | Description |
|---|---|
dict
|
The configuration dictionary for the selected dataset. |
map_config_to_session ¶
map_config_to_session(
cfg, session, *, logger=None, echo_console=False
)
Validate and map a loaded JSON config into SessionState.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg
|
dict
|
Loaded configuration dictionary. |
required |
session
|
SessionState
|
The session state to populate. |
required |
Returns:
tuple : (bool, list[str]) (ok, messages): ok=False if any hard error prevents continuing.
ensure_cds_connection ¶
ensure_cds_connection(
client,
creds,
max_reauth_attempts=6,
wait_between_attempts=15,
)
Ensure a valid CDS API client. Re-authenticate automatically if the connection drops.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
client
|
Client
|
Current CDS API client. |
required |
creds
|
dict
|
{'url': str, 'key': str} stored from initial login. |
required |
max_reauth_attempts
|
int
|
Maximum reconnection attempts before aborting. |
6
|
wait_between_attempts
|
int
|
Wait time (seconds) between re-auth attempts. |
15
|
Returns:
| Type | Description |
|---|---|
Client | None
|
Valid client or None if re-authentication ultimately fails. |
internet_speedtest ¶
internet_speedtest(
test_urls=None,
max_seconds=15,
logger=None,
echo_console=True,
)
Download ~100MB test file from a fast CDN to estimate speed (MB/s).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
test_urls
|
list[str]
|
List of URLs of the test files. |
None
|
max_seconds
|
int
|
Maximum time to wait for a response, by default 15 seconds. |
15
|
Returns:
| Type | Description |
|---|---|
float: Estimated download speed in Mbps.
|
|
Key Classes:
SessionState
A simple key-value store for tracking configuration state during interactive sessions.
Methods:
- get(key, default=None) - Retrieve a value
- set(key, value) - Store a value
- unset(key) - Remove a value
- first_unfilled_key() - Get the next required key that hasn't been set
- to_dict() - Convert session to a dictionary (for saving)
Usage:
session = SessionState()
session.set("data_provider", "cds")
session.set("dataset_short_name", "era5-land")
provider = session.get("data_provider") # "cds"
config = session.to_dict() # {"data_provider": "cds", ...}
Key Functions:
internet_speedtest(test_urls, max_seconds, logger, echo_console)
Performs a quick internet speed test to estimate download times.
Parameters:
- test_urls (list, optional): URLs to test (uses defaults if None)
- max_seconds (int): Maximum time to spend testing
- logger (logging.Logger): Logger instance
- echo_console (bool): Whether to echo results
Returns:
- float: Estimated speed in Mbps
map_config_to_session(config, session, logger, echo_console)
Maps a configuration dictionary to a SessionState, validating as it goes.
utils/data_validation.py¶
Input validation functions.
weather_data_retrieval.utils.data_validation ¶
normalize_input ¶
normalize_input(value, category)
Normalize user input to canonical internal value as defined in NORMALIZATION_MAP.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
value
|
str
|
The user input value to normalize. |
required |
category
|
str
|
The category of normalization (e.g., 'data_provider', 'dataset_short_name') |
required |
Returns:
| Type | Description |
|---|---|
str
|
The normalized value. |
format_duration ¶
format_duration(seconds)
Convert seconds to a nice Hh Mm Ss string (with decimal seconds).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seconds
|
float
|
Duration in seconds. |
required |
Returns:
| Type | Description |
|---|---|
str: Formatted duration string.
|
|
format_coordinates_nwse ¶
format_coordinates_nwse(boundaries)
Extracts and formats coordinates as integers in N-W-S-E order Used for compact representation in filenames.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
boundaries
|
list
|
List of boundaries in the order [north, west, south, east] |
required |
Returns:
| Type | Description |
|---|---|
str: Formatted string in the format 'N{north}W{west}S{south}E{east}'
|
|
month_days ¶
month_days(year, month)
Get list of days in a month formatted as two-digit strings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
year
|
int
|
Year of interest. |
required |
month
|
int
|
Month of interest. |
required |
Returns:
| Type | Description |
|---|---|
List[str]
|
List of days in the month as two-digit strings. |
validate_data_provider ¶
validate_data_provider(provider)
Ensure dataprovider is recognized and implemented.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
provider
|
str
|
Name of the data provider. |
required |
Returns:
| Type | Description |
|---|---|
bool: True if valid, False otherwise.
|
|
validate_dataset_short_name ¶
validate_dataset_short_name(
dataset_short_name,
provider,
*,
logger=None,
echo_console=False
)
Check dataset compatibility with provider.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_short_name
|
str
|
Dataset short name. |
required |
provider
|
str
|
Name of the data provider. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if valid, False otherwise. |
validate_cds_api_key ¶
validate_cds_api_key(
url, key, *, logger=None, echo_console=False
)
Validate CDS API credentials by attempting to initialize a cdsapi.Client.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
CDS API URL. |
required |
key
|
str
|
CDS API key. |
required |
logger
|
Logger
|
Logger for logging messages, by default None. |
None
|
echo_console
|
bool
|
Whether to echo messages to console, by default False. |
False
|
Returns:
| Type | Description |
|---|---|
Client | None
|
Authenticated client if successful, otherwise None. |
validate_directory ¶
validate_directory(path)
Check if path exists or can be created.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Directory path to validate. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if path exists or was created successfully, False otherwise. |
validate_date ¶
validate_date(
value,
allow_month_only=False,
*,
logger=None,
echo_console=False
)
Validate date format as YYYY-MM-DD or optionally YYYY-MM.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
value
|
str
|
Date string to validate. |
required |
allow_month_only
|
bool
|
If True, also accept YYYY-MM format, by default False. |
False
|
Returns:
| Type | Description |
|---|---|
bool
|
True if valid, False otherwise. |
parse_date_with_defaults ¶
parse_date_with_defaults(
date_str, default_to_month_end=False
)
Parse date string and apply defaults for incomplete dates.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
date_str
|
str
|
Date string in format YYYY-MM-DD or YYYY-MM. |
required |
default_to_month_end
|
bool
|
If True and date is YYYY-MM format, default to last day of month. If False and date is YYYY-MM format, default to first day of month. By default False. |
False
|
Returns:
| Type | Description |
|---|---|
tuple[datetime, str]
|
Tuple of (parsed datetime object, ISO format string YYYY-MM-DD) |
Raises:
| Type | Description |
|---|---|
ValueError
|
If date string is invalid. |
clamp_era5_available_end_date ¶
clamp_era5_available_end_date(
end, *, logger=None, echo_console=False
)
Clamp end date to ERA5 data availability boundary (8 days ago).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
end
|
datetime
|
Desired end date. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
datetime: Clamped end date.
|
|
|
NOTES |
ERA5 data is available up to 8 days prior to the current date.
|
|
8-day lag is used to ensure data availability.
|
|
validate_coordinates ¶
validate_coordinates(north, west, south, east)
Ensure coordinates are within realistic bounds.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
north
|
int | float
|
Northern latitude boundary. |
required |
west
|
int | float
|
Western longitude boundary. |
required |
south
|
int | float
|
Southern latitude boundary. |
required |
east
|
int | float
|
Eastern longitude boundary. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if coordinates are valid, False otherwise. |
validate_variables ¶
validate_variables(
variable_list,
variable_restrictions,
restriction_allow=False,
)
Ensure user-specified variables are available for this dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
variable_list
|
list[str]
|
List of variable names to validate. |
required |
variable_restrictions
|
list[str]
|
List of variables that are either allowed or disallowed. |
required |
restriction_allow
|
bool
|
If True, variable_restrictions is an allowlist (i.e. in). If False, it's a denylist (i.e. not in) |
False
|
Returns:
| Type | Description |
|---|---|
bool
|
True if all variables are valid, False otherwise. |
validate_existing_file_action ¶
validate_existing_file_action(
session, *, allow_prompts, logger, echo_console=False
)
Normalize existing_file_action for the current run-mode. - If 'case_by_case' is set but prompts are not allowed (automatic mode), coerce to 'skip_all' and warn.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
Any
|
Current session state. |
required |
allow_prompts
|
bool
|
Whether prompts are allowed (i.e., interactive mode). |
required |
logger
|
Logger
|
Logger for logging messages. |
required |
echo_console
|
bool
|
Whether to echo messages to console. |
False
|
Returns:
| Type | Description |
|---|---|
str
|
Normalized existing_file_action policy. |
validate_config ¶
validate_config(
config,
*,
logger=None,
run_mode="automatic",
echo_console=False,
live_auth_check=False
)
Entry point. Validates common shape then dispatches to provider-specific validator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
dict
|
Configuration dictionary. |
required |
logger
|
Logger
|
Logger for logging messages, by default None. |
None
|
run_mode
|
str
|
Run mode, either 'interactive' or 'automatic', by default "automatic". |
'automatic'
|
echo_console
|
bool
|
Whether to echo messages to console, by default False. |
False
|
live_auth_check
|
bool
|
Whether to perform live authentication checks (e.g., CDS API), by default False. |
False
|
Returns:
| Type | Description |
|---|---|
None
|
|
Key Functions:
validate_config(config, logger, run_mode)
Comprehensive configuration validation. Checks:
- All required fields are present
- Data types are correct
- Values are in valid ranges
- Enumerations match allowed values
- Dates are properly formatted and logical
- Region bounds are valid
Side Effects:
- Clamps out-of-range values with warnings
- Converts incompatible settings (e.g., prompt → skip_all in batch mode)
validate_cds_api_key(api_url, api_key, logger, echo_console)
Tests CDS API credentials by attempting to create a client connection.
Returns:
- cdsapi.Client if successful, else None
invalid_era5_world_variables(variables)
Checks for invalid variable names in ERA5 (world) dataset.
Returns:
- list: Invalid variable names (empty if all valid)
invalid_era5_land_variables(variables)
Checks for invalid variable names in ERA5-Land dataset.
Returns:
- list: Invalid variable names (empty if all valid)
Formatting Helpers:
format_coordinates_nwse(boundaries)- Convert bounds to string like"N40W10S35E5"format_duration(seconds)- Convert seconds to human-readable durationvalidate_date_format(date_string)- Check if date string is valid YYYY-MM-DD
utils/file_management.py¶
File naming, checking, and estimation functions.
weather_data_retrieval.utils.file_management ¶
generate_filename_hash ¶
generate_filename_hash(
dataset_short_name, variables, boundaries
)
Generate a unique hash for the download parameters that will be used to create the filename.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_short_name
|
str
|
The dataset short name (era5-world etc). |
required |
variables
|
list[str]
|
List of variable names. |
required |
boundaries
|
list[float]
|
List of boundaries [north, west, south, east]. |
required |
Returns:
| Type | Description |
|---|---|
str: A unique hash string representing the download parameters.
|
|
find_existing_month_file ¶
find_existing_month_file(
save_dir, filename_base, year, month
)
Tolerant matcher that finds an existing file for the given month.
Accepts both _YYYY-MM.ext and _YYYY_MM.ext patterns and any extension.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
save_dir
|
Path
|
Directory where files are saved. |
required |
filename_base
|
str
|
Base filename (without date or extension). |
required |
year
|
int
|
Year of the file. |
required |
month
|
int
|
Month of the file. |
required |
Returns:
| Type | Description |
|---|---|
Optional[Path]
|
Path to the existing file if found, else None. |
estimate_era5_monthly_file_size ¶
estimate_era5_monthly_file_size(
variables,
area,
grid_resolution=0.25,
timestep_hours=1.0,
bytes_per_value=4.0,
)
Estimate ERA5 GRIB file size (MB) for a monthly dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
variables
|
list[str]
|
Variables requested (e.g. ['2m_temperature', 'total_precipitation']). |
required |
area
|
list[float]
|
[north, west, south, east] geographic bounds in degrees. |
required |
grid_resolution
|
float
|
Grid spacing in degrees (default 0.25° for ERA5). |
0.25
|
timestep_hours
|
float
|
Temporal resolution in hours (1 = hourly, 3 = 3-hourly, 6 = 6-hourly, etc.). |
1.0
|
bytes_per_value
|
float
|
Bytes per gridpoint per variable (float32 = 4 bytes). |
4.0
|
Returns:
| Type | Description |
|---|---|
float
|
Estimated monthly file size in MB. |
estimate_cds_download ¶
estimate_cds_download(
variables,
area,
start_date,
end_date,
observed_speed_mbps,
grid_resolution=0.25,
timestep_hours=1.0,
bytes_per_value=4.0,
overhead_per_request_s=180.0,
overhead_per_var_s=12.0,
)
Estimate per-file and total download size/time for CDS (ERA5) retrievals, using an empirically grounded file size model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
variables
|
list[str]
|
Variables selected (e.g. ['2m_temperature', 'total_precipitation']). |
required |
area
|
list[float]
|
[north, west, south, east] bounds in degrees. |
required |
start_date
|
str
|
Date range (YYYY-MM-DD). |
required |
end_date
|
str
|
Date range (YYYY-MM-DD). |
required |
observed_speed_mbps
|
float
|
Measured internet speed in megabits per second (Mbps). |
required |
grid_resolution
|
float
|
Grid resolution in degrees (default 0.25°). |
0.25
|
timestep_hours
|
float
|
Temporal resolution in hours (default 1-hourly). |
1.0
|
bytes_per_value
|
float
|
Bytes per stored value (float32 = 4). |
4.0
|
overhead_per_request_s
|
float
|
Fixed CDS request overhead time (queue/prep). |
180.0
|
overhead_per_var_s
|
float
|
Per-variable overhead for CDS throttling/prep. |
12.0
|
Returns:
| Type | Description |
|---|---|
dict
|
{ "months": int, "file_size_MB": float, "total_size_MB": float, "time_per_file_sec": float, "total_time_sec": float } |
expected_save_stem ¶
expected_save_stem(save_dir, filename_base, year, month)
Construct canonical save stem (without extension) for monthly data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
save_dir
|
str | Path | None
|
Base directory for saving. If None, defaults to osme_common.paths.data_dir(). |
required |
filename_base
|
str
|
Base name without date or extension. |
required |
year
|
int
|
Year and month of the file. |
required |
month
|
int
|
Year and month of the file. |
required |
Returns:
| Type | Description |
|---|---|
Path
|
Resolved path under the proper data directory. |
expected_save_path ¶
expected_save_path(
save_dir, filename_base, year, month, data_format="grib"
)
Construct canonical save path for monthly data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
save_dir
|
str | Path | None
|
Base directory for saving. If None, defaults to osme_common.paths.data_dir(). |
required |
filename_base
|
str
|
Base name without date or extension. |
required |
year
|
int
|
Year and month of the file. |
required |
month
|
int
|
Year and month of the file. |
required |
data_format
|
str
|
File extension, e.g., 'grib' or 'nc'. |
'grib'
|
Returns:
| Type | Description |
|---|---|
Path
|
Resolved path under the proper data directory. |
is_zip_file ¶
is_zip_file(path)
Check if a file is a ZIP archive by reading its magic number.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path to the file to check. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if the file is a ZIP archive, False otherwise. |
is_grib_file ¶
is_grib_file(path)
Check if a file is a GRIB file by reading its magic number. GRIB files begin with ASCII bytes 'GRIB'.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path to the file to check. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if the file is a GRIB file, False otherwise. |
unpack_zip_to_grib ¶
unpack_zip_to_grib(zip_path, final_grib_path)
Extract a ZIP and move the contained GRIB-like file to final_grib_path. If multiple candidates exist, prefer .grib/.grb/.grib2; otherwise take the largest file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
zip_path
|
Path
|
Path to the ZIP file. |
required |
final_grib_path
|
Path
|
Desired final path for the extracted GRIB file. |
required |
Returns:
| Type | Description |
|---|---|
Path
|
Path to the final GRIB file. |
Key Functions:
generate_filename_hash(dataset_short_name, variables, boundaries)
Creates a unique 12-character hash from download parameters. This ensures files with different parameters don't collide.
Returns:
- str: 12-character hash (e.g., "abc123def456")
find_existing_month_file(save_dir, filename_base, year, month)
Searches for an existing file for a given month, handling different date separators and extensions.
Returns:
- Path if file exists, else None
estimate_era5_monthly_file_size(variables, area, grid_resolution, ...)
Estimates file size for a monthly ERA5 download based on empirical data.
Parameters:
- variables (list): Variables to download
- area (list): Region bounds
- grid_resolution (float): Grid spacing in degrees
- timestep_hours (float): Temporal resolution
- bytes_per_value (float): Storage size per value
Returns:
- float: Estimated size in MB
estimate_cds_download(variables, area, start_date, end_date, observed_speed_mbps, ...)
Comprehensive download estimation including size, time, and parallelization effects.
Returns:
- dict:
{
"months": 12, # Number of monthly files
"file_size_MB": 42.3, # Average file size
"total_size_MB": 507.6, # Total download size
"time_per_file_sec": 180, # Time per file
"total_time_sec": 2160 # Total estimated time
}
File Format Helpers:
is_zip_file(path)- Check if file is a ZIP archiveis_grib_file(path)- Check if file is a GRIB fileunpack_zip_to_grib(zip_path, final_grib_path)- Extract GRIB from ZIP
utils/logging.py¶
Logging configuration and utilities.
weather_data_retrieval.utils.logging ¶
build_download_summary ¶
build_download_summary(
session, estimates, speed_mbps, save_dir=None
)
Construct a formatted summary string of the current download configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
SessionState
|
Current session state containing all parameters. |
required |
estimates
|
dict
|
Dictionary containing download size and time estimates. |
required |
speed_mbps
|
float
|
Measured or estimated internet speed in Mbps. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Nicely formatted summary string for display or logging. |
setup_logger ¶
setup_logger(
save_dir=None, run_mode="interactive", verbose=False
)
Initialize and return a configured logger.
Logs are written to
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
save_dir
|
str or None
|
Directory to save log files. If None, defaults to osme_common.paths.log_dir(). |
None
|
run_mode
|
str
|
Either 'interactive' or 'automatic', by default 'interactive'. |
'interactive'
|
verbose
|
bool
|
Whether to echo logs to console in automatic mode, by default False. |
False
|
Returns:
| Type | Description |
|---|---|
Logger
|
Configured logger instance. |
log_msg ¶
log_msg(
msg,
logger,
*,
level="info",
echo_console=False,
force=False
)
Unified logging utility. - Always logs to file. - Optionally echoes to console via tqdm.write.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
echo_console
|
bool
|
Print to console when True (used for verbose mode). |
False
|
force
|
bool
|
Print to console regardless of echo_console (used for summaries/errors). |
False
|
create_final_log_file ¶
create_final_log_file(
session,
filename_base,
original_logger,
*,
delete_original=True,
reattach_to_final=True
)
Create a final log file with the same naming pattern as data files. Copies content from the original log file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
session
|
Any(SessionState)
|
Current session state. |
required |
filename_base
|
str
|
Base filename pattern (same as data files). |
required |
original_logger
|
Logger
|
The original logger instance. |
required |
delete_original
|
bool
|
Whether to delete the original log file after creating the final one, by default True. |
True
|
reattach_to_final
|
bool
|
Whether to reattach the logger to the final log file, by default True. |
True
|
Returns:
| Type | Description |
|---|---|
str
|
Path to the final log file. |
Key Functions:
setup_logger(save_dir, run_mode, verbose)
Initializes a configured logger for the package.
Parameters:
- save_dir (str, optional): Directory for log files (defaults to logs/weather_data_retrieval/)
- run_mode (str): "interactive" or "automatic"
- verbose (bool): Whether to show console output
Returns:
- logging.Logger: Configured logger instance
Configuration:
- File handler: DEBUG level (everything)
- Console handler: INFO level (only in interactive or verbose mode)
- Format: "%(asctime)s | %(levelname)s | %(message)s"
log_msg(msg, logger, level, echo_console, force)
Unified logging utility that writes to file and optionally to console.
Parameters:
- msg (str): Message to log
- logger (logging.Logger): Logger instance
- level (str): "debug", "info", "warning", "error", "exception"
- echo_console (bool): Print to console
- force (bool): Print even if echo_console=False (for critical messages)
build_download_summary(session, estimates, speed_mbps, save_dir)
Constructs a formatted summary of the download configuration.
Returns:
- str: Multi-line summary string
create_final_log_file(session, filename_base, original_logger, ...)
Creates a final log file with the same naming convention as data files.
Parameters:
- session (SessionState): Current session
- filename_base (str): Base filename
- original_logger (logging.Logger): Current logger
- delete_original (bool): Remove temporary log
- reattach_to_final (bool): Continue logging to final file
Returns:
- str: Path to final log file
Usage Examples¶
Creating a Custom Download Script¶
from weather_data_retrieval.runner import run
# Define your configuration
config = {
"data_provider": "cds",
"dataset_short_name": "era5-land",
"api_url": "https://cds.climate.copernicus.eu/api",
"api_key": "YOUR_KEY",
"start_date": "2023-01-01",
"end_date": "2023-03-31",
"region_bounds": [40, -10, 35, 5],
"variables": ["2m_temperature"],
"existing_file_action": "skip_all",
"retry_settings": {"max_retries": 3, "retry_delay_sec": 15},
"parallel_settings": {"enabled": True, "max_concurrent": 2}
}
# Run the download
exit_code = run(config, run_mode="automatic", verbose=True)
if exit_code == 0:
print("Download completed successfully!")
elif exit_code == 2:
print("Download completed with some failures. Check logs.")
else:
print("Download failed. Check logs.")
Validating a Config Before Running¶
from weather_data_retrieval.utils.data_validation import validate_config
from weather_data_retrieval.utils.logging import setup_logger
logger = setup_logger(run_mode="automatic", verbose=True)
config = { ... } # Your config
try:
validate_config(config, logger=logger, run_mode="automatic")
print("✓ Configuration is valid!")
except ValueError as e:
print(f"✗ Configuration error: {e}")
Estimating Download Before Running¶
from weather_data_retrieval.utils.file_management import estimate_cds_download
from weather_data_retrieval.utils.session_management import internet_speedtest
# Test internet speed
speed_mbps = internet_speedtest(max_seconds=10, logger=None, echo_console=True)
# Estimate download
estimates = estimate_cds_download(
variables=["2m_temperature", "total_precipitation"],
area=[40, -10, 35, 5],
start_date="2023-01-01",
end_date="2023-12-31",
observed_speed_mbps=speed_mbps,
grid_resolution=0.1
)
print(f"Estimated files: {estimates['months']}")
print(f"Total size: {estimates['total_size_MB']:.1f} MB")
print(f"Estimated time: {estimates['total_time_sec'] / 60:.1f} minutes")
Extending the Package¶
Adding a New Data Provider¶
To add a new provider (e.g., Open-Meteo):
-
Create provider module:
sources/my_provider.py -
Implement download function:
def orchestrate_my_provider_downloads( session, filename_base, save_dir, successful_downloads, failed_downloads, skipped_downloads, logger, echo_console, allow_prompts ): # Your implementation pass -
Add to session mapping: Update
session_management.pyto handle new provider -
Add validation: Update
data_validation.pyto validate provider-specific params -
Update prompts: Add provider selection in
io/prompts.py
Adding New Variables¶
Variables are validated against hard-coded lists in data_validation.py:
# In data_validation.py
ERA5_LAND_VARIABLES = [
"2m_temperature",
"total_precipitation",
# ... add new variable here
]
Adding New Configuration Parameters¶
-
Define in SessionState: Add to the required/optional keys list
-
Add validation: Update
validate_config()indata_validation.py -
Add prompt (if interactive): Create
prompt_new_parameter()inio/prompts.py -
Use in runner: Access via
session.get("new_parameter")in your code
Testing¶
Running Unit Tests¶
# From repo root
pytest packages/weather_data_retrieval/tests/
Testing Individual Modules¶
# Test validation
from weather_data_retrieval.utils.data_validation import validate_config
config = {...}
validate_config(config, logger=None, run_mode="automatic")
Mock Testing Downloads¶
For testing without hitting the CDS API, you can mock the download functions:
from unittest.mock import patch, MagicMock
with patch('weather_data_retrieval.sources.cds_era5.download_monthly_era5_file') as mock_download:
mock_download.return_value = Path("/fake/file.grib")
# Run your test
Further Reading¶
- User Guide for usage patterns
- Configuration Reference for all parameters
- Troubleshooting for common issues
- GitHub Repository for source code