API
search
- dtotools.search.get_collection_url() str
Return the base URL of the STAC collections endpoint.
This function exposes the configured STAC collections endpoint used by the underlying pystac client. It can be useful for logging, debugging, or validation purposes.
- Returns:
The base URL of the STAC collections endpoint.
- Return type:
- dtotools.search.search_on_title(title: str, collection: str | None = None, verbose: int = 0) List[Item]
Search STAC items whose title contains a given substring.
This function performs a client-side scan of one or more STAC collections and returns all items whose
titleproperty contains the provided search string (case-insensitive). Since the target STAC API does not support the/searchendpoint, server-side filtering is unavailable and a full collection scan is required.- Parameters:
title (str) – Substring to search for within the item
titlefield. The match is performed in a case-insensitive manner.collection (str, optional) – Identifier of a specific collection to search. If not provided, all available collections in the catalog are searched.
verbose (int, optional) –
Verbosity level controlling console output:
0: silent mode (default)>0: print progress and status messages
- Returns:
A list of STAC items whose
titlefield contains the specified substring.- Return type:
List[pystac.Item]
- Raises:
ValueError – If the specified collection identifier does not exist in the catalog.
Notes
This implementation performs a brute-force scan over collections and items due to the lack of support for the STAC
ITEM_SEARCHconformance class by the target API. For large catalogs, this operation may be slow and network-bound.Examples
Search all collections for items containing
"koster"in their title:>>> items = search_on_title("koster", verbose=1)
Search only within a specific collection:
>>> items = search_on_title("koster", collection="emodnet-biology")
- dtotools.inspect_parquet.get_schema(dataset_url: Dataset | str, output_file: str | None = None) Schema
Print and return the parquet schema.
- Parameters:
- Returns:
The parquet schema.
- Return type:
pyarrow.lib.Schema
Examples
>>> schema = get_schema("file:///tmp/example.parquet")
- dtotools.inspect_parquet.inspect_parquet(dataset: Dataset | str, output_file: str, columns: list[str] | None = None, filters: Mapping[str, Any] | list[tuple[str, Any]] | None = None, logs: bool = True) str
Inspect parquet data and write unique values/counts per column to CSV.
- Parameters:
dataset (pyarrow.dataset.Dataset or str) – Dataset object, parquet URL/path, or data-explorer wrapper URL.
output_file (str) – Destination CSV file.
columns (list[str] | None, optional) – Columns to inspect. If
Noneor empty, all columns are inspected.filters (mapping | list[tuple[str, Any]] | None, optional) – Row filters applied before counting values. Mapping example:
{"parameter_imisdasid": 4687}. List example:[("country", ["NL", "BE"]), ("year", 2024)].logs (bool, optional) – If
True, print timestamped progress information.
- Returns:
The path written to
output_file.- Return type:
- dtotools.read_parquet.read_parquet(parquet: Dataset | str, columns: list[str] | None = None, filters: Mapping[str, Any] | list[tuple[str, Any]] | None = None, max_rows: int = 25, output_file: str | None = None, logs: bool = True) dict[str, Any]
Read and filter parquet data, optionally save to CSV, and display results.
This function reads only the first max_rows from the parquet file by iterating through batches and stopping early. It ensures minimal data is loaded into memory. The operation is extremely fast since it doesn’t scan the entire file.
- Parameters:
parquet (pyarrow.dataset.Dataset or str) – Dataset object, parquet URL/path, or a data-explorer wrapper URL.
columns (list[str] | None, optional) – Columns to read. If
Noneor empty, all columns are read.filters (mapping | list[tuple[str, Any]] | None, optional) – Row filters applied before reading values. Mapping example:
{"parameter_imisdasid": 4687}. List example:[("country", ["NL", "BE"]), ("year", 2024)].max_rows (int, optional) – Maximum rows to read from the parquet file (default: 25). Only these rows are loaded into memory and returned.
output_file (str | None, optional) – If provided, write the filtered data to this CSV file.
logs (bool, optional) – If
True, print timestamped progress information.
- Returns:
Dictionary with keys: - ‘total_rows’: rows read (limited by max_rows) - ‘displayed_rows’: rows shown (should equal total_rows in most cases) - ‘columns’: list of column names - ‘data’: list of dictionaries (rows) - ‘output_file’: path to CSV if written, else None
- Return type: