API

search

dtotools.search.get_collection_url() str

Return the base URL of the STAC collections endpoint.

This function exposes the configured STAC collections endpoint used by the underlying pystac client. It can be useful for logging, debugging, or validation purposes.

Returns:

The base URL of the STAC collections endpoint.

Return type:

str

dtotools.search.search_on_title(title: str, collection: str | None = None, verbose: int = 0) List[Item]

Search STAC items whose title contains a given substring.

This function performs a client-side scan of one or more STAC collections and returns all items whose title property contains the provided search string (case-insensitive). Since the target STAC API does not support the /search endpoint, server-side filtering is unavailable and a full collection scan is required.

Parameters:
  • title (str) – Substring to search for within the item title field. The match is performed in a case-insensitive manner.

  • collection (str, optional) – Identifier of a specific collection to search. If not provided, all available collections in the catalog are searched.

  • verbose (int, optional) –

    Verbosity level controlling console output:

    • 0: silent mode (default)

    • >0: print progress and status messages

Returns:

A list of STAC items whose title field contains the specified substring.

Return type:

List[pystac.Item]

Raises:

ValueError – If the specified collection identifier does not exist in the catalog.

Notes

This implementation performs a brute-force scan over collections and items due to the lack of support for the STAC ITEM_SEARCH conformance class by the target API. For large catalogs, this operation may be slow and network-bound.

Examples

Search all collections for items containing "koster" in their title:

>>> items = search_on_title("koster", verbose=1)

Search only within a specific collection:

>>> items = search_on_title("koster", collection="emodnet-biology")
dtotools.inspect_parquet.get_schema(dataset_url: Dataset | str, output_file: str | None = None) Schema

Print and return the parquet schema.

Parameters:
  • dataset_url (pyarrow.dataset.Dataset or str) – Dataset object, parquet URL/path, or a data-explorer wrapper URL.

  • output_file (str | None, optional) – If provided, a CSV file is written with columns name and dtype. The default is None.

Returns:

The parquet schema.

Return type:

pyarrow.lib.Schema

Examples

>>> schema = get_schema("file:///tmp/example.parquet")
dtotools.inspect_parquet.inspect_parquet(dataset: Dataset | str, output_file: str, columns: list[str] | None = None, filters: Mapping[str, Any] | list[tuple[str, Any]] | None = None, logs: bool = True) str

Inspect parquet data and write unique values/counts per column to CSV.

Parameters:
  • dataset (pyarrow.dataset.Dataset or str) – Dataset object, parquet URL/path, or data-explorer wrapper URL.

  • output_file (str) – Destination CSV file.

  • columns (list[str] | None, optional) – Columns to inspect. If None or empty, all columns are inspected.

  • filters (mapping | list[tuple[str, Any]] | None, optional) – Row filters applied before counting values. Mapping example: {"parameter_imisdasid": 4687}. List example: [("country", ["NL", "BE"]), ("year", 2024)].

  • logs (bool, optional) – If True, print timestamped progress information.

Returns:

The path written to output_file.

Return type:

str

dtotools.read_parquet.read_parquet(parquet: Dataset | str, columns: list[str] | None = None, filters: Mapping[str, Any] | list[tuple[str, Any]] | None = None, max_rows: int = 25, output_file: str | None = None, logs: bool = True) dict[str, Any]

Read and filter parquet data, optionally save to CSV, and display results.

This function reads only the first max_rows from the parquet file by iterating through batches and stopping early. It ensures minimal data is loaded into memory. The operation is extremely fast since it doesn’t scan the entire file.

Parameters:
  • parquet (pyarrow.dataset.Dataset or str) – Dataset object, parquet URL/path, or a data-explorer wrapper URL.

  • columns (list[str] | None, optional) – Columns to read. If None or empty, all columns are read.

  • filters (mapping | list[tuple[str, Any]] | None, optional) – Row filters applied before reading values. Mapping example: {"parameter_imisdasid": 4687}. List example: [("country", ["NL", "BE"]), ("year", 2024)].

  • max_rows (int, optional) – Maximum rows to read from the parquet file (default: 25). Only these rows are loaded into memory and returned.

  • output_file (str | None, optional) – If provided, write the filtered data to this CSV file.

  • logs (bool, optional) – If True, print timestamped progress information.

Returns:

Dictionary with keys: - ‘total_rows’: rows read (limited by max_rows) - ‘displayed_rows’: rows shown (should equal total_rows in most cases) - ‘columns’: list of column names - ‘data’: list of dictionaries (rows) - ‘output_file’: path to CSV if written, else None

Return type:

dict