The version on the docs site is a static render — to actually run the cells, click the badge above (Google Colab) or follow Setup to run locally.

In [ ]:

Copied!





# @colab-setup
# Run this cell first. On Colab it installs the deps; locally it is a no-op.
import sys
if "google.colab" in sys.modules:
    %pip install -q duckdb pandas requests
# @colab-setup
# Run this cell first. On Colab it installs the deps; locally it is a no-op.
import sys
if "google.colab" in sys.modules:
    %pip install -q duckdb pandas requests

02 — OBIS Parquet on S3 via DuckDB¶

Time budget: ~15 min · Goal: the headline performance demo. Pull whole datasets in seconds, run analytical SQL directly against S3.

OBIS publishes per-dataset Parquet files (see iobis/obis-open-data) at:

https://obis-open-data.s3.amazonaws.com/occurrence/{dataset_uuid}.parquet

No auth, no SDK. DuckDB scans the file over HTTPS using HTTP range requests — it only fetches the columns and row groups it needs. The schema has nested source and interpreted structs (plus extensions); always unwrap interpreted.* unless you specifically want the raw, as-submitted values in source.*.

This is what explore-cioos/harvester/cde_harvester/obis_harvester.py:294–322 does in production.

In [ ]:

Copied!





import duckdb
import pandas as pd
import time

DEMO_UUID = 'd895e645-a98d-4720-b6fb-332929190f36'  # Maritimes Spring RV Surveys
PARQUET_BASE = 'https://obis-open-data.s3.amazonaws.com/occurrence'

con = duckdb.connect()
# httpfs is the DuckDB extension that lets it read files over HTTP(S)/S3 —
# install it once, then load it into this connection.
con.sql('INSTALL httpfs; LOAD httpfs;')
# The OBIS open-data bucket lives in AWS region us-east-1; tell DuckDB so its
# S3 requests are signed/routed correctly.
con.sql("SET s3_region='us-east-1';")
print('DuckDB version:', duckdb.__version__)
import duckdb
import pandas as pd
import time

DEMO_UUID = 'd895e645-a98d-4720-b6fb-332929190f36'  # Maritimes Spring RV Surveys
PARQUET_BASE = 'https://obis-open-data.s3.amazonaws.com/occurrence'

con = duckdb.connect()
# httpfs is the DuckDB extension that lets it read files over HTTP(S)/S3 —
# install it once, then load it into this connection.
con.sql('INSTALL httpfs; LOAD httpfs;')
# The OBIS open-data bucket lives in AWS region us-east-1; tell DuckDB so its
# S3 requests are signed/routed correctly.
con.sql("SET s3_region='us-east-1';")
print('DuckDB version:', duckdb.__version__)

1. Schema discovery¶

Before querying, look at what's there. DESCRIBE plus LIMIT 0 is cheap — it only fetches the Parquet footer.

The top-level table has just 15 columns, but source, interpreted, and extensions are nested structs — their column_type is a long STRUCT(...) string that the display truncates with …. To see what's actually inside one, expand it with <struct>.* (next cell).

In [2]:

Copied!

url = f"{PARQUET_BASE}/{DEMO_UUID}.parquet"
schema = con.sql(f"DESCRIBE SELECT * FROM read_parquet('{url}') LIMIT 0").df()
schema
url = f"{PARQUET_BASE}/{DEMO_UUID}.parquet"
schema = con.sql(f"DESCRIBE SELECT * FROM read_parquet('{url}') LIMIT 0").df()
schema

Out[2]:

	column_name	column_type	null	key	default	extra
0	_id	VARCHAR	YES	None	None	None
1	_event_id	VARCHAR	YES	None	None	None
2	_occurrence_id	VARCHAR	YES	None	None	None
3	dataset_id	VARCHAR	YES	None	None	None
4	node_ids	VARCHAR[]	YES	None	None	None
5	source	STRUCT(acceptedNameUsage VARCHAR, acceptedName...	YES	None	None	None
6	interpreted	STRUCT(acceptedNameUsage VARCHAR, acceptedName...	YES	None	None	None
7	extensions	STRUCT("http://rs.gbif.org/terms/1.0/DNADerive...	YES	None	None	None
8	missing	VARCHAR[]	YES	None	None	None
9	invalid	VARCHAR[]	YES	None	None	None
10	flags	VARCHAR[]	YES	None	None	None
11	dropped	BOOLEAN	YES	None	None	None
12	absence	BOOLEAN	YES	None	None	None
13	tags	VARCHAR[]	YES	None	None	None
14	geometry	GEOMETRY('OGC:CRS84')	YES	None	None	None

In [ ]:

Copied!





# Expand a nested struct into its individual fields with `<struct>.*`.
# `interpreted` is the big one — the parsed/QC'd/enriched Darwin Core we query
# throughout (~276 fields). Swap in `source.*` to inspect the raw, as-submitted
# values instead (~188 fields, no OBIS-added fields like `aphiaid`).
interpreted_fields = con.sql(
    f"DESCRIBE SELECT interpreted.* FROM read_parquet('{url}') LIMIT 0"
).df()
print(f'`interpreted` contains {len(interpreted_fields)} fields')
interpreted_fields[['column_name', 'column_type']]
# Expand a nested struct into its individual fields with `<struct>.*`.
# `interpreted` is the big one — the parsed/QC'd/enriched Darwin Core we query
# throughout (~276 fields). Swap in `source.*` to inspect the raw, as-submitted
# values instead (~188 fields, no OBIS-added fields like `aphiaid`).
interpreted_fields = con.sql(
    f"DESCRIBE SELECT interpreted.* FROM read_parquet('{url}') LIMIT 0"
).df()
print(f'`interpreted` contains {len(interpreted_fields)} fields')
interpreted_fields[['column_name', 'column_type']]

2. Single-dataset pull¶

Verbatim adaptation of the production query in obis_harvester.py:294–319. Note the interpreted.* unwrap and the lat/lon bounds filter (records outside the Web Mercator range break tile rendering).

In [3]:

Copied!





occurrences_query = f"""
    SELECT
        interpreted.decimalLatitude       AS decimalLatitude,
        interpreted.decimalLongitude      AS decimalLongitude,
        interpreted.date_start            AS date_start,
        interpreted.date_end              AS date_end,
        interpreted.minimumDepthInMeters  AS minDepth,
        interpreted.maximumDepthInMeters  AS maxDepth,
        interpreted.scientificName        AS scientificName,
        _id                               AS id
    FROM read_parquet('{url}')
    WHERE interpreted.decimalLatitude  BETWEEN -85.06 AND 85.06
      AND interpreted.decimalLongitude BETWEEN -180   AND 180
"""

t0 = time.perf_counter()
df = con.sql(occurrences_query).df()
parquet_secs = time.perf_counter() - t0
print(f'Parquet pull: {len(df):,} records in {parquet_secs:.2f} s ({len(df)/parquet_secs:,.0f} rec/s)')
df.head()
occurrences_query = f"""
    SELECT
        interpreted.decimalLatitude       AS decimalLatitude,
        interpreted.decimalLongitude      AS decimalLongitude,
        interpreted.date_start            AS date_start,
        interpreted.date_end              AS date_end,
        interpreted.minimumDepthInMeters  AS minDepth,
        interpreted.maximumDepthInMeters  AS maxDepth,
        interpreted.scientificName        AS scientificName,
        _id                               AS id
    FROM read_parquet('{url}')
    WHERE interpreted.decimalLatitude  BETWEEN -85.06 AND 85.06
      AND interpreted.decimalLongitude BETWEEN -180   AND 180
"""

t0 = time.perf_counter()
df = con.sql(occurrences_query).df()
parquet_secs = time.perf_counter() - t0
print(f'Parquet pull: {len(df):,} records in {parquet_secs:.2f} s ({len(df)/parquet_secs:,.0f} rec/s)')
df.head()

Parquet pull: 650,166 records in 2.15 s (301,712 rec/s)

Out[3]:

	decimalLatitude	decimalLongitude	date_start	date_end	minDepth	maxDepth	scientificName	id
0	40.231167	-68.820667	636163200000	636163200000	NaN	NaN	Pisces	73952c55-6a04-43b0-b69c-f4fa4b575d83
1	40.450667	-69.419667	1109462400000	1109462400000	NaN	NaN	Bothus	bf6612d6-855c-4bb1-a36d-ede18a526f2d
2	40.450667	-69.419667	1109462400000	1109462400000	NaN	NaN	Bothus	c1462459-7e24-4ffd-91f0-85b2aa3552bc
3	44.850000	-61.333333	321062400000	321062400000	NaN	NaN	Gadus morhua	be11fe1c-52ea-49dd-8913-981482ed0720
4	43.883333	-62.250000	322099200000	322099200000	NaN	NaN	Gadus morhua	abf67d47-b1ba-445f-8f5d-29723ac0c71a

3. Bench: Parquet vs paged REST¶

Same dataset, two routes, side-by-side timing. We cap the REST pull at 2 pages so the cell stays workshop-sized; the speed ratio holds in both directions.

In [4]:

Copied!





import requests

API = 'https://api.obis.org/v3'

def fetch_via_rest(dataset_id: str, page_size: int = 10000, max_pages: int = 2):
    params = {'datasetid': dataset_id, 'size': page_size}
    out = []
    for _ in range(max_pages):
        r = requests.get(f'{API}/occurrence', params=params, timeout=120)
        r.raise_for_status()
        results = r.json().get('results', [])
        if not results:
            break
        out.extend(results)
        if len(results) < page_size:
            break
        params['after'] = results[-1].get('id')
    return out

t0 = time.perf_counter()
rest_records = fetch_via_rest(DEMO_UUID, max_pages=2)
rest_secs = time.perf_counter() - t0
print(f'REST  (2 pages): {len(rest_records):,} records in {rest_secs:.2f} s ({len(rest_records)/rest_secs:,.0f} rec/s)')
print(f'Parquet (full):  {len(df):,} records in {parquet_secs:.2f} s ({len(df)/parquet_secs:,.0f} rec/s)')
print(f'\nParquet is ~{(len(df)/parquet_secs)/(len(rest_records)/rest_secs):.1f}× faster on rec/s.')
import requests

API = 'https://api.obis.org/v3'

def fetch_via_rest(dataset_id: str, page_size: int = 10000, max_pages: int = 2):
    params = {'datasetid': dataset_id, 'size': page_size}
    out = []
    for _ in range(max_pages):
        r = requests.get(f'{API}/occurrence', params=params, timeout=120)
        r.raise_for_status()
        results = r.json().get('results', [])
        if not results:
            break
        out.extend(results)
        if len(results) < page_size:
            break
        params['after'] = results[-1].get('id')
    return out

t0 = time.perf_counter()
rest_records = fetch_via_rest(DEMO_UUID, max_pages=2)
rest_secs = time.perf_counter() - t0
print(f'REST  (2 pages): {len(rest_records):,} records in {rest_secs:.2f} s ({len(rest_records)/rest_secs:,.0f} rec/s)')
print(f'Parquet (full):  {len(df):,} records in {parquet_secs:.2f} s ({len(df)/parquet_secs:,.0f} rec/s)')
print(f'\nParquet is ~{(len(df)/parquet_secs)/(len(rest_records)/rest_secs):.1f}× faster on rec/s.')

REST  (2 pages): 20,000 records in 13.63 s (1,468 rec/s)
Parquet (full):  650,166 records in 2.15 s (301,712 rec/s)

Parquet is ~205.6× faster on rec/s.

4. Spatial query — Atlantic-Canada bbox¶

Push the spatial filter into DuckDB instead of pulling everything and filtering in pandas. Cheap, even when the dataset is huge — DuckDB only fetches the row groups that overlap. The demo dataset is the Maritimes survey, so we filter to an Atlantic-Canada box (the Pacific is in the Q2 exercise below).

In [ ]:

Copied!





spatial_query = f"""
    SELECT interpreted.scientificName    AS scientificName,
           interpreted.decimalLatitude   AS lat,
           interpreted.decimalLongitude  AS lon
    FROM read_parquet('{url}')
    WHERE interpreted.decimalLatitude  BETWEEN 41 AND 48
      AND interpreted.decimalLongitude BETWEEN -70 AND -56
"""
atlantic = con.sql(spatial_query).df()
print(f'{len(atlantic):,} records in Atlantic-Canada bbox')
atlantic.head()
spatial_query = f"""
    SELECT interpreted.scientificName    AS scientificName,
           interpreted.decimalLatitude   AS lat,
           interpreted.decimalLongitude  AS lon
    FROM read_parquet('{url}')
    WHERE interpreted.decimalLatitude  BETWEEN 41 AND 48
      AND interpreted.decimalLongitude BETWEEN -70 AND -56
"""
atlantic = con.sql(spatial_query).df()
print(f'{len(atlantic):,} records in Atlantic-Canada bbox')
atlantic.head()

5. Aggregation — top species¶

The GROUP BY runs inside DuckDB, against data streamed straight from the remote Parquet — pandas only receives the final 20-row result, never the ~650k underlying rows. Contrast with pulling every row into a DataFrame and calling .value_counts(): that moves 650k rows into memory just to produce 20. Push the arithmetic down to the data instead.

In [6]:

Copied!





top_species = con.sql(f"""
    SELECT interpreted.scientificName AS scientificName,
           COUNT(*) AS n_records
    FROM read_parquet('{url}')
    WHERE interpreted.scientificName IS NOT NULL
    GROUP BY 1
    ORDER BY n_records DESC
    LIMIT 20
""").df()
top_species
top_species = con.sql(f"""
    SELECT interpreted.scientificName AS scientificName,
           COUNT(*) AS n_records
    FROM read_parquet('{url}')
    WHERE interpreted.scientificName IS NOT NULL
    GROUP BY 1
    ORDER BY n_records DESC
    LIMIT 20
""").df()
top_species

Out[6]:

	scientificName	n_records
0	Melanogrammus aeglefinus	139499
1	Merluccius bilinearis	61573
2	Gadus morhua	37493
3	Leucoraja ocellata	35597
4	Myoxocephalus octodecemspinosus	30336
5	Myzopsetta ferruginea	30048
6	Clupea harengus	29435
7	Leucoraja erinaceus	25957
8	Hippoglossoides platessoides	19921
9	Placopecten magellanicus	18739
10	Squalus acanthias	15687
11	Scomber scombrus	14209
12	Sebastes	13194
13	Scophthalmus aquosus	13178
14	Urophycis chuss	12046
15	Hemitripterus americanus	11103
16	Alosa pseudoharengus	10533
17	Pollachius virens	9815
18	Homarus americanus	9255
19	Pseudopleuronectes americanus	9119

6. Multi-dataset union¶

DuckDB's read_parquet([...]) accepts a list of URLs and unions them into a single virtual table. This is how you'd start building a regional summary across many datasets.

In [7]:

Copied!





# Pick a few small Canadian datasets for the demo (replace with your own UUIDs).
uuids = [
    DEMO_UUID,
    '4b5e4ccb-cf66-44e4-8890-fa68f8404c3f',
]
urls = [f'{PARQUET_BASE}/{u}.parquet' for u in uuids]
url_list_sql = ', '.join(f"'{u}'" for u in urls)

summary = con.sql(f"""
    SELECT _dataset_id AS dataset_id,
           COUNT(*) AS n_records,
           COUNT(DISTINCT interpreted.scientificName) AS n_species
    FROM read_parquet([{url_list_sql}], filename = '_dataset_id')
    GROUP BY 1
""").df()
summary
# Pick a few small Canadian datasets for the demo (replace with your own UUIDs).
uuids = [
    DEMO_UUID,
    '4b5e4ccb-cf66-44e4-8890-fa68f8404c3f',
]
urls = [f'{PARQUET_BASE}/{u}.parquet' for u in uuids]
url_list_sql = ', '.join(f"'{u}'" for u in urls)

summary = con.sql(f"""
    SELECT _dataset_id AS dataset_id,
           COUNT(*) AS n_records,
           COUNT(DISTINCT interpreted.scientificName) AS n_species
    FROM read_parquet([{url_list_sql}], filename = '_dataset_id')
    GROUP BY 1
""").df()
summary

Out[7]:

	dataset_id	n_records	n_species
0	https://obis-open-data.s3.amazonaws.com/occurr...	345	3
1	https://obis-open-data.s3.amazonaws.com/occurr...	650166	440

7. Cracking open an extension — eMoF¶

Sections 1–6 only read interpreted.* (the occurrence fields). The third struct, extensions, holds the child records attached to each occurrence. In this dataset that's the Extended Measurement or Fact (eMoF) extension — tow speed, trawl distance, fish length, bottom temperature, and so on.

Two wrinkles vs. the flat columns:

extensions is keyed by the extension's full URI (http://rs.iobis.org/obis/terms/ExtendedMeasurementOrFact).
The value is a list of measurements (one occurrence → many measurements), so we UNNEST it into one row per measurement, then reach into each measurement's source.* fields.

This is the same eMoF you'd validate as a standalone file with pyobistools — here it's nested inline rather than stored as a separate linked table.

In [ ]:

Copied!





EMOF = 'http://rs.iobis.org/obis/terms/ExtendedMeasurementOrFact'

# UNNEST the per-occurrence list of measurements into one row each,
# then pull fields out of each measurement's `source.*` sub-struct.
emof = con.sql(f"""
    SELECT
        _id                         AS occurrence_id,
        interpreted.scientificName  AS scientificName,
        m.source.measurementType    AS measurementType,
        m.source.measurementValue   AS measurementValue,
        m.source.measurementUnit    AS measurementUnit
    FROM read_parquet('{url}'),
         UNNEST("extensions"['{EMOF}']) AS t(m)
    LIMIT 10
""").df()
emof
EMOF = 'http://rs.iobis.org/obis/terms/ExtendedMeasurementOrFact'

# UNNEST the per-occurrence list of measurements into one row each,
# then pull fields out of each measurement's `source.*` sub-struct.
emof = con.sql(f"""
    SELECT
        _id                         AS occurrence_id,
        interpreted.scientificName  AS scientificName,
        m.source.measurementType    AS measurementType,
        m.source.measurementValue   AS measurementValue,
        m.source.measurementUnit    AS measurementUnit
    FROM read_parquet('{url}'),
         UNNEST("extensions"['{EMOF}']) AS t(m)
    LIMIT 10
""").df()
emof

Check your understanding¶

Q1 — interpreted vs source. Every query in this notebook reads interpreted.decimalLatitude, never a top-level decimalLatitude column. Why? And when would you reach into the source.* struct instead?

Answer

The Parquet schema has no flat Darwin Core columns — the fields live inside nested structs. interpreted.* holds OBIS's parsed, quality-controlled, and enriched values: coordinates as floats, dates as epoch millis, names resolved against WoRMS, plus OBIS-added fields like aphiaid and full taxonomy that weren't in the original submission (~276 fields, vs ~188 in source). That's what you want for analysis. source.* is the raw Darwin Core exactly as the publisher submitted it — original strings, spellings, and date formats, with none of the OBIS enrichment. You only reach for source.* when doing QC or provenance work: comparing what was submitted against what OBIS interpreted, or debugging why a record was flagged. For everything analytical, unwrap interpreted.*.

Q2 — Hands on. Re-run the section-4 spatial query, but for a Canadian Pacific bbox (47–60°N, 140–120°W) instead of the Atlantic one. Predict the row count before you run it, then explain the result. Fill in the starter cell below, then expand the solution to check your work.

In [ ]:

Copied!





# Q2 — your turn.
# Same spatial query as section 4, but for a Canadian *Pacific* bbox.
# Predict the count before running, then think about why you get it.

lat_min, lat_max = 47, 60        # Canadian Pacific latitudes
lon_min, lon_max = -140, -120    # Canadian Pacific longitudes

con.sql(f"""
    SELECT COUNT(*) AS n
    FROM read_parquet('{url}')
    WHERE interpreted.decimalLatitude  BETWEEN {lat_min} AND {lat_max}
      AND interpreted.decimalLongitude BETWEEN {lon_min} AND {lon_max}
""").df()

# Q: is the result a bug, or the right answer? Why?
# Q2 — your turn.
# Same spatial query as section 4, but for a Canadian *Pacific* bbox.
# Predict the count before running, then think about why you get it.

lat_min, lat_max = 47, 60        # Canadian Pacific latitudes
lon_min, lon_max = -140, -120    # Canadian Pacific longitudes

con.sql(f"""
    SELECT COUNT(*) AS n
    FROM read_parquet('{url}')
    WHERE interpreted.decimalLatitude  BETWEEN {lat_min} AND {lat_max}
      AND interpreted.decimalLongitude BETWEEN {lon_min} AND {lon_max}
""").df()

# Q: is the result a bug, or the right answer? Why?

Q2 — solution

0 records — and that's the correct answer, not a bug. The demo dataset is the Maritimes Spring RV Surveys, an Atlantic bottom-trawl survey. None of its records fall in the Pacific box (47–60°N, 140–120°W), so the count is genuinely zero. Section 4's Atlantic box returned ~590k from the same file — the filter works fine; there's just no data out west.

Takeaway: an empty result from a pushed-down spatial filter almost always means "no data in that box," not "query broken." Before debugging the SQL, sanity-check the bbox against where the dataset actually lives.

Gotchas to flag in the talk¶

Lag: the Parquet exports trail the live API by hours/days — don't use Parquet for time-critical pulls.
Coverage: a few private/embargoed datasets are API-only; if read_parquet 404s, fall back to REST (Notebook 01).
Schema: always unwrap interpreted.*; the source.* struct is the raw, as-submitted Darwin Core (useful for QC, not analysis).
Coordinate validity: ±85.06° lat / ±180° lon bounds are essential before sending data to a Web Mercator map renderer.

Next: Notebook 03 — resolve those scientificName strings against WoRMS to get authoritative AphiaIDs and full taxonomic classifications.

In [ ]: