The version on the docs site is a static render — to actually run the cells, click the badge above (Google Colab) or follow Setup to run locally.
# @colab-setup
# Run this cell first. On Colab it installs the deps; locally it is a no-op.
import sys
if "google.colab" in sys.modules:
%pip install -q duckdb pandas requests
02 — OBIS Parquet on S3 via DuckDB¶
Time budget: ~15 min · Goal: the headline performance demo. Pull whole datasets in seconds, run analytical SQL directly against S3.
OBIS publishes per-dataset Parquet files (see iobis/obis-open-data) at:
https://obis-open-data.s3.amazonaws.com/occurrence/{dataset_uuid}.parquet
No auth, no SDK. DuckDB scans the file over HTTPS using HTTP range requests — it only fetches the columns and row groups it needs. The schema has nested source and interpreted structs (plus extensions); always unwrap interpreted.* unless you specifically want the raw, as-submitted values in source.*.
This is what explore-cioos/harvester/cde_harvester/obis_harvester.py:294–322 does in production.
import duckdb
import pandas as pd
import time
DEMO_UUID = 'd895e645-a98d-4720-b6fb-332929190f36' # Maritimes Spring RV Surveys
PARQUET_BASE = 'https://obis-open-data.s3.amazonaws.com/occurrence'
con = duckdb.connect()
# httpfs is the DuckDB extension that lets it read files over HTTP(S)/S3 —
# install it once, then load it into this connection.
con.sql('INSTALL httpfs; LOAD httpfs;')
# The OBIS open-data bucket lives in AWS region us-east-1; tell DuckDB so its
# S3 requests are signed/routed correctly.
con.sql("SET s3_region='us-east-1';")
print('DuckDB version:', duckdb.__version__)
1. Schema discovery¶
Before querying, look at what's there. DESCRIBE plus LIMIT 0 is cheap — it only fetches the Parquet footer.
The top-level table has just 15 columns, but source, interpreted, and extensions are nested structs — their column_type is a long STRUCT(...) string that the display truncates with …. To see what's actually inside one, expand it with <struct>.* (next cell).
url = f"{PARQUET_BASE}/{DEMO_UUID}.parquet"
schema = con.sql(f"DESCRIBE SELECT * FROM read_parquet('{url}') LIMIT 0").df()
schema
| column_name | column_type | null | key | default | extra | |
|---|---|---|---|---|---|---|
| 0 | _id | VARCHAR | YES | None | None | None |
| 1 | _event_id | VARCHAR | YES | None | None | None |
| 2 | _occurrence_id | VARCHAR | YES | None | None | None |
| 3 | dataset_id | VARCHAR | YES | None | None | None |
| 4 | node_ids | VARCHAR[] | YES | None | None | None |
| 5 | source | STRUCT(acceptedNameUsage VARCHAR, acceptedName... | YES | None | None | None |
| 6 | interpreted | STRUCT(acceptedNameUsage VARCHAR, acceptedName... | YES | None | None | None |
| 7 | extensions | STRUCT("http://rs.gbif.org/terms/1.0/DNADerive... | YES | None | None | None |
| 8 | missing | VARCHAR[] | YES | None | None | None |
| 9 | invalid | VARCHAR[] | YES | None | None | None |
| 10 | flags | VARCHAR[] | YES | None | None | None |
| 11 | dropped | BOOLEAN | YES | None | None | None |
| 12 | absence | BOOLEAN | YES | None | None | None |
| 13 | tags | VARCHAR[] | YES | None | None | None |
| 14 | geometry | GEOMETRY('OGC:CRS84') | YES | None | None | None |
# Expand a nested struct into its individual fields with `<struct>.*`.
# `interpreted` is the big one — the parsed/QC'd/enriched Darwin Core we query
# throughout (~276 fields). Swap in `source.*` to inspect the raw, as-submitted
# values instead (~188 fields, no OBIS-added fields like `aphiaid`).
interpreted_fields = con.sql(
f"DESCRIBE SELECT interpreted.* FROM read_parquet('{url}') LIMIT 0"
).df()
print(f'`interpreted` contains {len(interpreted_fields)} fields')
interpreted_fields[['column_name', 'column_type']]
2. Single-dataset pull¶
Verbatim adaptation of the production query in obis_harvester.py:294–319. Note the interpreted.* unwrap and the lat/lon bounds filter (records outside the Web Mercator range break tile rendering).
occurrences_query = f"""
SELECT
interpreted.decimalLatitude AS decimalLatitude,
interpreted.decimalLongitude AS decimalLongitude,
interpreted.date_start AS date_start,
interpreted.date_end AS date_end,
interpreted.minimumDepthInMeters AS minDepth,
interpreted.maximumDepthInMeters AS maxDepth,
interpreted.scientificName AS scientificName,
_id AS id
FROM read_parquet('{url}')
WHERE interpreted.decimalLatitude BETWEEN -85.06 AND 85.06
AND interpreted.decimalLongitude BETWEEN -180 AND 180
"""
t0 = time.perf_counter()
df = con.sql(occurrences_query).df()
parquet_secs = time.perf_counter() - t0
print(f'Parquet pull: {len(df):,} records in {parquet_secs:.2f} s ({len(df)/parquet_secs:,.0f} rec/s)')
df.head()
Parquet pull: 650,166 records in 2.15 s (301,712 rec/s)
| decimalLatitude | decimalLongitude | date_start | date_end | minDepth | maxDepth | scientificName | id | |
|---|---|---|---|---|---|---|---|---|
| 0 | 40.231167 | -68.820667 | 636163200000 | 636163200000 | NaN | NaN | Pisces | 73952c55-6a04-43b0-b69c-f4fa4b575d83 |
| 1 | 40.450667 | -69.419667 | 1109462400000 | 1109462400000 | NaN | NaN | Bothus | bf6612d6-855c-4bb1-a36d-ede18a526f2d |
| 2 | 40.450667 | -69.419667 | 1109462400000 | 1109462400000 | NaN | NaN | Bothus | c1462459-7e24-4ffd-91f0-85b2aa3552bc |
| 3 | 44.850000 | -61.333333 | 321062400000 | 321062400000 | NaN | NaN | Gadus morhua | be11fe1c-52ea-49dd-8913-981482ed0720 |
| 4 | 43.883333 | -62.250000 | 322099200000 | 322099200000 | NaN | NaN | Gadus morhua | abf67d47-b1ba-445f-8f5d-29723ac0c71a |
3. Bench: Parquet vs paged REST¶
Same dataset, two routes, side-by-side timing. We cap the REST pull at 2 pages so the cell stays workshop-sized; the speed ratio holds in both directions.
import requests
API = 'https://api.obis.org/v3'
def fetch_via_rest(dataset_id: str, page_size: int = 10000, max_pages: int = 2):
params = {'datasetid': dataset_id, 'size': page_size}
out = []
for _ in range(max_pages):
r = requests.get(f'{API}/occurrence', params=params, timeout=120)
r.raise_for_status()
results = r.json().get('results', [])
if not results:
break
out.extend(results)
if len(results) < page_size:
break
params['after'] = results[-1].get('id')
return out
t0 = time.perf_counter()
rest_records = fetch_via_rest(DEMO_UUID, max_pages=2)
rest_secs = time.perf_counter() - t0
print(f'REST (2 pages): {len(rest_records):,} records in {rest_secs:.2f} s ({len(rest_records)/rest_secs:,.0f} rec/s)')
print(f'Parquet (full): {len(df):,} records in {parquet_secs:.2f} s ({len(df)/parquet_secs:,.0f} rec/s)')
print(f'\nParquet is ~{(len(df)/parquet_secs)/(len(rest_records)/rest_secs):.1f}× faster on rec/s.')
REST (2 pages): 20,000 records in 13.63 s (1,468 rec/s) Parquet (full): 650,166 records in 2.15 s (301,712 rec/s) Parquet is ~205.6× faster on rec/s.
4. Spatial query — Atlantic-Canada bbox¶
Push the spatial filter into DuckDB instead of pulling everything and filtering in pandas. Cheap, even when the dataset is huge — DuckDB only fetches the row groups that overlap. The demo dataset is the Maritimes survey, so we filter to an Atlantic-Canada box (the Pacific is in the Q2 exercise below).
spatial_query = f"""
SELECT interpreted.scientificName AS scientificName,
interpreted.decimalLatitude AS lat,
interpreted.decimalLongitude AS lon
FROM read_parquet('{url}')
WHERE interpreted.decimalLatitude BETWEEN 41 AND 48
AND interpreted.decimalLongitude BETWEEN -70 AND -56
"""
atlantic = con.sql(spatial_query).df()
print(f'{len(atlantic):,} records in Atlantic-Canada bbox')
atlantic.head()
5. Aggregation — top species¶
The GROUP BY runs inside DuckDB, against data streamed straight from the remote Parquet — pandas only receives the final 20-row result, never the ~650k underlying rows. Contrast with pulling every row into a DataFrame and calling .value_counts(): that moves 650k rows into memory just to produce 20. Push the arithmetic down to the data instead.
top_species = con.sql(f"""
SELECT interpreted.scientificName AS scientificName,
COUNT(*) AS n_records
FROM read_parquet('{url}')
WHERE interpreted.scientificName IS NOT NULL
GROUP BY 1
ORDER BY n_records DESC
LIMIT 20
""").df()
top_species
| scientificName | n_records | |
|---|---|---|
| 0 | Melanogrammus aeglefinus | 139499 |
| 1 | Merluccius bilinearis | 61573 |
| 2 | Gadus morhua | 37493 |
| 3 | Leucoraja ocellata | 35597 |
| 4 | Myoxocephalus octodecemspinosus | 30336 |
| 5 | Myzopsetta ferruginea | 30048 |
| 6 | Clupea harengus | 29435 |
| 7 | Leucoraja erinaceus | 25957 |
| 8 | Hippoglossoides platessoides | 19921 |
| 9 | Placopecten magellanicus | 18739 |
| 10 | Squalus acanthias | 15687 |
| 11 | Scomber scombrus | 14209 |
| 12 | Sebastes | 13194 |
| 13 | Scophthalmus aquosus | 13178 |
| 14 | Urophycis chuss | 12046 |
| 15 | Hemitripterus americanus | 11103 |
| 16 | Alosa pseudoharengus | 10533 |
| 17 | Pollachius virens | 9815 |
| 18 | Homarus americanus | 9255 |
| 19 | Pseudopleuronectes americanus | 9119 |
6. Multi-dataset union¶
DuckDB's read_parquet([...]) accepts a list of URLs and unions them into a single virtual table. This is how you'd start building a regional summary across many datasets.
# Pick a few small Canadian datasets for the demo (replace with your own UUIDs).
uuids = [
DEMO_UUID,
'4b5e4ccb-cf66-44e4-8890-fa68f8404c3f',
]
urls = [f'{PARQUET_BASE}/{u}.parquet' for u in uuids]
url_list_sql = ', '.join(f"'{u}'" for u in urls)
summary = con.sql(f"""
SELECT _dataset_id AS dataset_id,
COUNT(*) AS n_records,
COUNT(DISTINCT interpreted.scientificName) AS n_species
FROM read_parquet([{url_list_sql}], filename = '_dataset_id')
GROUP BY 1
""").df()
summary
| dataset_id | n_records | n_species | |
|---|---|---|---|
| 0 | https://obis-open-data.s3.amazonaws.com/occurr... | 345 | 3 |
| 1 | https://obis-open-data.s3.amazonaws.com/occurr... | 650166 | 440 |
7. Cracking open an extension — eMoF¶
Sections 1–6 only read interpreted.* (the occurrence fields). The third struct, extensions, holds the child records attached to each occurrence. In this dataset that's the Extended Measurement or Fact (eMoF) extension — tow speed, trawl distance, fish length, bottom temperature, and so on.
Two wrinkles vs. the flat columns:
extensionsis keyed by the extension's full URI (http://rs.iobis.org/obis/terms/ExtendedMeasurementOrFact).- The value is a list of measurements (one occurrence → many measurements), so we
UNNESTit into one row per measurement, then reach into each measurement'ssource.*fields.
This is the same eMoF you'd validate as a standalone file with pyobistools — here it's nested inline rather than stored as a separate linked table.
EMOF = 'http://rs.iobis.org/obis/terms/ExtendedMeasurementOrFact'
# UNNEST the per-occurrence list of measurements into one row each,
# then pull fields out of each measurement's `source.*` sub-struct.
emof = con.sql(f"""
SELECT
_id AS occurrence_id,
interpreted.scientificName AS scientificName,
m.source.measurementType AS measurementType,
m.source.measurementValue AS measurementValue,
m.source.measurementUnit AS measurementUnit
FROM read_parquet('{url}'),
UNNEST("extensions"['{EMOF}']) AS t(m)
LIMIT 10
""").df()
emof
Check your understanding¶
Q1 — interpreted vs source. Every query in this notebook reads interpreted.decimalLatitude, never a top-level decimalLatitude column. Why? And when would you reach into the source.* struct instead?
Answer
The Parquet schema has no flat Darwin Core columns — the fields live inside nested structs. interpreted.* holds OBIS's parsed, quality-controlled, and enriched values: coordinates as floats, dates as epoch millis, names resolved against WoRMS, plus OBIS-added fields like aphiaid and full taxonomy that weren't in the original submission (~276 fields, vs ~188 in source). That's what you want for analysis. source.* is the raw Darwin Core exactly as the publisher submitted it — original strings, spellings, and date formats, with none of the OBIS enrichment. You only reach for source.* when doing QC or provenance work: comparing what was submitted against what OBIS interpreted, or debugging why a record was flagged. For everything analytical, unwrap interpreted.*.
Q2 — Hands on. Re-run the section-4 spatial query, but for a Canadian Pacific bbox (47–60°N, 140–120°W) instead of the Atlantic one. Predict the row count before you run it, then explain the result. Fill in the starter cell below, then expand the solution to check your work.
# Q2 — your turn.
# Same spatial query as section 4, but for a Canadian *Pacific* bbox.
# Predict the count before running, then think about why you get it.
lat_min, lat_max = 47, 60 # Canadian Pacific latitudes
lon_min, lon_max = -140, -120 # Canadian Pacific longitudes
con.sql(f"""
SELECT COUNT(*) AS n
FROM read_parquet('{url}')
WHERE interpreted.decimalLatitude BETWEEN {lat_min} AND {lat_max}
AND interpreted.decimalLongitude BETWEEN {lon_min} AND {lon_max}
""").df()
# Q: is the result a bug, or the right answer? Why?
Q2 — solution
0 records — and that's the correct answer, not a bug. The demo dataset is the Maritimes Spring RV Surveys, an Atlantic bottom-trawl survey. None of its records fall in the Pacific box (47–60°N, 140–120°W), so the count is genuinely zero. Section 4's Atlantic box returned ~590k from the same file — the filter works fine; there's just no data out west.
Takeaway: an empty result from a pushed-down spatial filter almost always means "no data in that box," not "query broken." Before debugging the SQL, sanity-check the bbox against where the dataset actually lives.
Gotchas to flag in the talk¶
- Lag: the Parquet exports trail the live API by hours/days — don't use Parquet for time-critical pulls.
- Coverage: a few private/embargoed datasets are API-only; if
read_parquet404s, fall back to REST (Notebook 01). - Schema: always unwrap
interpreted.*; thesource.*struct is the raw, as-submitted Darwin Core (useful for QC, not analysis). - Coordinate validity: ±85.06° lat / ±180° lon bounds are essential before sending data to a Web Mercator map renderer.
Next: Notebook 03 — resolve those scientificName strings against WoRMS to get authoritative AphiaIDs and full taxonomic classifications.