The version on the docs site is a static render — to actually run the cells, click the badge above (Google Colab) or follow Setup to run locally.
# @colab-setup
# Run this cell first. On Colab it installs the deps; locally it is a no-op.
import sys
if "google.colab" in sys.modules:
%pip install -q requests pandas tqdm ipywidgets
01 — OBIS REST API directly¶
Time budget: ~10 min · Goal: the request-level view of OBIS — endpoints, cursor pagination, and faceting. Everything higher-level (clients, exports) ultimately lands on these calls.
Three endpoints cover almost everything CIOOS does with OBIS:
| Endpoint | Purpose |
|---|---|
/v3/dataset |
List or get datasets |
/v3/occurrence |
Occurrence records (paginated) |
/v3/facet |
Aggregations (taxonomic class counts, etc.) |
Pagination quirk: OBIS uses an after= cursor — the id of the last record on the previous page — not a page-number parameter. If you write &start=… you'll silently miss data.
Country-filter quirk: country=CA does not filter geographically — it filters by OBIS node. For Canadian-waters records, use a geometry= polygon.
Reference implementation: explore-cioos/harvester/cde_harvester/obis_harvester.py:324–360.
import requests
import pandas as pd
from tqdm.notebook import tqdm
API = 'https://api.obis.org/v3'
DEMO_UUID = 'd895e645-a98d-4720-b6fb-332929190f36' # Maritimes Spring RV Surveys
CANADA_BBOX_WKT = 'POLYGON((-141 41, -52 41, -52 84, -141 84, -141 41))'
1. Find datasets in Canadian waters¶
Server-side filtering with a geometry= polygon — much cheaper than pulling everything and filtering client-side.
r = requests.get(
f'{API}/dataset',
params={'geometry': CANADA_BBOX_WKT, 'size': 10},
timeout=60,
)
r.raise_for_status()
payload = r.json()
print(f"{payload['total']:,} datasets intersect Canadian-waters bbox")
ca = pd.DataFrame(payload['results'])
cols = [c for c in ('id', 'title', 'records') if c in ca.columns]
ca[cols].head(10)
1,021 datasets intersect Canadian-waters bbox
| id | title | records | |
|---|---|---|---|
| 0 | 75bd3b87-1059-4355-b757-40a2f141fd3f | DFO Pacific: Herring Biosample Database | 1923172 |
| 1 | 5061d21c-6161-4ea2-a8d4-38f8285dfc47 | Pacific Multispecies Small Mesh Bottom Trawl S... | 1665776 |
| 2 | 7b6fa45f-e4fd-4e40-a537-97eb2f63c690 | Maritimes Summer Research Vessel Surveys | 1519459 |
| 3 | a390823a-2a59-4680-9f42-c34af84eeae8 | Quoddy Region Pelagics Telemetry | 1316591 |
| 4 | e0f45753-759a-4bc2-ab24-26dba4376707 | Inner Bay of Fundy Striped Bass | 1112261 |
| 5 | d17e4cae-baf5-4f8b-980f-aa6806f070e2 | DATRAS Canadian Maritimes Trawl Survey | 1031032 |
| 6 | d91cd0b2-66a9-4758-b18f-add691bcd0ce | OTN VR2W Loan - Emera Snow Crab and American L... | 833675 |
| 7 | 4d52d23d-4cd0-488b-8f28-ee71b03f98b8 | Nahant Collection | 829176 |
| 8 | d895e645-a98d-4720-b6fb-332929190f36 | Maritimes Spring Research Vessel Surveys | 588991 |
| 9 | 800210cb-3c7a-45e4-bbf4-4f246a1eb6dd | Large Scale Movements of Sturgeon in the St. L... | 520151 |
2. Single-dataset metadata¶
What does the JSON shape actually look like? Walk it once so the rest of the workshop makes sense.
import html, json
from IPython.display import JSON
r = requests.get(f'{API}/dataset/{DEMO_UUID}', timeout=30)
r.raise_for_status()
ds = r.json()['results'][0]
# Collapsed by default — click ▶ to drill into nested fields.
JSON(ds, expanded=False, root='dataset')
<IPython.core.display.JSON object>
3. Cursor-paginated occurrence pull¶
The pattern: ask for size=10000 records, look at the last id you got back, pass it as after=… to the next request. Stop when a page is shorter than your page size.
This is the same loop our production harvester uses (obis_harvester.py:328–354).
def fetch_all_occurrences(dataset_id: str, page_size: int = 10000, max_pages = None):
"""Cursor-paginate /v3/occurrence and return a list of dicts."""
params = {'datasetid': dataset_id, 'size': page_size}
all_results = []
page = 0
pbar = tqdm(desc=f'{dataset_id[:8]}…', unit='rec')
while True:
r = requests.get(f'{API}/occurrence', params=params, timeout=120)
r.raise_for_status()
results = r.json().get('results', [])
if not results:
break
all_results.extend(results)
pbar.update(len(results))
page += 1
if len(results) < page_size or (max_pages and page >= max_pages):
break
last_id = results[-1].get('id')
if not last_id:
break
params['after'] = last_id
pbar.close()
return all_results
# Cap at 2 pages (~20 000 rows) so the cell finishes in workshop time.
records = fetch_all_occurrences(DEMO_UUID, max_pages=2)
print(f'Pulled {len(records):,} records')
d895e645…: 0rec [00:00, ?rec/s]
Pulled 20,000 records
occ = pd.DataFrame(records)
print(f'{len(occ.columns)} columns:', occ.columns.tolist())
cols = [c for c in ('scientificName', 'decimalLatitude', 'decimalLongitude', 'eventDate', 'minimumDepthInMeters') if c in occ.columns]
occ[cols].head()
4. Faceting¶
/v3/facet is the cheap way to ask aggregate questions. Here: how many records per taxonomic class in the dataset?
r = requests.get(f'{API}/facet', params={'datasetid': DEMO_UUID, 'facets': 'class'}, timeout=30)
r.raise_for_status()
facet_results = r.json()['results']['class']
facets = pd.DataFrame(facet_results)
facets.head(15)
| key | records | |
|---|---|---|
| 0 | Teleostei | 496321 |
| 1 | Elasmobranchii | 97005 |
| 2 | Malacostraca | 20160 |
| 3 | Bivalvia | 18902 |
| 4 | Cephalopoda | 9490 |
| 5 | Asteroidea | 2252 |
| 6 | Gastropoda | 1037 |
| 7 | Echinoidea | 736 |
| 8 | Polychaeta | 338 |
| 9 | Petromyzonti | 309 |
# Visualise the top classes
ax = facets.head(10).set_index('key')['records'].plot.barh(figsize=(8, 4))
ax.invert_yaxis()
ax.set_xlabel('records')
ax.set_title(f'Top taxonomic classes — {DEMO_UUID[:8]}…')
Real-world callback: facet → EOV mapping¶
This same ?facets=class call powers fetch_eovs_from_taxonomy() in cioos-metadata-conversion — one cheap request tells you the per-class record histogram, which we turn into a list of Essential Ocean Variables for the CIOOS catalogue. Stripped-down sketch below; the production version has a ~80-entry class→EOV map and per-EOV thresholds.
# Abridged — real map lives in cioos_metadata_conversion/load_from/obis.py
TAXON_CLASS_TO_EOV = {
'Teleostei': 'fish_abundance_and_distribution',
'Elasmobranchii': 'fish_abundance_and_distribution',
'Bacillariophyceae': 'phytoplankton_biomass_and_diversity',
'Dinophyceae': 'phytoplankton_biomass_and_diversity',
'Copepoda': 'zooplankton_biomass_and_diversity',
'Malacostraca': 'invertebrate_abundance_and_distribution',
'Bivalvia': 'invertebrate_abundance_and_distribution',
'Cephalopoda': 'invertebrate_abundance_and_distribution',
'Anthozoa': 'hard_coral_cover_and_composition', # "cover" EOV
}
COVER_EOVS = {'hard_coral_cover_and_composition'}
COVER_EOV_MIN_FRACTION = 0.05 # suppress by-catch false positives
def fetch_eovs_from_taxonomy(dataset_id):
r = requests.get(
f'{API}/facet',
params={'datasetid': dataset_id, 'facets': 'class'},
timeout=30,
)
r.raise_for_status()
classes = r.json().get('results', {}).get('class', [])
total = sum(c.get('records', 0) or 0 for c in classes)
eovs = set()
for c in classes:
eov = TAXON_CLASS_TO_EOV.get(c['key'])
if not eov:
continue
# Cover-type EOVs need to be dominant, not by-catch
if eov in COVER_EOVS and total and c['records'] / total < COVER_EOV_MIN_FRACTION:
continue
eovs.add(eov)
return sorted(eovs)
fetch_eovs_from_taxonomy(DEMO_UUID)
Check your understanding¶
Q1 — Spot the bugs. A colleague wants the second page of Canadian-waters occurrences and writes:
requests.get(f'{API}/occurrence', params={'country': 'CA', 'start': 10000, 'size': 10000})
This request returns data, but it's wrong in two ways. What are they?
Answer
country='CA'is not a geographic filter — it filters by OBIS node, not by where records were collected. To get Canadian waters, pass ageometry=polygon (likeCANADA_BBOX_WKT) instead.start=does nothing — OBIS paginates with anafter=<last id>cursor, not a numeric offset.start=10000is silently ignored, so this returns page one again, not page two. To advance, take theidof the last record and pass it asafter=.
Q2 — Hands on. Without pulling a single occurrence record, use the /v3/facet result to find how many records belong to class Elasmobranchii (sharks, skates, and rays). Fill in the starter cell below, then expand the solution to check your work.
# Q2 — your turn.
# Goal: how many records in the demo dataset belong to class 'Elasmobranchii'?
# Reuse the /v3/facet call from section 4, then look up the class in the result.
r = requests.get(
f'{API}/facet',
params={'datasetid': DEMO_UUID, 'facets': 'class'},
timeout=30,
)
r.raise_for_status()
classes = r.json()['results']['class'] # list of {'key': <class>, 'records': <count>}
# TODO: find the entry whose 'key' == 'Elasmobranchii' and print its record count.
# Note: keys are case-sensitive ('Elasmobranchii', not 'elasmobranchii').
Q2 — solution
match = next((c for c in classes if c['key'] == 'Elasmobranchii'), None)
print(f"{match['records']:,} Elasmobranchii records" if match else 'Not in the top classes')
→ 97,005 Elasmobranchii records — one cheap request, no occurrence pull.
Two things worth noting:
- The lookup is case-sensitive. The keys come straight from the taxonomy (
'Elasmobranchii','Teleostei'), so== 'elasmobranchii'would never match. /v3/facetreturns only the top ~10 buckets by default. So a class missing from the list (e.g.Aves) isn't necessarily absent from the dataset — it just didn't crack the top 10. To widen the list, pass a largersize=; to count one specific class regardless of rank, query the/v3/occurrenceendpoint with ataxonid/scientificnamefilter instead.
What's next¶
- The cursor-paged pull above is honest but slow — we'll redo it in seconds with Parquet+DuckDB in Notebook 02. OBIS itself recommends this path for large subsets: obis.org/data/access — "for programmatic analysis of large subsets of data, the GeoParquet version of the dataset hosted on AWS is highly recommended." The snapshots live at iobis/obis-open-data.
scientificNamestrings here aren't authoritative; resolving them to AphiaIDs against the World Register of Marine Species (WoRMS) is what Notebook 03 is for. OBIS requires publishers to match against an authoritative register — see OBIS Manual §12 "Match Taxonomic Names" — and the WoRMS taxon-match tool is the canonical resolver.- For private/embargoed datasets that don't have Parquet exports yet, this REST path is your only option — keep it in your back pocket as a fallback. The obis-open-data snapshot is a periodic export of published data, so very recent or embargoed datasets aren't in it until the next cycle.