The version on the docs site is a static render — to actually run the cells, click the badge above (Google Colab) or follow Setup to run locally.

In [1]:

Copied!





# @colab-setup
# Run this cell first. On Colab it installs the deps; locally it is a no-op.
import sys
if "google.colab" in sys.modules:
    %pip install -q requests pandas tqdm ipywidgets
# @colab-setup
# Run this cell first. On Colab it installs the deps; locally it is a no-op.
import sys
if "google.colab" in sys.modules:
    %pip install -q requests pandas tqdm ipywidgets

01 — OBIS REST API directly¶

Time budget: ~10 min · Goal: the request-level view of OBIS — endpoints, cursor pagination, and faceting. Everything higher-level (clients, exports) ultimately lands on these calls.

Three endpoints cover almost everything CIOOS does with OBIS:

Endpoint	Purpose
`/v3/dataset`	List or get datasets
`/v3/occurrence`	Occurrence records (paginated)
`/v3/facet`	Aggregations (taxonomic class counts, etc.)

Pagination quirk: OBIS uses an after= cursor — the id of the last record on the previous page — not a page-number parameter. If you write &start=… you'll silently miss data.

Country-filter quirk: country=CA does not filter geographically — it filters by OBIS node. For Canadian-waters records, use a geometry= polygon.

Reference implementation: explore-cioos/harvester/cde_harvester/obis_harvester.py:324–360.

In [2]:

Copied!





import requests
import pandas as pd
from tqdm.notebook import tqdm

API = 'https://api.obis.org/v3'
DEMO_UUID = 'd895e645-a98d-4720-b6fb-332929190f36'  # Maritimes Spring RV Surveys
CANADA_BBOX_WKT = 'POLYGON((-141 41, -52 41, -52 84, -141 84, -141 41))'
import requests
import pandas as pd
from tqdm.notebook import tqdm

API = 'https://api.obis.org/v3'
DEMO_UUID = 'd895e645-a98d-4720-b6fb-332929190f36'  # Maritimes Spring RV Surveys
CANADA_BBOX_WKT = 'POLYGON((-141 41, -52 41, -52 84, -141 84, -141 41))'

1. Find datasets in Canadian waters¶

Server-side filtering with a geometry= polygon — much cheaper than pulling everything and filtering client-side.

In [3]:

Copied!





r = requests.get(
    f'{API}/dataset',
    params={'geometry': CANADA_BBOX_WKT, 'size': 10},
    timeout=60,
)
r.raise_for_status()
payload = r.json()
print(f"{payload['total']:,} datasets intersect Canadian-waters bbox")
ca = pd.DataFrame(payload['results'])
cols = [c for c in ('id', 'title', 'records') if c in ca.columns]
ca[cols].head(10)
r = requests.get(
    f'{API}/dataset',
    params={'geometry': CANADA_BBOX_WKT, 'size': 10},
    timeout=60,
)
r.raise_for_status()
payload = r.json()
print(f"{payload['total']:,} datasets intersect Canadian-waters bbox")
ca = pd.DataFrame(payload['results'])
cols = [c for c in ('id', 'title', 'records') if c in ca.columns]
ca[cols].head(10)

1,021 datasets intersect Canadian-waters bbox

Out[3]:

	id	title	records
0	75bd3b87-1059-4355-b757-40a2f141fd3f	DFO Pacific: Herring Biosample Database	1923172
1	5061d21c-6161-4ea2-a8d4-38f8285dfc47	Pacific Multispecies Small Mesh Bottom Trawl S...	1665776
2	7b6fa45f-e4fd-4e40-a537-97eb2f63c690	Maritimes Summer Research Vessel Surveys	1519459
3	a390823a-2a59-4680-9f42-c34af84eeae8	Quoddy Region Pelagics Telemetry	1316591
4	e0f45753-759a-4bc2-ab24-26dba4376707	Inner Bay of Fundy Striped Bass	1112261
5	d17e4cae-baf5-4f8b-980f-aa6806f070e2	DATRAS Canadian Maritimes Trawl Survey	1031032
6	d91cd0b2-66a9-4758-b18f-add691bcd0ce	OTN VR2W Loan - Emera Snow Crab and American L...	833675
7	4d52d23d-4cd0-488b-8f28-ee71b03f98b8	Nahant Collection	829176
8	d895e645-a98d-4720-b6fb-332929190f36	Maritimes Spring Research Vessel Surveys	588991
9	800210cb-3c7a-45e4-bbf4-4f246a1eb6dd	Large Scale Movements of Sturgeon in the St. L...	520151

2. Single-dataset metadata¶

What does the JSON shape actually look like? Walk it once so the rest of the workshop makes sense.

In [11]:

Copied!

import html, json
from IPython.display import JSON

r = requests.get(f'{API}/dataset/{DEMO_UUID}', timeout=30)
r.raise_for_status()
ds = r.json()['results'][0]

# Collapsed by default — click ▶ to drill into nested fields.
JSON(ds, expanded=False, root='dataset')
import html, json
from IPython.display import JSON

r = requests.get(f'{API}/dataset/{DEMO_UUID}', timeout=30)
r.raise_for_status()
ds = r.json()['results'][0]

# Collapsed by default — click ▶ to drill into nested fields.
JSON(ds, expanded=False, root='dataset')

Out[11]:

<IPython.core.display.JSON object>

3. Cursor-paginated occurrence pull¶

The pattern: ask for size=10000 records, look at the last id you got back, pass it as after=… to the next request. Stop when a page is shorter than your page size.

This is the same loop our production harvester uses (obis_harvester.py:328–354).

In [4]:

Copied!





def fetch_all_occurrences(dataset_id: str, page_size: int = 10000, max_pages = None):
    """Cursor-paginate /v3/occurrence and return a list of dicts."""
    params = {'datasetid': dataset_id, 'size': page_size}
    all_results = []
    page = 0
    pbar = tqdm(desc=f'{dataset_id[:8]}…', unit='rec')
    while True:
        r = requests.get(f'{API}/occurrence', params=params, timeout=120)
        r.raise_for_status()
        results = r.json().get('results', [])
        if not results:
            break
        all_results.extend(results)
        pbar.update(len(results))
        page += 1
        if len(results) < page_size or (max_pages and page >= max_pages):
            break
        last_id = results[-1].get('id')
        if not last_id:
            break
        params['after'] = last_id
    pbar.close()
    return all_results

# Cap at 2 pages (~20 000 rows) so the cell finishes in workshop time.
records = fetch_all_occurrences(DEMO_UUID, max_pages=2)
print(f'Pulled {len(records):,} records')
def fetch_all_occurrences(dataset_id: str, page_size: int = 10000, max_pages = None):
    """Cursor-paginate /v3/occurrence and return a list of dicts."""
    params = {'datasetid': dataset_id, 'size': page_size}
    all_results = []
    page = 0
    pbar = tqdm(desc=f'{dataset_id[:8]}…', unit='rec')
    while True:
        r = requests.get(f'{API}/occurrence', params=params, timeout=120)
        r.raise_for_status()
        results = r.json().get('results', [])
        if not results:
            break
        all_results.extend(results)
        pbar.update(len(results))
        page += 1
        if len(results) < page_size or (max_pages and page >= max_pages):
            break
        last_id = results[-1].get('id')
        if not last_id:
            break
        params['after'] = last_id
    pbar.close()
    return all_results

# Cap at 2 pages (~20 000 rows) so the cell finishes in workshop time.
records = fetch_all_occurrences(DEMO_UUID, max_pages=2)
print(f'Pulled {len(records):,} records')

d895e645…: 0rec [00:00, ?rec/s]

Pulled 20,000 records

In [ ]:

Copied!

occ = pd.DataFrame(records)
print(f'{len(occ.columns)} columns:', occ.columns.tolist())

cols = [c for c in ('scientificName', 'decimalLatitude', 'decimalLongitude', 'eventDate', 'minimumDepthInMeters') if c in occ.columns]
occ[cols].head()
occ = pd.DataFrame(records)
print(f'{len(occ.columns)} columns:', occ.columns.tolist())

cols = [c for c in ('scientificName', 'decimalLatitude', 'decimalLongitude', 'eventDate', 'minimumDepthInMeters') if c in occ.columns]
occ[cols].head()

4. Faceting¶

/v3/facet is the cheap way to ask aggregate questions. Here: how many records per taxonomic class in the dataset?

In [6]:

Copied!





r = requests.get(f'{API}/facet', params={'datasetid': DEMO_UUID, 'facets': 'class'}, timeout=30)
r.raise_for_status()
facet_results = r.json()['results']['class']
facets = pd.DataFrame(facet_results)
facets.head(15)
r = requests.get(f'{API}/facet', params={'datasetid': DEMO_UUID, 'facets': 'class'}, timeout=30)
r.raise_for_status()
facet_results = r.json()['results']['class']
facets = pd.DataFrame(facet_results)
facets.head(15)

Out[6]:

	key	records
0	Teleostei	496321
1	Elasmobranchii	97005
2	Malacostraca	20160
3	Bivalvia	18902
4	Cephalopoda	9490
5	Asteroidea	2252
6	Gastropoda	1037
7	Echinoidea	736
8	Polychaeta	338
9	Petromyzonti	309

In [ ]:

Copied!





# Visualise the top classes
ax = facets.head(10).set_index('key')['records'].plot.barh(figsize=(8, 4))
ax.invert_yaxis()
ax.set_xlabel('records')
ax.set_title(f'Top taxonomic classes — {DEMO_UUID[:8]}…')
# Visualise the top classes
ax = facets.head(10).set_index('key')['records'].plot.barh(figsize=(8, 4))
ax.invert_yaxis()
ax.set_xlabel('records')
ax.set_title(f'Top taxonomic classes — {DEMO_UUID[:8]}…')

This same ?facets=class call powers fetch_eovs_from_taxonomy() in cioos-metadata-conversion — one cheap request tells you the per-class record histogram, which we turn into a list of Essential Ocean Variables for the CIOOS catalogue. Stripped-down sketch below; the production version has a ~80-entry class→EOV map and per-EOV thresholds.

In [ ]:

Copied!





# Abridged — real map lives in cioos_metadata_conversion/load_from/obis.py
TAXON_CLASS_TO_EOV = {
    'Teleostei':         'fish_abundance_and_distribution',
    'Elasmobranchii':    'fish_abundance_and_distribution',
    'Bacillariophyceae': 'phytoplankton_biomass_and_diversity',
    'Dinophyceae':       'phytoplankton_biomass_and_diversity',
    'Copepoda':          'zooplankton_biomass_and_diversity',
    'Malacostraca':      'invertebrate_abundance_and_distribution',
    'Bivalvia':          'invertebrate_abundance_and_distribution',
    'Cephalopoda':       'invertebrate_abundance_and_distribution',
    'Anthozoa':          'hard_coral_cover_and_composition',  # "cover" EOV
}
COVER_EOVS = {'hard_coral_cover_and_composition'}
COVER_EOV_MIN_FRACTION = 0.05  # suppress by-catch false positives


def fetch_eovs_from_taxonomy(dataset_id):
    r = requests.get(
        f'{API}/facet',
        params={'datasetid': dataset_id, 'facets': 'class'},
        timeout=30,
    )
    r.raise_for_status()
    classes = r.json().get('results', {}).get('class', [])

    total = sum(c.get('records', 0) or 0 for c in classes)
    eovs = set()
    for c in classes:
        eov = TAXON_CLASS_TO_EOV.get(c['key'])
        if not eov:
            continue
        # Cover-type EOVs need to be dominant, not by-catch
        if eov in COVER_EOVS and total and c['records'] / total < COVER_EOV_MIN_FRACTION:
            continue
        eovs.add(eov)
    return sorted(eovs)


fetch_eovs_from_taxonomy(DEMO_UUID)
# Abridged — real map lives in cioos_metadata_conversion/load_from/obis.py
TAXON_CLASS_TO_EOV = {
    'Teleostei':         'fish_abundance_and_distribution',
    'Elasmobranchii':    'fish_abundance_and_distribution',
    'Bacillariophyceae': 'phytoplankton_biomass_and_diversity',
    'Dinophyceae':       'phytoplankton_biomass_and_diversity',
    'Copepoda':          'zooplankton_biomass_and_diversity',
    'Malacostraca':      'invertebrate_abundance_and_distribution',
    'Bivalvia':          'invertebrate_abundance_and_distribution',
    'Cephalopoda':       'invertebrate_abundance_and_distribution',
    'Anthozoa':          'hard_coral_cover_and_composition',  # "cover" EOV
}
COVER_EOVS = {'hard_coral_cover_and_composition'}
COVER_EOV_MIN_FRACTION = 0.05  # suppress by-catch false positives


def fetch_eovs_from_taxonomy(dataset_id):
    r = requests.get(
        f'{API}/facet',
        params={'datasetid': dataset_id, 'facets': 'class'},
        timeout=30,
    )
    r.raise_for_status()
    classes = r.json().get('results', {}).get('class', [])

    total = sum(c.get('records', 0) or 0 for c in classes)
    eovs = set()
    for c in classes:
        eov = TAXON_CLASS_TO_EOV.get(c['key'])
        if not eov:
            continue
        # Cover-type EOVs need to be dominant, not by-catch
        if eov in COVER_EOVS and total and c['records'] / total < COVER_EOV_MIN_FRACTION:
            continue
        eovs.add(eov)
    return sorted(eovs)


fetch_eovs_from_taxonomy(DEMO_UUID)

Check your understanding¶

Q1 — Spot the bugs. A colleague wants the second page of Canadian-waters occurrences and writes:

requests.get(f'{API}/occurrence', params={'country': 'CA', 'start': 10000, 'size': 10000})

This request returns data, but it's wrong in two ways. What are they?

Answer

country='CA' is not a geographic filter — it filters by OBIS node, not by where records were collected. To get Canadian waters, pass a geometry= polygon (like CANADA_BBOX_WKT) instead.
start= does nothing — OBIS paginates with an after=<last id> cursor, not a numeric offset. start=10000 is silently ignored, so this returns page one again, not page two. To advance, take the id of the last record and pass it as after=.

Q2 — Hands on. Without pulling a single occurrence record, use the /v3/facet result to find how many records belong to class Elasmobranchii (sharks, skates, and rays). Fill in the starter cell below, then expand the solution to check your work.

In [ ]:

Copied!





# Q2 — your turn.
# Goal: how many records in the demo dataset belong to class 'Elasmobranchii'?
# Reuse the /v3/facet call from section 4, then look up the class in the result.

r = requests.get(
    f'{API}/facet',
    params={'datasetid': DEMO_UUID, 'facets': 'class'},
    timeout=30,
)
r.raise_for_status()
classes = r.json()['results']['class']   # list of {'key': <class>, 'records': <count>}

# TODO: find the entry whose 'key' == 'Elasmobranchii' and print its record count.
#       Note: keys are case-sensitive ('Elasmobranchii', not 'elasmobranchii').
# Q2 — your turn.
# Goal: how many records in the demo dataset belong to class 'Elasmobranchii'?
# Reuse the /v3/facet call from section 4, then look up the class in the result.

r = requests.get(
    f'{API}/facet',
    params={'datasetid': DEMO_UUID, 'facets': 'class'},
    timeout=30,
)
r.raise_for_status()
classes = r.json()['results']['class']   # list of {'key': <class>, 'records': <count>}

# TODO: find the entry whose 'key' == 'Elasmobranchii' and print its record count.
#       Note: keys are case-sensitive ('Elasmobranchii', not 'elasmobranchii').

Q2 — solution

match = next((c for c in classes if c['key'] == 'Elasmobranchii'), None)
print(f"{match['records']:,} Elasmobranchii records" if match else 'Not in the top classes')

→ 97,005 Elasmobranchii records — one cheap request, no occurrence pull.

Two things worth noting:

The lookup is case-sensitive. The keys come straight from the taxonomy ('Elasmobranchii', 'Teleostei'), so == 'elasmobranchii' would never match.
/v3/facet returns only the top ~10 buckets by default. So a class missing from the list (e.g. Aves) isn't necessarily absent from the dataset — it just didn't crack the top 10. To widen the list, pass a larger size=; to count one specific class regardless of rank, query the /v3/occurrence endpoint with a taxonid/scientificname filter instead.

What's next¶

The cursor-paged pull above is honest but slow — we'll redo it in seconds with Parquet+DuckDB in Notebook 02. OBIS itself recommends this path for large subsets: obis.org/data/access — "for programmatic analysis of large subsets of data, the GeoParquet version of the dataset hosted on AWS is highly recommended." The snapshots live at iobis/obis-open-data.
scientificName strings here aren't authoritative; resolving them to AphiaIDs against the World Register of Marine Species (WoRMS) is what Notebook 03 is for. OBIS requires publishers to match against an authoritative register — see OBIS Manual §12 "Match Taxonomic Names" — and the WoRMS taxon-match tool is the canonical resolver.
For private/embargoed datasets that don't have Parquet exports yet, this REST path is your only option — keep it in your back pocket as a fallback. The obis-open-data snapshot is a periodic export of published data, so very recent or embargoed datasets aren't in it until the next cycle.

01 — OBIS REST API directly¶

1. Find datasets in Canadian waters¶

2. Single-dataset metadata¶

3. Cursor-paginated occurrence pull¶

4. Faceting¶

Real-world callback: facet → EOV mapping¶

Check your understanding¶

What's next¶