The version on the docs site is a static render — to actually run the cells, click the badge above (Google Colab) or follow Setup to run locally.
# @colab-setup
# Run this cell first. On Colab it installs the deps; locally it is a no-op.
import sys
if "google.colab" in sys.modules:
%pip install -q pyworms requests pandas
03 — WoRMS REST API + pyworms¶
Time budget: ~15 min · Goal: turn the raw scientificName strings we pulled from OBIS in Notebooks 01–02 into authoritative taxonomy — stable AphiaIDs, accepted names, full classification trees.
The World Register of Marine Species (WoRMS) is the taxonomic backbone OBIS itself uses for name interpretation. Knowing how to query it directly matters because:
- OBIS records carry
scientificNameas supplied by the publisher — often misspelt, synonymised, or out of date. - The
interpreted.*struct in the Parquet exports already does WoRMS resolution, but if you're working with non-OBIS data (your own collection, GBIF, a fresh DwC archive) you have to do it yourself. - WoRMS exposes a JSON REST service at
https://www.marinespecies.org/rest/and a thin Python wrapper,pyworms, maintained by the OBIS Secretariat.
We'll use both — pyworms for the common cases, raw requests when we need to control batching or fields.
import pyworms
import requests
import pandas as pd
WORMS = 'https://www.marinespecies.org/rest'
def aphia_id_by_name(name, marine_only=True):
"""Resolve a scientific name to its AphiaID via the WoRMS REST API.
pyworms 0.4.0 ships `aphiaIDByName` as an unimplemented stub, so we call the
REST endpoint directly. Returns the AphiaID as an int, or None if unmatched.
"""
r = requests.get(
f'{WORMS}/AphiaIDByName/{name}',
params={'marine_only': str(marine_only).lower()},
timeout=30,
)
if r.status_code == 204 or not r.text.strip():
return None
r.raise_for_status()
return int(r.text)
1. Single-name lookup¶
Atlantic cod — one of the top species in the Maritimes Spring RV dataset from Notebook 02. The two-step pattern is: name → AphiaID → full record.
Note:
pyworms0.4.0 doesn't implementaphiaIDByName, so we use the smallaphia_id_by_name()helper defined in the imports cell — a one-line call to the WoRMSAphiaIDByNameREST endpoint. Everything else in this notebook usespywormsdirectly.
aphia_id = aphia_id_by_name('Gadus morhua')
print('AphiaID:', aphia_id)
record = pyworms.aphiaRecordByAphiaID(aphia_id)
{k: record[k] for k in ('AphiaID', 'scientificname', 'authority', 'rank', 'status', 'valid_AphiaID', 'valid_name', 'kingdom', 'phylum', 'class', 'order', 'family')}
2. Synonym → accepted name¶
A name that's still in circulation but has been superseded. WoRMS returns the record for the synonym (status: "unaccepted") with a pointer to the accepted valid_AphiaID — always follow that pointer before using the name downstream.
# 'Clupea harengus harengus' is an obsolete subspecies designation for Atlantic herring.
rec = pyworms.aphiaRecordByAphiaID(aphia_id_by_name('Clupea harengus harengus'))
print('Looked up :', rec['scientificname'], '— status:', rec['status'])
print('Accepted :', rec['valid_name'], '(AphiaID', rec['valid_AphiaID'], ')')
# Always resolve to the accepted record:
accepted = pyworms.aphiaRecordByAphiaID(rec['valid_AphiaID'])
accepted['scientificname'], accepted['rank']
3. Batch name matching¶
The everyday use case: you have a column of scientificName strings from OBIS, GBIF, or a CSV, and you want an AphiaID for each. aphiaRecordsByMatchNames accepts up to 50 names per call and fuzzy-matches them against WoRMS — returns a list-of-lists (multiple candidates per name when the match is ambiguous).
# A mix taken from the top species in Notebook 02, with one typo to exercise fuzzy match.
names = [
'Melanogrammus aeglefinus',
'Gadus morhua',
'Squalus acanthias',
'Homarus americanus',
'Gadus morrhua', # historical misspelling
]
matches = pyworms.aphiaRecordsByMatchNames(names)
rows = []
for query, candidates in zip(names, matches):
if not candidates:
rows.append({'query': query, 'matched_name': None, 'AphiaID': None, 'match_type': None, 'status': None})
continue
best = candidates[0]
rows.append({
'query': query,
'matched_name': best['scientificname'],
'AphiaID': best['AphiaID'],
'match_type': best.get('match_type'),
'status': best['status'],
'valid_name': best.get('valid_name'),
})
pd.DataFrame(rows)
4. Full classification¶
The flat kingdom/phylum/class/... fields on a record only show the seven "standard" ranks. aphiaClassificationByAphiaID returns the complete classification — every rank WoRMS holds, including intermediate clades like subphylum, gigaclass, and superclass.
pyworms0.4.0 returns this as a flattened{rank: name, <rank>id: id}dict (it callsflatten()on the WoRMS nested tree), so we pull out the rank→name pairs rather than walking a nestedchildchain.
cod_aphia = aphia_id_by_name('Gadus morhua')
classification = pyworms.aphiaClassificationByAphiaID(cod_aphia)
# pyworms flattens the classification into a {rank: name, <rank>id: id} dict.
# Keep the rank→name pairs (those whose key has a matching '<rank>id' entry),
# in taxonomic order — dict order runs high rank → low.
lineage = [(rank, name) for rank, name in classification.items() if f'{rank}id' in classification]
pd.DataFrame(lineage, columns=['rank', 'scientificname'])
5. Same call, raw REST¶
pyworms is a thin wrapper — every call maps 1:1 to a URL. Useful to know when you need a field pyworms doesn't surface, want to control timeouts, or are debugging from curl. All endpoints are documented at https://www.marinespecies.org/rest/.
r = requests.get(f'{WORMS}/AphiaRecordByAphiaID/{cod_aphia}', timeout=30)
r.raise_for_status()
{k: r.json()[k] for k in ('AphiaID', 'scientificname', 'rank', 'status', 'lsid', 'url')}
6. Common (vernacular) names¶
A scientific name is precise but unfriendly — most people don't know Gadus morhua is Atlantic cod. WoRMS curates vernacular names in dozens of languages, which is exactly what you'd surface in a public catalogue, a map pop-up, or a report.
pyworms 0.4.0 doesn't implement aphiaVernacularsByAphiaID (another stub), so — as with the ID lookup in §1 — we call the AphiaVernacularsByAphiaID REST endpoint directly.
def vernaculars_by_aphia_id(aphia_id):
"""Common names for an AphiaID, as a DataFrame (one row per name + language)."""
r = requests.get(f'{WORMS}/AphiaVernacularsByAphiaID/{aphia_id}', timeout=30)
if r.status_code == 204 or not r.text.strip():
return pd.DataFrame(columns=['vernacular', 'language_code', 'language'])
r.raise_for_status()
return pd.DataFrame(r.json())
# Atlantic cod (AphiaID from §1) — every common name WoRMS holds, across all languages.
vernaculars = vernaculars_by_aphia_id(cod_aphia)
print(f'{len(vernaculars)} names in {vernaculars["language_code"].nunique()} languages')
vernaculars.head(10)
# CIOOS is bilingual, so for display we want the English *and* French names.
EN_FR = ['eng', 'fra']
print('Atlantic cod — EN/FR names:')
for _, row in vernaculars[vernaculars['language_code'].isin(EN_FR)].iterrows():
print(f" [{row['language_code']}] {row['vernacular']}")
# The same call works for any AphiaID — e.g. Atlantic herring (Clupea harengus):
herring = vernaculars_by_aphia_id(aphia_id_by_name('Clupea harengus'))
herring.loc[herring['language_code'].isin(EN_FR), ['vernacular', 'language']]
Check your understanding¶
Before moving on, see if you can answer these. Click each ▶ to reveal the answer.
Q1 — Which key do you store? You match a name and WoRMS returns a record with status: 'unaccepted', AphiaID: 126436, and valid_AphiaID: 126435. Which AphiaID do you persist as your stable taxon key, and why does it matter?
Answer
Store valid_AphiaID (126435) — the accepted name's ID. The looked-up record (126436) is a synonym; its status is 'unaccepted'. If you store the synonym's AphiaID, two records that are the same taxon under different names get different keys, so counts, joins, and species lists silently fragment. The rule from §2 and the gotchas: any record whose status != 'accepted' is a pointer — follow valid_AphiaID to the accepted record before storing anything downstream.
Q2 — The marine_only trap. Your dataset is from a river-mouth survey and includes a freshwater species. You batch-match the names with the defaults and that species comes back with no match. What's the likely cause, and what's the one-argument fix?
Answer
aphiaRecordsByMatchNames(..., marine_only=True) is the default, so WoRMS filters out non-marine taxa — a freshwater or terrestrial species returns empty even when it exists in WoRMS. The fix is to pass marine_only=False:
pyworms.aphiaRecordsByMatchNames(names, marine_only=False)
This is fine to leave on for pure OBIS data (marine by definition), but it bites on mixed datasets — river-mouth surveys, GBIF downloads, anything spanning the freshwater/marine boundary.
Gotchas to flag in the talk¶
- Rate limits: WoRMS' public REST endpoint is shared infrastructure. Batch with
aphiaRecordsByMatchNames(≤ 50 names/call) before you parallelise; don't fire one request per row. - Marine-only filter:
aphiaRecordsByMatchNames(..., marine_only=True)is the default — fine for OBIS, wrong if your dataset includes freshwater or terrestrial species (e.g. river-mouth surveys, GBIF mixed downloads). - Always follow
valid_AphiaID: a record withstatus != 'accepted'is a synonym — don't store itsAphiaIDas your stable key. - For OBIS data, prefer
interpreted.scientificNameID: the AphiaID is already in the Parquet exports as an LSID; only do the WoRMS round-trip when the input isn't OBIS-interpreted.
Wrap-up¶
Three APIs, three jobs:
- OBIS REST (Notebook 01) — fine-grained queries, faceting, fallback for embargoed data.
- OBIS Parquet on S3 (Notebook 02) — bulk + analytical, the production harvester's path.
- WoRMS / pyworms (this notebook) — the authoritative taxonomic backbone behind both.