The version on the docs site is a static render — to run the cells, open the Colab link above or clone the repo and run locally.
🌊 How to Validate Your Biodiversity Data¶
OBIS Developer Workshop — Pyobistools¶
What you'll learn — all 6 validation functions:
| Block | Function | What it checks |
|---|---|---|
| 2 | check_fields |
Required and recommended Darwin Core fields are present |
| 3 | check_occurrence_core_and_extension |
Valid occurrenceStatus, basisOfRecord, no duplicate IDs |
| 4 | check_eventids |
Event hierarchy is consistent (parent–child IDs) |
| 5 | check_measurementids |
All measurementID values are unique |
| 6 | check_scientificname_and_ids |
Species names and LSIDs match WoRMS |
| 7 | check_onland |
Coordinates fall in the ocean, not on land |
Resources:
- 📖 Darwin Core standard
- 📋 Darwin Core terms — OBIS manual
- 🐍 pyobistools on GitHub
- 🌐 WoRMS — World Register of Marine Species
⏱️ Estimated time: ~35 minutes
📶 Internet required for: Block 6 (WoRMS API) and Block 7 bonus (OBIS API)
🗺️ Mermaid diagram below requires internet to load from CDN
Validation workflow — each check feeds into the next; fix errors before moving on:
Your CSV → check_fields → check_occurrence → check_eventids → check_measurementids → check_scientificname_and_ids → check_onland → Ready to publish
# @colab-setup
# Run this cell first. On Colab it installs the deps; locally it is a no-op.
import sys
if "google.colab" in sys.modules:
%pip install -q pyobistools nest_asyncio pyobis plotly
import sys
import pandas as pd
import numpy as np
import nest_asyncio
import plotly.graph_objects as go
from pyobistools.validation.check_fields import check_fields
from pyobistools.validation.check_occurrence_core_and_extension import check_occurrence_core_and_extension
from pyobistools.validation.check_eventids import check_eventids, check_extension_eventids
from pyobistools.validation.check_measurementids import check_measurementids
from pyobistools.validation.check_scientificname_and_ids import check_scientificname_and_ids
from pyobistools.validation.check_onland import check_onland
pd.set_option('max_colwidth', None)
print('All imports successful ✅')
📂 Block 1 — Load Your Datasets¶
We'll use four synthetic datasets throughout this workshop — load them all now.
| Variable | File | Used in |
|---|---|---|
df_occ |
workshop_ex_occurrence.csv |
Blocks 2, 3, 4, 6 — field checks, occurrence validation, event IDs, WoRMS |
df_event |
workshop_ex_event_core.csv |
Block 4b — extension event ID check |
df_emof |
workshop_ex_emof.csv |
Blocks 4b, 5 — extension event IDs, measurement IDs |
df_onland |
workshop_ex_onland.csv |
Block 7 — on-land coordinate check |
_COLAB = "google.colab" in sys.modules
_BASE = "https://raw.githubusercontent.com/cioos-siooc/CPDW-VI/main/docs/notebooks/"
def _csv(name):
return (_BASE + name) if _COLAB else name
df_occ = pd.read_csv(_csv('workshop_ex_occurrence.csv'))
df_event = pd.read_csv(_csv('workshop_ex_event_core.csv'))
df_emof = pd.read_csv(_csv('workshop_ex_emof.csv'))
df_onland = pd.read_csv(_csv('workshop_ex_onland.csv'))
print('Datasets loaded:')
for name, data in [('df_occ', df_occ), ('df_event', df_event), ('df_emof', df_emof), ('df_onland', df_onland)]:
print(f' {name:<12} {data.shape[0]:>4} rows x {data.shape[1]:>2} cols')
📝 Block 2 — check_fields¶
This function evaluates a Darwin Core DataFrame and reports:
- Absent fields — required or recommended columns missing entirely from the file
- Empty values — required or recommended columns that are present but contain blank cells
| Argument | Default | Effect |
|---|---|---|
data |
— | DataFrame to validate |
analysis_type |
— | DWC file type: 'occurrence_core', 'event_core', 'occurrence_extension', 'extended_measurement_or_fact_extension' |
level |
'error' |
'error' — absent and empty required fields (field presence check is case-insensitive). 'warning' — absent and empty recommended fields, plus any present required or recommended field with incorrect column name case. |
accepted_name_usage_id_check |
False |
When True, suppresses the empty-scientificNameID error on rows where acceptedNameUsageID is filled instead |
Output columns: field | level | row | message
→ row = NaN — the column is entirely absent from the file
→ row = N — that specific row has an empty value in a required/recommended field
print(df_occ.columns)
df_occ.head()
# Check for ERROR-level issues: missing required fields and empty required values
errors = check_fields(df_occ, analysis_type='occurrence_core', level='error')
print(f'Errors found: {len(errors)}')
errors
# Check for WARNING-level issues: missing recommended fields, incorrect column case
warnings = check_fields(df_occ, analysis_type='occurrence_core', level='warning')
print(f'Warnings found: {len(warnings)}')
warnings
🔧 Your Task — Block 2¶
The error output shows two types of issues in df_occ:
- Missing column (
row = NaN) —geodeticDatumis entirely absent from the file - Empty values in a present required field (
row = N) —scientificNameIDexists as a column but is blank on some rows
Fix the geodeticDatum error by adding that column to df_fixed.
Run check_fields again to confirm the geodeticDatum error disappears.
💡 All OBIS coordinates use WGS84 — the value should be the string
'WGS84'. ThescientificNameIDempty-value errors will remain — those require a WoRMS lookup (Block 6).
df_fixed = df_occ.copy()
# Add the missing geodeticDatum column
#@title 🔑 SOLUTION — double-click to reveal
df_fixed = df_occ.copy()
df_fixed['geodeticDatum'] = 'WGS84'
print('geodeticDatum column added.')
df_fixed.head(2)
# Verify: run the check again on your fixed dataset
errors_after = check_fields(df_fixed, analysis_type='occurrence_core', level='error')
print(f'Errors before fix : {len(errors)}')
print(f'Errors after fix : {len(errors_after)}')
print(f'\nRemaining errors:')
errors_after
📋 Block 3 — check_occurrence_core_and_extension¶
This function validates controlled vocabulary and uniqueness constraints in occurrence files:
| What it checks | Valid values |
|---|---|
Duplicate occurrenceID |
All values must be unique |
occurrenceStatus |
present or absent (case-sensitive, lowercase) |
basisOfRecord |
PreservedSpecimen, FossilSpecimen, LivingSpecimen, HumanObservation, MachineObservation, MaterialSample, MaterialCitation, MaterialEntity, Occurrence, Taxon, Event (case-sensitive) |
We'll work with a synthetic dataset that contains intentional errors — just like a file a collaborator might send you.
# Explore the dataset — can you spot any issues?
df_occ
# Run the validation
occ_errors = check_occurrence_core_and_extension(df_occ)
print(f'Issues found: {len(occ_errors)}')
occ_errors
🔧 Your Task — Block 3¶
The output above found issues across 4 different rows:
- A duplicate
occurrenceID— two rows share the same ID - An invalid
basisOfRecord—FieldObservationis not in the accepted vocabulary - An invalid
occurrenceStatus— a value that is notpresentorabsent - An invalid
basisOfRecordcase —Humanobservationhas incorrect capitalisation
Fix all 4 issues in df_occ_fixed and run the check again to confirm 0 errors.
💡 Look at the
rowcolumn in the error output — it tells you which row index to fix.
Usedf_occ_fixed.loc[row_number, 'column_name'] = 'new_value'to fix individual cells.
df_occ_fixed = df_occ.copy()
# Fix 1: duplicate occurrenceID
# Fix 2: invalid basisOfRecord (wrong vocabulary term)
# Fix 3: invalid occurrenceStatus
# Fix 4: invalid basisOfRecord (wrong case)
#@title 🔑 SOLUTION — double-click to reveal
df_occ_fixed = df_occ.copy()
# Fix 1: make the duplicate occurrenceID unique
df_occ_fixed.loc[1, 'occurrenceID'] = 'WS-OCC-001-B'
# Fix 2: use a valid basisOfRecord vocabulary term
df_occ_fixed.loc[5, 'basisOfRecord'] = 'HumanObservation'
# Fix 3: use 'present' (lowercase) — not 'presence'
df_occ_fixed.loc[8, 'occurrenceStatus'] = 'present'
# Fix 4: correct the capitalisation
df_occ_fixed.loc[11, 'basisOfRecord'] = 'HumanObservation'
# Verify: should return 0 errors
result = check_occurrence_core_and_extension(df_occ_fixed)
print(f'Issues remaining: {len(result)}')
if len(result) == 0:
print('All issues fixed! ✅')
🔗 Block 4 — check_eventids¶
Event-based Darwin Core datasets use a parent–child hierarchy:
Cruise (top-level, no parent)
├── Station A ← parentEventID = cruise eventID
└── Station B ← parentEventID = cruise eventID
Top-level events have an empty parentEventID. All other events must set
parentEventID to an eventID that actually exists in the same file.
| What it checks | Condition |
|---|---|
eventID field |
Must be present in the dataset |
parentEventID field |
Must be present in the dataset |
eventID values |
Must be unique — no duplicates |
parentEventID values |
Every non-empty value must reference an existing eventID |
This function is used with an event_core or occurrence_core — not an extension.
We'll use df_event, which has a cruise–station hierarchy with two intentional errors.
# Explore the event core
print(f'df_event : {df_event.shape[0]} rows x {df_event.shape[1]} cols')
print(f'Columns : {list(df_event.columns)}')
df_event
# Validate event IDs in the event core
eventid_errors = check_eventids(df_event)
print(f'Event ID issues found: {len(eventid_errors)}')
eventid_errors
🔧 Your Task — Block 4¶
The output above found 2 issues:
- A duplicate
eventID—STATION-002appears on two rows - A broken
parentEventID—STATION-003hasparentEventID = CRUISE-9999, but no row in the file haseventID = CRUISE-9999(the parent event is missing)
Fix both errors in df_event_fixed and run check_eventids again to confirm 0 errors.
💡 For the duplicate: keep the first occurrence and drop the second. For the broken parent:
STATION-003should belong toCRUISE-2024.
df_event_fixed = df_event.copy()
# Fix 1: remove the duplicate eventID
# Fix 2: correct the orphaned parentEventID
#@title 🔑 SOLUTION — double-click to reveal
df_event_fixed = df_event.copy()
# Fix 1: keep only the first occurrence of each eventID
df_event_fixed = df_event_fixed.drop_duplicates(subset='eventID', keep='first').reset_index(drop=True)
# Fix 2: STATION-003 belongs to CRUISE-2024, not the non-existent CRUISE-9999
df_event_fixed.loc[df_event_fixed['eventID'] == 'STATION-003', 'parentEventID'] = 'CRUISE-2024'
print('Fixes applied.')
eventid_errors_fixed = check_eventids(df_event_fixed)
print(f'Event ID issues remaining: {len(eventid_errors_fixed)}')
if len(eventid_errors_fixed) == 0:
print('Event IDs are valid! ✅')
df_event_fixed
check_extension_eventids¶
When you publish an eMoF file alongside a core file, every eventID in the
extension must match an eventID in the core — otherwise those measurements
cannot be linked to any sampling event.
Supported file pairings:
- event_core + occurrence_extension
- event_core + eMoF
- occurrence_core + eMoF
| Argument | Default | Effect |
|---|---|---|
core |
— | DataFrame of the event_core or occurrence_core |
extension_or_emof |
— | DataFrame of the eMoF or extension file |
field |
'eventID' |
Linking field: 'eventID' or 'occurrenceID' |
df_event has stations STATION-001 through STATION-003.
df_emof references STATION-001 through STATION-004 — let's see what happens.
df_event
df_emof
# Check that all df_emof eventIDs have a match in the event core
ext_errors = check_extension_eventids(df_event, df_emof, field='eventID')
print(f'Extension linkage errors: {len(ext_errors)}')
ext_errors
🔧 Your Task — Block 4b¶
df_emof contains measurements for STATION-004, but that station has no
corresponding eventID in df_event.
Fix the error by adding STATION-004 to df_event_fixed and rerunning the check.
💡 In a real dataset this means either:
- the core file is incomplete (the station was never recorded as an event), or
- the eMoF has a typo in one of its
eventIDvalues.
df_event_fixed = df_event.copy()
# Add the missing station to the event core
#@title 🔑 SOLUTION — double-click to reveal
df_event_fixed = pd.concat([
df_event,
pd.DataFrame({'eventID': ['STATION-004'], 'parentEventID': ['STATION-004']})
], ignore_index=True)
result = check_extension_eventids(df_event_fixed, df_emof, field='eventID')
print(f'Extension linkage errors remaining: {len(result)}')
if len(result) == 0:
print('All eMoF records are linked to a core event! ✅')
📐 Block 5 — check_measurementids¶
In the extended Measurement or Fact (eMoF) extension, each measurement record should have a unique measurementID.
Duplicate IDs break the ability to reference individual measurements and are rejected by OBIS.
This function has one argument: data — the eMoF DataFrame.
It returns a standard [field | level | row | message] error DataFrame.
# Explore the eMoF dataset — can you spot the duplicate IDs?
df_emof
# Run the check
meas_errors = check_measurementids(df_emof)
print(f'Measurement ID issues found: {len(meas_errors)}')
meas_errors
🔧 Your Task — Block 5¶
MEAS-002 and MEAS-004 each appear on two rows — the check flagged all 4 offending rows.
Assign unique measurementID values to all rows in df_emof_fixed.
Run the check again to confirm 0 errors.
💡 There are many valid approaches. The simplest is to generate a new sequential ID for every row.
df_emof_fixed = df_emof.copy()
# Assign unique measurementIDs
#@title 🔑 SOLUTION — double-click to reveal
df_emof_fixed = df_emof.copy()
df_emof_fixed['measurementID'] = [f'MEAS-{i:03d}' for i in range(1, len(df_emof_fixed) + 1)]
# Verify: should return 0 errors
result = check_measurementids(df_emof_fixed)
print(f'Measurement ID issues remaining: {len(result)}')
if len(result) == 0:
print('All measurementIDs are now unique! ✅')
df_emof_fixed
🧬 Block 6 — check_scientificname_and_ids¶
This function queries the World Register of Marine Species (WoRMS) to validate:
- Whether scientific names are recognized
- Whether
scientificNameID(LSID) matches the accepted name - Whether taxon rank is correct
What is an LSID? A globally unique persistent identifier: urn:lsid:marinespecies.org:taxname:126505
value |
Returns | What it checks |
|---|---|---|
'names' |
1 DataFrame | Name recognized in WoRMS? |
'names_ids' |
2 DataFrames | Above + LSID matches WoRMS? |
'names_taxons_ids' |
2 DataFrames | Above + taxon rank correct? |
Output Oui/Yes = match Non/No = mismatch
📶 Internet required — live WoRMS API calls. Expect ~30–90 seconds for this dataset.
nest_asyncio.apply()
print('Ready to query WoRMS 🌎')
df_occ.scientificname.unique()
# Step 1: validate scientific names only
# All 4 species in df_occ are fictional — expect Non/No for every name
names_result = check_scientificname_and_ids(df_occ, value='names')
n_total = len(names_result)
n_match = (names_result[('Validation', 'Exact_Match')] == 'Oui/Yes').sum()
n_nomatch = (names_result[('Validation', 'Exact_Match')] == 'Non/No').sum()
print(f'Unique names checked : {n_total}')
print(f' Matched (Oui/Yes) : {n_match}')
print(f' Mismatch (Non/No) : {n_nomatch}')
names_result
🔧 Your Task — Block 6, Part 1¶
The names_result table shows a Non/No in Exact_Match for species not recognized by WoRMS.
Write code to:
- Filter
names_resultto show only theNon/Norows - Print just the list of unrecognized species names
💡 The column is a multi-index tuple:
('Validation', 'Exact_Match')
# Filter to Non/No rows and print the species names
#@title 🔑 SOLUTION — double-click to reveal
non_match = names_result[names_result[('Validation', 'Exact_Match')] == 'Non/No']
print(f'Species not recognized by WoRMS: {len(non_match)}')
print()
for name in non_match[('Dataset Values', 'scientificName')].values:
print(f' - {name}')
💡 Working with freshwater or terrestrial data?
Useitis_usage=Trueto fall back to ITIS when WoRMS has no result:
check_scientificname_and_ids(df, value='names', itis_usage=True)
ITIS queries add several minutes for large datasets.
🗺️ Block 7 — check_onland¶
Marine data should be in the ocean. Coordinates on land are a red flag — common causes:
- Decimal separator error (
-66,12instead of-66.12) - Swapped latitude/longitude
- Wrong coordinate reference system (not WGS84)
- Georeferencing error (point snapped to wrong location)
| Argument | Options | Effect |
|---|---|---|
offline |
True / False |
True = Natural Earth shoreline (fast); False = OBIS web service (precise) |
buffer |
degrees | Points within this distance of shore are considered valid |
report |
True / False |
True returns error format; False returns the flagged rows |
We'll use a synthetic 25-record dataset. A few coordinates are obviously wrong — dropped into distant inland cities — while several others look plausible but still fall on land. The offline check below flags 10 of the 25.
# Explore the dataset — see if you can spot the suspicious coordinates
print(f'{len(df_onland)} records')
df_onland[['occurrenceID', 'scientificName', 'decimalLatitude', 'decimalLongitude']].head(10)
# Run the on-land check (offline = Natural Earth shoreline, no extra internet needed)
flagged = check_onland(df_onland, offline=True)
print(f'Records flagged as on land: {len(flagged)}')
flagged[['occurrenceID', 'scientificName', 'decimalLatitude', 'decimalLongitude', 'on_land']]
# Visualize: all records blue, on-land records red
# The map zooms out to fit every point — the 3 distant-city errors stretch the view;
# the other red points sit on land within the Atlantic Canada cluster.
all_lats = df_onland['decimalLatitude'].astype(float)
all_lons = df_onland['decimalLongitude'].astype(float)
centre_lat = 0.5 * (all_lats.max() + all_lats.min())
centre_lon = 0.5 * (all_lons.max() + all_lons.min())
area = (all_lats.max() - all_lats.min()) * (all_lons.max() - all_lons.min())
zoom = np.interp(x=area, xp=[0.0005, .02, .05, 30, 350, 3500], fp=[12, 9.5, 6, 4, 2, 1])
fig = go.Figure()
fig.add_trace(go.Scattermapbox(
lat=all_lats, lon=all_lons, mode='markers',
marker={'size': 7, 'color': 'steelblue', 'opacity': 0.7},
name=f'All records ({len(df_onland)})'
))
fig.add_trace(go.Scattermapbox(
lat=flagged['decimalLatitude'].astype(float),
lon=flagged['decimalLongitude'].astype(float), mode='markers',
marker={'size': 12, 'color': 'red', 'opacity': 0.9},
name=f'On land ⚠️ ({len(flagged)})'
))
fig.update_layout(
mapbox={'style': 'open-street-map', 'center': {'lon': centre_lon, 'lat': centre_lat}, 'zoom': zoom},
showlegend=True, legend={'x': 0.01, 'y': 0.99},
margin={'l': 0, 'r': 0, 'b': 0, 't': 30},
title_text='Workshop dataset — on-land observations highlighted in red',
height=480
)
fig.show()
🔧 Your Task — Block 7¶
check_onland flagged 10 of the 25 records (stored in flagged). On the map, three red points sit far outside the survey area — obvious errors in distant cities — while the rest fall on land within Atlantic Canada: plausible-looking coordinates that are still invalid for marine data.
⚠️ The offline Natural Earth shoreline is coarse, so it can flag near-shore points generously. Re-run with
offline=Falseto use OBIS's precise shoreline service, which may flag fewer.
Create df_ocean containing only the valid ocean observations (remove the on-land rows).
Then verify with check_onland that no flagged records remain.
💡
flagged.indexcontains the original row indices of the on-land records.
Use.isin()to identify which rows to exclude fromdf_onland.
# Create df_ocean with on-land rows removed
df_ocean = df_onland
# Your fix here
#@title 🔑 SOLUTION — double-click to reveal
df_ocean = df_onland[~df_onland.index.isin(flagged.index)]
print(f'Removed {len(flagged)} on-land record(s).')
print(f'Ocean records remaining: {len(df_ocean)}')
# Verify: check_onland should return 0 flagged rows
flagged_after = check_onland(df_ocean, offline=True)
print(f'Flagged records remaining: {len(flagged_after)}')
if len(flagged_after) == 0:
print('All coordinates are now in the ocean! ✅')
# Show cleaned map (all blue, no red)
clean_lats = df_ocean['decimalLatitude'].astype(float)
clean_lons = df_ocean['decimalLongitude'].astype(float)
c_lat = 0.5 * (clean_lats.max() + clean_lats.min())
c_lon = 0.5 * (clean_lons.max() + clean_lons.min())
c_area = (clean_lats.max() - clean_lats.min()) * (clean_lons.max() - clean_lons.min())
c_zoom = np.interp(x=c_area, xp=[0.0005, .02, .05, 30, 350, 3500], fp=[12, 9.5, 6, 4, 2, 1])
fig2 = go.Figure(go.Scattermapbox(
lat=clean_lats, lon=clean_lons, mode='markers',
marker={'size': 8, 'color': 'steelblue', 'opacity': 0.8},
name=f'Ocean records ({len(df_ocean)})'
))
fig2.update_layout(
mapbox={'style': 'open-street-map', 'center': {'lon': c_lon, 'lat': c_lat}, 'zoom': c_zoom},
showlegend=True, margin={'l': 0, 'r': 0, 'b': 0, 't': 30},
title_text='Cleaned dataset — all records in the ocean ✅',
height=400
)
fig2.show()
📈 Block 8 — Validation Pipeline¶
Now let's build a reusable run_validation() function that runs the occurrence-core checks
(check_fields and check_occurrence_core_and_extension) in one pass and returns a single consolidated error report.
def run_validation(df, analysis_type='occurrence_core'):
"""Run the occurrence-core checks (check_fields + check_occurrence_core_and_extension) and return a combined error report."""
results = []
for level in ('error', 'warning'):
r = check_fields(df, level=level, analysis_type=analysis_type).copy()
if len(r):
r['check'] = f'check_fields ({level})'
results.append(r)
r = check_occurrence_core_and_extension(df).copy()
if len(r):
r['check'] = 'check_occurrence'
results.append(r)
if not results:
print('No issues found! ✅')
return pd.DataFrame(columns=['field', 'level', 'row', 'message', 'check'])
return pd.concat(results, ignore_index=True)
print('run_validation() ready ✅')
# Run the full pipeline on df_occ
report = run_validation(df_occ)
print(f'Total issues: {len(report)} ({(report["level"]=="error").sum()} errors, {(report["level"]=="warning").sum()} warnings)')
# Scoreboard: group by check, level, and field
scoreboard = (
report
.groupby(['check', 'level', 'field'])
.size()
.reset_index(name='count')
.sort_values(['level', 'count'], ascending=[True, False])
.reset_index(drop=True)
)
print('\n🏆 Data Quality Scoreboard')
scoreboard
🔧 Your Task — Block 8¶
Using the scoreboard above, apply all the fixes from previous blocks to create a df_final
and run the pipeline again. How much did the error count drop?
💡 You already solved the
geodeticDatumerror in Block 2. Chain that fix with any others you can address.
df_final = df_occ.copy()
# Apply fixes from previous blocks
# Run the pipeline
report_final = run_validation(df_final)
print(f'Issues remaining: {len(report_final)}')
# 🔑 SOLUTION — Expand to reveal
df_final = df_occ.copy()
# Fix from Block 2: add the missing required column
df_final['geodeticDatum'] = 'WGS84'
report_final = run_validation(df_final)
print(f'Before : {len(report)} issues')
print(f'After : {len(report_final)} issues')
print(f'Fixed : {len(report) - len(report_final)}')
✅ Pre-Publication Checklist¶
You've run all 6 validation checks. Before publishing to OBIS:
- Fix all
error-levelcheck_fieldsissues — hard requirements - Fix duplicate IDs from
check_occurrenceandcheck_measurementids - Fix invalid vocabulary (
occurrenceStatus,basisOfRecord) fromcheck_occurrence - Resolve orphaned parentEventIDs from
check_eventids - Correct
scientificNameIDto use WoRMS LSIDs fromcheck_scientificname_and_ids - Investigate and remove or document on-land points from
check_onland - Address
warning-level fields (depth, sampling protocol, etc.) for richer data - Publish via IPT using
pyIPTor directly on your node's OBIS IPT instance
Next tools in the workflow:
- pyIPT — publish to OBIS programmatically
- OBIS2CIOOS — transform OBIS datasets for CIOOS discovery
- OBIS data download API — fetch data in Parquet format
- OBIS data quality manual
🙋 Questions? Open an issue at github.com/cioos-siooc/pyobistools