The version on the docs site is a static render — to run the cells, open the Colab link above or clone the repo and run locally.

🌊 How to Validate Your Biodiversity Data¶

OBIS Developer Workshop — Pyobistools¶

What you'll learn — all 6 validation functions:

Block	Function	What it checks
2	`check_fields`	Required and recommended Darwin Core fields are present
3	`check_occurrence_core_and_extension`	Valid `occurrenceStatus`, `basisOfRecord`, no duplicate IDs
4	`check_eventids`	Event hierarchy is consistent (parent–child IDs)
5	`check_measurementids`	All `measurementID` values are unique
6	`check_scientificname_and_ids`	Species names and LSIDs match WoRMS
7	`check_onland`	Coordinates fall in the ocean, not on land

Resources:

📖 Darwin Core standard
📋 Darwin Core terms — OBIS manual
🐍 pyobistools on GitHub
🌐 WoRMS — World Register of Marine Species

⏱️ Estimated time: ~35 minutes
📶 Internet required for: Block 6 (WoRMS API) and Block 7 bonus (OBIS API)
🗺️ Mermaid diagram below requires internet to load from CDN

Validation workflow — each check feeds into the next; fix errors before moving on:

Your CSV → check_fields → check_occurrence → check_eventids → check_measurementids → check_scientificname_and_ids → check_onland → Ready to publish

In [ ]:

Copied!





# @colab-setup
# Run this cell first. On Colab it installs the deps; locally it is a no-op.
import sys
if "google.colab" in sys.modules:
    %pip install -q pyobistools nest_asyncio pyobis plotly
# @colab-setup
# Run this cell first. On Colab it installs the deps; locally it is a no-op.
import sys
if "google.colab" in sys.modules:
    %pip install -q pyobistools nest_asyncio pyobis plotly

In [ ]:

Copied!





import sys

import pandas as pd
import numpy as np
import nest_asyncio
import plotly.graph_objects as go

from pyobistools.validation.check_fields import check_fields
from pyobistools.validation.check_occurrence_core_and_extension import check_occurrence_core_and_extension
from pyobistools.validation.check_eventids import check_eventids, check_extension_eventids
from pyobistools.validation.check_measurementids import check_measurementids
from pyobistools.validation.check_scientificname_and_ids import check_scientificname_and_ids
from pyobistools.validation.check_onland import check_onland

pd.set_option('max_colwidth', None)
print('All imports successful ✅')
import sys

import pandas as pd
import numpy as np
import nest_asyncio
import plotly.graph_objects as go

from pyobistools.validation.check_fields import check_fields
from pyobistools.validation.check_occurrence_core_and_extension import check_occurrence_core_and_extension
from pyobistools.validation.check_eventids import check_eventids, check_extension_eventids
from pyobistools.validation.check_measurementids import check_measurementids
from pyobistools.validation.check_scientificname_and_ids import check_scientificname_and_ids
from pyobistools.validation.check_onland import check_onland

pd.set_option('max_colwidth', None)
print('All imports successful ✅')

📂 Block 1 — Load Your Datasets¶

We'll use four synthetic datasets throughout this workshop — load them all now.

Variable	File	Used in
`df_occ`	`workshop_ex_occurrence.csv`	Blocks 2, 3, 4, 6 — field checks, occurrence validation, event IDs, WoRMS
`df_event`	`workshop_ex_event_core.csv`	Block 4b — extension event ID check
`df_emof`	`workshop_ex_emof.csv`	Blocks 4b, 5 — extension event IDs, measurement IDs
`df_onland`	`workshop_ex_onland.csv`	Block 7 — on-land coordinate check

In [ ]:

Copied!





_COLAB = "google.colab" in sys.modules
_BASE  = "https://raw.githubusercontent.com/cioos-siooc/CPDW-VI/main/docs/notebooks/"
def _csv(name):
    return (_BASE + name) if _COLAB else name

df_occ    = pd.read_csv(_csv('workshop_ex_occurrence.csv'))
df_event  = pd.read_csv(_csv('workshop_ex_event_core.csv'))
df_emof   = pd.read_csv(_csv('workshop_ex_emof.csv'))
df_onland = pd.read_csv(_csv('workshop_ex_onland.csv'))

print('Datasets loaded:')
for name, data in [('df_occ', df_occ), ('df_event', df_event), ('df_emof', df_emof), ('df_onland', df_onland)]:
    print(f'  {name:<12} {data.shape[0]:>4} rows x {data.shape[1]:>2} cols')
_COLAB = "google.colab" in sys.modules
_BASE  = "https://raw.githubusercontent.com/cioos-siooc/CPDW-VI/main/docs/notebooks/"
def _csv(name):
    return (_BASE + name) if _COLAB else name

df_occ    = pd.read_csv(_csv('workshop_ex_occurrence.csv'))
df_event  = pd.read_csv(_csv('workshop_ex_event_core.csv'))
df_emof   = pd.read_csv(_csv('workshop_ex_emof.csv'))
df_onland = pd.read_csv(_csv('workshop_ex_onland.csv'))

print('Datasets loaded:')
for name, data in [('df_occ', df_occ), ('df_event', df_event), ('df_emof', df_emof), ('df_onland', df_onland)]:
    print(f'  {name:<12} {data.shape[0]:>4} rows x {data.shape[1]:>2} cols')

📝 Block 2 — `check_fields`¶

This function evaluates a Darwin Core DataFrame and reports:

Absent fields — required or recommended columns missing entirely from the file
Empty values — required or recommended columns that are present but contain blank cells

Argument	Default	Effect
`data`	—	DataFrame to validate
`analysis_type`	—	DWC file type: `'occurrence_core'`, `'event_core'`, `'occurrence_extension'`, `'extended_measurement_or_fact_extension'`
`level`	`'error'`	`'error'` — absent and empty required fields (field presence check is case-insensitive). `'warning'` — absent and empty recommended fields, plus any present required or recommended field with incorrect column name case.
`accepted_name_usage_id_check`	`False`	When `True`, suppresses the empty-`scientificNameID` error on rows where `acceptedNameUsageID` is filled instead

Output columns: field | level | row | message → row = NaN — the column is entirely absent from the file → row = N — that specific row has an empty value in a required/recommended field

In [ ]:

Copied!

print(df_occ.columns)
df_occ.head()
print(df_occ.columns)
df_occ.head()

In [ ]:

Copied!





# Check for ERROR-level issues: missing required fields and empty required values
errors = check_fields(df_occ, analysis_type='occurrence_core', level='error')
print(f'Errors found: {len(errors)}')
errors
# Check for ERROR-level issues: missing required fields and empty required values
errors = check_fields(df_occ, analysis_type='occurrence_core', level='error')
print(f'Errors found: {len(errors)}')
errors

In [ ]:

Copied!





# Check for WARNING-level issues: missing recommended fields, incorrect column case
warnings = check_fields(df_occ, analysis_type='occurrence_core', level='warning')
print(f'Warnings found: {len(warnings)}')
warnings
# Check for WARNING-level issues: missing recommended fields, incorrect column case
warnings = check_fields(df_occ, analysis_type='occurrence_core', level='warning')
print(f'Warnings found: {len(warnings)}')
warnings

🔧 Your Task — Block 2¶

The error output shows two types of issues in df_occ:

Missing column (row = NaN) — geodeticDatum is entirely absent from the file
Empty values in a present required field (row = N) — scientificNameID exists as a column but is blank on some rows

Fix the geodeticDatum error by adding that column to df_fixed. Run check_fields again to confirm the geodeticDatum error disappears.

💡 All OBIS coordinates use WGS84 — the value should be the string 'WGS84'. The scientificNameID empty-value errors will remain — those require a WoRMS lookup (Block 6).

In [ ]:

Copied!

df_fixed = df_occ.copy()
# Add the missing geodeticDatum column
df_fixed = df_occ.copy()
# Add the missing geodeticDatum column

In [ ]:

Copied!





#@title 🔑 SOLUTION — double-click to reveal
df_fixed = df_occ.copy()
df_fixed['geodeticDatum'] = 'WGS84'
print('geodeticDatum column added.')
df_fixed.head(2)
#@title 🔑 SOLUTION — double-click to reveal
df_fixed = df_occ.copy()
df_fixed['geodeticDatum'] = 'WGS84'
print('geodeticDatum column added.')
df_fixed.head(2)

In [ ]:

Copied!





# Verify: run the check again on your fixed dataset
errors_after = check_fields(df_fixed, analysis_type='occurrence_core', level='error')
print(f'Errors before fix : {len(errors)}')
print(f'Errors after fix  : {len(errors_after)}')
print(f'\nRemaining errors:')
errors_after
# Verify: run the check again on your fixed dataset
errors_after = check_fields(df_fixed, analysis_type='occurrence_core', level='error')
print(f'Errors before fix : {len(errors)}')
print(f'Errors after fix  : {len(errors_after)}')
print(f'\nRemaining errors:')
errors_after

📋 Block 3 — `check_occurrence_core_and_extension`¶

This function validates controlled vocabulary and uniqueness constraints in occurrence files:

What it checks	Valid values
Duplicate `occurrenceID`	All values must be unique
`occurrenceStatus`	`present` or `absent` (case-sensitive, lowercase)
`basisOfRecord`	`PreservedSpecimen`, `FossilSpecimen`, `LivingSpecimen`, `HumanObservation`, `MachineObservation`, `MaterialSample`, `MaterialCitation`, `MaterialEntity`, `Occurrence`, `Taxon`, `Event` (case-sensitive)

We'll work with a synthetic dataset that contains intentional errors — just like a file a collaborator might send you.

In [ ]:

Copied!

# Explore the dataset — can you spot any issues?
df_occ
# Explore the dataset — can you spot any issues?
df_occ

In [ ]:

Copied!





# Run the validation
occ_errors = check_occurrence_core_and_extension(df_occ)
print(f'Issues found: {len(occ_errors)}')
occ_errors
# Run the validation
occ_errors = check_occurrence_core_and_extension(df_occ)
print(f'Issues found: {len(occ_errors)}')
occ_errors

🔧 Your Task — Block 3¶

The output above found issues across 4 different rows:

A duplicate occurrenceID — two rows share the same ID
An invalid basisOfRecord — FieldObservation is not in the accepted vocabulary
An invalid occurrenceStatus — a value that is not present or absent
An invalid basisOfRecord case — Humanobservation has incorrect capitalisation

Fix all 4 issues in df_occ_fixed and run the check again to confirm 0 errors.

💡 Look at the row column in the error output — it tells you which row index to fix.
Use df_occ_fixed.loc[row_number, 'column_name'] = 'new_value' to fix individual cells.

In [ ]:

Copied!

df_occ_fixed = df_occ.copy()
# Fix 1: duplicate occurrenceID

# Fix 2: invalid basisOfRecord (wrong vocabulary term)

# Fix 3: invalid occurrenceStatus

# Fix 4: invalid basisOfRecord (wrong case)
df_occ_fixed = df_occ.copy()
# Fix 1: duplicate occurrenceID

# Fix 2: invalid basisOfRecord (wrong vocabulary term)

# Fix 3: invalid occurrenceStatus

# Fix 4: invalid basisOfRecord (wrong case)

In [ ]:

Copied!





#@title 🔑 SOLUTION — double-click to reveal
df_occ_fixed = df_occ.copy()
# Fix 1: make the duplicate occurrenceID unique
df_occ_fixed.loc[1, 'occurrenceID'] = 'WS-OCC-001-B'
# Fix 2: use a valid basisOfRecord vocabulary term
df_occ_fixed.loc[5, 'basisOfRecord'] = 'HumanObservation'
# Fix 3: use 'present' (lowercase) — not 'presence'
df_occ_fixed.loc[8, 'occurrenceStatus'] = 'present'
# Fix 4: correct the capitalisation
df_occ_fixed.loc[11, 'basisOfRecord'] = 'HumanObservation'
#@title 🔑 SOLUTION — double-click to reveal
df_occ_fixed = df_occ.copy()
# Fix 1: make the duplicate occurrenceID unique
df_occ_fixed.loc[1, 'occurrenceID'] = 'WS-OCC-001-B'
# Fix 2: use a valid basisOfRecord vocabulary term
df_occ_fixed.loc[5, 'basisOfRecord'] = 'HumanObservation'
# Fix 3: use 'present' (lowercase) — not 'presence'
df_occ_fixed.loc[8, 'occurrenceStatus'] = 'present'
# Fix 4: correct the capitalisation
df_occ_fixed.loc[11, 'basisOfRecord'] = 'HumanObservation'

In [ ]:

Copied!





# Verify: should return 0 errors
result = check_occurrence_core_and_extension(df_occ_fixed)
print(f'Issues remaining: {len(result)}')
if len(result) == 0:
    print('All issues fixed! ✅') 
# Verify: should return 0 errors
result = check_occurrence_core_and_extension(df_occ_fixed)
print(f'Issues remaining: {len(result)}')
if len(result) == 0:
    print('All issues fixed! ✅') 

🔗 Block 4 — `check_eventids`¶

Event-based Darwin Core datasets use a parent–child hierarchy:

Cruise (top-level, no parent)
 ├── Station A  ← parentEventID = cruise eventID
 └── Station B  ← parentEventID = cruise eventID

Top-level events have an empty parentEventID. All other events must set parentEventID to an eventID that actually exists in the same file.

What it checks	Condition
`eventID` field	Must be present in the dataset
`parentEventID` field	Must be present in the dataset
`eventID` values	Must be unique — no duplicates
`parentEventID` values	Every non-empty value must reference an existing `eventID`

This function is used with an event_core or occurrence_core — not an extension. We'll use df_event, which has a cruise–station hierarchy with two intentional errors.

In [ ]:

Copied!





# Explore the event core
print(f'df_event : {df_event.shape[0]} rows x {df_event.shape[1]} cols')
print(f'Columns  : {list(df_event.columns)}')
df_event
# Explore the event core
print(f'df_event : {df_event.shape[0]} rows x {df_event.shape[1]} cols')
print(f'Columns  : {list(df_event.columns)}')
df_event

In [ ]:

Copied!





# Validate event IDs in the event core
eventid_errors = check_eventids(df_event)
print(f'Event ID issues found: {len(eventid_errors)}')
eventid_errors
# Validate event IDs in the event core
eventid_errors = check_eventids(df_event)
print(f'Event ID issues found: {len(eventid_errors)}')
eventid_errors

🔧 Your Task — Block 4¶

The output above found 2 issues:

A duplicate eventID — STATION-002 appears on two rows
A broken parentEventID — STATION-003 has parentEventID = CRUISE-9999, but no row in the file has eventID = CRUISE-9999 (the parent event is missing)

Fix both errors in df_event_fixed and run check_eventids again to confirm 0 errors.

💡 For the duplicate: keep the first occurrence and drop the second. For the broken parent: STATION-003 should belong to CRUISE-2024.

In [ ]:

Copied!

df_event_fixed = df_event.copy()
# Fix 1: remove the duplicate eventID

# Fix 2: correct the orphaned parentEventID
df_event_fixed = df_event.copy()
# Fix 1: remove the duplicate eventID

# Fix 2: correct the orphaned parentEventID

In [ ]:

Copied!





#@title 🔑 SOLUTION — double-click to reveal
df_event_fixed = df_event.copy()
# Fix 1: keep only the first occurrence of each eventID
df_event_fixed = df_event_fixed.drop_duplicates(subset='eventID', keep='first').reset_index(drop=True)
# Fix 2: STATION-003 belongs to CRUISE-2024, not the non-existent CRUISE-9999
df_event_fixed.loc[df_event_fixed['eventID'] == 'STATION-003', 'parentEventID'] = 'CRUISE-2024'
print('Fixes applied.')

eventid_errors_fixed = check_eventids(df_event_fixed)
print(f'Event ID issues remaining: {len(eventid_errors_fixed)}')
if len(eventid_errors_fixed) == 0:
    print('Event IDs are valid! ✅')
df_event_fixed
#@title 🔑 SOLUTION — double-click to reveal
df_event_fixed = df_event.copy()
# Fix 1: keep only the first occurrence of each eventID
df_event_fixed = df_event_fixed.drop_duplicates(subset='eventID', keep='first').reset_index(drop=True)
# Fix 2: STATION-003 belongs to CRUISE-2024, not the non-existent CRUISE-9999
df_event_fixed.loc[df_event_fixed['eventID'] == 'STATION-003', 'parentEventID'] = 'CRUISE-2024'
print('Fixes applied.')

eventid_errors_fixed = check_eventids(df_event_fixed)
print(f'Event ID issues remaining: {len(eventid_errors_fixed)}')
if len(eventid_errors_fixed) == 0:
    print('Event IDs are valid! ✅')
df_event_fixed

`check_extension_eventids`¶

When you publish an eMoF file alongside a core file, every eventID in the extension must match an eventID in the core — otherwise those measurements cannot be linked to any sampling event.

Supported file pairings:

event_core + occurrence_extension
event_core + eMoF
occurrence_core + eMoF

Argument	Default	Effect
`core`	—	DataFrame of the event_core or occurrence_core
`extension_or_emof`	—	DataFrame of the eMoF or extension file
`field`	`'eventID'`	Linking field: `'eventID'` or `'occurrenceID'`

df_event has stations STATION-001 through STATION-003. df_emof references STATION-001 through STATION-004 — let's see what happens.

In [ ]:

Copied!

df_event
df_event

In [ ]:

Copied!

df_emof
df_emof

In [ ]:

Copied!





# Check that all df_emof eventIDs have a match in the event core
ext_errors = check_extension_eventids(df_event, df_emof, field='eventID')
print(f'Extension linkage errors: {len(ext_errors)}')
ext_errors
# Check that all df_emof eventIDs have a match in the event core
ext_errors = check_extension_eventids(df_event, df_emof, field='eventID')
print(f'Extension linkage errors: {len(ext_errors)}')
ext_errors

🔧 Your Task — Block 4b¶

df_emof contains measurements for STATION-004, but that station has no corresponding eventID in df_event.

Fix the error by adding STATION-004 to df_event_fixed and rerunning the check.

💡 In a real dataset this means either:

the core file is incomplete (the station was never recorded as an event), or

the eMoF has a typo in one of its eventID values.

In [ ]:

Copied!

df_event_fixed = df_event.copy()
# Add the missing station to the event core
df_event_fixed = df_event.copy()
# Add the missing station to the event core

In [ ]:

Copied!





#@title 🔑 SOLUTION — double-click to reveal
df_event_fixed = pd.concat([
    df_event,
    pd.DataFrame({'eventID': ['STATION-004'], 'parentEventID': ['STATION-004']})
], ignore_index=True)

result = check_extension_eventids(df_event_fixed, df_emof, field='eventID')
print(f'Extension linkage errors remaining: {len(result)}')
if len(result) == 0:
    print('All eMoF records are linked to a core event! ✅')
#@title 🔑 SOLUTION — double-click to reveal
df_event_fixed = pd.concat([
    df_event,
    pd.DataFrame({'eventID': ['STATION-004'], 'parentEventID': ['STATION-004']})
], ignore_index=True)

result = check_extension_eventids(df_event_fixed, df_emof, field='eventID')
print(f'Extension linkage errors remaining: {len(result)}')
if len(result) == 0:
    print('All eMoF records are linked to a core event! ✅')

📐 Block 5 — `check_measurementids`¶

In the extended Measurement or Fact (eMoF) extension, each measurement record should have a unique measurementID.
Duplicate IDs break the ability to reference individual measurements and are rejected by OBIS.

This function has one argument: data — the eMoF DataFrame.
It returns a standard [field | level | row | message] error DataFrame.

In [ ]:

Copied!

# Explore the eMoF dataset — can you spot the duplicate IDs?
df_emof
# Explore the eMoF dataset — can you spot the duplicate IDs?
df_emof

In [ ]:

Copied!





# Run the check
meas_errors = check_measurementids(df_emof)
print(f'Measurement ID issues found: {len(meas_errors)}')
meas_errors
# Run the check
meas_errors = check_measurementids(df_emof)
print(f'Measurement ID issues found: {len(meas_errors)}')
meas_errors

🔧 Your Task — Block 5¶

MEAS-002 and MEAS-004 each appear on two rows — the check flagged all 4 offending rows.

Assign unique measurementID values to all rows in df_emof_fixed.
Run the check again to confirm 0 errors.

💡 There are many valid approaches. The simplest is to generate a new sequential ID for every row.

In [ ]:

Copied!

df_emof_fixed = df_emof.copy()
# Assign unique measurementIDs
df_emof_fixed = df_emof.copy()
# Assign unique measurementIDs

In [ ]:

Copied!

#@title 🔑 SOLUTION — double-click to reveal
df_emof_fixed = df_emof.copy()
df_emof_fixed['measurementID'] = [f'MEAS-{i:03d}' for i in range(1, len(df_emof_fixed) + 1)]
#@title 🔑 SOLUTION — double-click to reveal
df_emof_fixed = df_emof.copy()
df_emof_fixed['measurementID'] = [f'MEAS-{i:03d}' for i in range(1, len(df_emof_fixed) + 1)]

In [ ]:

Copied!





# Verify: should return 0 errors
result = check_measurementids(df_emof_fixed)
print(f'Measurement ID issues remaining: {len(result)}')
if len(result) == 0:
    print('All measurementIDs are now unique! ✅')
df_emof_fixed
# Verify: should return 0 errors
result = check_measurementids(df_emof_fixed)
print(f'Measurement ID issues remaining: {len(result)}')
if len(result) == 0:
    print('All measurementIDs are now unique! ✅')
df_emof_fixed

🧬 Block 6 — `check_scientificname_and_ids`¶

This function queries the World Register of Marine Species (WoRMS) to validate:

Whether scientific names are recognized
Whether scientificNameID (LSID) matches the accepted name
Whether taxon rank is correct

What is an LSID? A globally unique persistent identifier: urn:lsid:marinespecies.org:taxname:126505

`value`	Returns	What it checks
`'names'`	1 DataFrame	Name recognized in WoRMS?
`'names_ids'`	2 DataFrames	Above + LSID matches WoRMS?
`'names_taxons_ids'`	2 DataFrames	Above + taxon rank correct?

Output Oui/Yes = match Non/No = mismatch

📶 Internet required — live WoRMS API calls. Expect ~30–90 seconds for this dataset.

In [ ]:

Copied!

nest_asyncio.apply()
print('Ready to query WoRMS 🌎')
nest_asyncio.apply()
print('Ready to query WoRMS 🌎')

In [ ]:

Copied!

df_occ.scientificname.unique()
df_occ.scientificname.unique()

In [ ]:

Copied!





# Step 1: validate scientific names only
# All 4 species in df_occ are fictional — expect Non/No for every name
names_result = check_scientificname_and_ids(df_occ, value='names')

n_total   = len(names_result)
n_match   = (names_result[('Validation', 'Exact_Match')] == 'Oui/Yes').sum()
n_nomatch = (names_result[('Validation', 'Exact_Match')] == 'Non/No').sum()
print(f'Unique names checked  : {n_total}')
print(f'  Matched  (Oui/Yes) : {n_match}')
print(f'  Mismatch (Non/No)  : {n_nomatch}')
names_result
# Step 1: validate scientific names only
# All 4 species in df_occ are fictional — expect Non/No for every name
names_result = check_scientificname_and_ids(df_occ, value='names')

n_total   = len(names_result)
n_match   = (names_result[('Validation', 'Exact_Match')] == 'Oui/Yes').sum()
n_nomatch = (names_result[('Validation', 'Exact_Match')] == 'Non/No').sum()
print(f'Unique names checked  : {n_total}')
print(f'  Matched  (Oui/Yes) : {n_match}')
print(f'  Mismatch (Non/No)  : {n_nomatch}')
names_result

🔧 Your Task — Block 6, Part 1¶

The names_result table shows a Non/No in Exact_Match for species not recognized by WoRMS.

Write code to:

Filter names_result to show only the Non/No rows
Print just the list of unrecognized species names

💡 The column is a multi-index tuple: ('Validation', 'Exact_Match')

In [ ]:

Copied!

# Filter to Non/No rows and print the species names
# Filter to Non/No rows and print the species names

In [ ]:

Copied!





#@title 🔑 SOLUTION — double-click to reveal
non_match = names_result[names_result[('Validation', 'Exact_Match')] == 'Non/No']
print(f'Species not recognized by WoRMS: {len(non_match)}')
print()
for name in non_match[('Dataset Values', 'scientificName')].values:
    print(f'  - {name}')
#@title 🔑 SOLUTION — double-click to reveal
non_match = names_result[names_result[('Validation', 'Exact_Match')] == 'Non/No']
print(f'Species not recognized by WoRMS: {len(non_match)}')
print()
for name in non_match[('Dataset Values', 'scientificName')].values:
    print(f'  - {name}')

💡 Working with freshwater or terrestrial data?
Use itis_usage=True to fall back to ITIS when WoRMS has no result:
check_scientificname_and_ids(df, value='names', itis_usage=True)
ITIS queries add several minutes for large datasets.

🗺️ Block 7 — `check_onland`¶

Marine data should be in the ocean. Coordinates on land are a red flag — common causes:

Decimal separator error (-66,12 instead of -66.12)
Swapped latitude/longitude
Wrong coordinate reference system (not WGS84)
Georeferencing error (point snapped to wrong location)

Argument	Options	Effect
`offline`	`True` / `False`	`True` = Natural Earth shoreline (fast); `False` = OBIS web service (precise)
`buffer`	degrees	Points within this distance of shore are considered valid
`report`	`True` / `False`	`True` returns error format; `False` returns the flagged rows

We'll use a synthetic 25-record dataset. A few coordinates are obviously wrong — dropped into distant inland cities — while several others look plausible but still fall on land. The offline check below flags 10 of the 25.

In [ ]:

Copied!

# Explore the dataset — see if you can spot the suspicious coordinates
print(f'{len(df_onland)} records')
df_onland[['occurrenceID', 'scientificName', 'decimalLatitude', 'decimalLongitude']].head(10)
# Explore the dataset — see if you can spot the suspicious coordinates
print(f'{len(df_onland)} records')
df_onland[['occurrenceID', 'scientificName', 'decimalLatitude', 'decimalLongitude']].head(10)

In [ ]:

Copied!





# Run the on-land check (offline = Natural Earth shoreline, no extra internet needed)
flagged = check_onland(df_onland, offline=True)
print(f'Records flagged as on land: {len(flagged)}')
flagged[['occurrenceID', 'scientificName', 'decimalLatitude', 'decimalLongitude', 'on_land']]
# Run the on-land check (offline = Natural Earth shoreline, no extra internet needed)
flagged = check_onland(df_onland, offline=True)
print(f'Records flagged as on land: {len(flagged)}')
flagged[['occurrenceID', 'scientificName', 'decimalLatitude', 'decimalLongitude', 'on_land']]

In [ ]:

Copied!





# Visualize: all records blue, on-land records red
# The map zooms out to fit every point — the 3 distant-city errors stretch the view;
# the other red points sit on land within the Atlantic Canada cluster.
all_lats = df_onland['decimalLatitude'].astype(float)
all_lons = df_onland['decimalLongitude'].astype(float)

centre_lat = 0.5 * (all_lats.max() + all_lats.min())
centre_lon = 0.5 * (all_lons.max() + all_lons.min())
area = (all_lats.max() - all_lats.min()) * (all_lons.max() - all_lons.min())
zoom = np.interp(x=area, xp=[0.0005, .02, .05, 30, 350, 3500], fp=[12, 9.5, 6, 4, 2, 1])

fig = go.Figure()
fig.add_trace(go.Scattermapbox(
    lat=all_lats, lon=all_lons, mode='markers',
    marker={'size': 7, 'color': 'steelblue', 'opacity': 0.7},
    name=f'All records ({len(df_onland)})'
))
fig.add_trace(go.Scattermapbox(
    lat=flagged['decimalLatitude'].astype(float),
    lon=flagged['decimalLongitude'].astype(float), mode='markers',
    marker={'size': 12, 'color': 'red', 'opacity': 0.9},
    name=f'On land ⚠️ ({len(flagged)})'
))
fig.update_layout(
    mapbox={'style': 'open-street-map', 'center': {'lon': centre_lon, 'lat': centre_lat}, 'zoom': zoom},
    showlegend=True, legend={'x': 0.01, 'y': 0.99},
    margin={'l': 0, 'r': 0, 'b': 0, 't': 30},
    title_text='Workshop dataset — on-land observations highlighted in red',
    height=480
)
fig.show()
# Visualize: all records blue, on-land records red
# The map zooms out to fit every point — the 3 distant-city errors stretch the view;
# the other red points sit on land within the Atlantic Canada cluster.
all_lats = df_onland['decimalLatitude'].astype(float)
all_lons = df_onland['decimalLongitude'].astype(float)

centre_lat = 0.5 * (all_lats.max() + all_lats.min())
centre_lon = 0.5 * (all_lons.max() + all_lons.min())
area = (all_lats.max() - all_lats.min()) * (all_lons.max() - all_lons.min())
zoom = np.interp(x=area, xp=[0.0005, .02, .05, 30, 350, 3500], fp=[12, 9.5, 6, 4, 2, 1])

fig = go.Figure()
fig.add_trace(go.Scattermapbox(
    lat=all_lats, lon=all_lons, mode='markers',
    marker={'size': 7, 'color': 'steelblue', 'opacity': 0.7},
    name=f'All records ({len(df_onland)})'
))
fig.add_trace(go.Scattermapbox(
    lat=flagged['decimalLatitude'].astype(float),
    lon=flagged['decimalLongitude'].astype(float), mode='markers',
    marker={'size': 12, 'color': 'red', 'opacity': 0.9},
    name=f'On land ⚠️ ({len(flagged)})'
))
fig.update_layout(
    mapbox={'style': 'open-street-map', 'center': {'lon': centre_lon, 'lat': centre_lat}, 'zoom': zoom},
    showlegend=True, legend={'x': 0.01, 'y': 0.99},
    margin={'l': 0, 'r': 0, 'b': 0, 't': 30},
    title_text='Workshop dataset — on-land observations highlighted in red',
    height=480
)
fig.show()

🔧 Your Task — Block 7¶

check_onland flagged 10 of the 25 records (stored in flagged). On the map, three red points sit far outside the survey area — obvious errors in distant cities — while the rest fall on land within Atlantic Canada: plausible-looking coordinates that are still invalid for marine data.

⚠️ The offline Natural Earth shoreline is coarse, so it can flag near-shore points generously. Re-run with offline=False to use OBIS's precise shoreline service, which may flag fewer.

Create df_ocean containing only the valid ocean observations (remove the on-land rows).
Then verify with check_onland that no flagged records remain.

💡 flagged.index contains the original row indices of the on-land records.
Use .isin() to identify which rows to exclude from df_onland.

In [ ]:

Copied!

# Create df_ocean with on-land rows removed
df_ocean = df_onland
# Your fix here
# Create df_ocean with on-land rows removed
df_ocean = df_onland
# Your fix here

In [ ]:

Copied!





#@title 🔑 SOLUTION — double-click to reveal
df_ocean = df_onland[~df_onland.index.isin(flagged.index)]
print(f'Removed {len(flagged)} on-land record(s).')
print(f'Ocean records remaining: {len(df_ocean)}')
#@title 🔑 SOLUTION — double-click to reveal
df_ocean = df_onland[~df_onland.index.isin(flagged.index)]
print(f'Removed {len(flagged)} on-land record(s).')
print(f'Ocean records remaining: {len(df_ocean)}')

In [ ]:

Copied!





# Verify: check_onland should return 0 flagged rows
flagged_after = check_onland(df_ocean, offline=True)
print(f'Flagged records remaining: {len(flagged_after)}')
if len(flagged_after) == 0:
    print('All coordinates are now in the ocean! ✅')

# Show cleaned map (all blue, no red)
clean_lats = df_ocean['decimalLatitude'].astype(float)
clean_lons = df_ocean['decimalLongitude'].astype(float)
c_lat = 0.5 * (clean_lats.max() + clean_lats.min())
c_lon = 0.5 * (clean_lons.max() + clean_lons.min())
c_area = (clean_lats.max() - clean_lats.min()) * (clean_lons.max() - clean_lons.min())
c_zoom = np.interp(x=c_area, xp=[0.0005, .02, .05, 30, 350, 3500], fp=[12, 9.5, 6, 4, 2, 1])

fig2 = go.Figure(go.Scattermapbox(
    lat=clean_lats, lon=clean_lons, mode='markers',
    marker={'size': 8, 'color': 'steelblue', 'opacity': 0.8},
    name=f'Ocean records ({len(df_ocean)})'
))
fig2.update_layout(
    mapbox={'style': 'open-street-map', 'center': {'lon': c_lon, 'lat': c_lat}, 'zoom': c_zoom},
    showlegend=True, margin={'l': 0, 'r': 0, 'b': 0, 't': 30},
    title_text='Cleaned dataset — all records in the ocean ✅',
    height=400
)
fig2.show()
# Verify: check_onland should return 0 flagged rows
flagged_after = check_onland(df_ocean, offline=True)
print(f'Flagged records remaining: {len(flagged_after)}')
if len(flagged_after) == 0:
    print('All coordinates are now in the ocean! ✅')

# Show cleaned map (all blue, no red)
clean_lats = df_ocean['decimalLatitude'].astype(float)
clean_lons = df_ocean['decimalLongitude'].astype(float)
c_lat = 0.5 * (clean_lats.max() + clean_lats.min())
c_lon = 0.5 * (clean_lons.max() + clean_lons.min())
c_area = (clean_lats.max() - clean_lats.min()) * (clean_lons.max() - clean_lons.min())
c_zoom = np.interp(x=c_area, xp=[0.0005, .02, .05, 30, 350, 3500], fp=[12, 9.5, 6, 4, 2, 1])

fig2 = go.Figure(go.Scattermapbox(
    lat=clean_lats, lon=clean_lons, mode='markers',
    marker={'size': 8, 'color': 'steelblue', 'opacity': 0.8},
    name=f'Ocean records ({len(df_ocean)})'
))
fig2.update_layout(
    mapbox={'style': 'open-street-map', 'center': {'lon': c_lon, 'lat': c_lat}, 'zoom': c_zoom},
    showlegend=True, margin={'l': 0, 'r': 0, 'b': 0, 't': 30},
    title_text='Cleaned dataset — all records in the ocean ✅',
    height=400
)
fig2.show()

📈 Block 8 — Validation Pipeline¶

Now let's build a reusable run_validation() function that runs the occurrence-core checks
(check_fields and check_occurrence_core_and_extension) in one pass and returns a single consolidated error report.

In [ ]:

Copied!





def run_validation(df, analysis_type='occurrence_core'):
    """Run the occurrence-core checks (check_fields + check_occurrence_core_and_extension) and return a combined error report."""
    results = []

    for level in ('error', 'warning'):
        r = check_fields(df, level=level, analysis_type=analysis_type).copy()
        if len(r):
            r['check'] = f'check_fields ({level})'
            results.append(r)

    r = check_occurrence_core_and_extension(df).copy()
    if len(r):
        r['check'] = 'check_occurrence'
        results.append(r)

    if not results:
        print('No issues found! ✅')
        return pd.DataFrame(columns=['field', 'level', 'row', 'message', 'check'])

    return pd.concat(results, ignore_index=True)


print('run_validation() ready ✅')
def run_validation(df, analysis_type='occurrence_core'):
    """Run the occurrence-core checks (check_fields + check_occurrence_core_and_extension) and return a combined error report."""
    results = []

    for level in ('error', 'warning'):
        r = check_fields(df, level=level, analysis_type=analysis_type).copy()
        if len(r):
            r['check'] = f'check_fields ({level})'
            results.append(r)

    r = check_occurrence_core_and_extension(df).copy()
    if len(r):
        r['check'] = 'check_occurrence'
        results.append(r)

    if not results:
        print('No issues found! ✅')
        return pd.DataFrame(columns=['field', 'level', 'row', 'message', 'check'])

    return pd.concat(results, ignore_index=True)


print('run_validation() ready ✅')

In [ ]:

Copied!





# Run the full pipeline on df_occ
report = run_validation(df_occ)
print(f'Total issues: {len(report)} ({(report["level"]=="error").sum()} errors, {(report["level"]=="warning").sum()} warnings)')

# Scoreboard: group by check, level, and field
scoreboard = (
    report
    .groupby(['check', 'level', 'field'])
    .size()
    .reset_index(name='count')
    .sort_values(['level', 'count'], ascending=[True, False])
    .reset_index(drop=True)
)
print('\n🏆 Data Quality Scoreboard')
scoreboard
# Run the full pipeline on df_occ
report = run_validation(df_occ)
print(f'Total issues: {len(report)} ({(report["level"]=="error").sum()} errors, {(report["level"]=="warning").sum()} warnings)')

# Scoreboard: group by check, level, and field
scoreboard = (
    report
    .groupby(['check', 'level', 'field'])
    .size()
    .reset_index(name='count')
    .sort_values(['level', 'count'], ascending=[True, False])
    .reset_index(drop=True)
)
print('\n🏆 Data Quality Scoreboard')
scoreboard

🔧 Your Task — Block 8¶

Using the scoreboard above, apply all the fixes from previous blocks to create a df_final
and run the pipeline again. How much did the error count drop?

💡 You already solved the geodeticDatum error in Block 2. Chain that fix with any others you can address.

In [ ]:

Copied!





df_final = df_occ.copy()
# Apply fixes from previous blocks

# Run the pipeline
report_final = run_validation(df_final)
print(f'Issues remaining: {len(report_final)}')
df_final = df_occ.copy()
# Apply fixes from previous blocks

# Run the pipeline
report_final = run_validation(df_final)
print(f'Issues remaining: {len(report_final)}')

In [ ]:

Copied!





# 🔑 SOLUTION — Expand to reveal
df_final = df_occ.copy()
# Fix from Block 2: add the missing required column
df_final['geodeticDatum'] = 'WGS84'

report_final = run_validation(df_final)
print(f'Before : {len(report)} issues')
print(f'After  : {len(report_final)} issues')
print(f'Fixed  : {len(report) - len(report_final)}')
# 🔑 SOLUTION — Expand to reveal
df_final = df_occ.copy()
# Fix from Block 2: add the missing required column
df_final['geodeticDatum'] = 'WGS84'

report_final = run_validation(df_final)
print(f'Before : {len(report)} issues')
print(f'After  : {len(report_final)} issues')
print(f'Fixed  : {len(report) - len(report_final)}')

✅ Pre-Publication Checklist¶

You've run all 6 validation checks. Before publishing to OBIS:

Fix all error-level check_fields issues — hard requirements
Fix duplicate IDs from check_occurrence and check_measurementids
Fix invalid vocabulary (occurrenceStatus, basisOfRecord) from check_occurrence
Resolve orphaned parentEventIDs from check_eventids
Correct scientificNameID to use WoRMS LSIDs from check_scientificname_and_ids
Investigate and remove or document on-land points from check_onland
Address warning-level fields (depth, sampling protocol, etc.) for richer data
Publish via IPT using pyIPT or directly on your node's OBIS IPT instance

Next tools in the workflow:

pyIPT — publish to OBIS programmatically
OBIS2CIOOS — transform OBIS datasets for CIOOS discovery
OBIS data download API — fetch data in Parquet format
OBIS data quality manual

🙋 Questions? Open an issue at github.com/cioos-siooc/pyobistools

🌊 How to Validate Your Biodiversity Data¶

OBIS Developer Workshop — Pyobistools¶

📂 Block 1 — Load Your Datasets¶

📝 Block 2 — check_fields¶

🔧 Your Task — Block 2¶

📋 Block 3 — check_occurrence_core_and_extension¶

🔧 Your Task — Block 3¶

🔗 Block 4 — check_eventids¶

🔧 Your Task — Block 4¶

check_extension_eventids¶

🔧 Your Task — Block 4b¶

📐 Block 5 — check_measurementids¶

🔧 Your Task — Block 5¶

🧬 Block 6 — check_scientificname_and_ids¶

🔧 Your Task — Block 6, Part 1¶

🗺️ Block 7 — check_onland¶

🔧 Your Task — Block 7¶

📈 Block 8 — Validation Pipeline¶

🔧 Your Task — Block 8¶

✅ Pre-Publication Checklist¶

📝 Block 2 — `check_fields`¶

📋 Block 3 — `check_occurrence_core_and_extension`¶

🔗 Block 4 — `check_eventids`¶

`check_extension_eventids`¶

📐 Block 5 — `check_measurementids`¶

🧬 Block 6 — `check_scientificname_and_ids`¶

🗺️ Block 7 — `check_onland`¶