-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Context
load_dwca_data() in src/dataset_tools/utils.py:20 hardcodes the columns it selects from the DwC-A occurrence data, and verbatimScientificName is not among them. This means all downstream CLI commands (clean-dataset, verify-images, split-dataset, etc.) cannot natively use verbatimScientificName as a label column.
Currently, this is worked around by scripts/build_species_list.py, which reads the DwC-A independently and joins the name column onto the annotations CSV after the clean-dataset step. This works but adds an extra step to the pipeline.
Proposed Changes
-
Add an
extra_columnsparameter (or aname_columnoption) toload_dwca_data()insrc/dataset_tools/utils.py:20so that additional DwC-A columns can be carried through the pipeline. -
Update the CLI decorators/options in
src/dataset_tools/cli.pyto expose this option on relevant commands (clean-dataset,split-dataset, etc.). -
Once supported natively, the
build_species_list.pybridge script could be simplified or removed.
Related
- PR feat: add species classifier training pipeline #69 (species classifier pipeline) uses the workaround