Skip to content

feat: support verbatimScientificName as species label in existing dataset tools #70

@mihow

Description

@mihow

Context

load_dwca_data() in src/dataset_tools/utils.py:20 hardcodes the columns it selects from the DwC-A occurrence data, and verbatimScientificName is not among them. This means all downstream CLI commands (clean-dataset, verify-images, split-dataset, etc.) cannot natively use verbatimScientificName as a label column.

Currently, this is worked around by scripts/build_species_list.py, which reads the DwC-A independently and joins the name column onto the annotations CSV after the clean-dataset step. This works but adds an extra step to the pipeline.

Proposed Changes

  1. Add an extra_columns parameter (or a name_column option) to load_dwca_data() in src/dataset_tools/utils.py:20 so that additional DwC-A columns can be carried through the pipeline.

  2. Update the CLI decorators/options in src/dataset_tools/cli.py to expose this option on relevant commands (clean-dataset, split-dataset, etc.).

  3. Once supported natively, the build_species_list.py bridge script could be simplified or removed.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions