feat: support verbatimScientificName as species label in existing dataset tools

## Context

`load_dwca_data()` in `src/dataset_tools/utils.py:20` hardcodes the columns it selects from the DwC-A occurrence data, and `verbatimScientificName` is not among them. This means all downstream CLI commands (`clean-dataset`, `verify-images`, `split-dataset`, etc.) cannot natively use `verbatimScientificName` as a label column.

Currently, this is worked around by `scripts/build_species_list.py`, which reads the DwC-A independently and joins the name column onto the annotations CSV after the `clean-dataset` step. This works but adds an extra step to the pipeline.

## Proposed Changes

1. Add an `extra_columns` parameter (or a `name_column` option) to `load_dwca_data()` in `src/dataset_tools/utils.py:20` so that additional DwC-A columns can be carried through the pipeline.

2. Update the CLI decorators/options in `src/dataset_tools/cli.py` to expose this option on relevant commands (`clean-dataset`, `split-dataset`, etc.).

3. Once supported natively, the `build_species_list.py` bridge script could be simplified or removed.

## Related

- PR #69 (species classifier pipeline) uses the workaround

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support verbatimScientificName as species label in existing dataset tools #70

Context

Proposed Changes

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: support verbatimScientificName as species label in existing dataset tools #70

Description

Context

Proposed Changes

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions