Conversation
| pd = _get_pandas() | ||
| tqdm = _get_tqdm() |
There was a problem hiding this comment.
Is there a reason these are not just being imported at the top of the file?
cleanlab_studio/studio/enrichment.py
Outdated
| Returns: | ||
| Dict[str, Any]: A dictionary containing information about the enrichment job and the enriched dataset. | ||
| """ | ||
| run_online = _get_run_online() |
There was a problem hiding this comment.
Why is this not just imported at the top of the file?
cleanlab_studio/studio/enrichment.py
Outdated
| Dict[str, Any]: A dictionary containing information about the enrichment job and the enriched dataset. | ||
| """ | ||
| run_online = _get_run_online() | ||
| job_info = run_online(data, options, new_column_name, self._api_key) |
There was a problem hiding this comment.
I don't think passing in self._api_key because run_online expects a Studio object?
cleanlab_studio/studio/enrichment.py
Outdated
|
|
||
| Args: | ||
| data (Union[pd.DataFrame, List[dict]]): The dataset to enrich. | ||
| options (EnrichmentOptions): Options for enriching the dataset. |
There was a problem hiding this comment.
Link to EnrichmentOptions docstring
| return job_info | ||
|
|
||
|
|
||
| def _validate_enrichment_options(options: EnrichmentOptions) -> None: |
There was a problem hiding this comment.
Can you clarify why there is a separate _validate_enrichment_options defined here rather than using the validation function in run() here?
| regex: Union[str, Replacement, List[Replacement]], | ||
| ) -> Union[pd.Series, List[str]]: | ||
| column_data: Union["pd.Series", List[str]], | ||
| regex: Union[str, Tuple[str, str], List[Tuple[str, str]]], |
There was a problem hiding this comment.
[nit] Replacement is a type alias for the Tuple[str, str] type (ref here), not entirely sure why you made this change?
| @lru_cache(maxsize=None) | ||
| def _get_pandas(): | ||
| import pandas as pd | ||
|
|
||
| return pd |
There was a problem hiding this comment.
what is going on here?
pandas is already a dependency of this package, there should be no special logic to lazy-import it
Line 52 in c2a3013
| @lru_cache(maxsize=None) | ||
| def _get_tqdm(): | ||
| from tqdm import tqdm | ||
|
|
||
| return tqdm | ||
|
|
There was a problem hiding this comment.
what is going on here?
tqdm is already a dependency of this package, there should be no special logic to lazy import it
Line 57 in c2a3013
| @lru_cache(maxsize=None) | ||
| def _get_run_online(): | ||
| from cleanlab_studio.utils.data_enrichment.enrich import run_online | ||
|
|
||
| return run_online | ||
|
|
There was a problem hiding this comment.
get rid of this and do a standard import, unsure why you are using such an odd approach
| @@ -17,6 +17,7 @@ | |||
| import pandas as pd | |||
There was a problem hiding this comment.
please update the PR description with:
-
User code if they just want to do some real-time data enrichment quickly.
-
User code if they want to first run data enrichment project over a big static dataset, and then later want to run some real-time data enrichment over additional data.
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
Refactored client side code based on this task: https://www.notion.so/cleanlab/make-data-enrichment-client-side-API-match-backend-API-105c7fee85be8097b54dfb121b7dba4e
Goal: make Dynamic API match the Static API.
in all facets: cohesive naming of methods, argument types, regular expression libraries, etc.
This way user can use backend API to prototype Enrichment jobs and run them over a big static dataset, but then use client-API when they need to run the same logic in real-time over streaming data one at a time.
For any packages we need to import client-side, make these lazy optional imports, so cleanlab-studio package still works without those installed.
This will contain:
User code if they just want to do some real-time data enrichment quickly.
User code if they want to first run data enrichment project over a big static dataset, and then later want to run some real-time data enrichment over additional data.
Still doing some testing (unit test runs) but will request review on overall structure.