-
Notifications
You must be signed in to change notification settings - Fork 17
Catalogs: Generate during Tidy #1272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: catalog
Are you sure you want to change the base?
Conversation
|
Partial work for #1270 |
|
Can you share what such a catalouge looks like? |
|
For the AWI-ESM 2 Tutorial example on Albedo, one gets something in YAML as below. Fold out to see the details, it is long. Catalog Exampleesmcat_version: 0.1.0
id: c90f4a285ba316f6296291a331024402f78cff3f5cc09462863f2f60406b5637
last_updated: '2025-01-20 15:07:54.677422'
title: Intake-ESM Catalog for Experiment sylvester-004
description: Basic AWIESM Test for PI
aggregation_control:
aggregations:
- attribute_name: variable_id
options: {}
type: union
- attribute_name: time_min
options:
compat: override
coords: minimal
dim: time
type: join_existing
- attribute_name: variable_id
options: {}
type: union
- attribute_name: time_min
options:
compat: override
coords: minimal
dim: time
type: join_existing
- attribute_name: variable_id
options: {}
type: union
- attribute_name: time_min
options:
compat: override
coords: minimal
dim: time
type: join_existing
groupby_attrs:
- project
- institution_id
- source_id
- experiment_id
- realm
- project
- institution_id
- source_id
- experiment_id
- realm
- project
- institution_id
- source_id
- experiment_id
- realm
variable_column_name: variable_id
assets:
column_name: uri
format_column_name: format
attributes:
- column_name: variable_id
vocabulary: ''
- column_name: project
vocabulary: ''
- column_name: institution_id
vocabulary: ''
- column_name: source_id
vocabulary: ''
- column_name: experiment_id
vocabulary: ''
- column_name: realm
vocabulary: ''
- column_name: time_min
vocabulary: ''
- column_name: time_max
vocabulary: ''
- column_name: variable_id
vocabulary: ''
- column_name: project
vocabulary: ''
- column_name: institution_id
vocabulary: ''
- column_name: source_id
vocabulary: ''
- column_name: experiment_id
vocabulary: ''
- column_name: realm
vocabulary: ''
- column_name: time_min
vocabulary: ''
- column_name: time_max
vocabulary: ''
- column_name: variable_id
vocabulary: ''
- column_name: project
vocabulary: ''
- column_name: institution_id
vocabulary: ''
- column_name: source_id
vocabulary: ''
- column_name: experiment_id
vocabulary: ''
- column_name: realm
vocabulary: ''
- column_name: time_min
vocabulary: ''
- column_name: time_max
vocabulary: ''
catalog_dict:
... many many entries
These catalogues then behave the same way as the standard DKRZ catalog. An abbreviated example from DKRZ: import intake
import intake_esm
dkrz_catalog = intake.open_catalog(["https://dkrz.de/s/intake"])
pm2 = dkrz_catalog.dkrz_palmod2_disk
pm2.search(variable_id="tas")
The functionality added in this PR creates such catalogs for each experiment at import intake
import intake_esm
import pandas as pd
import pathlib
catalog_file = pathlib.Path("/albedo/work/user/pgierz/SciComp/Tutorials/AWIESM_Basics/experiments/sylvester-005/sylvester-005_intake_catalog.yaml")
with open(catalog_file) as f:
catalog_dict = yaml.safe_load(f)
catalog_df = pd.DataFrame(catalog_dict)
cat = intake.open_esm_datastore(obj=dict(esmcat=catalog_dict, df=catalog_df))At this point, the |
…er hashing for ID
|
@mandresm, I have one idea for improvement here: at the moment, I assume that the tidy job has already finished moving data, and then iterate through the entire outdata folder. This isn't really effective, since some of the outdata files will already have been catalogued. I think it would be better to only restrict the catalog created during the current tidy job to only be aware of the current outdata files. Do we keep a record of the files the tidy job rearranges? If not, I would need to implement that, earlier on in the job. |
|
Waits for automatic creation of yaml describing which files are cleaned up during this run. |

This allows you to create an intake catalog from your output files.
I'll run my own tests and post how this might look in practice. Tagging @mandresm for developer feedback and @JanStreffing as a friendly tester ;-)
Next Steps:
Copilot Summary
This pull request introduces new functionality to create and manage intake-esm catalogs for simulation configurations. The main changes include the addition of new functions to handle the creation and writing of these catalogs, as well as updates to the configuration files to incorporate these new steps.
New Functionality for Intake-ESM Catalogs:
src/esm_runscripts/catalog.py: Addedcreate_intake_esm_catalogandwrite_intake_esm_catalogfunctions to create and save intake-esm catalogs based on simulation configurations. These functions allow for the generation of catalogs that can be controlled via configuration keys.Configuration Updates:
configs/esm_software/esm_plugins.yaml: Addedcreate_intake_esm_catalogandwrite_intake_esm_catalogsteps to thecatalogworkflow to integrate the new catalog creation functionality.configs/esm_software/esm_runscripts/esm_runscripts.yaml: Includedcreate_intake_esm_catalogandwrite_intake_esm_catalogsteps in thechoose_job_typesection to ensure these steps are executed during the appropriate job types.Dependency Additions:
setup.py: Added thedpathlibrary as a new dependency to support the merging of catalog data.