Catalogs: Generate during Tidy #1272

pgierz · 2025-01-20T12:26:07Z

This allows you to create an intake catalog from your output files.

I'll run my own tests and post how this might look in practice. Tagging @mandresm for developer feedback and @JanStreffing as a friendly tester ;-)

Next Steps:

Sharing catalogs centrally

Copilot Summary

This pull request introduces new functionality to create and manage intake-esm catalogs for simulation configurations. The main changes include the addition of new functions to handle the creation and writing of these catalogs, as well as updates to the configuration files to incorporate these new steps.

New Functionality for Intake-ESM Catalogs:

src/esm_runscripts/catalog.py: Added create_intake_esm_catalog and write_intake_esm_catalog functions to create and save intake-esm catalogs based on simulation configurations. These functions allow for the generation of catalogs that can be controlled via configuration keys.

Configuration Updates:

configs/esm_software/esm_plugins.yaml: Added create_intake_esm_catalog and write_intake_esm_catalog steps to the catalog workflow to integrate the new catalog creation functionality.
configs/esm_software/esm_runscripts/esm_runscripts.yaml: Included create_intake_esm_catalog and write_intake_esm_catalog steps in the choose_job_type section to ensure these steps are executed during the appropriate job types.

Dependency Additions:

setup.py: Added the dpath library as a new dependency to support the merging of catalog data.

pgierz · 2025-01-20T12:37:36Z

Partial work for #1270

JanStreffing · 2025-01-20T12:39:05Z

Can you share what such a catalouge looks like?

pgierz · 2025-01-20T14:48:37Z

For the AWI-ESM 2 Tutorial example on Albedo, one gets something in YAML as below. Fold out to see the details, it is long.

Catalog Example

esmcat_version: 0.1.0
id: c90f4a285ba316f6296291a331024402f78cff3f5cc09462863f2f60406b5637
last_updated: '2025-01-20 15:07:54.677422'
title: Intake-ESM Catalog for Experiment sylvester-004
description: Basic AWIESM Test for PI
aggregation_control:
  aggregations:
    - attribute_name: variable_id
      options: {}
      type: union
    - attribute_name: time_min
      options:
        compat: override
        coords: minimal
        dim: time
      type: join_existing
    - attribute_name: variable_id
      options: {}
      type: union
    - attribute_name: time_min
      options:
        compat: override
        coords: minimal
        dim: time
      type: join_existing
    - attribute_name: variable_id
      options: {}
      type: union
    - attribute_name: time_min
      options:
        compat: override
        coords: minimal
        dim: time
      type: join_existing
  groupby_attrs:
    - project
    - institution_id
    - source_id
    - experiment_id
    - realm
    - project
    - institution_id
    - source_id
    - experiment_id
    - realm
    - project
    - institution_id
    - source_id
    - experiment_id
    - realm
  variable_column_name: variable_id
assets:
  column_name: uri
  format_column_name: format
attributes:
  - column_name: variable_id
    vocabulary: ''
  - column_name: project
    vocabulary: ''
  - column_name: institution_id
    vocabulary: ''
  - column_name: source_id
    vocabulary: ''
  - column_name: experiment_id
    vocabulary: ''
  - column_name: realm
    vocabulary: ''
  - column_name: time_min
    vocabulary: ''
  - column_name: time_max
    vocabulary: ''
  - column_name: variable_id
    vocabulary: ''
  - column_name: project
    vocabulary: ''
  - column_name: institution_id
    vocabulary: ''
  - column_name: source_id
    vocabulary: ''
  - column_name: experiment_id
    vocabulary: ''
  - column_name: realm
    vocabulary: ''
  - column_name: time_min
    vocabulary: ''
  - column_name: time_max
    vocabulary: ''
  - column_name: variable_id
    vocabulary: ''
  - column_name: project
    vocabulary: ''
  - column_name: institution_id
    vocabulary: ''
  - column_name: source_id
    vocabulary: ''
  - column_name: experiment_id
    vocabulary: ''
  - column_name: realm
    vocabulary: ''
  - column_name: time_min
    vocabulary: ''
  - column_name: time_max
    vocabulary: ''
catalog_dict:
   ... many many entries

These catalogues then behave the same way as the standard DKRZ catalog. An abbreviated example from DKRZ:

import intake
import intake_esm
dkrz_catalog = intake.open_catalog(["https://dkrz.de/s/intake"])
pm2 = dkrz_catalog.dkrz_palmod2_disk
pm2.search(variable_id="tas")

The functionality added in this PR creates such catalogs for each experiment at tidy time. So, one can do something like:

import intake
import intake_esm
import pandas as pd
import pathlib
catalog_file = pathlib.Path("/albedo/work/user/pgierz/SciComp/Tutorials/AWIESM_Basics/experiments/sylvester-005/sylvester-005_intake_catalog.yaml")
with open(catalog_file) as f:
    catalog_dict = yaml.safe_load(f)
catalog_df = pd.DataFrame(catalog_dict)
cat = intake.open_esm_datastore(obj=dict(esmcat=catalog_dict, df=catalog_df))

At this point, the cat object is the same as pm2. For right now, you get one catalog for each experiment, but the overall design goal that we envision is for these catalogs to eventually be shared between many users and centrally stored. The "what you can search for" was here exemplified withe the variable id, but this could in principle be expanded to be anything that might be of interest -- namelist settings, which machine you were on, the model version, and so on. It would then become trivial to do things like "show me all experiments with momix" or "Get all runs by Fernanda and Christian for CO2 settings greater than 1000"

…er hashing for ID

…instance

pgierz · 2025-01-23T12:28:46Z

@mandresm, I have one idea for improvement here: at the moment, I assume that the tidy job has already finished moving data, and then iterate through the entire outdata folder. This isn't really effective, since some of the outdata files will already have been catalogued.

I think it would be better to only restrict the catalog created during the current tidy job to only be aware of the current outdata files. Do we keep a record of the files the tidy job rearranges? If not, I would need to implement that, earlier on in the job.

pgierz · 2025-02-03T08:49:53Z

Waits for automatic creation of yaml describing which files are cleaned up during this run.

pgierz requested a review from mandresm January 20, 2025 12:29

pgierz changed the base branch from release to catalog January 21, 2025 10:54

pgierz changed the title ~~Catalogs~~ Catalogs: Generate during Tidy Jan 21, 2025

pgierz mentioned this pull request Jan 21, 2025

Catalogs: Upload Catalog to Server #1274

Draft

1 task

pgierz marked this pull request as ready for review January 21, 2025 10:57

pgierz added 8 commits January 21, 2025 12:14

feat(catalog): start of catalog work

73e1b11

fix(catalog): forgot to add new module in __init__

ab37b8f

fix(catalog): forgot some imports

2c62442

wip(catalog): convert all config elements to strings explicitly, bett…

098fb6b

…er hashing for ID

wip(catalog): better use of title and description keys

69c40f9

wip(catalog): better saving by using serialize directly from catalog …

2fd4955

…instance

wip(catalog): adds cfgrib dependency for indexing grib files correctly

44bac3f

wip(catalog): finishing touches for first part

dd2bf6d

pgierz force-pushed the feat/catalog branch from 22bfca5 to dd2bf6d Compare January 21, 2025 11:14

pgierz self-assigned this Jan 21, 2025

pgierz added 2 commits February 3, 2025 13:41

feat: adds some example intake files

146d872

wip

78975dc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Catalogs: Generate during Tidy #1272

Catalogs: Generate during Tidy #1272

Uh oh!

pgierz commented Jan 20, 2025 •

edited

Loading

Uh oh!

pgierz commented Jan 20, 2025

Uh oh!

JanStreffing commented Jan 20, 2025

Uh oh!

pgierz commented Jan 20, 2025 •

edited

Loading

Uh oh!

pgierz commented Jan 23, 2025

Uh oh!

pgierz commented Feb 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Catalogs: Generate during Tidy #1272

Are you sure you want to change the base?

Catalogs: Generate during Tidy #1272

Uh oh!

Conversation

pgierz commented Jan 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Next Steps:

Copilot Summary

New Functionality for Intake-ESM Catalogs:

Configuration Updates:

Dependency Additions:

Uh oh!

pgierz commented Jan 20, 2025

Uh oh!

JanStreffing commented Jan 20, 2025

Uh oh!

pgierz commented Jan 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pgierz commented Jan 23, 2025

Uh oh!

pgierz commented Feb 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pgierz commented Jan 20, 2025 •

edited

Loading

pgierz commented Jan 20, 2025 •

edited

Loading