Skip to content

Conversation

@pgierz
Copy link
Member

@pgierz pgierz commented Jan 20, 2025

This allows you to create an intake catalog from your output files.

I'll run my own tests and post how this might look in practice. Tagging @mandresm for developer feedback and @JanStreffing as a friendly tester ;-)

Next Steps:

  • Sharing catalogs centrally

Copilot Summary

This pull request introduces new functionality to create and manage intake-esm catalogs for simulation configurations. The main changes include the addition of new functions to handle the creation and writing of these catalogs, as well as updates to the configuration files to incorporate these new steps.

New Functionality for Intake-ESM Catalogs:

  • src/esm_runscripts/catalog.py: Added create_intake_esm_catalog and write_intake_esm_catalog functions to create and save intake-esm catalogs based on simulation configurations. These functions allow for the generation of catalogs that can be controlled via configuration keys.

Configuration Updates:

Dependency Additions:

  • setup.py: Added the dpath library as a new dependency to support the merging of catalog data.

@pgierz pgierz requested a review from mandresm January 20, 2025 12:29
@pgierz
Copy link
Member Author

pgierz commented Jan 20, 2025

Partial work for #1270

@JanStreffing
Copy link
Contributor

Can you share what such a catalouge looks like?

@pgierz
Copy link
Member Author

pgierz commented Jan 20, 2025

For the AWI-ESM 2 Tutorial example on Albedo, one gets something in YAML as below. Fold out to see the details, it is long.

Catalog Example
esmcat_version: 0.1.0
id: c90f4a285ba316f6296291a331024402f78cff3f5cc09462863f2f60406b5637
last_updated: '2025-01-20 15:07:54.677422'
title: Intake-ESM Catalog for Experiment sylvester-004
description: Basic AWIESM Test for PI
aggregation_control:
  aggregations:
    - attribute_name: variable_id
      options: {}
      type: union
    - attribute_name: time_min
      options:
        compat: override
        coords: minimal
        dim: time
      type: join_existing
    - attribute_name: variable_id
      options: {}
      type: union
    - attribute_name: time_min
      options:
        compat: override
        coords: minimal
        dim: time
      type: join_existing
    - attribute_name: variable_id
      options: {}
      type: union
    - attribute_name: time_min
      options:
        compat: override
        coords: minimal
        dim: time
      type: join_existing
  groupby_attrs:
    - project
    - institution_id
    - source_id
    - experiment_id
    - realm
    - project
    - institution_id
    - source_id
    - experiment_id
    - realm
    - project
    - institution_id
    - source_id
    - experiment_id
    - realm
  variable_column_name: variable_id
assets:
  column_name: uri
  format_column_name: format
attributes:
  - column_name: variable_id
    vocabulary: ''
  - column_name: project
    vocabulary: ''
  - column_name: institution_id
    vocabulary: ''
  - column_name: source_id
    vocabulary: ''
  - column_name: experiment_id
    vocabulary: ''
  - column_name: realm
    vocabulary: ''
  - column_name: time_min
    vocabulary: ''
  - column_name: time_max
    vocabulary: ''
  - column_name: variable_id
    vocabulary: ''
  - column_name: project
    vocabulary: ''
  - column_name: institution_id
    vocabulary: ''
  - column_name: source_id
    vocabulary: ''
  - column_name: experiment_id
    vocabulary: ''
  - column_name: realm
    vocabulary: ''
  - column_name: time_min
    vocabulary: ''
  - column_name: time_max
    vocabulary: ''
  - column_name: variable_id
    vocabulary: ''
  - column_name: project
    vocabulary: ''
  - column_name: institution_id
    vocabulary: ''
  - column_name: source_id
    vocabulary: ''
  - column_name: experiment_id
    vocabulary: ''
  - column_name: realm
    vocabulary: ''
  - column_name: time_min
    vocabulary: ''
  - column_name: time_max
    vocabulary: ''
catalog_dict:
   ... many many entries

These catalogues then behave the same way as the standard DKRZ catalog. An abbreviated example from DKRZ:

import intake
import intake_esm
dkrz_catalog = intake.open_catalog(["https://dkrz.de/s/intake"])
pm2 = dkrz_catalog.dkrz_palmod2_disk
pm2.search(variable_id="tas")
image

The functionality added in this PR creates such catalogs for each experiment at tidy time. So, one can do something like:

import intake
import intake_esm
import pandas as pd
import pathlib
catalog_file = pathlib.Path("/albedo/work/user/pgierz/SciComp/Tutorials/AWIESM_Basics/experiments/sylvester-005/sylvester-005_intake_catalog.yaml")
with open(catalog_file) as f:
    catalog_dict = yaml.safe_load(f)
catalog_df = pd.DataFrame(catalog_dict)
cat = intake.open_esm_datastore(obj=dict(esmcat=catalog_dict, df=catalog_df))

At this point, the cat object is the same as pm2. For right now, you get one catalog for each experiment, but the overall design goal that we envision is for these catalogs to eventually be shared between many users and centrally stored. The "what you can search for" was here exemplified withe the variable id, but this could in principle be expanded to be anything that might be of interest -- namelist settings, which machine you were on, the model version, and so on. It would then become trivial to do things like "show me all experiments with momix" or "Get all runs by Fernanda and Christian for CO2 settings greater than 1000"

@pgierz pgierz changed the base branch from release to catalog January 21, 2025 10:54
@pgierz pgierz changed the title Catalogs Catalogs: Generate during Tidy Jan 21, 2025
@pgierz pgierz marked this pull request as ready for review January 21, 2025 10:57
@pgierz pgierz self-assigned this Jan 21, 2025
@pgierz
Copy link
Member Author

pgierz commented Jan 23, 2025

@mandresm, I have one idea for improvement here: at the moment, I assume that the tidy job has already finished moving data, and then iterate through the entire outdata folder. This isn't really effective, since some of the outdata files will already have been catalogued.

I think it would be better to only restrict the catalog created during the current tidy job to only be aware of the current outdata files. Do we keep a record of the files the tidy job rearranges? If not, I would need to implement that, earlier on in the job.

@pgierz
Copy link
Member Author

pgierz commented Feb 3, 2025

Waits for automatic creation of yaml describing which files are cleaned up during this run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants