Storing model parameters for NTK analysis in Colibri by achiefa · Pull Request #2433 · NNPDF/nnpdf

achiefa · 2026-02-23T23:51:31Z

This PR implements checkpointing of the model parameters during training. Parameters are serialised in npz format as a single flattened array, which is what the n3fit module in Colibri expects.

There are other things that are meant as workarounds in order to make the serialised objects compatible with I've already implemented in Colibri for the NTK. Thus, this is a temporary solution until the n3fit module in colibri is ready. This is not meant to be merged!

I'll perform tests and post the results here.

scarlehoff · 2026-02-24T07:51:43Z

n3fit/src/n3fit/io/writer.py

+        epoch = self.stopping_object.would_stop_epoch
+        with open(out_path, "w", encoding="utf-8") as f:
+            f.write(str(epoch) if epoch is not None else "None")
+            f.write("\n")


I know you said this is not meant to be merged, but I think if instead of adding a new file you add this to the final .json file, so that normally would_stop_epoch == stop_epoch, unless you have some ntk flag which then makes them different, this could very well be on the standard n3fit.

(if you store it all in the same object that also means that it should work with parallel replicas ootb)

Thanks, this seems much better.

scarlehoff · 2026-02-24T07:53:31Z

n3fit/src/n3fit/stopping.py

+                -1 if self._history.final_epoch is None else self._history.final_epoch + 1
+            )
+        if not self._dont_stop:
+            self._restore_best_weights()


I would remove this condition. If the rest is correct, then the best weights should be the last ones. If not, there's something missing / not working as intended.

It is a good colibri down the mine

~~You mean if not self._dont_stop:? In other words, always call self._restore_best_weights?~~

Ah no, you meant the condition for self._would_stop_epoch.

achiefa added 5 commits February 23, 2026 18:57

Making the photon a downloadable resource

4c3b98f

Save parameters

b9d56ed

Allow checkpointing for multiple replicas

d115ea0

Serialization for colibri

f6b7194

change folder structure

ebd3e3a

achiefa requested a review from scarlehoff February 23, 2026 23:51

achiefa self-assigned this Feb 23, 2026

achiefa added enhancement New feature or request dont-merge labels Feb 23, 2026

achiefa added 3 commits February 24, 2026 00:27

Workaround for colibri replicas

df2dc91

remove debugging

1f68580

Remove stopping and add would stop

a32daab

scarlehoff reviewed Feb 24, 2026

View reviewed changes

achiefa added 4 commits February 24, 2026 09:46

Restoring master version for loader and photon

311fca7

Apply epoch convention to stored parameters

a1f9a12

Add would_stop_epoch to json

6c8858f

Restoring condition

e7e9450

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storing model parameters for NTK analysis in Colibri#2433

Storing model parameters for NTK analysis in Colibri#2433
achiefa wants to merge 12 commits intomasterfrom
ntk_colibri

achiefa commented Feb 23, 2026

Uh oh!

scarlehoff Feb 24, 2026

Uh oh!

achiefa Feb 24, 2026

Uh oh!

scarlehoff Feb 24, 2026

Uh oh!

achiefa Feb 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

achiefa commented Feb 23, 2026

Uh oh!

scarlehoff Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

achiefa Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

scarlehoff Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

achiefa Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

achiefa Feb 24, 2026 •

edited

Loading