feat: manifest generation tooling, workflow optimization, and Python 3.13/3.14 support #11

adilhusain-s · 2025-12-24T15:39:15Z

Overview

This PR stabilizes the release pipeline by introducing partial manifest tooling, refactoring CI workflows to eliminate race conditions, and improving fault tolerance across architectures.

The key motivation is reliability.

Before this change, the pipeline tightly coupled builds, releases, and git updates inside matrix jobs. This made releases fragile, hard to recover from, and increasingly error-prone after adding Trivy scanning. In particular:

Non-release artifacts (Trivy SBOMs and scan reports) were accidentally being picked up during manifest parsing.
A failure on a single architecture could cancel the entire workflow.
Concurrent matrix jobs attempted to push to the repository, causing race conditions and flaky failures.
Recovering from partial failures required rerunning the full workflow.

This PR decouples artifact generation from manifest updates, introduces an explicit aggregation step, and makes the pipeline resilient to partial failures.

How the Release Pipeline Works (After This PR)

At a high level, the pipeline now runs in four clearly separated phases:

Discover which Python versions to release
Build artifacts per architecture
Create or update GitHub releases
Aggregate partial manifests and update tracked data atomically

This separation is intentional and is what fixes the reliability issues.

Pipeline Flow Explained

1. Tag Discovery (`get-tags` job)

The workflow first determines which Python versions should be processed.

If .github/release/python-tag-filter.txt exists, it is used as a filter (e.g. 3.13.*).
Otherwise, the workflow derives a filter from the latest upstream Python version.
Matching Python tags are collected and passed as a JSON matrix.

This keeps the workflow deterministic and avoids manual inputs while still allowing controlled releases.

2. Build & Package (Matrix Jobs)

For each discovered Python tag, the workflow runs a matrix build across:

Architectures (ppc64le, s390x)
Ubuntu versions (22.04, 24.04)

Key design choices:

fail-fast: false
A failure on one architecture does not cancel other builds.
Each matrix job:
- Builds the Python artifacts
- Uploads them to GitHub Releases
- Generates a partial manifest describing only its own artifacts

Partial manifests are uploaded as workflow artifacts and do not touch git.

3. Release Asset Finalization (`release-assets` job)

Once builds complete, a follow-up job ensures release assets are finalized per Python version.

Operates per Python tag (not per architecture)
Proceeds even if some architectures failed
Ensures release metadata is consistent

4. Manifest Aggregation (`update-manifests` job)

Instead of each build job pushing to the repository, a single aggregation job now runs:

Downloads all available partial manifest artifacts
(missing artifacts are tolerated for failed architectures)
Merges partial manifests into the tracked data
Commits and pushes changes once, atomically

Concurrency is controlled so only one aggregation runs per ref.

If a build for one architecture fails, only that job needs to be rerun.
The regenerated partial manifest can then be recombined without restarting the full workflow.

Key Changes

Infrastructure & Security

Added retry logic (8 attempts, 5s delay) to dotnet-install.py to handle transient network failures
Upgraded Trivy to v0.68.2 with strict failure thresholds
Simplified Makefile by removing unnecessary sudo usage

Partial Manifest Tooling

generate_partial_manifest.py: Generates architecture-scoped partial manifests
apply_partial_manifests.py: Merges partial manifests
backfill-manifests.yml: Regenerates or fixes manifests for existing releases without rebuilding binaries
Added unit tests for manifest generation and merging logic

This prevents Trivy-generated assets from leaking into release metadata.

CI/CD Workflow Refactor

Removed git push operations from matrix jobs
Introduced a single aggregation step
Added concurrency controls to serialize updates
Disabled fail-fast to preserve successful builds

The pipeline now follows an Artifact → Aggregate → Commit model.

Technical Rationale

Pushing to main from within a matrix strategy caused race conditions and flaky failures.
The new aggregation model eliminates these issues and allows partial recovery without full reruns.

Verification

✅ Unit tests for partial manifest generation and merging
✅ Infrastructure validated with upgraded Trivy
✅ Backfill workflow verified to parse tags and generate partial artifacts correctly

- dotnet-install.py: Add retry logic (8 attempts) for JSON fetching to handle network flakes. - Makefile: Upgrade Trivy to v0.68.2 and enforce build failure on High/Critical vulnerabilities. Signed-off-by: Adilhusain Shaikh <[email protected]>

- Add 'generate_partial_manifest.py' and 'apply_partial_manifests.py' scripts. - Add 'backfill-manifests.yml' workflow to process partial manifests. - Add unit tests for manifest generation and application logic. Signed-off-by: Adilhusain Shaikh <[email protected]> fix(tests): update error message assertion for invalid JSON handling Signed-off-by: Adilhusain Shaikh <[email protected]>

- release-matching-python-tags: Target Python 3.13.* and implement concurrency groups. - reusable-release-python-tar: Remove direct Git push logic; generate partial manifest artifacts instead. - release-matching-python-tags: Add 'update-manifests' job to aggregate partials and commit atomically. - Optimize 'max-parallel' and disable 'fail-fast' for better resilience. Signed-off-by: Adilhusain Shaikh <[email protected]>

- Drop legacy manifest files for Python 3.9, 3.10, 3.11, and 3.12. - Add and update manifest definitions for Python 3.13.x and 3.14.x on ppc64le and s390x architectures. Signed-off-by: Adilhusain Shaikh <[email protected]>

Signed-off-by: Adilhusain Shaikh <[email protected]>

…gged URLs Signed-off-by: Adilhusain Shaikh <[email protected]>

Signed-off-by: Adilhusain Shaikh <[email protected]>

…nd improve descriptions Signed-off-by: Adilhusain Shaikh <[email protected]>

Signed-off-by: Adilhusain Shaikh <[email protected]>

anup-kodlekere · 2025-12-30T13:54:22Z

PowerShell/dotnet-install.py

+                return json.loads(response.read())
+        except urllib.error.HTTPError as exc:  # Retry transient HTTP errors
+            if exc.code in [500, 502, 503, 504] and attempt < FETCH_MAX_RETRIES - 1:
+                typer.echo(f"⚠️ HTTP {exc.code} fetching {url}. Retrying in {FETCH_RETRY_DELAY}s... ({attempt + 1}/{FETCH_MAX_RETRIES})")


we could try exponential back-off for retry attempts, in case many of our CI jobs are retrying for whatever reason.

something like time.sleep(FETCH_RETRY_DELAY * (2 ** attempt))

Good suggestion! I’ll update the retry logic to use exponential backoff (e.g., time.sleep(FETCH_RETRY_DELAY * (2 ** attempt))) to reduce load during repeated failures.

Before you get into bakoffs, figure out exactly why we are getting those errors. Papering over an issue by extending retries with longer backoffs doesn't usually solve things, it just better hides real issues.

anup-kodlekere · 2025-12-30T13:55:48Z

PowerShell/dotnet-install.py

+                time.sleep(FETCH_RETRY_DELAY)
+                continue
+            raise
+        except Exception as exc:


can you speak as to why we need this broader/ catch-all exception that isn't already caught in the previous two sections?

The catch-all exception is intended to handle any unexpected errors not covered by the specific handlers, ensuring the retry loop is robust. If you prefer, I can narrow it further or add a comment explaining its purpose.

anup-kodlekere · 2025-12-30T13:56:36Z

PowerShell/dotnet-install.py

+                typer.echo(f"⚠️ Error fetching {url}: {exc}. Retrying in {FETCH_RETRY_DELAY}s... ({attempt + 1}/{FETCH_MAX_RETRIES})")
+                time.sleep(FETCH_RETRY_DELAY)
+                continue
+            raise


do we have any logic to show a fetching failure after max retries?

Yes, after the final retry, the script prints an error message and exits with a non-zero code using typer.Exit, so failures are clearly reported in CI logs.

How long does this wind up re-trying with the current or any extended backups, and how does that compare to the time slice allowed for a git hub job? And why do we really have to fetch from MS x86 space on every run? blech... Adding retries if there are network issues is going to put more pressure on the network meaning it exacerbates any problem; backoffs are good but for how long, and if you back off for 10 minutes (I don't know how long the back off goes) but how much time slice remains for the job to run? This sounds like this would benefit from good logging and decide if we should just fail sooner, rather than retry until the user gets a two minute time slice?

anup-kodlekere · 2025-12-30T14:14:10Z

.github/scripts/apply_partial_manifests.py

+
+    if not partials_path.exists():
+        print(f"No partial manifests found in {partials_path}")
+        return 0


should we return 0 here silently or fail the script with an explicit 1?

In our CI workflows, it’s expected that sometimes there are no partial manifests to apply, so returning 0 is intentional and prevents unnecessary failures. I’ll add a clarifying comment in the code.

anup-kodlekere · 2025-12-30T14:14:25Z

.github/scripts/apply_partial_manifests.py

+    files = discover_partial_files(partials_path)
+    if not files:
+        print(f"No JSON files discovered under {partials_path}")
+        return 0


same as previous

Confirmed—return codes are handled consistently for all ‘no work to do’ cases, matching our pipeline’s fault-tolerant design.

adilhusain-s · 2026-01-01T17:27:22Z

@pleia2

This PR addresses the failure mode documented in #15.

The issue captures the root cause of incorrect architecture mappings (e.g. ppc64le resolving s390x tarballs), manifest pollution after introducing Trivy artifacts, and workflow cancellations due to infrastructure failures.

The changes in this PR refactor the release pipeline to:

prevent parallel jobs from mutating shared manifest state
tolerate partial architecture failures
serialize manifest updates into a single, atomic commit

Linking here for context and future reference.

Signed-off-by: Adilhusain Shaikh <[email protected]>

gerrith3 · 2026-01-03T04:46:23Z

.github/workflows/python-sample.yml

Why are we moving to 3.14 here? Is it expected to be in the OS's you install later, eg. Ubuntu 22.04? I don't think RHEL/UBI uses 3.14 yet, and we've standardized mostly on 3.12 for various reasons. Is there a real reason to be on 3.14 yet?

gerrith3 · 2026-01-03T04:48:56Z

.github/workflows/release-latest-python-tag.yml

+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: '3.13'


3.13 here but 3.14 earlier? Again, 3.12 would be a lot safer?

gerrith3 · 2026-01-03T04:52:15Z

Makefile

-FAIL_ON_CRITICAL        ?= 0
-FAIL_ON_HIGH            ?= 0
+FAIL_ON_CRITICAL        ?= 1
+FAIL_ON_HIGH            ?= 1


Is this really the default that end users want? To fail their entire github runs on trivy high/critical? It is nice, but there's a lot of software that does not set the bar that high. How does this compare to other runners?

gerrith3 · 2026-01-03T04:56:54Z

tests/test_apply_partial_manifests.py

+    sample_entry = {
+        "version": "3.13.3",
+        "filename": "python-3.13.3-linux-22.04-ppc64le.tar.gz",
+        "arch": "ppc64le",


Again, I don't get the inconsistencies with python versions here?

adilhusain-s and others added 12 commits December 24, 2025 10:04

chore(manifests): rotate version manifests

cd1088f

- Drop legacy manifest files for Python 3.9, 3.10, 3.11, and 3.12. - Add and update manifest definitions for Python 3.13.x and 3.14.x on ppc64le and s390x architectures. Signed-off-by: Adilhusain Shaikh <[email protected]>

fix(release): update python version filter from 3.13.* to 3.12.*

f341f51

Signed-off-by: Adilhusain Shaikh <[email protected]>

fix(manifest): improve download URL validation and construct final ta…

837dcab

…gged URLs Signed-off-by: Adilhusain Shaikh <[email protected]>

Backfill manifests from workflow run [skip ci]

989844d

fix(release): update python version filter from 3.12.* to 3.11.*

1fe8f0a

Signed-off-by: Adilhusain Shaikh <[email protected]>

fix(workflows): sign commit messages in backfill and manifest workflows

b34cfd5

Signed-off-by: Adilhusain Shaikh <[email protected]>

fix(tests): update URL validation tests to accept untagged releases a…

6ccb88f

…nd improve descriptions Signed-off-by: Adilhusain Shaikh <[email protected]>

Apply manifest partials [skip ci]

ad56100

Apply manifest partials [skip ci]

8c56c78

adilhusain-s requested review from anup-kodlekere, mtarsel and rahulssv-ibm as code owners December 24, 2025 15:39

Remove version manifest changes from PR IBM#11

0389a37

Signed-off-by: Adilhusain Shaikh <[email protected]>

anup-kodlekere reviewed Dec 30, 2025

View reviewed changes

fix: improve error handling and retry logic in fetch_json function

64e388f

Signed-off-by: Adilhusain Shaikh <[email protected]>

gerrith3 reviewed Jan 3, 2026

View reviewed changes

feat: manifest generation tooling, workflow optimization, and Python 3.13/3.14 support #11

Are you sure you want to change the base?

feat: manifest generation tooling, workflow optimization, and Python 3.13/3.14 support #11

Uh oh!

Conversation

adilhusain-s commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

How the Release Pipeline Works (After This PR)

Pipeline Flow Explained

1. Tag Discovery (get-tags job)

2. Build & Package (Matrix Jobs)

3. Release Asset Finalization (release-assets job)

4. Manifest Aggregation (update-manifests job)

Key Changes

Infrastructure & Security

Partial Manifest Tooling

CI/CD Workflow Refactor

Technical Rationale

Verification

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adilhusain-s commented Jan 1, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

adilhusain-s commented Dec 24, 2025 •

edited

Loading

1. Tag Discovery (`get-tags` job)

3. Release Asset Finalization (`release-assets` job)

4. Manifest Aggregation (`update-manifests` job)