Skip to content

Conversation

@adilhusain-s
Copy link
Collaborator

@adilhusain-s adilhusain-s commented Dec 24, 2025

Overview

This PR stabilizes the release pipeline by introducing partial manifest tooling, refactoring CI workflows to eliminate race conditions, and improving fault tolerance across architectures.

The key motivation is reliability.

Before this change, the pipeline tightly coupled builds, releases, and git updates inside matrix jobs. This made releases fragile, hard to recover from, and increasingly error-prone after adding Trivy scanning. In particular:

  • Non-release artifacts (Trivy SBOMs and scan reports) were accidentally being picked up during manifest parsing.
  • A failure on a single architecture could cancel the entire workflow.
  • Concurrent matrix jobs attempted to push to the repository, causing race conditions and flaky failures.
  • Recovering from partial failures required rerunning the full workflow.

This PR decouples artifact generation from manifest updates, introduces an explicit aggregation step, and makes the pipeline resilient to partial failures.


How the Release Pipeline Works (After This PR)

At a high level, the pipeline now runs in four clearly separated phases:

  1. Discover which Python versions to release
  2. Build artifacts per architecture
  3. Create or update GitHub releases
  4. Aggregate partial manifests and update tracked data atomically

This separation is intentional and is what fixes the reliability issues.


Pipeline Flow Explained

1. Tag Discovery (get-tags job)

The workflow first determines which Python versions should be processed.

  • If .github/release/python-tag-filter.txt exists, it is used as a filter (e.g. 3.13.*).
  • Otherwise, the workflow derives a filter from the latest upstream Python version.
  • Matching Python tags are collected and passed as a JSON matrix.

This keeps the workflow deterministic and avoids manual inputs while still allowing controlled releases.


2. Build & Package (Matrix Jobs)

For each discovered Python tag, the workflow runs a matrix build across:

  • Architectures (ppc64le, s390x)
  • Ubuntu versions (22.04, 24.04)

Key design choices:

  • fail-fast: false
    A failure on one architecture does not cancel other builds.
  • Each matrix job:
    • Builds the Python artifacts
    • Uploads them to GitHub Releases
    • Generates a partial manifest describing only its own artifacts

Partial manifests are uploaded as workflow artifacts and do not touch git.


3. Release Asset Finalization (release-assets job)

Once builds complete, a follow-up job ensures release assets are finalized per Python version.

  • Operates per Python tag (not per architecture)
  • Proceeds even if some architectures failed
  • Ensures release metadata is consistent

4. Manifest Aggregation (update-manifests job)

Instead of each build job pushing to the repository, a single aggregation job now runs:

  • Downloads all available partial manifest artifacts
    (missing artifacts are tolerated for failed architectures)
  • Merges partial manifests into the tracked data
  • Commits and pushes changes once, atomically

Concurrency is controlled so only one aggregation runs per ref.

If a build for one architecture fails, only that job needs to be rerun.
The regenerated partial manifest can then be recombined without restarting the full workflow.


Key Changes

Infrastructure & Security

  • Added retry logic (8 attempts, 5s delay) to dotnet-install.py to handle transient network failures
  • Upgraded Trivy to v0.68.2 with strict failure thresholds
  • Simplified Makefile by removing unnecessary sudo usage

Partial Manifest Tooling

  • generate_partial_manifest.py: Generates architecture-scoped partial manifests
  • apply_partial_manifests.py: Merges partial manifests
  • backfill-manifests.yml: Regenerates or fixes manifests for existing releases without rebuilding binaries
  • Added unit tests for manifest generation and merging logic

This prevents Trivy-generated assets from leaking into release metadata.


CI/CD Workflow Refactor

  • Removed git push operations from matrix jobs
  • Introduced a single aggregation step
  • Added concurrency controls to serialize updates
  • Disabled fail-fast to preserve successful builds

The pipeline now follows an Artifact → Aggregate → Commit model.


Technical Rationale

Pushing to main from within a matrix strategy caused race conditions and flaky failures.
The new aggregation model eliminates these issues and allows partial recovery without full reruns.


Verification

  • ✅ Unit tests for partial manifest generation and merging
  • ✅ Infrastructure validated with upgraded Trivy
  • ✅ Backfill workflow verified to parse tags and generate partial artifacts correctly

adilhusain-s and others added 12 commits December 24, 2025 10:04
- dotnet-install.py: Add retry logic (8 attempts) for JSON fetching to handle network flakes.
- Makefile: Upgrade Trivy to v0.68.2 and enforce build failure on High/Critical vulnerabilities.

Signed-off-by: Adilhusain Shaikh <[email protected]>
- Add 'generate_partial_manifest.py' and 'apply_partial_manifests.py' scripts.
- Add 'backfill-manifests.yml' workflow to process partial manifests.
- Add unit tests for manifest generation and application logic.

Signed-off-by: Adilhusain Shaikh <[email protected]>

fix(tests): update error message assertion for invalid JSON handling

Signed-off-by: Adilhusain Shaikh <[email protected]>
- release-matching-python-tags: Target Python 3.13.* and implement concurrency groups.
- reusable-release-python-tar: Remove direct Git push logic; generate partial manifest artifacts instead.
- release-matching-python-tags: Add 'update-manifests' job to aggregate partials and commit atomically.
- Optimize 'max-parallel' and disable 'fail-fast' for better resilience.

Signed-off-by: Adilhusain Shaikh <[email protected]>
- Drop legacy manifest files for Python 3.9, 3.10, 3.11, and 3.12.
- Add and update manifest definitions for Python 3.13.x and 3.14.x on ppc64le and s390x architectures.

Signed-off-by: Adilhusain Shaikh <[email protected]>
…nd improve descriptions

Signed-off-by: Adilhusain Shaikh <[email protected]>
return json.loads(response.read())
except urllib.error.HTTPError as exc: # Retry transient HTTP errors
if exc.code in [500, 502, 503, 504] and attempt < FETCH_MAX_RETRIES - 1:
typer.echo(f"⚠️ HTTP {exc.code} fetching {url}. Retrying in {FETCH_RETRY_DELAY}s... ({attempt + 1}/{FETCH_MAX_RETRIES})")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could try exponential back-off for retry attempts, in case many of our CI jobs are retrying for whatever reason.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like time.sleep(FETCH_RETRY_DELAY * (2 ** attempt))

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion! I’ll update the retry logic to use exponential backoff (e.g., time.sleep(FETCH_RETRY_DELAY * (2 ** attempt))) to reduce load during repeated failures.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before you get into bakoffs, figure out exactly why we are getting those errors. Papering over an issue by extending retries with longer backoffs doesn't usually solve things, it just better hides real issues.

time.sleep(FETCH_RETRY_DELAY)
continue
raise
except Exception as exc:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you speak as to why we need this broader/ catch-all exception that isn't already caught in the previous two sections?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The catch-all exception is intended to handle any unexpected errors not covered by the specific handlers, ensuring the retry loop is robust. If you prefer, I can narrow it further or add a comment explaining its purpose.

typer.echo(f"⚠️ Error fetching {url}: {exc}. Retrying in {FETCH_RETRY_DELAY}s... ({attempt + 1}/{FETCH_MAX_RETRIES})")
time.sleep(FETCH_RETRY_DELAY)
continue
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have any logic to show a fetching failure after max retries?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, after the final retry, the script prints an error message and exits with a non-zero code using typer.Exit, so failures are clearly reported in CI logs.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long does this wind up re-trying with the current or any extended backups, and how does that compare to the time slice allowed for a git hub job? And why do we really have to fetch from MS x86 space on every run? blech... Adding retries if there are network issues is going to put more pressure on the network meaning it exacerbates any problem; backoffs are good but for how long, and if you back off for 10 minutes (I don't know how long the back off goes) but how much time slice remains for the job to run? This sounds like this would benefit from good logging and decide if we should just fail sooner, rather than retry until the user gets a two minute time slice?


if not partials_path.exists():
print(f"No partial manifests found in {partials_path}")
return 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we return 0 here silently or fail the script with an explicit 1?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our CI workflows, it’s expected that sometimes there are no partial manifests to apply, so returning 0 is intentional and prevents unnecessary failures. I’ll add a clarifying comment in the code.

files = discover_partial_files(partials_path)
if not files:
print(f"No JSON files discovered under {partials_path}")
return 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as previous

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed—return codes are handled consistently for all ‘no work to do’ cases, matching our pipeline’s fault-tolerant design.

@adilhusain-s
Copy link
Collaborator Author

@pleia2

This PR addresses the failure mode documented in #15.

The issue captures the root cause of incorrect architecture mappings (e.g. ppc64le resolving s390x tarballs), manifest pollution after introducing Trivy artifacts, and workflow cancellations due to infrastructure failures.

The changes in this PR refactor the release pipeline to:

  • prevent parallel jobs from mutating shared manifest state
  • tolerate partial architecture failures
  • serialize manifest updates into a single, atomic commit

Linking here for context and future reference.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we moving to 3.14 here? Is it expected to be in the OS's you install later, eg. Ubuntu 22.04? I don't think RHEL/UBI uses 3.14 yet, and we've standardized mostly on 3.12 for various reasons. Is there a real reason to be on 3.14 yet?

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.13'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3.13 here but 3.14 earlier? Again, 3.12 would be a lot safer?

FAIL_ON_CRITICAL ?= 0
FAIL_ON_HIGH ?= 0
FAIL_ON_CRITICAL ?= 1
FAIL_ON_HIGH ?= 1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really the default that end users want? To fail their entire github runs on trivy high/critical? It is nice, but there's a lot of software that does not set the bar that high. How does this compare to other runners?

sample_entry = {
"version": "3.13.3",
"filename": "python-3.13.3-linux-22.04-ppc64le.tar.gz",
"arch": "ppc64le",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I don't get the inconsistencies with python versions here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants