-
Notifications
You must be signed in to change notification settings - Fork 1
feat: manifest generation tooling, workflow optimization, and Python 3.13/3.14 support #11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- dotnet-install.py: Add retry logic (8 attempts) for JSON fetching to handle network flakes. - Makefile: Upgrade Trivy to v0.68.2 and enforce build failure on High/Critical vulnerabilities. Signed-off-by: Adilhusain Shaikh <[email protected]>
- Add 'generate_partial_manifest.py' and 'apply_partial_manifests.py' scripts. - Add 'backfill-manifests.yml' workflow to process partial manifests. - Add unit tests for manifest generation and application logic. Signed-off-by: Adilhusain Shaikh <[email protected]> fix(tests): update error message assertion for invalid JSON handling Signed-off-by: Adilhusain Shaikh <[email protected]>
- release-matching-python-tags: Target Python 3.13.* and implement concurrency groups. - reusable-release-python-tar: Remove direct Git push logic; generate partial manifest artifacts instead. - release-matching-python-tags: Add 'update-manifests' job to aggregate partials and commit atomically. - Optimize 'max-parallel' and disable 'fail-fast' for better resilience. Signed-off-by: Adilhusain Shaikh <[email protected]>
- Drop legacy manifest files for Python 3.9, 3.10, 3.11, and 3.12. - Add and update manifest definitions for Python 3.13.x and 3.14.x on ppc64le and s390x architectures. Signed-off-by: Adilhusain Shaikh <[email protected]>
Signed-off-by: Adilhusain Shaikh <[email protected]>
…gged URLs Signed-off-by: Adilhusain Shaikh <[email protected]>
Signed-off-by: Adilhusain Shaikh <[email protected]>
Signed-off-by: Adilhusain Shaikh <[email protected]>
…nd improve descriptions Signed-off-by: Adilhusain Shaikh <[email protected]>
Signed-off-by: Adilhusain Shaikh <[email protected]>
PowerShell/dotnet-install.py
Outdated
| return json.loads(response.read()) | ||
| except urllib.error.HTTPError as exc: # Retry transient HTTP errors | ||
| if exc.code in [500, 502, 503, 504] and attempt < FETCH_MAX_RETRIES - 1: | ||
| typer.echo(f"⚠️ HTTP {exc.code} fetching {url}. Retrying in {FETCH_RETRY_DELAY}s... ({attempt + 1}/{FETCH_MAX_RETRIES})") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could try exponential back-off for retry attempts, in case many of our CI jobs are retrying for whatever reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something like time.sleep(FETCH_RETRY_DELAY * (2 ** attempt))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion! I’ll update the retry logic to use exponential backoff (e.g., time.sleep(FETCH_RETRY_DELAY * (2 ** attempt))) to reduce load during repeated failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before you get into bakoffs, figure out exactly why we are getting those errors. Papering over an issue by extending retries with longer backoffs doesn't usually solve things, it just better hides real issues.
| time.sleep(FETCH_RETRY_DELAY) | ||
| continue | ||
| raise | ||
| except Exception as exc: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you speak as to why we need this broader/ catch-all exception that isn't already caught in the previous two sections?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The catch-all exception is intended to handle any unexpected errors not covered by the specific handlers, ensuring the retry loop is robust. If you prefer, I can narrow it further or add a comment explaining its purpose.
| typer.echo(f"⚠️ Error fetching {url}: {exc}. Retrying in {FETCH_RETRY_DELAY}s... ({attempt + 1}/{FETCH_MAX_RETRIES})") | ||
| time.sleep(FETCH_RETRY_DELAY) | ||
| continue | ||
| raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have any logic to show a fetching failure after max retries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, after the final retry, the script prints an error message and exits with a non-zero code using typer.Exit, so failures are clearly reported in CI logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long does this wind up re-trying with the current or any extended backups, and how does that compare to the time slice allowed for a git hub job? And why do we really have to fetch from MS x86 space on every run? blech... Adding retries if there are network issues is going to put more pressure on the network meaning it exacerbates any problem; backoffs are good but for how long, and if you back off for 10 minutes (I don't know how long the back off goes) but how much time slice remains for the job to run? This sounds like this would benefit from good logging and decide if we should just fail sooner, rather than retry until the user gets a two minute time slice?
|
|
||
| if not partials_path.exists(): | ||
| print(f"No partial manifests found in {partials_path}") | ||
| return 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we return 0 here silently or fail the script with an explicit 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our CI workflows, it’s expected that sometimes there are no partial manifests to apply, so returning 0 is intentional and prevents unnecessary failures. I’ll add a clarifying comment in the code.
| files = discover_partial_files(partials_path) | ||
| if not files: | ||
| print(f"No JSON files discovered under {partials_path}") | ||
| return 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as previous
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed—return codes are handled consistently for all ‘no work to do’ cases, matching our pipeline’s fault-tolerant design.
|
This PR addresses the failure mode documented in #15. The issue captures the root cause of incorrect architecture mappings (e.g. The changes in this PR refactor the release pipeline to:
Linking here for context and future reference. |
Signed-off-by: Adilhusain Shaikh <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we moving to 3.14 here? Is it expected to be in the OS's you install later, eg. Ubuntu 22.04? I don't think RHEL/UBI uses 3.14 yet, and we've standardized mostly on 3.12 for various reasons. Is there a real reason to be on 3.14 yet?
| - name: Set up Python | ||
| uses: actions/setup-python@v5 | ||
| with: | ||
| python-version: '3.13' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3.13 here but 3.14 earlier? Again, 3.12 would be a lot safer?
| FAIL_ON_CRITICAL ?= 0 | ||
| FAIL_ON_HIGH ?= 0 | ||
| FAIL_ON_CRITICAL ?= 1 | ||
| FAIL_ON_HIGH ?= 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really the default that end users want? To fail their entire github runs on trivy high/critical? It is nice, but there's a lot of software that does not set the bar that high. How does this compare to other runners?
| sample_entry = { | ||
| "version": "3.13.3", | ||
| "filename": "python-3.13.3-linux-22.04-ppc64le.tar.gz", | ||
| "arch": "ppc64le", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I don't get the inconsistencies with python versions here?
Overview
This PR stabilizes the release pipeline by introducing partial manifest tooling, refactoring CI workflows to eliminate race conditions, and improving fault tolerance across architectures.
The key motivation is reliability.
Before this change, the pipeline tightly coupled builds, releases, and git updates inside matrix jobs. This made releases fragile, hard to recover from, and increasingly error-prone after adding Trivy scanning. In particular:
This PR decouples artifact generation from manifest updates, introduces an explicit aggregation step, and makes the pipeline resilient to partial failures.
How the Release Pipeline Works (After This PR)
At a high level, the pipeline now runs in four clearly separated phases:
This separation is intentional and is what fixes the reliability issues.
Pipeline Flow Explained
1. Tag Discovery (
get-tagsjob)The workflow first determines which Python versions should be processed.
.github/release/python-tag-filter.txtexists, it is used as a filter (e.g.3.13.*).This keeps the workflow deterministic and avoids manual inputs while still allowing controlled releases.
2. Build & Package (Matrix Jobs)
For each discovered Python tag, the workflow runs a matrix build across:
ppc64le,s390x)22.04,24.04)Key design choices:
fail-fast: falseA failure on one architecture does not cancel other builds.
Partial manifests are uploaded as workflow artifacts and do not touch git.
3. Release Asset Finalization (
release-assetsjob)Once builds complete, a follow-up job ensures release assets are finalized per Python version.
4. Manifest Aggregation (
update-manifestsjob)Instead of each build job pushing to the repository, a single aggregation job now runs:
(missing artifacts are tolerated for failed architectures)
Concurrency is controlled so only one aggregation runs per ref.
If a build for one architecture fails, only that job needs to be rerun.
The regenerated partial manifest can then be recombined without restarting the full workflow.
Key Changes
Infrastructure & Security
dotnet-install.pyto handle transient network failuresv0.68.2with strict failure thresholdssudousagePartial Manifest Tooling
generate_partial_manifest.py: Generates architecture-scoped partial manifestsapply_partial_manifests.py: Merges partial manifestsbackfill-manifests.yml: Regenerates or fixes manifests for existing releases without rebuilding binariesThis prevents Trivy-generated assets from leaking into release metadata.
CI/CD Workflow Refactor
git pushoperations from matrix jobsThe pipeline now follows an Artifact → Aggregate → Commit model.
Technical Rationale
Pushing to
mainfrom within a matrix strategy caused race conditions and flaky failures.The new aggregation model eliminates these issues and allows partial recovery without full reruns.
Verification