fix: six bugs that break FORMAT_KEGG_DB and KEGG annotation pipeline#485
Open
ruben1294 wants to merge 3 commits intoWrightonLabCSU:devfrom
Open
fix: six bugs that break FORMAT_KEGG_DB and KEGG annotation pipeline#485ruben1294 wants to merge 3 commits intoWrightonLabCSU:devfrom
ruben1294 wants to merge 3 commits intoWrightonLabCSU:devfrom
Conversation
1. Move scikit-bio imports inside functions (lazy import)
- `from skbio import ...` was at module level, causing ImportError in
Docker containers that do not have scikit-bio installed (e.g.
python_pandas_hmmer_mmseqs2_pruned). The import is only needed when
a gene_ko_link file is actually processed, so it is now placed inside
the two functions that use it: process_kegg() and
generate_modified_kegg_fasta().
2. Fix MMseqs2 output database name (kegg.mmsdb, not kegg.<date>.mmsdb)
- The database was written as kegg.<download_date>.mmsdb but
modules/local/annotate/mmseqs_search.nf expects the file to be named
exactly kegg.mmsdb (it constructs the path as ${db_name}.mmsdb where
db_name is the parent directory name "kegg"). The date suffix caused a
"No such file or directory" error at annotation time.
3. Fix --skip_gene_ko_link argparse definition
- Using `type=bool` does NOT work as a flag: argparse passes the string
"False" / "True" to bool(), and bool("False") == True. Replaced with
`action="store_true"` so the flag behaves as intended.
1. Replace container that lacks mmseqs2
- FORMAT_KEGG_DB used python_scikit-bio_scipy which does not include
mmseqs2. The process calls mmseqs createdb / createindex, so it fails
immediately with "No such file or directory: mmseqs". Replaced with
python_pandas_hmmer_mmseqs2_pruned, which already carries mmseqs2 and
is used by other annotation processes in the pipeline.
2. Fix bash condition for skip_gene_ko_link
- The Nextflow value passed to the process is the string "0" or "1"
(see dram.nf). In bash, `if [ "0" ]` evaluates to TRUE because any
non-empty string is truthy. FORMAT_KEGG_DB therefore always ran the
--skip_gene_ko_link branch, ignoring the gene_ko_link file.
Fixed with an explicit string comparison:
if [ "${skip_gene_ko_link}" = "true" ]
(see companion fix in workflows/dram.nf)
The companion fix for the bash condition in format_kegg_db.nf requires
that skip_gene_ko_link be the string "true" or "false" rather than the
integer 1 or 0.
In bash:
`if [ "0" ]` -> true (non-empty string)
`if [ "false" ]` -> true (still non-empty - also wrong)
The correct pattern used in format_kegg_db.nf is:
`if [ "${skip_gene_ko_link}" = "true" ]`
which requires this value to be exactly the string "true" or "false".
Changed `params.skip_gene_ko_link ? 1 : 0` to
`params.skip_gene_ko_link ? "true" : "false"`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes six bugs found while running the KEGG database formatting
and annotation pipeline (
FORMAT_KEGG_DBprocess). Together they causethe pipeline to fail when
format_kegg = trueor producesilently wrong results. All fixes were validated end-to-end on an HPC
cluster running Nextflow with Docker.
Bug 1 —
format_kegg_database.py: top-levelscikit-bioimport crashes containers without itFile:
bin/format_kegg_database.pyfrom skbio import ...was imported at module level. Any container that does not have scikit-bio installed raisesImportErroron startup, before any real work is done. scikit-bio is only needed when agene_ko_linkfile is actually processed, so the imports are now moved into the two functions that require them:process_keggandgenerate_modified_kegg_fasta.Bug 2 —
format_kegg_database.py: MMseqs2 DB written askegg.<date>.mmsdb, expected askegg.mmsdbFile:
bin/format_kegg_database.pyprocess_kegg()previously wrote the MMseqs2 database as:However,
modules/local/annotate/mmseqs_search.nfsymlinks the database directory and then references the database as${db_name}.mmsdb(wheredb_nameis the directory name"kegg"), so it expects exactlykegg/kegg.mmsdb. The date suffix caused a silent "file not found" error at annotation time.This is fixed by removing the date suffix:
Bug 3 —
format_kegg_database.py:--skip_gene_ko_linkargparse usestype=boolinstead ofaction="store_true"File:
bin/format_kegg_database.pytype=booldoes not work as a flag. argparse passes the literal string"False"tobool(), andbool("False") == True, so any invocation with--skip_gene_ko_linkwould receive the wrong value.Bug 4 —
format_kegg_db.nf: containerpython_scikit-bio_scipydoes not includemmseqs2File:
modules/local/database/format_kegg_db.nfThe
FORMAT_KEGG_DBprocess calledmmseqs createdbandmmseqs createindexbut ran inside
python_scikit-bio_scipy, which does not shipmmseqs2.This caused an immediate
FileNotFoundError: mmseqs.The container is now aligned with the rest of the pipeline and uses an image that includes
mmseqs2:Bug 5 —
format_kegg_db.nf: bash conditionif [ ${skip_gene_ko_link} ]is always trueFile:
modules/local/database/format_kegg_db.nfIn bash,
if [ "0" ]evaluates to true because any non-empty string is truthy. As a result, the process always took the--skip_gene_ko_linkbranch and silently ignored thegene_ko_linkfile even when it was provided.This is fixed by comparing explicitly to the string
"true"(see also Bug 6):Bug 6 —
dram.nf:skip_gene_ko_linkpassed as integer1/0instead of string"true"/"false"File:
workflows/dram.nfThe value passed to
FORMAT_KEGG_DBwas:This produces the strings
"1"or"0"inside the bash script. As noted in Bug 5,if [ "0" ]is true in bash. The fix informat_kegg_db.nfrequires the value to be exactly"true"or"false":Files changed
bin/format_kegg_database.pykegg.mmsdbname), #3 (argparse flag)modules/local/database/format_kegg_db.nfworkflows/dram.nf