Skip to content

fix: six bugs that break FORMAT_KEGG_DB and KEGG annotation pipeline#485

Open
ruben1294 wants to merge 3 commits intoWrightonLabCSU:devfrom
ruben1294:fix/format-kegg-database-pipeline
Open

fix: six bugs that break FORMAT_KEGG_DB and KEGG annotation pipeline#485
ruben1294 wants to merge 3 commits intoWrightonLabCSU:devfrom
ruben1294:fix/format-kegg-database-pipeline

Conversation

@ruben1294
Copy link

Summary

This PR fixes six bugs found while running the KEGG database formatting
and annotation pipeline (FORMAT_KEGG_DB process). Together they cause
the pipeline to fail when format_kegg = true or produce
silently wrong results. All fixes were validated end-to-end on an HPC
cluster running Nextflow with Docker.


Bug 1 — format_kegg_database.py: top-level scikit-bio import crashes containers without it

File: bin/format_kegg_database.py

from skbio import ... was imported at module level. Any container that does not have scikit-bio installed raises ImportError on startup, before any real work is done. scikit-bio is only needed when a gene_ko_link file is actually processed, so the imports are now moved into the two functions that require them: process_kegg and generate_modified_kegg_fasta.


Bug 2 — format_kegg_database.py: MMseqs2 DB written as kegg.<date>.mmsdb, expected as kegg.mmsdb

File: bin/format_kegg_database.py

process_kegg() previously wrote the MMseqs2 database as:

kegg_mmseqs_db = path.join(output_dir, "kegg.%s.mmsdb" % download_date)
# e.g. kegg/kegg.2018-12-30.mmsdb

However, modules/local/annotate/mmseqs_search.nf symlinks the database directory and then references the database as ${db_name}.mmsdb (where db_name is the directory name "kegg"), so it expects exactly kegg/kegg.mmsdb. The date suffix caused a silent "file not found" error at annotation time.

This is fixed by removing the date suffix:

kegg_mmseqs_db = path.join(output_dir, "kegg.mmsdb")

Bug 3 — format_kegg_database.py: --skip_gene_ko_link argparse uses type=bool instead of action="store_true"

File: bin/format_kegg_database.py

type=bool does not work as a flag. argparse passes the literal string "False" to bool(), and bool("False") == True, so any invocation with --skip_gene_ko_link would receive the wrong value.

# Before (broken):
parser.add_argument("--skip_gene_ko_link", type=bool, default=False, ...)

# After (fixed):
parser.add_argument("--skip_gene_ko_link", action="store_true", ...)

Bug 4 — format_kegg_db.nf: container python_scikit-bio_scipy does not include mmseqs2

File: modules/local/database/format_kegg_db.nf

The FORMAT_KEGG_DB process called mmseqs createdb and mmseqs createindex
but ran inside python_scikit-bio_scipy, which does not ship mmseqs2.
This caused an immediate FileNotFoundError: mmseqs.

The container is now aligned with the rest of the pipeline and uses an image that includes mmseqs2:

// Before:
container "community.wave.seqera.io/library/python_scikit-bio_scipy:0f89a100e990daf2"

// After:
container "community.wave.seqera.io/library/python_pandas_hmmer_mmseqs2_pruned:d2c88b719ab1322c"

Bug 5 — format_kegg_db.nf: bash condition if [ ${skip_gene_ko_link} ] is always true

File: modules/local/database/format_kegg_db.nf

In bash, if [ "0" ] evaluates to true because any non-empty string is truthy. As a result, the process always took the --skip_gene_ko_link branch and silently ignored the gene_ko_link file even when it was provided.

This is fixed by comparing explicitly to the string "true" (see also Bug 6):

# Before:
if [ ${skip_gene_ko_link} ]; then

# After:
if [ "${skip_gene_ko_link}" = "true" ]; then

Bug 6 — dram.nf: skip_gene_ko_link passed as integer 1/0 instead of string "true"/"false"

File: workflows/dram.nf

The value passed to FORMAT_KEGG_DB was:

// Before:
skip_gene_ko_link = params.skip_gene_ko_link ? 1 : 0

This produces the strings "1" or "0" inside the bash script. As noted in Bug 5, if [ "0" ] is true in bash. The fix in format_kegg_db.nf requires the value to be exactly "true" or "false":

// After:
skip_gene_ko_link = params.skip_gene_ko_link ? "true" : "false"

Files changed

File Bugs fixed
bin/format_kegg_database.py #1 (lazy scikit-bio import), #2 (kegg.mmsdb name), #3 (argparse flag)
modules/local/database/format_kegg_db.nf #4 (container with mmseqs2), #5 (bash condition)
workflows/dram.nf #6 (boolean string value)

1. Move scikit-bio imports inside functions (lazy import)
   - `from skbio import ...` was at module level, causing ImportError in
     Docker containers that do not have scikit-bio installed (e.g.
     python_pandas_hmmer_mmseqs2_pruned). The import is only needed when
     a gene_ko_link file is actually processed, so it is now placed inside
     the two functions that use it: process_kegg() and
     generate_modified_kegg_fasta().

2. Fix MMseqs2 output database name (kegg.mmsdb, not kegg.<date>.mmsdb)
   - The database was written as kegg.<download_date>.mmsdb but
     modules/local/annotate/mmseqs_search.nf expects the file to be named
     exactly kegg.mmsdb (it constructs the path as ${db_name}.mmsdb where
     db_name is the parent directory name "kegg"). The date suffix caused a
     "No such file or directory" error at annotation time.

3. Fix --skip_gene_ko_link argparse definition
   - Using `type=bool` does NOT work as a flag: argparse passes the string
     "False" / "True" to bool(), and bool("False") == True. Replaced with
     `action="store_true"` so the flag behaves as intended.
1. Replace container that lacks mmseqs2
   - FORMAT_KEGG_DB used python_scikit-bio_scipy which does not include
     mmseqs2. The process calls mmseqs createdb / createindex, so it fails
     immediately with "No such file or directory: mmseqs". Replaced with
     python_pandas_hmmer_mmseqs2_pruned, which already carries mmseqs2 and
     is used by other annotation processes in the pipeline.

2. Fix bash condition for skip_gene_ko_link
   - The Nextflow value passed to the process is the string "0" or "1"
     (see dram.nf). In bash, `if [ "0" ]` evaluates to TRUE because any
     non-empty string is truthy. FORMAT_KEGG_DB therefore always ran the
     --skip_gene_ko_link branch, ignoring the gene_ko_link file.
     Fixed with an explicit string comparison:
       if [ "${skip_gene_ko_link}" = "true" ]
     (see companion fix in workflows/dram.nf)
The companion fix for the bash condition in format_kegg_db.nf requires
that skip_gene_ko_link be the string "true" or "false" rather than the
integer 1 or 0.

In bash:
  `if [ "0" ]`   -> true  (non-empty string)
  `if [ "false" ]` -> true  (still non-empty - also wrong)

The correct pattern used in format_kegg_db.nf is:
  `if [ "${skip_gene_ko_link}" = "true" ]`

which requires this value to be exactly the string "true" or "false".
Changed `params.skip_gene_ko_link ? 1 : 0` to
`params.skip_gene_ko_link ? "true" : "false"`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: To Sort

Development

Successfully merging this pull request may close these issues.

1 participant