Conversation
e869e31 to
18232af
Compare
18232af to
b0d6728
Compare
config/configfile.yaml
Outdated
| genome/all-time: | ||
| samples: | ||
| all-time: | ||
| <<: [*subsample_genome, *subsample_all-time, *subsample_defaults] |
There was a problem hiding this comment.
In general, I'm not sure this pattern of YAML anchors/aliases is very user friendly. If we do use anchors/aliases, then we should probably disable them in the dumped config so that users can at least see the fully expanded config in results/run_config.yaml (see suggested workaround in yaml/pyyaml#103).
I wonder if this can use the config wildcards pattern that @jameshadfield used in avian-flu. These could be expanded at start of Snakemake (or whatever comes of discussion in nextstrain/public#23).
There was a problem hiding this comment.
If we do use anchors/aliases, then we should probably disable them in the dumped config
Yes, agreed. (I'd go further: we should write out small-multiples with lots of duplication, but that's beyond this PR.)
I wonder if this can use the config wildcards pattern that @jameshadfield used in avian-flu.
Across the 9 builds almost all of the difference is the subsampling parameters, so at some level it is going to be either complex or verbose. I don't think glob-like syntax will help here with the structure as it is now, but it could if you moved towards something more like:
subsample:
samples:
min_length:
"genome/*": 10_000
"G/*": 600
"F/*": 1_200
group_by: "year country"
min_coverage: 0.3
resolutions:
"*/all-time": {"min_date": "1795-01-01"}
"*/6y": {"min_date": "6Y", "background_min_date": "1975-01-01"}
"*/3y": {"min_date": "3Y", "background_min_date": "1975-01-01"}There was a problem hiding this comment.
I also don't think we should use YAML anchors and aliases, and would like to decide on a good alternative in nextstrain/public#27. I'll consider something like the above with config pre-processing to translate into augur subsample-ready config.
There was a problem hiding this comment.
For my ability to follow this thread, the reason against using YAML anchors and aliases:
- not sure if the pattern of YAML anchors/aliases is very user friendly (for which users? A google search is returning several articles recommending YAML anchors (example) to avoid YAML duplication, but maybe these are not our user group or feel free to add comments from WA-DOH/others etc.)
- requires resolving the anchors explicitly in the dumped config (similar lift as
config pre-processing to translate into augur subsample-ready config)
Feel free to add to the above list, mostly looking for a summary statement
There was a problem hiding this comment.
not sure if the pattern of YAML anchors/aliases is very user friendly
I don't think we're advising against anchors per-se -- they can be very useful! -- rather I think the pushback was against the specific usage of anchors in this config. (The pushback wasn't from me, so I won't elaborate.)
requires resolving the anchors explicitly in the dumped config
Anchors are resolved automatically, i.e. yaml.dump(yaml.load(...)) will not preserve anchors.
There was a problem hiding this comment.
Anchors are resolved automatically, i.e. yaml.dump(yaml.load(...)) will not preserve anchors.
I got this wrong (see nextstrain/shared#62), at least partially. Anchors of primitive types won't be preserved, but anchors of more complex types are preserved. E.g. the following YAML
foo: &foo
a: 1
b: 2
bar: *foo
x: &x 42
y: *xwhen roundtripped through yaml.dump(yaml.load(...)) will keep one of the anchors and drop the other:
bar: &id001
a: 1
b: 2
foo: *id001
x: 42
y: 42|
I'll wait for a decision in nextstrain/public#27 before continuing here. |
b0d6728 to
5ee2efa
Compare
3340439 to
7dd63f2
Compare
5ee2efa to
5490644
Compare
5490644 to
42aa5f6
Compare
e243d2f to
0b22185
Compare
Preparing to use build_dir in config.smk, and I figured it'd be good to move both auspice_dir with it.
The same logic can be represented as a query directly in augur filter, avoiding the extra step in the workflow.
Similar to "Add separate frequencies config" (0b22185), this rule shouldn't rely on config from another rule.
The previous subsampling implementation was fixed to a two-sample recent+background split with some hardcoded parameters. Replacing it with augur subsample allows for more flexible configuration. To keep the workflow config schema concise, we generate each augur subsample config dynamically using a matrix defined in the config and merging code in config.smk. This is a breaking change and the old configuration will no longer work.
42aa5f6 to
b437074
Compare
Description of proposed changes
This PR contains 3 prep commits + 1 main commit. Message from main commit:
The previous subsampling implementation was fixed to a two-sample recent+background split with some hardcoded parameters. Replacing it with augur subsample allows for more flexible configuration.
To keep the workflow config schema concise, we generate each augur subsample config dynamically using a matrix defined in the config and merging code in config.smk.
This is a breaking change and the old configuration will no longer work.
Related issue(s)
Closes #101
Checklist
Update changelog