A Light Discussion about Dataset Choices for URL (at least)
Besides a small subset of (m)C4, I prefer finding intersections among metadata (URL at least), promptsource, and evaluation WGs.
For either one of two WGs excluding us metadata here,
- From evaluation
- GEM from eval WG, specifically
- From promptsource