Consider a dataset where observations are grouped into discrete partitions, e.g., clusters, factors. We would like to sample a subset of observations for further analysis, typically in time-consuming steps where the full dataset would be too large. In doing so, we would also like to preserve the distribution of cells across partitions within the subset. This improves the relevance of subset's results when extrapolated to the full dataset.
partisub implements a simple algorithm for subsampling a dataset with user-defined partitions. It subsamples observations within each partition to minimize the effect of sampling noise on the relative frequencies of partitions in the subset. Optionally, it can also force partitions to be represented by at least one observation. This is useful for guaranteeing the presence of low-frequency partitions, e.g., rare cell types in single-cell applications.
#include "partisub/partisub.hpp"
int nobs = 1000;
std::vector<int> clusters(nobs); // or some type of partition.
std::fill(clusters.begin() + 500, clusters.end(), 1);
// Subsampling to 100 observations.
auto selected = partisub::compute(nobs, clusters.data(), 100, {});
// Each partition will be represented by default, even if it is rare:
clusters[0] = 2;
auto selected2 = partisub::compute(nobs, clusters.data(), 100, {});Check out the reference documentation for more details.
If you're using CMake, you just need to add something like this to your CMakeLists.txt:
include(FetchContent)
FetchContent_Declare(
partisub
GIT_REPOSITORY https://github.com/libscran/partisub
GIT_TAG master # or any version of interest
)
FetchContent_MakeAvailable(partisub)Then you can link to partisub to make the headers available during compilation:
# For executables:
target_link_libraries(myexe libscran::partisub)
# For libaries
target_link_libraries(mylib INTERFACE libscran::partisub)By default, this will use FetchContent to fetch all external dependencies.
Applications should consider pinning versions of all dependencies - see extern/CMakeLists.txt for suggested versions.
If you want to install them manually, use -DPARTISUB_FETCH_EXTERN=OFF.
find_package(libscran_partisub CONFIG REQUIRED)
target_link_libraries(mylib INTERFACE libscran::partisub)To install the library, use:
mkdir build && cd build
cmake .. -DPARTISUB_TESTS=OFF
cmake --build . --target installAgain, this will use FetchContent to retrieve dependencies, see comments above.
If you're not using CMake, the simple approach is to just copy the files in include/ - either directly or with Git submodules - and include their path during compilation with, e.g., GCC's -I.
This also requires the external dependencies listed in extern/CMakeLists.txt.