-
Notifications
You must be signed in to change notification settings - Fork 50
(improvement)Optimize DCAware/RackAware/TokenAware RoundRobinPolicy with host distance caching #651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
This is interesting, my change has exposed this - Need to understand this better :-/ |
if host is None:
host = self._cluster.metadata.get_host_by_host_id(host_id)
if host and host.endpoint != endpoint:
log.debug("[control connection] Updating host ip from %s to %s for (%s)", host.endpoint, endpoint, host_id)
old_endpoint = host.endpoint
host.endpoint = endpoint
self._cluster.metadata.update_host(host, old_endpoint)
reconnector = host.get_and_set_reconnection_handler(None)
if reconnector:
reconnector.cancel()
self._cluster.on_down(host, is_host_addition=False, expect_host_to_be_down=True)So first we update the host with the new endpoint, then mark it as down? |
|
This fixes it for me: which also makes sense to me. |
76ee195 to
edab823
Compare
…tate Refactor `DCAwareRoundRobinPolicy` to simplify distance calculations and memory usage. Key changes: - Remove `_hosts_by_distance` and the complex caching of LOCAL hosts. - `distance()` now checks `host.datacenter` directly for LOCAL calculation, which is correct and static. - Only cache `_remote_hosts` to efficiently handle `used_hosts_per_remote_dc`. - Optimize control plane operations (`on_up`, `on_down`) to only rebuild the remote cache when necessary (when remote hosts change or local DC changes). - This removes the overhead of maintaining a redundant LOCAL cache and ensures correct behavior even if a local host is marked down. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
When a host changes its IP address, the driver previously updated the host endpoint to the new IP before calling on_down. This caused on_down to mistakenly target the new IP for connection cleanup. This change reorders the operations to ensure on_down cleans up the old IP's resources before the host object is updated and on_up is called for the new IP. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
|
I think CI failure is unrelated and is #359 |
edab823 to
1884f59
Compare
|
By using the (not amazing) benchmark from #653 , I got the following results: For master branch as a baseline: This branch (with just DC aware improvements): ** 433 -> 781 Kops/sec improvement ** With improvement to rack aware (on top of master), I got: ** 277 -> 324 Kops/sec improvement ** And on top of this branch: ** 277 -> 344 Kops/sec improvement ** And finally, for #650 : which kinda makes me suspect that branch is no good :-/ |
Sent separate PR - #654 |
|
With rack aware added (3rd commit), these are the current numbers: |
…distances Refactor `RackAwareRoundRobinPolicy` to simplify distance calculations and memory usage. Add self._remote_hosts to cache remote hosts distance, self._non_local_rack_hosts for non-local rack host distance. This improves the performance nicely, from ~290K query plans per second to ~600K query plans per second. - Only cache `_remote_hosts` to efficiently handle `used_hosts_per_remote_dc`. - Optimize control plane operations (`on_up`, `on_down`) to only rebuild the remote cache when necessary (when remote hosts change or local DC changes). Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
cc0204d to
6282e6f
Compare
Now that I also cache non-local hosts, not just remote (duh!), perf. is better: |
|
Added for TokenAware as well some optimization (need to improve commit message). So reasonable improvement, at least in this micro-benchmark. |
…y planning.
Optimize TokenAwarePolicy query plan generation
This patch significantly improves the performance of TokenAwarePolicy by
reducing overhead in list materialization and distance calculation.
Key changes:
1. Introduced `make_query_plan_with_exclusion()` to the LoadBalancingPolicy
interface.
- This allows a parent policy (like TokenAware) to request a plan from
a child policy while efficiently skipping a set of already-yielded
hosts (replicas).
- Implemented optimized versions in `DCAwareRoundRobinPolicy` and
`RackAwareRoundRobinPolicy`. These implementations integrate the
exclusion check directly into their generation loops, avoiding the
need for inefficient external filtering or full list materialization.
2. Optimized `TokenAwarePolicy.make_query_plan`:
- Removed list materialization of the child query plan.
- Replaced multiple passes over replicas (checking `child.distance`
each time) with a single pass that buckets replicas into local/remote
lists.
- Utilizes `make_query_plan_with_exclusion` to yield the remainder
of the plan.
- Added `__slots__` to reduce memory overhead and attribute access cost.
Performance Impact:
Benchmarks show query plan generation throughput increasing by approximately
4x for TokenAware configurations:
- TokenAware(DCAware): ~80 Kops/s -> ~355 Kops/s
- TokenAware(RackAware): ~75 Kops/s -> ~320 Kops/s
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
8f96d39 to
87c6a01
Compare
Refactor
DCAwareRoundRobinPolicyto use a Copy-On-Write (COW) strategy for managing host distances.Key changes:
_remote_hoststo cacheREMOTEhosts, enabling O(1) distance lookups during query planning for distance.IGNOREDhosts do not need to be stored in the cache.For 'LOCAL' we do a simple comparison.
_refresh_remote_hoststo handle node changes.This is a different attempt from #650 to add caching to host distance to make query planning faster.
I'm not sure code-wise it's less code, but it certainly does make sense to do it per policy, not just for the TokenAware? Unsure.
The 1st commit is for DCAwareRoundRobinPolicy. Once it passes CI, I have an additional commit for RackAware one, and perhaps more are needed later.
Pre-review checklist
./docs/source/.Fixes:annotations to PR description.