Skip to content

Conversation

@devabhishekpal
Copy link
Contributor

@devabhishekpal devabhishekpal commented Jan 5, 2026

What changes were proposed in this pull request?

Create the Cluster Capacity page UI

Please describe your PR in detail:

  • This PR adds a Cluster Capacity page to the Recon UI.
  • The intention of this page is to provide a clear picture of the space usage breakdown to the end user.
  • It intends to alleviate some of the pain points regarding stuck deletions by showing more details into the pending deletion and the stages where it is pending.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13183

How was this patch tested?

Patch was tested manually.

Screenshot 2026-01-05 at 13 40 39 Screenshot 2026-01-05 at 13 40 49 Screenshot 2026-01-05 at 13 41 27 Screenshot 2026-01-05 at 13 41 13 Screenshot 2026-01-20 at 12 02 16 Screenshot 2026-01-05 at 13 40 29 Screenshot 2026-01-22 at 01 54 16 Screenshot 2026-01-05 at 13 40 02 Screenshot 2026-01-20 at 12 02 07

Copy link
Contributor

@devmadhuu devmadhuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @devabhishekpal for the patch.

  1. I think , we decided to use datanode dropdown as combo of textbox and dropdown, where user can enter also to search in long list of drop down if needed.
  2. We need export option for datanodes card atleast.
  3. By default, datanodes in dropdown needs to be populated in descending order of pending deletion size.
  4. We should gray out the list of failed datanodes in dropdown as their pending deletion data is no longer useful for user to be displayed. This should be possible using css.
  5. How are we planning to handle the case when the API response will be having duplicate nodes with same name. Pls check the API behavior in cases when same name DNs joins the cluster and if API will have response because API response contains datanode names as well as UUID, and UUID is only being treated as unique.

Copy link
Contributor

@priyeshkaratha priyeshkaratha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @devabhishekpal for working on the patch.
I have two comments regarding current UI implementation for showing DN pendingDeletion and related refreshing strategy. Please have a look on that.

Copy link
Contributor

@priyeshkaratha priyeshkaratha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @devabhishekpal for updating the patch. Overall changes LGTM

@devabhishekpal
Copy link
Contributor Author

@adoroszlai I have added the commons-csv library from apache commons.
It would be great if you could take a look as well in case we have any existing library which serves the purpose.

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @devabhishekpal for the heads-up.

I have added the commons-csv library from apache commons. It would be great if you could take a look as well in case we have any existing library which serves the purpose.

dependency check failure has instructions on what needs to be done when adding new one.

Copy link
Contributor

@devmadhuu devmadhuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @devabhishekpal for the patch. Kindly add unit and integration test for the code including new API end point. Also UI dev tests. And a small nit.

@yandrey321
Copy link

yandrey321 commented Jan 21, 2026

[Usability issue] Its better to use grid (+ pagination or infinity scrolling and filtering/sorting/searching capabilities) with info for all data data nodes. In case of hundreds datanodes it would be hard to explore and pinpoint data nodes that cause probles.

@devabhishekpal
Copy link
Contributor Author

devabhishekpal commented Jan 21, 2026

[Usability issue] Its better to use grid (+ pagination or infinity scrolling and filtering/sorting/searching capabilities) with info for all data data nodes. In case of hundreds datanodes it would be hard to explore and pinpoint data nodes that cause probles.

Thanks @yandrey321, you are correct that we should have used pagination, however here we decided that instead of showing all the DNs, since this shows the list where datanodes might be stuck in deletion state, we show the top 10 data-nodes sorted by size of pending deletes i.e the first DN has the most size of pending deletion and so on.
This assumes that the user is concerned with where the deletion is stuck in order to debug further, however there is also an option to download the list of all DNs and their pending deletion size as a CSV so they can further process that raw data as they see fit.

@devabhishekpal
Copy link
Contributor Author

@yandrey321 Thanks for the inputs, I reflected on this and I added a tooltip to better convey this information to the end user i.e the list only contains the top 15 DNs and the whole information can be downloaded as a CSV file.

@priyeshkaratha
Copy link
Contributor

Thanks @devabhishekpal for improving the patch. Changes looks fine for me. Please check the CI failures

Copy link
Contributor

@priyeshkaratha priyeshkaratha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @devabhishekpal for fixing ci issues. LGTM

}

public DataNodeMetricsServiceResponse getCollectedMetrics() {
public DataNodeMetricsServiceResponse getCollectedMetrics(Integer limit) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devmadhuu I was thinking about whether we should implement sorting and fetching the top X results in the backend API, or move this logic to the frontend. Since we rely on JMX, we have to fetch all the data from JMX before we can perform any sorting. Because of this, applying sorting or limits in the backend does not provide any performance benefit—it only reduces the response size.

Given this, would it make more sense to move the sorting logic (based on pending deletion data) to the frontend?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we are not sure of the worst case scenario for number of DNs this might cause issues with sorting and slicing for a very large dataset. If it's in hundreds or thousands maybe it'll still perform well, but if this data is expanded to include other properties and it is thousands of nodes I think the browser might not be able to handle it properly.

Perhaps we can use something similar to infinite scroll instead of limit, but for sorting I'd prefer if it is sorted from the backend itself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@priyeshkaratha yes it may have performance issues if number of DNs are high and if some DN may respond slow and 10min is quite high for timeout here, because we wait for all futures to be completed, and in worst case 10 mins can be the time taken for this all futures to be completed and return the pending blocks data. Now if we are displaying top 10 or top 20 datanodes sorted by size (descending), then we could have implemented the above collectMetrics data in such a way that instead of waiting for all datanodes to return the data after their futures is completed, we could have started pushing their results in queue and another thread (any client program) who needs this data like API end point here could just start picking results from that queue and can implement priority queue to start sorting for top 10 or 20, so this way I think memory footprint will be less and since on UI, we display only top 10 or top 20, then backend logic based on priority can keep only that much and keep discarding remaining. And UI should have a scroll dynamic logic that should send request to backend if user goes beyond those top 10 or top 20. But we can have this logic later sometime in future and assume that in case of large clusters where 1000+ datanodes can be there, jmx response from all datanodes can come quickly or else worst 10 mins.

Copy link
Contributor

@devmadhuu devmadhuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @devabhishekpal for updating the patch. Largely changes LGTM +1. A comment on approach which we can do in future, but I still insists to do some testing on real cluser having 1000+ DNs where some DN could be running slow. IMO, You can write an integration test where you can introduce some delay in returning JMX response by adding some sleep and see how your API endpoint response behaves and then you can predict how will be UI behavior for user. Basically it should not be a long wait where UI simply gets stuck for worst 10 mins.

}

public DataNodeMetricsServiceResponse getCollectedMetrics() {
public DataNodeMetricsServiceResponse getCollectedMetrics(Integer limit) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@priyeshkaratha yes it may have performance issues if number of DNs are high and if some DN may respond slow and 10min is quite high for timeout here, because we wait for all futures to be completed, and in worst case 10 mins can be the time taken for this all futures to be completed and return the pending blocks data. Now if we are displaying top 10 or top 20 datanodes sorted by size (descending), then we could have implemented the above collectMetrics data in such a way that instead of waiting for all datanodes to return the data after their futures is completed, we could have started pushing their results in queue and another thread (any client program) who needs this data like API end point here could just start picking results from that queue and can implement priority queue to start sorting for top 10 or 20, so this way I think memory footprint will be less and since on UI, we display only top 10 or top 20, then backend logic based on priority can keep only that much and keep discarding remaining. And UI should have a scroll dynamic logic that should send request to backend if user goes beyond those top 10 or top 20. But we can have this logic later sometime in future and assume that in case of large clusters where 1000+ datanodes can be there, jmx response from all datanodes can come quickly or else worst 10 mins.

@devmadhuu devmadhuu self-requested a review January 23, 2026 07:20
Copy link
Contributor

@ArafatKhan2198 ArafatKhan2198 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1

@devabhishekpal
Copy link
Contributor Author

Thanks for the reviews and inputs @priyeshkaratha @devmadhuu @yandrey321 @adoroszlai @ArafatKhan2198 .

Merging this PR

@devabhishekpal devabhishekpal merged commit 7a791ad into apache:master Jan 23, 2026
44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants