-
Notifications
You must be signed in to change notification settings - Fork 594
HDDS-13183. Create Cluster Capacity page UI. #9584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
(cherry picked from commit cc5cf02)
devmadhuu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @devabhishekpal for the patch.
- I think , we decided to use datanode dropdown as combo of textbox and dropdown, where user can enter also to search in long list of drop down if needed.
- We need export option for datanodes card atleast.
- By default, datanodes in dropdown needs to be populated in descending order of pending deletion size.
- We should gray out the list of failed datanodes in dropdown as their pending deletion data is no longer useful for user to be displayed. This should be possible using css.
- How are we planning to handle the case when the API response will be having duplicate nodes with same name. Pls check the API behavior in cases when same name DNs joins the cluster and if API will have response because API response contains datanode names as well as UUID, and UUID is only being treated as unique.
hadoop-ozone/recon/src/main/resources/webapps/recon/ozone-recon-web/api/db.json
Show resolved
Hide resolved
priyeshkaratha
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @devabhishekpal for working on the patch.
I have two comments regarding current UI implementation for showing DN pendingDeletion and related refreshing strategy. Please have a look on that.
...ne/recon/src/main/resources/webapps/recon/ozone-recon-web/src/v2/pages/capacity/capacity.tsx
Show resolved
Hide resolved
...ne/recon/src/main/resources/webapps/recon/ozone-recon-web/src/v2/pages/capacity/capacity.tsx
Show resolved
Hide resolved
priyeshkaratha
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @devabhishekpal for updating the patch. Overall changes LGTM
|
@adoroszlai I have added the commons-csv library from apache commons. |
adoroszlai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @devabhishekpal for the heads-up.
I have added the commons-csv library from apache commons. It would be great if you could take a look as well in case we have any existing library which serves the purpose.
dependency check failure has instructions on what needs to be done when adding new one.
devmadhuu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @devabhishekpal for the patch. Kindly add unit and integration test for the code including new API end point. Also UI dev tests. And a small nit.
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/api/PendingDeletionEndpoint.java
Outdated
Show resolved
Hide resolved
|
[Usability issue] Its better to use grid (+ pagination or infinity scrolling and filtering/sorting/searching capabilities) with info for all data data nodes. In case of hundreds datanodes it would be hard to explore and pinpoint data nodes that cause probles. |
Thanks @yandrey321, you are correct that we should have used pagination, however here we decided that instead of showing all the DNs, since this shows the list where datanodes might be stuck in deletion state, we show the top 10 data-nodes sorted by size of pending deletes i.e the first DN has the most size of pending deletion and so on. |
|
@yandrey321 Thanks for the inputs, I reflected on this and I added a tooltip to better convey this information to the end user i.e the list only contains the top 15 DNs and the whole information can be downloaded as a CSV file. |
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/api/DataNodeMetricsService.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/api/PendingDeletionEndpoint.java
Show resolved
Hide resolved
|
Thanks @devabhishekpal for improving the patch. Changes looks fine for me. Please check the CI failures |
priyeshkaratha
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @devabhishekpal for fixing ci issues. LGTM
| } | ||
|
|
||
| public DataNodeMetricsServiceResponse getCollectedMetrics() { | ||
| public DataNodeMetricsServiceResponse getCollectedMetrics(Integer limit) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devmadhuu I was thinking about whether we should implement sorting and fetching the top X results in the backend API, or move this logic to the frontend. Since we rely on JMX, we have to fetch all the data from JMX before we can perform any sorting. Because of this, applying sorting or limits in the backend does not provide any performance benefit—it only reduces the response size.
Given this, would it make more sense to move the sorting logic (based on pending deletion data) to the frontend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if we are not sure of the worst case scenario for number of DNs this might cause issues with sorting and slicing for a very large dataset. If it's in hundreds or thousands maybe it'll still perform well, but if this data is expanded to include other properties and it is thousands of nodes I think the browser might not be able to handle it properly.
Perhaps we can use something similar to infinite scroll instead of limit, but for sorting I'd prefer if it is sorted from the backend itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@priyeshkaratha yes it may have performance issues if number of DNs are high and if some DN may respond slow and 10min is quite high for timeout here, because we wait for all futures to be completed, and in worst case 10 mins can be the time taken for this all futures to be completed and return the pending blocks data. Now if we are displaying top 10 or top 20 datanodes sorted by size (descending), then we could have implemented the above collectMetrics data in such a way that instead of waiting for all datanodes to return the data after their futures is completed, we could have started pushing their results in queue and another thread (any client program) who needs this data like API end point here could just start picking results from that queue and can implement priority queue to start sorting for top 10 or 20, so this way I think memory footprint will be less and since on UI, we display only top 10 or top 20, then backend logic based on priority can keep only that much and keep discarding remaining. And UI should have a scroll dynamic logic that should send request to backend if user goes beyond those top 10 or top 20. But we can have this logic later sometime in future and assume that in case of large clusters where 1000+ datanodes can be there, jmx response from all datanodes can come quickly or else worst 10 mins.
devmadhuu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @devabhishekpal for updating the patch. Largely changes LGTM +1. A comment on approach which we can do in future, but I still insists to do some testing on real cluser having 1000+ DNs where some DN could be running slow. IMO, You can write an integration test where you can introduce some delay in returning JMX response by adding some sleep and see how your API endpoint response behaves and then you can predict how will be UI behavior for user. Basically it should not be a long wait where UI simply gets stuck for worst 10 mins.
| } | ||
|
|
||
| public DataNodeMetricsServiceResponse getCollectedMetrics() { | ||
| public DataNodeMetricsServiceResponse getCollectedMetrics(Integer limit) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@priyeshkaratha yes it may have performance issues if number of DNs are high and if some DN may respond slow and 10min is quite high for timeout here, because we wait for all futures to be completed, and in worst case 10 mins can be the time taken for this all futures to be completed and return the pending blocks data. Now if we are displaying top 10 or top 20 datanodes sorted by size (descending), then we could have implemented the above collectMetrics data in such a way that instead of waiting for all datanodes to return the data after their futures is completed, we could have started pushing their results in queue and another thread (any client program) who needs this data like API end point here could just start picking results from that queue and can implement priority queue to start sorting for top 10 or 20, so this way I think memory footprint will be less and since on UI, we display only top 10 or top 20, then backend logic based on priority can keep only that much and keep discarding remaining. And UI should have a scroll dynamic logic that should send request to backend if user goes beyond those top 10 or top 20. But we can have this logic later sometime in future and assume that in case of large clusters where 1000+ datanodes can be there, jmx response from all datanodes can come quickly or else worst 10 mins.
ArafatKhan2198
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM +1
|
Thanks for the reviews and inputs @priyeshkaratha @devmadhuu @yandrey321 @adoroszlai @ArafatKhan2198 . Merging this PR |
What changes were proposed in this pull request?
Create the Cluster Capacity page UI
Please describe your PR in detail:
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-13183
How was this patch tested?
Patch was tested manually.