Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-10452. Improve Recon Disk Usage to fetch and display Top N records based on size. #6318

Merged
merged 11 commits into from
Apr 16, 2024

Conversation

ArafatKhan2198
Copy link
Contributor

@ArafatKhan2198 ArafatKhan2198 commented Mar 2, 2024

What changes were proposed in this pull request?

  • This pull request introduces enhancements to the Recon disk usage endpoint to significantly improve usability and performance when dealing with large datasets:
  • Top Entities Focus: The endpoint has been updated to efficiently sort and display only the top entities by size. This targeted approach helps users easily identify the most significant space consumers, addressing the impracticality of visualizing thousands of records in a single view.
  • Efficient Sorting with Parallel Streams: To manage and sort vast numbers of records effectively, we've implemented parallel stream processing.
  • Key advantages of using parallel streams include :-
    1. Better Utilization of Multi-core Processors: Enables concurrent sorting operations across multiple cores, drastically cutting down processing times for large datasets.
    2. Optimized for Large Datasets: The parallelism overhead is more efficiently distributed over a large number of elements, making it particularly suited for our use case.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10452

How was this patch tested?

Manually Tested Out the API and also using Integration Testing :-

Results from Manual Testing :-

  • Created 4 files of 100MB, 10MB, 1MB & 10KB under dir-1
{
  "status": "OK",
  "path": "/volumetest/buckettest/dir1",
  "size": 111010000,
  "sizeWithReplica": -1,
  "subPathCount": 4,
  "subPaths": [
    {
      "key": true,
      "path": "/volumetest/buckettest/dir1/key100MB",
      "size": 100000000,
      "sizeWithReplica": -1,
      "isKey": true
    },
    {
      "key": true,
      "path": "/volumetest/buckettest/dir1/key10mb",
      "size": 10000000,
      "sizeWithReplica": -1,
      "isKey": true
    },
    {
      "key": true,
      "path": "/volumetest/buckettest/dir1/key1MB",
      "size": 1000000,
      "sizeWithReplica": -1,
      "isKey": true
    },
    {
      "key": true,
      "path": "/volumetest/buckettest/dir1/key10kb",
      "size": 10000,
      "sizeWithReplica": -1,
      "isKey": true
    }
  ],
  "sizeDirectKey": 111010000
}

@ArafatKhan2198 ArafatKhan2198 marked this pull request as ready for review March 4, 2024 08:29
@ArafatKhan2198 ArafatKhan2198 marked this pull request as draft March 4, 2024 08:29
@ArafatKhan2198 ArafatKhan2198 marked this pull request as ready for review March 4, 2024 08:33
@SaketaChalamchala
Copy link
Contributor

@devmadhuu and @dombizita could you please take a look?

@smitajoshi12
Copy link
Contributor

@ArafatKhan2198
Arfafat Can you set Limit on API

Copy link
Contributor

@devmadhuu devmadhuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ArafatKhan2198 for working on this patch. Few comments.

@ArafatKhan2198 ArafatKhan2198 requested a review from devmadhuu March 8, 2024 19:18
@ArafatKhan2198
Copy link
Contributor Author

@devmadhuu @adoroszlai @smitajoshi12

Could you please review the latest changes? Here's a quick summary:

  • Switched to Parallel Sorting: To improve performance, we're now using parallel sorting. More details are in the description.
  • Added a Toggle for Sorting: There's a new boolean flag to turn sorting on or off.
  • Set a Limit of 30 Records: We've added a constant to limit the response to the top 30 records in Disk Usage.

Copy link
Contributor

@dombizita dombizita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @ArafatKhan2198, overall it looks good to me, I'd like to make the javadoc and comments more accurate, please see my comments inline.

@devmadhuu
Copy link
Contributor

@devmadhuu @adoroszlai @smitajoshi12

Could you please review the latest changes? Here's a quick summary:

* Switched to Parallel Sorting: To improve performance, we're now using parallel sorting. More details are in the description.

* Added a Toggle for Sorting: There's a new boolean flag to turn sorting on or off.

* Set a Limit of 30 Records: We've added a constant to limit the response to the top 30 records in Disk Usage.

Thanks @ArafatKhan2198 for handling some points. However I am not sure if parallelStreaming always improves performance, in fact rather sometimes, it increases more overhead and may do bad than good. I would like you to have a look here.

Copy link
Contributor

@devmadhuu devmadhuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments are still open. Pls handle them.

@ArafatKhan2198
Copy link
Contributor Author

ArafatKhan2198 commented Mar 26, 2024

@devmadhuu @adoroszlai @smitajoshi12
Could you please review the latest changes? Here's a quick summary:

* Switched to Parallel Sorting: To improve performance, we're now using parallel sorting. More details are in the description.

* Added a Toggle for Sorting: There's a new boolean flag to turn sorting on or off.

* Set a Limit of 30 Records: We've added a constant to limit the response to the top 30 records in Disk Usage.

Thanks @ArafatKhan2198 for handling some points. However I am not sure if parallelStreaming always improves performance, in fact rather sometimes, it increases more overhead and may do bad than good. I would like you to have a look here.

Thanks a lot, @devmadhuu , for the comment and the article! I've read through it carefully and here's my analysis:

Parallel Streaming concern:

  • Parallel streams introduce overhead for managing multiple threads.
  • This overhead can outweigh the benefits of parallel processing for small datasets or simple operations.

After going through the article I can summarise the following ➖

  • Factors affecting performance:
    • Data size: Parallel streams benefit from large datasets where the overhead is justified.
      • This sorting algorithm will be applied to response objects at a single level in the file system hierarchy, which could potentially encompass millions of items in the worst-case scenario under ideal conditions.
    • Computation intensity: Operations involving complex calculations benefit more from parallelization.
      • Sorting is considered a moderately complex calculation in the context of parallelization.
    • Stream source: Easily splittable sources like arrays perform better in parallel streams,
      • We are using Lists as our source.

@adoroszlai
Copy link
Contributor

@ArafatKhan2198 @devmadhuu Please omit @mention when quoting the message that asked for review. Including it re-subscribes folks mentioned who may have already unsubscribed from the discussion (sorry, I don't have time to review this).

@devmadhuu
Copy link
Contributor

devmadhuu commented Mar 26, 2024

Could you please review the latest changes? Here's a quick summary:

* Switched to Parallel Sorting: To improve performance, we're now using parallel sorting. More details are in the description.

* Added a Toggle for Sorting: There's a new boolean flag to turn sorting on or off.

* Set a Limit of 30 Records: We've added a constant to limit the response to the top 30 records in Disk Usage.

Thanks @ArafatKhan2198 for handling some points. However I am not sure if parallelStreaming always improves performance, in fact rather sometimes, it increases more overhead and may do bad than good. I would like you to have a look here.

Thanks a lot, @devmadhuu , for the comment and the article! I've read through it carefully and here's my analysis:

Parallel Streaming concern:

  • Parallel streams introduce overhead for managing multiple threads.
  • This overhead can outweigh the benefits of parallel processing for small datasets or simple operations.

After going through the article I can summarise the following ➖

  • Factors affecting performance:

    • Data size: Parallel streams benefit from large datasets where the overhead is justified.

      • This sorting algorithm will be applied to response objects at a single level in the file system hierarchy, which could potentially encompass millions of items in the worst-case scenario under ideal conditions.
    • Computation intensity: Operations involving complex calculations benefit more from parallelization.

      • Sorting is considered a moderately complex calculation in the context of parallelization.
    • Stream source: Easily splittable sources like arrays perform better in parallel streams,

      • We are using Lists as our source.

Do we have any performance measure data over 1 million records at least with and without parallel streaming. I am emphasizing it because I have experienced , that even with few 10K of records, parallel streaming do bad more than good. So I would suggest to publish some figures of performance with and without parallel streaming at least with 1 million records.

@devmadhuu
Copy link
Contributor

devmadhuu commented Mar 26, 2024

Could you please review the latest changes? Here's a quick summary:

  • Switched to Parallel Sorting: To improve performance, we're now using parallel sorting. More details are in the description.
  • Added a Toggle for Sorting: There's a new boolean flag to turn sorting on or off.
  • Set a Limit of 30 Records: We've added a constant to limit the response to the top 30 records in Disk Usage.

Pls check on UI, what is the max limit in dropdown we are setting and using. I think its changed to 10k+. Pls check and confirm.

@ArafatKhan2198
Copy link
Contributor Author

Could you please review the latest changes? Here's a quick summary:

* Switched to Parallel Sorting: To improve performance, we're now using parallel sorting. More details are in the description.

* Added a Toggle for Sorting: There's a new boolean flag to turn sorting on or off.

* Set a Limit of 30 Records: We've added a constant to limit the response to the top 30 records in Disk Usage.

Thanks @ArafatKhan2198 for handling some points. However I am not sure if parallelStreaming always improves performance, in fact rather sometimes, it increases more overhead and may do bad than good. I would like you to have a look here.

Thanks a lot, @devmadhuu , for the comment and the article! I've read through it carefully and here's my analysis:
Parallel Streaming concern:

  • Parallel streams introduce overhead for managing multiple threads.
  • This overhead can outweigh the benefits of parallel processing for small datasets or simple operations.

After going through the article I can summarise the following ➖

  • Factors affecting performance:

    • Data size: Parallel streams benefit from large datasets where the overhead is justified.

      • This sorting algorithm will be applied to response objects at a single level in the file system hierarchy, which could potentially encompass millions of items in the worst-case scenario under ideal conditions.
    • Computation intensity: Operations involving complex calculations benefit more from parallelization.

      • Sorting is considered a moderately complex calculation in the context of parallelization.
    • Stream source: Easily splittable sources like arrays perform better in parallel streams,

      • We are using Lists as our source.

Do we have any performance measure data over 1 million records at least with and without parallel streaming. I am emphasizing it because I have experienced , that even with few 10K of records, parallel streaming do bad more than good. So I would suggest to publish some figures of performance with and without parallel streaming at least with 1 million records.

Thanks for the comments @devmadhuu tested this out on a cluster with 10 million keys,
These were the results :-

Sequential sort time: 7657 ms
Parallel sort time: 1279 ms

I believe we could got with parallel sort.

@devmadhuu
Copy link
Contributor

Could you please review the latest changes? Here's a quick summary:

* Switched to Parallel Sorting: To improve performance, we're now using parallel sorting. More details are in the description.

* Added a Toggle for Sorting: There's a new boolean flag to turn sorting on or off.

* Set a Limit of 30 Records: We've added a constant to limit the response to the top 30 records in Disk Usage.

Thanks @ArafatKhan2198 for handling some points. However I am not sure if parallelStreaming always improves performance, in fact rather sometimes, it increases more overhead and may do bad than good. I would like you to have a look here.

Thanks a lot, @devmadhuu , for the comment and the article! I've read through it carefully and here's my analysis:
Parallel Streaming concern:

  • Parallel streams introduce overhead for managing multiple threads.
  • This overhead can outweigh the benefits of parallel processing for small datasets or simple operations.

After going through the article I can summarise the following ➖

  • Factors affecting performance:

    • Data size: Parallel streams benefit from large datasets where the overhead is justified.

      • This sorting algorithm will be applied to response objects at a single level in the file system hierarchy, which could potentially encompass millions of items in the worst-case scenario under ideal conditions.
    • Computation intensity: Operations involving complex calculations benefit more from parallelization.

      • Sorting is considered a moderately complex calculation in the context of parallelization.
    • Stream source: Easily splittable sources like arrays perform better in parallel streams,

      • We are using Lists as our source.

Do we have any performance measure data over 1 million records at least with and without parallel streaming. I am emphasizing it because I have experienced , that even with few 10K of records, parallel streaming do bad more than good. So I would suggest to publish some figures of performance with and without parallel streaming at least with 1 million records.

Thanks for the comments @devmadhuu tested this out on a cluster with 10 million keys, These were the results :-

Sequential sort time: 7657 ms
Parallel sort time: 1279 ms

I believe we could got with parallel sort.

Thanks @ArafatKhan2198 for testing out and publish the figures. This looks promising.

Copy link
Contributor

@devmadhuu devmadhuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM +1. Pls resolve conflicts.

Copy link
Contributor

@devmadhuu devmadhuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor comment.

Copy link
Contributor

@devmadhuu devmadhuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ArafatKhan2198 for working on this patch. Changes LGTM +1

@ArafatKhan2198
Copy link
Contributor Author

@dombizita Could you please take a final look at it!
I believe we are done and can merge it

@devmadhuu
Copy link
Contributor

Thanks @ArafatKhan2198 for working on this patch.

@devmadhuu devmadhuu merged commit 93a2489 into apache:master Apr 16, 2024
40 of 51 checks passed
Tejaskriya pushed a commit to Tejaskriya/ozone that referenced this pull request Apr 17, 2024
jojochuang pushed a commit to jojochuang/ozone that referenced this pull request May 29, 2024
xichen01 pushed a commit to xichen01/ozone that referenced this pull request Jul 17, 2024
xichen01 pushed a commit to xichen01/ozone that referenced this pull request Jul 17, 2024
xichen01 pushed a commit to xichen01/ozone that referenced this pull request Jul 17, 2024
xichen01 pushed a commit to xichen01/ozone that referenced this pull request Jul 18, 2024
xichen01 pushed a commit to xichen01/ozone that referenced this pull request Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants