Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: reduce memory footprint of deleteFiles by utilizing getFilesStre… #2147

Merged

Conversation

ddelgrosso1
Copy link
Contributor

…am and using smaller queue of promises

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

@ddelgrosso1 ddelgrosso1 requested review from a team as code owners February 13, 2023 23:14
@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: storage Issues related to the googleapis/nodejs-storage API. labels Feb 13, 2023
@ddelgrosso1 ddelgrosso1 added the owlbot:run Add this label to trigger the Owlbot post processor. label Feb 13, 2023
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Feb 13, 2023
@ddelgrosso1 ddelgrosso1 merged commit f792f25 into googleapis:main Feb 15, 2023
@ddelgrosso1 ddelgrosso1 deleted the refactor-bucket-deletefiles branch February 15, 2023 12:54
const filesStream = this.getFilesStream(query);

for await (const curFile of filesStream) {
if (promises.length >= MAX_QUEUE_SIZE) {
Copy link

@timscottbell timscottbell Feb 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is both MAX_QUEUE_SIZE and MAX_PARALLEL_LIMIT. Do we need both as they are trying to accomplish the same thing? Limiting the number of files being deleted at once.

Also 10 seems pretty low for deleting files, but I could be missing something, 1000 seems like a more reasonable number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full disclosure I'm not really sure why the original authors decided to utilize pLimit for this one particular function. That said, MAX_PARALLEL_LIMIT represents how many concurrent promises will be allowed to execute and I'm not sure why it was set to 10. We are having some discussion around if we want to expose this to be user setable. In unrelated testing we have found that setting concurrent promises really high / uncapped led to other problems, i.e. network saturation. It is probably less of a concern here as delete doesn't return much in the way of data.

I implemented MAX_QUEUE_SIZE as a means of further reducing memory utilization. The issue you were seeing was because 1. we fetched every file into memory during getFiles and 2. we then created a giant array of promises 1 for each file delete. We are also discussing allowing this to be user setable.

Copy link

@timscottbell timscottbell Feb 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand, when the authors leave it can be hard to figure out why they wrote something, and potentially dangerous to remove without sufficient testing. It seems after more testing a better number between 10 and 1000 could be set, or potentially as you said making it user-configurable.


for await (const curFile of filesStream) {
if (promises.length >= MAX_QUEUE_SIZE) {
await Promise.all(promises);
Copy link

@timscottbell timscottbell Feb 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we hit the queue limit, all execution gets stopped until all the pending promises have finished. This technically works, but doesn't maximize the benefit of streams and results in lower performance.

When I was trying to find a workaround for this issue, I wrote this example transform stream for limiting execution of a stream: https://gist.github.com/timscottbell/364e2683354683c05cdfd93c708e9072. This is not production-ready code or fully tested, but more to give an idea of the direction I thought this solution would go in.

Our current workaround is this:

  async deleteBucket(name: string) {
    const bucket = StorageManagerGoogle._bucket(name, config.CloudProjectId);

    const bucketExists = await StorageManagerGoogle._exists(bucket);

    if (!bucketExists) {
      logger.warn(`Bucket [${name}] does not exist, cannot delete it`);

      return;
    }

    while (true) {
      const [files] = await bucket.getFiles({
        autoPaginate: false,
        maxResults: 1000,
      });

      if (files.length === 0) {
        break;
      }

      await Promise.all(
        files.map(async (file: File) => {
          try {
            await file.delete({ignoreNotFound: true});
          } catch (error) {
            const {response} = error;

            if (
              !response ||
              response.statusCode !== 409 ||
              response.statusMessage !== 'Conflict'
            ) {
              throw error;
            }
          }
        })
      );
    }

    await bucket.delete();
  }

which looks very surprisingly similar to the approach here. (We opted for the less performant solution over the stream solution in the gist because some team members have less familiarity with streams, and it was a workaround we only wanted to commit temporarily).

I couldn't find any standard libraries that do this (I didn't look super hard). Surprised... there really should be one that handles this common generic use case.

Maybe you could use pmap since the stream is an async iterable!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some other improvements to our backlog around deleteFiles performance in addition to some other stream related work we have planned for the next major version.

You are correct, we essentially back pressure the stream once the queue is full. I did this intentionally to not hold thousands / tens of thousands of objects in the queue at the same time. Hopefully this is enough of a reduction in memory for the time being. I don't think this will degrade performance as even the previous version only allowed 10 concurrent promises executing even if the queue contained 100,000 items.

If this doesn't help to resolve the issue you were seeing please feel free to update the open ticket and we will of course be happy to take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the googleapis/nodejs-storage API. size: m Pull request size is medium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants