Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Browser crashes (OOM?) when using a custom ReadableStream #145

Open
adamkewley opened this issue Mar 12, 2020 · 11 comments
Open

Browser crashes (OOM?) when using a custom ReadableStream #145

adamkewley opened this issue Mar 12, 2020 · 11 comments

Comments

@adamkewley
Copy link

Hi, this library seems very useful for a project I'm working on, which involves downloading extremely large (>10 GB) datasets (with client-side transformations). Skipping the details a bit, my use-case is going to involve copying a lot of buffers from wasm contexts, rather than the data source being a Resonse object (for example).

As a first step, I wrote a prototype underlyingSource that generates null-initialized buffers and enqueues them onto the controller (code example below). This implementation works as expected for smaller filesizes on the browsers I'm targeting at the moment (Firefox73, Chromium 80); however, once I start increasing TOTAL_SIZE above ~100 meg on my dev machine, the browser session will just crash.

I understand that the implementation below could recycle buffers using something like a ReadableByteStream or something more appropriate, but my understanding is that the implementation should start sinking the file one chunk at a time, applying backpressure where necessary (so that pull isn't over-called). However, debugging shows that the implementation calls pull thousands of time, with no file prompt, and then crashes. What have I done wrong?

<!DOCTYPE html>
<html>
  <head>
    <meta charset="UTF-8">
  </head>
  <body>
    <script src="https://cdn.jsdelivr.net/npm/web-streams-polyfill@2.0.2/dist/ponyfill.min.js"></script>
    <script src="https://cdn.jsdelivr.net/npm/streamsaver@2.0.3/StreamSaver.min.js"></script>
    <script src="impl.js"></script>
  </body>
</html>
// impl.js

const TOTAL_SIZE = 100_000_000;
const CHUNK_SIZE = Math.pow(2, 16);

const fileStream = window.streamSaver.createWriteStream('filename.txt', {
    size: TOTAL_SIZE,
    
    // These *should* ensure that only one chunk is queued before
    // flushing to the output file?
    writableStrategy: new CountQueuingStrategy({ highWaterMark: 1 }),
    readableStrategy: new CountQueuingStrategy({ highWaterMark: 1 }),
});

// A source that just produces NUL bytes per-pull until all data is
// pulled
class UnderlyingSource {
    constructor() {
        this.remaining = TOTAL_SIZE;
    }
    
    start(controller) {
    }
    
    pull(controller) {        
        const buffer = new ArrayBuffer(CHUNK_SIZE);

        if (this.remaining > 0) {
            const readSize = Math.min(this.remaining, CHUNK_SIZE);
            this.remaining -= readSize;
            controller.enqueue(new Uint8Array(buffer, 0, readSize));
        } else {
            controller.close();
        }
    }    
}

const ponyfill = window.WebStreamsPolyfill || {};
const reader = window.ReadableStream.prototype.pipeTo ?
      new window.ReadableStream(new UnderlyingSource()) :
      new ponyfill.ReadableStream(new UnderlyingSource());
reader.pipeTo(fileStream).catch(err => console.error(err));
@jimmywarting
Copy link
Owner

jimmywarting commented Mar 12, 2020

The streamsaver isn't a silver bullet when it comes to pulling/pushing data to the service worker over MessageChannel. it's sort of half implemented.

MessageChannel is used when transferable stream isn't available, dose it flushes the queue quite instantaneous.

The writable/readable Strategy where not implemented before i learned that transferable stream became avalible.

It's possible to emulate this behaviour by using postMessage().
The service worker would have to listen on pulls and request the main thread for more data instead of data flowing in from main thread.
postmessage-transferable-stream-wrapper or remote-web-streams are two Existing techniques for moving work off the main thread often resemble a subset of this functionality. that we could learn from


Browser are not a silver bullet when it comes to pulling data also. They are atomic. Meaning they write to a temporary file and starts pulling data and writes to a temporary file immediately even before you have selected a place/name to write to. When the stream closes then it moves/rename the temporary file to where you want your file to end up.

Still doe you are able to pause the stream and it stops pulling data - this is possible in canary or with some experimental flags enabled. but that is b/c it uses transferable streams instead of MessageChannel

In most scenarios you don't have all data buffered up in the browser at one single point. and haven't been any issue for someone yet. often you wait on some data to arrive, so maybe one "dirty" solution would be just to delay the writes

    async pull (controller) {
        await sleep(100)
        controller.enqueue(data)
    }

@adamkewley
Copy link
Author

Thanks a lot for this information @jimmywarting , I didn't realize some of the implementation quirks (clearly, I haven't spent enough time reading the source code). I will try out your suggestion, and also look into changing my implementation a little: I will report back with the results.

@adamkewley
Copy link
Author

Hey @jimmywarting , your suggestion works perfectly - yielding to the browser for a millisecond or so seems to prevent the OOM described in this ticket. My final implementation, which worked for >10 GB sizes, is:

const TOTAL_SIZE = 10_000_000_000;
const CHUNK_SIZE = Math.pow(2, 18);

const fileStream = window.streamSaver.createWriteStream('filename.txt', {
    size: TOTAL_SIZE,
    
    // These *should* ensure that only one chunk is queued before
    // flushing to the output file?
    writableStrategy: new CountQueuingStrategy({ highWaterMark: 1 }),
    readableStrategy: new CountQueuingStrategy({ highWaterMark: 1 }),
});

function sleep(msec) {
    return new Promise((resolve, reject) => {
        window.setTimeout(() => resolve(), msec);
    });
}

// A source that just produces NUL bytes per-pull until all data is
// pulled
class UnderlyingSource {
    constructor() {
        this.remaining = TOTAL_SIZE;
    }
    
    start(controller) {
    }
    
    async pull(controller) {
        await sleep(1);
        
        const buffer = new ArrayBuffer(CHUNK_SIZE);

        if (this.remaining > 0) {
            const readSize = Math.min(this.remaining, CHUNK_SIZE);
            this.remaining -= readSize;
            controller.enqueue(new Uint8Array(buffer, 0, readSize));
        } else {
            controller.close();
        }
    }    
}

const ponyfill = window.WebStreamsPolyfill || {};
const reader = window.ReadableStream.prototype.pipeTo ?
      new window.ReadableStream(new UnderlyingSource()) :
      new ponyfill.ReadableStream(new UnderlyingSource());
reader.pipeTo(fileStream).catch(err => console.error(err));

@jimmywarting
Copy link
Owner

ps. another performance trick would be to also make all chunks that you are sending also transferable so browser don't have to copy over the data.

// TODO: Kind of important that service worker respond back when
// it has been written. Otherwise we can't handle backpressure
// EDIT: Transfarable streams solvs this...
channel.port1.postMessage(chunk)

channel.port1.postMessage(chunk, [ chunk ])

But doing so can have side effects in the main thread if you plan to reuse the chunk.

@adamkewley
Copy link
Author

@jimmywarting great suggestion - I've been getting high CPU usage from having to copy the chunks.

Looking into the Transferable docs, it looks like it's an interface that the browser manufacturers implement (as in, I can't implement it for a custom class).

I don't need to recycle the chunks but the issue (I think?) is that I would need to pass either an ArrayBuffer, MessagePort, ImageBitmap, or OffscreenCanvas to .enqueue() to take advantage of the move semantics. However, StreamSaver.js doesn't seem to enjoy being passed raw ArrayBuffers, maybe because it is reliant on the chunk having a .length (so it can compute bytesWritten).

What's the best strategy for me to adopt the above code to get transferrable semantics?

@jimmywarting
Copy link
Owner

correct, you can't implement some custom transferable on a custom classes.

the readable stream chunks the Response expects a byte stream of Uint8Array chunks and nothing else so all chunks needs to be written as such.

but it is also possible to transfer arraybuffer views as well.

var chunk = new Uint8Array(20)
console.log(chunk.length) // 20
mc.port1.postMessage(chunk, [chunk.buffer])
console.log(chunk.length) // 0

so if you would like to go fancy and have some zip prototcol or something you can have multiple views that uses the same shared buffer

const buffer = new ArrayBuffer(20)
const payload = {
  header: new Uint8Array(buffer, 0, 5)
  body: new Uint8Array(buffer, 5, 15)
  footer: new Uint8Array(buffer, 15)
}
postMessage(payload, [buffer]) 

@jimmywarting
Copy link
Owner

jimmywarting commented Mar 13, 2020

Now that when i think of it... it may possible be that MessageChannel is faster than transferable streams

i expect transferable streams clones the data but i haven't measured it

@jimmywarting
Copy link
Owner

Non-goals

  • Avoiding cloning the chunks is not a goal at this stage; see "future work".
  • As such, Transfer-only objects (such as ImageBitmap) will not be supported yet; only serializable > objects and the built-in types supported by the structured serialization algorithm.

as expected - did't read it all.

@jimmywarting
Copy link
Owner

jimmywarting commented Mar 13, 2020

Would you like to help take a stab at improving StreamSaver with a PR to maybe help solve this back pressure issue (including sending back a cancel event #13)

And potentially include an option for making transferable an option

if (option.transfer) {
   channel.port1.postMessage(chunk, [ chunk ])
} else {
   channel.port1.postMessage(chunk)
}

@adamkewley
Copy link
Author

adamkewley commented Mar 13, 2020

Looking at this upcoming project I'm spinning up on, I will almost certainly hit some performance and comparability issues that might necessitate upstreaming some features into this library, yeah.

However, I'll need to talk with my colleagues, and also push my work forward a bit more, before I can commit to anything concrete - it may turn out that a completely unrelated project constraint means that we don't move forward with an implementation that uses this.

If I do get the time, in or outside work, I'll have a stab at those PRs, but I can't currentently make any promises. Regardless though, the library looks very promising and has definitely opened up some new avenues I didnt previously think were possible in a browser sandbox

@adamkewley
Copy link
Author

As for this particular issue, your feedback has been really helpful - I now have an implementation that can stream large (40 gb) datasets as a file on chrome, edge (80+, I had to monkeypatch for earlier versions), and firefox73

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants