Browser crashes (OOM?) when using a custom `ReadableStream` #145

adamkewley · 2020-03-12T15:43:57Z

Hi, this library seems very useful for a project I'm working on, which involves downloading extremely large (>10 GB) datasets (with client-side transformations). Skipping the details a bit, my use-case is going to involve copying a lot of buffers from wasm contexts, rather than the data source being a Resonse object (for example).

As a first step, I wrote a prototype underlyingSource that generates null-initialized buffers and enqueues them onto the controller (code example below). This implementation works as expected for smaller filesizes on the browsers I'm targeting at the moment (Firefox73, Chromium 80); however, once I start increasing TOTAL_SIZE above ~100 meg on my dev machine, the browser session will just crash.

I understand that the implementation below could recycle buffers using something like a ReadableByteStream or something more appropriate, but my understanding is that the implementation should start sinking the file one chunk at a time, applying backpressure where necessary (so that pull isn't over-called). However, debugging shows that the implementation calls pull thousands of time, with no file prompt, and then crashes. What have I done wrong?

<!DOCTYPE html>
<html>
  <head>
    <meta charset="UTF-8">
  </head>
  <body>
    <script src="https://cdn.jsdelivr.net/npm/web-streams-polyfill@2.0.2/dist/ponyfill.min.js"></script>
    <script src="https://cdn.jsdelivr.net/npm/streamsaver@2.0.3/StreamSaver.min.js"></script>
    <script src="impl.js"></script>
  </body>
</html>

// impl.js

const TOTAL_SIZE = 100_000_000;
const CHUNK_SIZE = Math.pow(2, 16);

const fileStream = window.streamSaver.createWriteStream('filename.txt', {
    size: TOTAL_SIZE,
    
    // These *should* ensure that only one chunk is queued before
    // flushing to the output file?
    writableStrategy: new CountQueuingStrategy({ highWaterMark: 1 }),
    readableStrategy: new CountQueuingStrategy({ highWaterMark: 1 }),
});

// A source that just produces NUL bytes per-pull until all data is
// pulled
class UnderlyingSource {
    constructor() {
        this.remaining = TOTAL_SIZE;
    }
    
    start(controller) {
    }
    
    pull(controller) {        
        const buffer = new ArrayBuffer(CHUNK_SIZE);

        if (this.remaining > 0) {
            const readSize = Math.min(this.remaining, CHUNK_SIZE);
            this.remaining -= readSize;
            controller.enqueue(new Uint8Array(buffer, 0, readSize));
        } else {
            controller.close();
        }
    }    
}

const ponyfill = window.WebStreamsPolyfill || {};
const reader = window.ReadableStream.prototype.pipeTo ?
      new window.ReadableStream(new UnderlyingSource()) :
      new ponyfill.ReadableStream(new UnderlyingSource());
reader.pipeTo(fileStream).catch(err => console.error(err));

The text was updated successfully, but these errors were encountered:

jimmywarting · 2020-03-12T22:56:49Z

The streamsaver isn't a silver bullet when it comes to pulling/pushing data to the service worker over MessageChannel. it's sort of half implemented.

MessageChannel is used when transferable stream isn't available, dose it flushes the queue quite instantaneous.

The writable/readable Strategy where not implemented before i learned that transferable stream became avalible.

It's possible to emulate this behaviour by using postMessage().
The service worker would have to listen on pulls and request the main thread for more data instead of data flowing in from main thread.
postmessage-transferable-stream-wrapper or remote-web-streams are two Existing techniques for moving work off the main thread often resemble a subset of this functionality. that we could learn from

Browser are not a silver bullet when it comes to pulling data also. They are atomic. Meaning they write to a temporary file and starts pulling data and writes to a temporary file immediately even before you have selected a place/name to write to. When the stream closes then it moves/rename the temporary file to where you want your file to end up.

Still doe you are able to pause the stream and it stops pulling data - this is possible in canary or with some experimental flags enabled. but that is b/c it uses transferable streams instead of MessageChannel

In most scenarios you don't have all data buffered up in the browser at one single point. and haven't been any issue for someone yet. often you wait on some data to arrive, so maybe one "dirty" solution would be just to delay the writes

    async pull (controller) {
        await sleep(100)
        controller.enqueue(data)
    }

adamkewley · 2020-03-13T08:27:06Z

Thanks a lot for this information @jimmywarting , I didn't realize some of the implementation quirks (clearly, I haven't spent enough time reading the source code). I will try out your suggestion, and also look into changing my implementation a little: I will report back with the results.

adamkewley · 2020-03-13T09:02:59Z

Hey @jimmywarting , your suggestion works perfectly - yielding to the browser for a millisecond or so seems to prevent the OOM described in this ticket. My final implementation, which worked for >10 GB sizes, is:

const TOTAL_SIZE = 10_000_000_000;
const CHUNK_SIZE = Math.pow(2, 18);

const fileStream = window.streamSaver.createWriteStream('filename.txt', {
    size: TOTAL_SIZE,
    
    // These *should* ensure that only one chunk is queued before
    // flushing to the output file?
    writableStrategy: new CountQueuingStrategy({ highWaterMark: 1 }),
    readableStrategy: new CountQueuingStrategy({ highWaterMark: 1 }),
});

function sleep(msec) {
    return new Promise((resolve, reject) => {
        window.setTimeout(() => resolve(), msec);
    });
}

// A source that just produces NUL bytes per-pull until all data is
// pulled
class UnderlyingSource {
    constructor() {
        this.remaining = TOTAL_SIZE;
    }
    
    start(controller) {
    }
    
    async pull(controller) {
        await sleep(1);
        
        const buffer = new ArrayBuffer(CHUNK_SIZE);

        if (this.remaining > 0) {
            const readSize = Math.min(this.remaining, CHUNK_SIZE);
            this.remaining -= readSize;
            controller.enqueue(new Uint8Array(buffer, 0, readSize));
        } else {
            controller.close();
        }
    }    
}

const ponyfill = window.WebStreamsPolyfill || {};
const reader = window.ReadableStream.prototype.pipeTo ?
      new window.ReadableStream(new UnderlyingSource()) :
      new ponyfill.ReadableStream(new UnderlyingSource());
reader.pipeTo(fileStream).catch(err => console.error(err));

jimmywarting · 2020-03-13T10:43:21Z

ps. another performance trick would be to also make all chunks that you are sending also transferable so browser don't have to copy over the data.

StreamSaver.js/StreamSaver.js

Lines 263 to 266 in a6ec1df

    
           // TODO: Kind of important that service worker respond back when 
        
           // it has been written. Otherwise we can't handle backpressure 
        
           // EDIT: Transfarable streams solvs this... 
        
           channel.port1.postMessage(chunk)

channel.port1.postMessage(chunk, [ chunk ])

But doing so can have side effects in the main thread if you plan to reuse the chunk.

adamkewley · 2020-03-13T13:16:07Z

@jimmywarting great suggestion - I've been getting high CPU usage from having to copy the chunks.

Looking into the Transferable docs, it looks like it's an interface that the browser manufacturers implement (as in, I can't implement it for a custom class).

I don't need to recycle the chunks but the issue (I think?) is that I would need to pass either an ArrayBuffer, MessagePort, ImageBitmap, or OffscreenCanvas to .enqueue() to take advantage of the move semantics. However, StreamSaver.js doesn't seem to enjoy being passed raw ArrayBuffers, maybe because it is reliant on the chunk having a .length (so it can compute bytesWritten).

What's the best strategy for me to adopt the above code to get transferrable semantics?

jimmywarting · 2020-03-13T15:15:57Z

correct, you can't implement some custom transferable on a custom classes.

the readable stream chunks the Response expects a byte stream of Uint8Array chunks and nothing else so all chunks needs to be written as such.

but it is also possible to transfer arraybuffer views as well.

var chunk = new Uint8Array(20)
console.log(chunk.length) // 20
mc.port1.postMessage(chunk, [chunk.buffer])
console.log(chunk.length) // 0

so if you would like to go fancy and have some zip prototcol or something you can have multiple views that uses the same shared buffer

const buffer = new ArrayBuffer(20)
const payload = {
  header: new Uint8Array(buffer, 0, 5)
  body: new Uint8Array(buffer, 5, 15)
  footer: new Uint8Array(buffer, 15)
}
postMessage(payload, [buffer])

jimmywarting · 2020-03-13T15:19:46Z

Now that when i think of it... it may possible be that MessageChannel is faster than transferable streams

i expect transferable streams clones the data but i haven't measured it

jimmywarting · 2020-03-13T15:28:59Z

Non-goals

Avoiding cloning the chunks is not a goal at this stage; see "future work".

As such, Transfer-only objects (such as ImageBitmap) will not be supported yet; only serializable > objects and the built-in types supported by the structured serialization algorithm.

as expected - did't read it all.

jimmywarting · 2020-03-13T18:43:20Z

Would you like to help take a stab at improving StreamSaver with a PR to maybe help solve this back pressure issue (including sending back a cancel event #13)

And potentially include an option for making transferable an option

if (option.transfer) {
   channel.port1.postMessage(chunk, [ chunk ])
} else {
   channel.port1.postMessage(chunk)
}

adamkewley · 2020-03-13T19:35:31Z

Looking at this upcoming project I'm spinning up on, I will almost certainly hit some performance and comparability issues that might necessitate upstreaming some features into this library, yeah.

However, I'll need to talk with my colleagues, and also push my work forward a bit more, before I can commit to anything concrete - it may turn out that a completely unrelated project constraint means that we don't move forward with an implementation that uses this.

If I do get the time, in or outside work, I'll have a stab at those PRs, but I can't currentently make any promises. Regardless though, the library looks very promising and has definitely opened up some new avenues I didnt previously think were possible in a browser sandbox

adamkewley · 2020-03-13T19:38:14Z

As for this particular issue, your feedback has been really helpful - I now have an implementation that can stream large (40 gb) datasets as a file on chrome, edge (80+, I had to monkeypatch for earlier versions), and firefox73

jimmywarting mentioned this issue Mar 24, 2020

Questions about compatible firefox browser #146

Closed

jimmywarting added the Back pressure label Mar 24, 2020

jimmywarting mentioned this issue Jun 18, 2020

Running out of memory copying large files #164

Closed

jimmywarting mentioned this issue Aug 27, 2020

Firefox on Mac OS issues CPU higth #179

Open

jimmywarting mentioned this issue Oct 16, 2020

Page becomes unresponsive when writing a new stream from data received over the internet #162

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Browser crashes (OOM?) when using a custom `ReadableStream` #145

Browser crashes (OOM?) when using a custom `ReadableStream` #145

adamkewley commented Mar 12, 2020

jimmywarting commented Mar 12, 2020 •

edited

Loading

adamkewley commented Mar 13, 2020

adamkewley commented Mar 13, 2020

jimmywarting commented Mar 13, 2020

adamkewley commented Mar 13, 2020

jimmywarting commented Mar 13, 2020

jimmywarting commented Mar 13, 2020 •

edited

Loading

jimmywarting commented Mar 13, 2020

jimmywarting commented Mar 13, 2020 •

edited

Loading

adamkewley commented Mar 13, 2020 •

edited

Loading

adamkewley commented Mar 13, 2020

Browser crashes (OOM?) when using a custom ReadableStream #145

Browser crashes (OOM?) when using a custom ReadableStream #145

Comments

adamkewley commented Mar 12, 2020

jimmywarting commented Mar 12, 2020 • edited Loading

adamkewley commented Mar 13, 2020

adamkewley commented Mar 13, 2020

jimmywarting commented Mar 13, 2020

adamkewley commented Mar 13, 2020

jimmywarting commented Mar 13, 2020

jimmywarting commented Mar 13, 2020 • edited Loading

jimmywarting commented Mar 13, 2020

jimmywarting commented Mar 13, 2020 • edited Loading

adamkewley commented Mar 13, 2020 • edited Loading

adamkewley commented Mar 13, 2020

Browser crashes (OOM?) when using a custom `ReadableStream` #145

Browser crashes (OOM?) when using a custom `ReadableStream` #145

jimmywarting commented Mar 12, 2020 •

edited

Loading

jimmywarting commented Mar 13, 2020 •

edited

Loading

jimmywarting commented Mar 13, 2020 •

edited

Loading

adamkewley commented Mar 13, 2020 •

edited

Loading