-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems and deficiencies around uploading big files with k6 #2311
Comments
ScopeWe want to identify how to pass and process an in-memory Reader (binary or encoded) to the HTTP API for minimizing the percentage of used memory in terms of replicated copies across VUs and per single iteration and maintaining a concurrent-safe logic across VUs and EventLoop. Even if they are complementary and should be considered from a higher level the following points are not in the scope of this issue to understand how and what is the best solution for:
Goals
Upload types to support
@mstoykov I tried to summarise what we are trying to resolve in this issue. Do you think is it accurate enough? About othersAs you mentioned, they are probably designed without memory sharing in mind so I don't know how much they can help. Do we have more relevant to mention and check? API changes proposal
@mstoykov Should it continue to use the
Do you mean As an alternative of type StreamType struct {
chunks []struct{ArrayBuffer, RWLock}
} It should allow us to lock on chunk-basis so we could potentially accept writes. So, as a summary of the changes that would be required:
|
Scope:As I tried to preface the original post the most important part of this issue is actually discussing the problems and figuring out which are the ones we want fixed, and then we can talk about solutions :). I would expect we want to fix all of them and I would argue it's better if we have an abstraction that fixes all of them well enough.
This seems more like a solution searching, while I am not certain, due to lack of comments on this, that we have agreed in my assessment of what the problems are and what we want fixed. Arguably we might have 10 problems but want to prioritize fixing 7 of them now and fix the remaining 3 later (numbers are just for example). In that case we likely will take a different way of developing this than if we want to fix all of the identified problems. The whole post was provoked by work around these issues that IMO fixes corner cases somewhat but not the underlying problem. So that we can then discuss the problems and then search for a solution. About others:
At the time I did not try to figure out a solution as I said before - I was trying to identify problems. There probably are good APIs, maybe even those are good enough. Whether we decided to implement them exactly or just be inspired by them or completely ignore them is a different question. One that I currently don't have an opinion on. API changes proposal
I don't know again. IMO my main point is that if the work with request bodies (but arguably also response bodies which will be even more involved IMO) require that they are loaded in memory in full, there will be memory problems. Go's io.Reader is designed (with all of it's additional interfaces) in a way to make more or less this exact thing not a problem - you don't need the whole body you need only parts of it at a time to send it. The http.Request specifically also has Ideally while we have built-in solutions for simple(r) problems such as:
Also, of note here is that the "we will need the body again" while important problem likely isn't a problem for each API where this will be needed, so we might want to only require it in some places :).
What I meant by
Here I was talking about the fact that if you have a generator, to get the next part you need to execute js code and we are in an async code we now need to queue this Arguably it will be nice if a user can create this in js code, and it can be optimized partially. In go there are a bunch of faster interfaces that Given just the above few thoughts from the top of my head, I would argue that coming to a solution will take a while and likely will require a lot of changes. I haven't even touched on the fact that httpext likely will need to grow a lot. Or that Arguably we should design it with a lot of those ideas in mind, but start implementing only small part of it. But again first we need to agree on the issues and which ones we think are worth fixing. To this end it's probably better to restructure them as a list with bigger examples 🤔 . I did write this issue somewhat in a hurry :(. p.s. the final example you are giving, seems to still require a copy of the data in memory? so that is what I am kind of against mostly. But I think we will need a lot bigger and more thought out examples going forward, because I am already pretty sure the above (writing of mine) is hard to follow without concrete examples. |
There seem to be somewhat standard Streams API which seems to be doing somewhat what I was mentioning above. Parts(?) of it seem to be experimental and I don't know how widely used it is. But is basically what is used in the fetch api so probably at least parts of it are fairly well understood and stable. |
This will discuss deficiencies around
net/http
, but in reality, some of those apply to other APIs as well.The idea is mostly to:
But let's first get a fairly simple script
Given this very simple example currently(v0.35.0) we have
file.bin
created for each VU on line A. This can be addressed with 1931 and PR2303binFile
is a mutatable ArrayBuffer it technically can lead to a possibility to edit the underlying data afterdata
is created. And ifhttp.post
was asynchronous(and 3 is fixed) possibly even a data race. Even if it was copying and if there wasn't dynamic data this could've been done once in the init context.data
gets "marshaled" into multipart body effectively copying the body.res
also has a fieldrequest
which has a fieldbody
which is a string dump of the body that was created above.So for this particular script, there are 1,3 and 4 make copies of potentially big amount of bytes.
Let's go with a similar but slightly different script
Here we have the same exact thing except that instead of 3. happening inside k6 it happens in
FormData
's.body()
call which is in javascript. This also happens to be the currently advised way to do multipart requests due to a number of issues.Arguably just 1931 can help with Line A and
as long as the body doesn't need something dynamic that also lets us doNew SharedArrayBuffer(name,func() {return arrayBUffer})
can be used to make the whole body in the init context.edit: turns out that I am wrong as at least in the case of FormData you also need the boundary which then requires
SharedArrayBuffer
to return somehow a buffer and a string (the boundary). While this is doable, I think this now means that we make it extremely specific, so arguably now we need to make it more general. I guess (haven't checked) we can have the FormData take in the boundary and for it to be a constant for each VU 🤷♂, but this now seems like we are just fixing a hack with a hack IMO.This still means that on each request that uploads we make a string copy of the whole body. This body also is IME never used actually. In reality, the only case where it's even relevant is in the first example as it can be used to see what actually k6 sent.
This currently proposed solution is to just reference the original value in all cases but the one where we generate it. This though in practice again falls short by a lot in the case where the values are dynamic and this needs to be generated.
All of this also isn't really thought through from the perspective of having asynchronous calls as well. With asynchronous operations there is no problem to have:
In this case there are multiple race conditions points where if
someBody
gets modified not only can it race with any of the post being done, butresponse.request.body
will not actually be accurate.I really can't figure out any way for
response.request.body
to be accurate in cases where we don't want to copy it and it's modifiable. Requesting allbody
arguments tohttp.*
to be immutable will probably be way too big of an ask and a breaking change.This combined with the race conditions around asynchronous programming and also that copying big binary blobs will be problematic forever I kind of feel this needs more ... drastic change to fix.
The thing I mostly think of is generators. If the body is given by a generator (or a generator-like object), the body can be read once from it and sent it. There are two problems(apart from no generator support in goja currently):
1
can be fixed by having the custom typeBodyGenerator
with bothnext()
and areset()
methods 🤷♂2
can be fixed by having custom in-go types that wrap immutable types in ways that will be more optimalAlong the same vein (of bigger changes) we can require that the body is written with explicit
write()
calls which are blocking or combined with the above.This likely needs more investigation to see how it works in other APIs such as in nodejs and browsers, but I expect that there this might be less of a problem as there won't be hundreds if not thousands of js VMs doing uploads at the same time. So for them, something like copying might be a lot more reasonable and only have a "streaming" variant for compliance.
Things I have not touched on:
http.*
will mess up any outside fixes.In my opinion, while the changes proposed in PR 2303 will have some beneficial effect to the situation with uploading big files in k6, they will not address the underlying problem. Which is that the k6 HTTP module was not designed with uploads in mind (or so it seems to me).
edit: Due to the above problem where even for static FormData a pure
SharedArrayBuffer
will not be sufficient and that ... I would expect there are even more deficiencies with it, I feel even more strongly that just addingSharedArrayBuffer
will have negligible positive effects but will warp any future change to try to work with it instead of actually coming with a better solution.I would really prefer if instead of making some changes, which may or may not have to be advised against in the future (as we might move to something different entirely), a redesign of the HTTP module is being thought of so that uploading anything is possible in some way that is both safe and performant. Hopefully, also it will be fairly obvious how to use the API in a way that this will be true instead of requiring jumping through many hoops 🤷♂
Relevant open at the time issues:
Some of those are only tangibly about uploads(and there are more around the current in k6 multipart support for "object" bodies).
The text was updated successfully, but these errors were encountered: