-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add multi-proc in to_json
#2747
Conversation
Thank you for working on this, @bhavitvyamalik 10% is not solving the issue, we want 5-10x faster on a machine that has lots of resources, but limited processing time. So let's benchmark it on an instance with many more cores, I can test with 12 on my dev box and 40 on JZ. Could you please share the test I could run with both versions? Should we also test the sharded version I shared in #2663 (comment) so optionally 3 versions to test. |
Since I was facing Update: I was able to convert into json which took 50% less time as compared to v1 on |
Here are the benchmarks with the current branch for both v1 and v2 (dataset:
Increasing the batch size on my machine helped in making v2 around 50% faster as compared to v1. Timings may vary depending on the machine. I'm including the benchmarking script as well. CircleCI errors are unrelated (something related to
This will download the dataset in every iteration and run Important: Benchmarking script will delete the newly generated json and |
Thank you for sharing the benchmark, @bhavitvyamalik. Your results look promising. But if I remember correctly the sharded version at #2663 (comment) was much faster. So we probably should compare to it as well? And if it's faster than at least document that manual sharding version? That's a dangerous benchmark as it'd wipe out many other HF things. Why not wipe out:
Running the benchmark now. |
Weird, I tried to adapt your benchmark to using shards and the program no longer works. It instead quickly uses up all available RAM and hangs. Has something changed recently in
Just careful, so that it won't crash your compute environment. As it almost crashed mine. |
So this part seems to no longer work:
|
If you are using
That's because some dataset related files were still left inside
I tried this |
I don't think the issue has anything to do with your work, @bhavitvyamalik. I forgot to mention I tested to see the same problem with the latest datasets release. Interesting, I tried your suggestion. This:
works fine and takes just a few GBs to complete. this on the other hand blows up memory-wise:
and I have to kill it before it uses up all RAM. (I have 128GB of it, so it should be more than enough) |
I think recent datasets added a method that will print out the path for all the different components for a given dataset, I can't recall the name though. It was when we were discussing a janitor program to clear up space selectively. |
Same thing just happened on my machine too. Memory leak somewhere maybe? Even if you were to load this dataset in your memory it shouldn't take more than 4GB. You were earlier doing this for |
Hmm, looks like Were you able to reproduce the memory blow up with But yes, oscar worked just fine with |
What I tried is:
and got:
|
Yes, this blows up memory-wise on my machine too. I found that a similar error was posted on the forum on 5th March. Since you already knew how much time #2663 comment took, can you try benchmarking v1 and v2 for now maybe until we have a fix for this memory blow up? |
OK, so I benchmarked using "lama" though it's too small for this kind of test, since the sharding is much slower than one thread here. Results: https://gist.github.com/stas00/dc1597a1e245c5915cfeefa0eee6902c So sharding does really bad there, and your json over procs is doing great! Any suggestions to a somewhat bigger dataset, but not too big? say 10 times of lama? |
Looks great! I had a few questions/suggestions related to
You can use:
|
Thank you, @bhavitvyamalik My apologies, at the moment I have not found time to do more benchmark with the proposed other datasets. I will try to do it later, but I don't want it to hold your PR, it's definitely a great improvement based on the benchmarks I did run! And the comparison to sharded is really just of interest to me to see if it's on par or slower. So if other reviewers are happy, this definitely looks like a great improvement to me and addresses the request I made in the first place.
Oh, no particular reason, I was just comparing to 4 shards on my desktop. Typically it's sufficient to go from 1 to 2-4 to see whether the distributed approach is faster or not. Once hit larger numbers you often run into bottlenecks like IO, and then numbers can be less representative. I hope it makes sense. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool thanks !
I just have a few comments, especially one regarding memory when num_proc>1
Tested it with a larger dataset ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice @bhavitvyamalik this is awesome :)
Indeed we need to have a test for this.
Maybe you can avoid OSError: [Errno 12] Cannot allocate memory
by using a smaller batch_size
(maybe <1000) and a small number of processes (maybe 2) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few nits.
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this feature @bhavitvyamalik :)
It looks all good !
Closes #2663. I've tried adding multiprocessing in
to_json
. Here's some benchmarking I did to compare the timings of current version (say v1) and multi-proc version (say v2). I did this withcpu_count
4 (2015 Macbook Air)Dataset name:
ascent_kb
- 8.9M samples (all samples were used, reporting this for a single run)v1- ~225 seconds for converting whole dataset to json
v2- ~200 seconds for converting whole dataset to json
Dataset name:
lama
- 1.3M samples (all samples were used, reporting this for 2 runs)v1- ~26 seconds for converting whole dataset to json
v2- ~23.6 seconds for converting whole dataset to json
I think it's safe to say that v2 is 10% faster as compared to v1. Timings may improve further with better configuration.
The only bottleneck I feel is writing to file from the output list. If we can improve that aspect then timings may improve further.
Let me know if any changes/improvements can be done in this @stas00, @lhoestq, @albertvillanova. @lhoestq even suggested to extend this work with other export methods as well like
csv
orparquet
.