Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

git pack generation #3

Closed
joshtriplett opened this issue Jul 14, 2020 · 5 comments
Closed

git pack generation #3

joshtriplett opened this issue Jul 14, 2020 · 5 comments

Comments

@joshtriplett
Copy link
Contributor

As promised on Reddit, here's an outline of the parallel pack generation use case:

  • I have a set of objects in the repository, identified via a list of object roots.
  • I have an optional set of potential thin-pack bases, which the other end has; the objects that go in the pack are all objects reachable from the roots and not reachable from the bases, and the objects usable for deltas are all objects reachable from either the roots or the bases.
  • I'd like to generate a pack (or thin-pack if any bases are specified), and stream that pack either to disk or over a network connection.
  • Sometimes that connection or disk will be slow, other times it'll be absurdly fast. I'd like to have some reasonable control over the tradeoffs between pack generation speed and pack size.
  • It would be nice to handle generating a pack that includes some objects that aren't in a repository, without first having to add the objects to the repository. (For instance, blobs and trees generated from a directory or extracted from a tarball.) Not a hard requirement, but helpful.
  • Massive bonus points if git-oxide could start streaming the pack almost immediately, and adaptively do as well at compression as you can in the time until the next bits are needed to send over the wire, to come as close as possible to saturating the available network speed. (The pack can later be repacked for space, taking more time to do so more effectively.)

I'd love to test this out, and I'd be happy to do so on the biggest machines I can throw at it, though I'd also like it to work well in the 2-8 CPU case.

@Byron
Copy link
Member

Byron commented Jul 14, 2020

Thanks so much for the write up! And I can tell you that especially the bonus point is what interests me! If this would be on the server side, this could result in massive time savings. That also implies that there should be some control about how much resources you actually want to put into it, otherwise slow connection would likely trigger higher CPU usage.

Something that should always be an option is to directly copy entries from a possibly highly compressed pack, which basically costs nothing. That way, one would differentiate between even trying to create efficient packs (during operations like gc), and trying to transfer data quickly ideally without initial delay.

A cool hybrid of both of these operations would be to use the transfer time for packing, and after transfer one could re-integrate the potentially more compressed parts of the transferred pack into the existing one. For git hosting providers, that would be something like auto-gc, if that is even ever useful :D.

Generally, the idea of having a massively bored 96 core server that sends data to a receiver through a 5kbit modem line and use all 96 cores to crush the pack down to 1/60 compression ratios seems like great fun!

Potentially making it easy to stream directories or tar files directly into a pack would make git packs into a genuine transport format, which also sounds interesting. That case can probably be supported at first by adding the source data into an in-memory object database, from which objects can later be streamed.

Anyway, I will keep you posted here once there is anything to test :).

@Byron
Copy link
Member

Byron commented Aug 14, 2020

In the current iteration, I believe it's time to implement a first stab at pack generation. It certainly won't be fancy, but I will do my best to build the machinery so that it can support streaming of packs with various settings.
To my mind, this will absolutely be possible and I would also want to start out reducing the time it takes to start streaming a pack, and overall, clone and fetch.

Exiting times ahead!

@aidanhs
Copy link

aidanhs commented Sep 25, 2020

I'm interested in following along pack writing/generation work - if there ends up being a branch that I can watch I'd love to hear about it
(no pressure - it's totally up to you what makes most sense when you do the work)

@Byron
Copy link
Member

Byron commented Sep 25, 2020

Great, unless I forget I will announce that here and work in a PR for that, no problem. It's probably the easiest way for you to follow along and set the stage for early feedback - for that task I can use all the help I can get 😅!

@GaurangTandon GaurangTandon mentioned this issue Feb 10, 2021
3 tasks
@Byron
Copy link
Member

Byron commented Mar 17, 2021

@Byron Byron closed this as completed Mar 17, 2021
@GitoxideLabs GitoxideLabs locked and limited conversation to collaborators Mar 17, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants