-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for tar archives #1540
Comments
We don't have any existing tar code to leverage, if that's what you mean. Tar is a pretty different format from Zip (particularly since it doesn't compress) so we would need to start mostly from scratch. That's not to say it isn't worthwhile, though. I'd love to be able to handle all popular compression formats. I'm not sure though where we would even want to put this if we did add it. System.IO.Compression only kind of makes sense since a tar doesn't compress. I guess FileSystem? tar is so frequently associated with gzip it seems incorrect to not place it alongside it.
In my opinion it would be ideal for it to be as similar to ZipArchive as possible. |
@ericstj @jasonwilliams200OK |
Thanks @ianhays. From dotnet/corefx#9673:
Usually bz2 is the compressed format which contains a tarball (as bz2 only compresses one file: https://en.wikipedia.org/wiki/Bzip2). We can probably use the same API methods to support bz2 (except for some format specific settings). This way, tarball expansion / contraction might make sense in S.I.C as part of bz2 (or even zip) compression / decompression. |
This comment has been minimized.
This comment has been minimized.
@ianhays I recently implemented a set of classes for manipulating tar files, including ustar and PAX (but not GNU, IIRC) support. We needed this for our Docker PowerShell cmdlets. They might be a good starting point for this work. /~https://github.com/Microsoft/Docker-PowerShell/tree/master/src/Tar |
I should also note that tar archives are fundamentally different from zip archives in that they are stream-oriented and do not contain a central directory of files. This means that both The reality is that tar demands a very different interface from zip. You may want to look at my implementation for some ideas of what works better with tar. |
In my above comment (under "Implementation") I was operating under the tentative plan that Entry indexing would require a seekable stream or throw an exception if it wasn't, but as you said this isn't likely to be frequently done since it will nearly always be wrapped in a GZip or LZMA stream. It may be worth adding anyways to cover edge cases, but I doubt it. Enumerating entries would be the preferred way of reading the archive.
The nice thing about not having a common parent with ZipArchive is that we can diverge the interface where it's necessary. While it would be ideal to have the API be similar, it isn't required. That said, I think we can at least keep the TarArchive/TarArchiveEntry structure if we just make some tweaks.
Thanks @jstarks, that looks very close to what I had in mind with the exception of some minor API differences (e.g. a unified |
Per discussion with @ianhays a conservative estimate for all this is 5 weeks, if it was forward only that would be less time. |
+1 on requesting this support on .net core. This would be super helpful for any cloud service that's trying to untar developer uploaded files from a linux machine. We are one of them (that uses .net core). Would really prefer to avoid picking third party libraries or resorting to running a shell process to achieve this. This becomes more crucial due to the fact that file permissions on zip archives are not set correctly for files archived on Linux. Given that neither this nor zip are fully functional out of the box on Linux makes it hard to support Linux platform on our service in a clean way. |
Triage: |
As the author of Sharpcompress, my key wish for this and Zip is to make the API stream oriented and not require a seekable stream. Then I can base my code on yours or not support it altogether! |
Transferring to the dotnet/runtime repo. |
Any chance this may be prioritized for .NET 6? |
@stephentoub @danmosemsft Can we make an API proposal based on @ianhays and @adamhathcock ? |
@deinok it's your preference, we usually like to have the latest proposal maintained in the top post, so in this case it may be easiest to create a new proposal using the template below and link to this which we can close, but you can also maintain it in a new comment here. To approve the API we also need to have a pretty good idea how we implement it. In the platform, it needs to be fast and stable. Any parser needs to be secure, and well fuzz tested. There's some existing code mentioned above -- it would be interesting to know whether it is compatibly licensed and xplat, high performance, stable etc. Or, whether the right approach would be to reimplement using existing code as appropriate. Realistically if it would be to be in .NET 6 it would likely be a community effort. It does seem reasonable to me in principle to have in the platform notwithstanding that. |
I should note that @ericstj and @carlossanlop own this area not me. Note that @ianhays is no longer on the team - but if you're reading this Ian, I hope all is well with you! |
Okey, thanks for the help @danmosemsft I will make a proposal in a new issue. Until that, lets have this open |
I'm happy to help design and write/donate code for various algorithms and formats as I'm interested in a streaming API. I looked at reusing the current zip implementation in the runtime but it's all file or seeking based. |
The issue description has been updated based on the initial proposal by @ianhays, and the feedback we received.
|
@carlossanlop I wonder whether it's worth moving the above into a new issue, and linking to/closing this one? That way its the topmost entry. |
@carlossanlop First of all I'd like to state that both me and @Numpsy have done bugfixes etc. on the Tar APIs, but we are not the authors. The code base is almost 20 years old... (😲).
Supporting all "formats" is probably a lot easier to do from the start than it would be to add them gradually. That wouldn't mean supporting all the different extensions (which are just "unknown" file types), just the general structure. This might be what was intended by "Add tar archival/de-archival support for GNU tar". |
@carlossanlop Please take from SharpCompress whatever you need. One thing I notice, is that it's unclear if you're supporting Forward-only (network stream) scenarios for Reading and/or Writing. Is I would also recommend the As for compression/writing, again I recommend having a separate I do think it's easier if you cover Streaming (forward-only) only scenarios first then build Random Access ones on top of that. Zip should be retrofitted onto that. Though Zip is a bit more challenging as it only supports streaming scenarios if the file entries have a suffix trailer telling you the expected size vs having to read it from the dictionary. Obivously, I'm biased and I think the basic strategy I've taken with SharpCompress is the best if you want to support Streaming (forward-only) scenarios. If there's things I'm missing, I'm happy to be wrong. If you choose not to support streaming , I'd like more of the internals to be exposed for the file formats/compression algos so that I can base SharpCompress on the your specific file format implementations. |
@carlossanlop et al. I edited the issue description to promote the updated/latest proposal up to the top (instead of closing this and creating a new issue). |
Hey everyone, I created a new issue with an updated proposal to get fresh feedback: #65951 I'm closing this issue but I'll link it in the new description. |
Update: New proposal here: #65951
Summary
The TAR archive format is commonly used in Unix/Linux-native workloads. .NET applications should be able to produce and consume these archives with built-in APIs that support the most frequently used TAR features and variations.
API Proposal
Reading APIs
We could gradually offer functionality. Initially, we must offer APIs that can read archives.
Writing APIs
The next step would be to add writing capabilities:
Static APIs - Sync
These APIs were heavily inspired in the
ZipFile
APIs.The following static methods would be powerful because they would be able to decompress the file, then read the internal tar.
We are unsure if from the perspective of API design, it makes sense to mix purposes.
Static APIs - Async
Static extension APIs - Sync
These extension APIs are similar to the
ZipArchiveEntry
ones.We could directly add these methods to the
TarArchiveEntry
class instead of making them extensions, since we are currently designing it all at the same time.The
overwriteFiles
boolean argument should be clearly documented with warnings about potential tarbomb behavior.Static extension APIs - Async
Usage examples
Here is a basic example of opening a tar.gz file for reading. First we decompress the gzip, then we read the archive.
TODO: More examples to come.
Tar format description
Optional read. Feel free to skip.
A tar archive is a linear sequence of blocks. Each block consists of a header and the file contents described by that header.
The blocks are aligned to a fixed block size, usually 512. In other words, a block size needs to be a multiple of the block size, which can be achieved by adding trailing null bytes at the end of the file contents, when necessary.
The header describes the metadata of the file contents (filename, mode, uid, guid, size, last modification time, etc.). The size of a header is fixed. Its fields all have a predefined max size.
The file contents can be 0 or more raw bytes, representing the contents of the file.
If the block represents a directory, the file contents can optionally be 0. It's not 0 when it contains a list of the filesystem entries inside that directory, which some tar format versions allow.
A tar archive is navigated by jumping from header to header. The beginning of the next header can be found by adding up the fixed size of a header plus the size of the file contents, minding the block size padding.
Tar archives do not contain a central directory like zip archives. A zip central directory is an uncompressed region of the zip archive that indicates the total number of files in the archive. If the user wants to know the total number of files contained in a tar archive, the whole archive needs to be traversed to count the total number of block headers found.
The tar spec was not designed to include compression capabilities, but tars are commonly combined with a compression method. The most popular method is to first generate the tar file, then compress it, usually with GZip (.tar.gz) or with LZMA (.tar.xz). While this method simplifies and separates the archival and compression stages, it also means that the only way the user can read the contents of the tar file is by decompressing it first.
Another not-so-common method is to compress the file contents individually, leaving the header readable by the user. The reason why it's not so common is because the header offers no field to indicate which compression method was used to compress each file contents block, so the user needs to preserve that information somewhere else.
There are multiple versions of the tar format: v7, ustar, pax, gnu, oldgnu, solaris, aix, macosx. We should focus on v7, ustar, pax and gnu.
Sources:
Open questions
Tar versions
Assembly
TarArchiveEntry
EntryType
property? I'd say yes, especially because some entries areLongLink
and the actual entry is expected to be located in the next position.EntryType
values should be allowed? Can the user programatically add a Block, Fifo Contiguous, Character entry?FullName
to be consistent with other full path properties. But should we instead useFullPath
?TryGetNextEntry
? EOF is marked in a tar with two 512-byte blocks filled with nulls.TarOptions
init
? Consider that theTarArchive
would cache it, but it may not make sense for the user to be able to change the value of the cached options.Static APIs
TarOptions
) one for extraction and another one for creation? We can pasa an instance of this class as an argument, and have only one method, instead of several overloads. This would be helpful in case we grow the options in the future.Compression
Security
Entries
property, like in zip. This is because we don't have a central directory. If we receive a network stream, we wouldn't be able to know theCount
.TarArchive
is opened inCreate
orUpdate
mode. This is because the assumption is that we will modify the tar file on dispose, either because we want to add new entries, or because we want to delete existing entries.Entries
property that can only be used if the stream is a seekableFileStream
, in which case we can use the newRandomAccess
APIs to get the files.Mode
,Uname
and/orGName
that does not match that of the current user, should we allow the user read/update/delete/extract that entry, or should we forbid access to it?Testing
tar
command, which generatesgnutar
files.The text was updated successfully, but these errors were encountered: