Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Tar APIs #67883

Merged
merged 48 commits into from
Apr 21, 2022
Merged

Implement Tar APIs #67883

merged 48 commits into from
Apr 21, 2022

Conversation

carlossanlop
Copy link
Member

@carlossanlop carlossanlop commented Apr 12, 2022

Fixes #65951

Approved proposal in this comment: #65951 (comment)

It's ready to review but I still have some small TODOs to address.

  • Some tests depend on the runtime-assets that need to be merged first: Add System.Formats.Tar test assets runtime-assets#238
  • Need to implement the latest stream-based TarFile methods.
  • Async methods (might submit them in a separate PR if that's ok).
  • Some Unix-specific syscall issues. @eerhardt @tmds I would appreciate if you could help me with those. I'll tag you.

cc @baronfel

@ghost
Copy link

ghost commented Apr 12, 2022

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes #65951

It's ready to review but I still have some small TODOs to address.

  • Some tests depend on the runtime-assets that need to be merged first: Add System.Formats.Tar test assets runtime-assets#238
  • Need to implement the latest stream-based TarFile methods.
  • Async methods (might submit them in a separate PR if that's ok).
  • Some Unix-specific syscall issues. @eerhardt @tmds I would appreciate if you could help me with those. I'll tag you.

cc @baronfel

Author: carlossanlop
Assignees: carlossanlop
Labels:

area-System.IO

Milestone: 7.0.0

@dotnet-issue-labeler
Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.


using System.IO;

namespace System.Formats.Tar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we share the existing WrappedStream.cs by moving it into Common?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually have one already being shared, but I needed to slightly modify it for my own purposes and didn't want to break the other tests depending on it. So that's why I had to copy it here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very similar. I wonder if there's a way to parameterize a single copy.

@danmoseley
Copy link
Member

I didn't review properly - someone else will need to 😃

get => _header._aTime;
set
{
// TODO: Is there a max value?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general what's the plan for all these TODOs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address them before merging.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we start getting close to the Preview 4 snap such that the merge is at-risk, I'd also be open to filing a follow-up issue for any of the TODOs that aren't blocking, for them to be addressed in Preview 5.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can collect a list of the pending TODOs into a comment so we can document them in this preview as features that won't yet be fully working.

Comment on lines +626 to +641
StringSplitOptions splitOptions = StringSplitOptions.RemoveEmptyEntries;

string[] attributeArray = nextLine.Split(' ', 2, splitOptions);
if (attributeArray.Length != 2)
{
return false;
}

string[] keyAndValueArray = attributeArray[1].Split('=', 2, splitOptions);
if (keyAndValueArray.Length != 2)
{
return false;
}

key = keyAndValueArray[0];
value = keyAndValueArray[1];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be much more efficient to parse this with IndexOf rather than Split

Comment on lines 17 to 21
private static readonly byte[] s_paxMagic = new byte[] { 0x75, 0x73, 0x74, 0x61, 0x72, 0x0 }; // "ustar\0"
private static readonly byte[] s_paxVersion = new byte[] { 0x30, 0x30 }; // "00"

private static readonly byte[] s_gnuMagic = new byte[] { 0x75, 0x73, 0x74, 0x61, 0x72, 0x20 }; // "ustar "
private static readonly byte[] s_gnuVersion = new byte[] { 0x20, 0x0 }; // " \0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these be spans instead? e.g.

private static ReadOnlySpan<byte> PaxMagic => new byte[] { ... };

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed the suggestion is a property. Is this smart enough to cache the array, instead of returning new byte[] every time it is called?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this, but I'm still curious. I suspect the answer is that it gets cached.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roslyn has an optimization to refer to the static data segment of the assembly. So no allocation happens here.
See sharplab.

checksum += WriteRightAlignedBytesAndGetChecksum(uidBytes, _uidBytes);

byte[] gidBytes = TarHelpers.GetAsciiBytes(TarHelpers.ConvertDecimalToOctal(_gid));
checksum += WriteRightAlignedBytesAndGetChecksum(gidBytes, _gidBytes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll stop commenting on these, too, but there are more efficient ways to do this rather than allocating new byte[]s each time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed with ArrayPool.

@carlossanlop
Copy link
Member Author

carlossanlop commented Apr 14, 2022

I seem to have generated the tar and tar.gz files from runtime-testdata with the wrong uid and gid, so some tests will fail due to that. I'll have to generate them again with the script and update the package here.

Edit: Fixed.

// If the permissions weren't set at all, don't write the file's permissions.
if (permissions != 0)
{
Interop.CheckIo(Interop.Sys.FChMod(handle, permissions), destinationFileName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike tar this doesn't respect umask while extracting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my elaborate comment here: #67883 (comment).

I think we want to respect the umask since that is something that protects the user for unintentionally opening up permissions.

To do that, you can call the internal SafeHandle.Open on Unix that accepts Interop.Sys.Permissions:

internal static SafeFileHandle Open(string fullPath, FileMode mode, FileAccess access, FileShare share, FileOptions options, long preallocationSize,
Interop.Sys.Permissions openPermissions = DefaultOpenPermissions,
Func<Interop.ErrorInfo, Interop.Sys.OpenFlags, string, Exception?>? createOpenException = null)
{
return Open(fullPath, mode, access, share, options, preallocationSize, openPermissions, out _, out _, createOpenException);
}

Per that comment, at some point we'll also want to support applying all permissions (including sticky, setgroup, setuser), and setting ownership (uid, gid). The API will need to be extended to support that use-case.

cc @eerhardt

const int ExtractPermissionMask = 0x1FF;
int permissions = (int)Mode & ExtractPermissionMask;

// If the permissions weren't set at all, don't write the file's permissions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// If the permissions weren't set at all

Why are we ignoring permissions when none are set?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw your other comment. This is probably wrong so I'll fix it.

@tmds
Copy link
Member

tmds commented Apr 14, 2022

tar files have permissions and ownership information.
The tar command has flags that controls these.

By default:

  • For a regular user, tar will not set the owner, and filter out priviledged bits from the permissions (because the user can't set them). The process umask is respected.
  • For root, tar will set the owner, and set all permission bits. The process umask is ignored.

The implementation isn't considering ownership at the moment, and the permission bits get filtered for priviledged bits but the umask gets ignored.

I think we should default to the regular user tar behavior.

To be able to opt into the other mode, the API will need to be extended. This doesn't need to happen now if only the default mode should be supported. Though maybe you want to take it into account already, and for example introduce ExtractOptions and add Overwrite to it instead of making it an argument.

/// <item>In all platforms: <see cref="TarEntryType.Directory"/>, <see cref="TarEntryType.HardLink"/>, <see cref="TarEntryType.SymbolicLink"/>, <see cref="TarEntryType.RegularFile"/>.</item>
/// <item>In Unix platforms only: <see cref="TarEntryType.BlockDevice"/>, <see cref="TarEntryType.CharacterDevice"/> and <see cref="TarEntryType.Fifo"/>.</item>
/// </list>
/// </remarks>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be possible to use inheritdoc and avoid the need of having dedicated docs for methods that just call base class implementation?


public Task ExtractToFileAsync(string destinationFileName, bool overwrite, CancellationToken cancellationToken = default)
{
throw new NotImplementedException();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume it's on the TODO list?

Copy link
Member Author

@carlossanlop carlossanlop Apr 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, async will have to be implemented later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're not going to implement these as part of this PR, I'd prefer they be removed, at least from the ref, until they're implemented. It's otherwise a bad experience to add new APIs that aren't yet implemented.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. Removing them from the ref.

// - POSIX IEEE 1003.1-2001 ("POSIX.1") Pax Interchange Tar Format (pax).
// - GNU Tar Format (gnu).
// Documentation: https://www.freebsd.org/cgi/man.cgi?query=tar&sektion=5
internal partial struct TarHeader
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this struct seems to be very large and I would not expect it to get us some perf. What are we getting by keeping it a struct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you prefer if I use class? And maybe split it into internal class TarHeaderRead and internal class TarHeaderWrite?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect us to get no perf benefits from keeping it struct, while we can run into issues like modifying a copy instead:

TarHeader header = new();
header._mode = 1;

SetMode(header);

Console.WriteLine(header._mode); // prints 1 as we have passed struct copy to SetMode and modified a copy

static void SetMode(TarHeader h) => h._mode = 2;

if (value < DateTimeOffset.UnixEpoch)
{
throw new ArgumentOutOfRangeException(nameof(value));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking current time is expensive. IMO the current implementation (having only the UnixEpoch check) is all we need.

// Does not support writing.
internal sealed class SeekableSubReadStream : SubReadStream
{
public SeekableSubReadStream(Stream superStream, long startPosition, long maxLength)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call these something other than "sub" and "super"? I find the terminology confusing.

@stephentoub SuperSubmarineStream? ;)


_header = default;

_header._extendedAttributes = new Dictionary<string, string>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Now I lazy-initialize this field.

@carlossanlop have you pushed these changes? I can still see _header._extendedAttributes = new Dictionary<string, string>();

public int Uid
{
get => _header._uid;
set => _header._uid = value;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we re-propose this as uint instead of int?

That would be my preference, but get ready for a lot of pushback at the review :)

// If the path can be extracted in the specified destination directory, returns the full path with sanitized file name. Otherwise, throws.
static string GetSanitizedFullPath(string destinationDirectoryFullPath, string path, string exceptionMessage)
{
string actualPath = Path.Join(Path.GetDirectoryName(path), ArchivingUtils.SanitizeEntryFilePath(Path.GetFileName(path)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perf: since Path.GetFileName returns only a substring of the provided path, we could use the overload that returns ReadOnlySpan<char> here and make SanitizeEntryFilePath accept ROS and avoid additional string allocation

{
string actualPath = Path.Join(Path.GetDirectoryName(path), ArchivingUtils.SanitizeEntryFilePath(Path.GetFileName(path)));

if (!Path.IsPathFullyQualified(actualPath))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for my education: for what input this could be true?

{
throw new IOException(string.Format(SR.IO_AlreadyExists_Name, filePath));
}
File.Delete(filePath);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would we like to remove the file now, if the method can still throw few checks later?

If we want to overwrite a file we can use FileMode.Create when creating a FileStream:

Specifies that the operating system should create a new file. If the file already exists, it will be overwritten.

Comment on lines +75 to +80
if (Path.Exists(destinationFileName))
{
throw new IOException(string.Format(SR.IO_FileExists_Name, destinationFileName));
}

using FileStream fs = File.Create(destinationFileName, bufferSize: 0x1000, FileOptions.None);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you create FileStream with FileMode.CreateNew it's going to throw if the file already exists.

Suggested change
if (Path.Exists(destinationFileName))
{
throw new IOException(string.Format(SR.IO_FileExists_Name, destinationFileName));
}
using FileStream fs = File.Create(destinationFileName, bufferSize: 0x1000, FileOptions.None);
using FileStream fs = new (destinationFileName, FileMode.CreateNew, FileAccess.Write);

Debug.Assert(entryNameLength > 0);

bool isDirectory = file.Attributes.HasFlag(FileAttributes.Directory);
string entryName = ArchivingUtils.EntryFromPath(file.FullName, basePath.Length, entryNameLength, ref entryNameBuffer, appendPathSeparator: isDirectory);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not related to your PR: I find it unintuitive that a buffer rented from array pool is passed by reference to a helper method. I needed to verify that the helper method returns the buffer to the pool in case it's too small.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything I should do on my code to help improve its readability? Or maybe should we open an issue to discuss it separately?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just do the rent/return inside EntryFromPath as an implementation detail.

// - POSIX IEEE 1003.1-2001 ("POSIX.1") Pax Interchange Tar Format (pax).
// - GNU Tar Format (gnu).
// Documentation: https://www.freebsd.org/cgi/man.cgi?query=tar&sektion=5
internal partial struct TarHeader
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect us to get no perf benefits from keeping it struct, while we can run into issues like modifying a copy instead:

TarHeader header = new();
header._mode = 1;

SetMode(header);

Console.WriteLine(header._mode); // prints 1 as we have passed struct copy to SetMode and modified a copy

static void SetMode(TarHeader h) => h._mode = 2;

@carlossanlop
Copy link
Member Author

The 3 test failures are unrelated to my changes.

Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, offline, let's merge it now, include in Preview4 and address remaining feedback later (#68230).

Big thanks for you work @carlossanlop !

@carlossanlop
Copy link
Member Author

Thanks Adam! I'll add your latest feedback to the follow-up issue checklist.

@carlossanlop carlossanlop merged commit 2389815 into dotnet:main Apr 21, 2022
@carlossanlop carlossanlop deleted the tar branch April 21, 2022 16:01
@carlossanlop
Copy link
Member Author

/backport to release/7.0-preview4

@github-actions
Copy link
Contributor

Started backporting to release/7.0-preview4: /~https://github.com/dotnet/runtime/actions/runs/2203040454

@github-actions
Copy link
Contributor

@carlossanlop backporting to release/7.0-preview4 failed, the patch most likely resulted in conflicts:

$ git am --3way --ignore-whitespace --keep-non-patch changes.patch

Applying: Implement Tar APIs
Applying: Address some comment suggestions.
Applying: Add SystemFormatsTarTestData package dependency entries and versions.
Applying: Merge the major and minor p/invokes into a single one.
Applying: Address exception and nullability suggestions.
Applying: Adjust tests for the latest suggestions.
Applying: Document elevation requirement to extract device files on Unix.
Applying: Add tests to verify extraction without elevation throws for device files. Rename file for TarFile Extraction to filesystem.
Applying: Add separate expected mode for special files from assets (644).
Applying: Bump assets version to one with the fix
error: sha1 information is lacking or useless (eng/Versions.props).
error: could not build fake ancestor
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0010 Bump assets version to one with the fix
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".
Error: The process '/usr/bin/git' failed with exit code 128

Please backport manually!

carlossanlop added a commit to carlossanlop/runtime that referenced this pull request Apr 21, 2022
carlossanlop added a commit that referenced this pull request Apr 22, 2022
* Implement Tar APIs (#67883)

API proposal: #65951

* Add assembly to NetCoreAppLibrary.props

* Remove <PackageDescription> from src csproj since it is not OOB.

* Add NetCoreAppCurrent to src csproj TargetFrameworks

* Additional src csproj changes.

* Allow sharing of input tar file for read

Co-authored-by: carlossanlop <carlossanlop@users.noreply.github.com>
Co-authored-by: Dan Moseley <danmose@microsoft.com>
@ghost ghost locked as resolved and limited conversation to collaborators Jun 9, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[API Proposal]: APIs to support tar archives