Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SARIF has per-line rolling (partial) hash support #2605

Merged
merged 23 commits into from
Jan 26, 2023

Conversation

suvamM
Copy link
Collaborator

@suvamM suvamM commented Jan 12, 2023

This PR adds a new hash utility to compute per-line rolling (partial) hashes for a given file. The PR also provides a port of a numeric library for representing and operating on 64-bit two's complement integers.

NuGet.Config Outdated Show resolved Hide resolved
src/Sarif/FileRegionsCache.cs Outdated Show resolved Hide resolved
src/Sarif/FileRegionsCache.cs Outdated Show resolved Hide resolved
src/Sarif/HashUtilities.cs Outdated Show resolved Hide resolved
src/Sarif/FileRegionsCache.cs Fixed Show fixed Hide fixed
src/Sarif/FileRegionsCache.cs Fixed Show fixed Hide fixed
src/Sarif/FileRegionsCache.cs Fixed Show fixed Hide fixed
Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodeQL found more than 10 potential problems in the proposed changes. Check the Files changed tab for more details.

using System;
using System.Collections.Generic;

namespace Microsoft.CodeAnalysis.Sarif.Numeric

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you delete this and just use C# long or ulong? The code does some funny things that look like it was ported from JavaScript.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KalleOlaviNiemitalo we did this port of the Long library to align the semantics with that of a Javascript implementation. The C# long was causing a behavior drift.

Copy link

@KalleOlaviNiemitalo KalleOlaviNiemitalo Jan 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This being a port of an Apache-2.0 licensed library, do you need to add third-party notices?

Do the rolling hashes need to be compatible with some other implementation, is the algorithm documented?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RollingHash code seems to be a port of /~https://github.com/github/codeql-action/blob/v2.1.39/src/fingerprints.ts.

What is the intended use of these hashes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. CodeQL has shipped a very nice rolling hash that they've used historically for result matching. GHAS understands this fingerprint natively. Our goal here is to broaden the availability of this hash mechanism to other scenarios.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michaelcfanning Do we need to add any third-party notices as @KalleOlaviNiemitalo points out?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, see section 4 Redistribution of the apache license for this module. Let's get this in and immediately follow up with this.

@suvamM suvamM changed the title Adding rolling hash algorithm SARIF has per-line rolling (partial) hash support Jan 23, 2023
@suvamM suvamM marked this pull request as ready for review January 23, 2023 21:46
Comment on lines 18 to 19
private static readonly Dictionary<int, Long> INT_CACHE = new Dictionary<int, Long>();
private static readonly Dictionary<uint, Long> UINT_CACHE = new Dictionary<uint, Long>();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These dictionaries are not thread-safe. If they are mutated from multiple threads without locking, it can cause the dictionary to be corrupted such that every thread that subsequently tries to use it falls in an infinite loop.

At minimum, you need to use locks or switch to ConcurentDictionary or just a fixed-size array. But I'd really prefer it if Long were removed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KalleOlaviNiemitalo Good catch. Switched to using ConcurrentDictionary. Unfortunately, as pointed out in a different comment, we need to use this port of Long to align the semantics with the CodeQL implementation. So the better option is to make this class thread safe.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KalleOlaviNiemitalo @michaelcfanning Could you kindly sign off on this change?

}
}
}
catch (IOException) { }
catch (UnauthorizedAccessException) { }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dictionary

this method really argues for an xml doc comment.

Copy link
Member

@michaelcfanning michaelcfanning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@michaelcfanning michaelcfanning merged commit e35f895 into main Jan 26, 2023
@michaelcfanning michaelcfanning deleted the users/suvam/rolling-hash branch January 26, 2023 00:43
This was referenced Feb 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants