Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SARIF has per-line rolling (partial) hash support #2605

Merged
merged 23 commits into from
Jan 26, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
cda8b9e
starting rolling hash implementation: added computeFirstMod and data …
suvamM Dec 17, 2022
b83c52e
starting rolling hash implementation: added computeFirstMod and data …
suvamM Dec 17, 2022
ed02fef
Merge branch 'users/suvam/rolling-hash' of /~https://github.com/microso…
suvamM Jan 10, 2023
ddb0a20
[wip] added hashing algorithm to file regions
suvamM Jan 12, 2023
69ec96e
[wip] Fixes to the rolling hash computation
suvamM Jan 17, 2023
8518dbe
[wip] fixes to the hashing algorithm.
suvamM Jan 17, 2023
95b3a0b
[wip] Moving hash computation to HashUtilities
suvamM Jan 18, 2023
96a4d57
Porting tests from CodeQL repo
suvamM Jan 21, 2023
b7c9fd3
Adding unit tests for rolling hash
suvamM Jan 23, 2023
b974cf9
Adding comments
suvamM Jan 23, 2023
77b5297
[wip] added hashing algorithm to file regions
suvamM Jan 12, 2023
822c0bb
[wip] Fixes to the rolling hash computation
suvamM Jan 17, 2023
b34980d
[wip] fixes to the hashing algorithm.
suvamM Jan 17, 2023
e52643a
[wip] Moving hash computation to HashUtilities
suvamM Jan 18, 2023
9b31212
Porting tests from CodeQL repo
suvamM Jan 21, 2023
095df5d
Adding unit tests for rolling hash
suvamM Jan 23, 2023
6cd2eec
Adding comments
suvamM Jan 23, 2023
983f8e5
Merge branch 'users/suvam/rolling-hash' of /~https://github.com/microso…
suvamM Jan 23, 2023
b1169f9
removing generics in file regions cache
suvamM Jan 23, 2023
b64c8c9
updating Release History
suvamM Jan 23, 2023
060174b
Merge branch 'main' into users/suvam/rolling-hash
suvamM Jan 25, 2023
a674a0c
incorporating PR feedback
suvamM Jan 25, 2023
573291a
format fixing Long
suvamM Jan 25, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion NuGet.Config
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
<?xml version="1.0" encoding="UTF-8"?>
<?xml version="1.0" encoding="utf-8"?>
<configuration>
<packageSources>
<clear />
Expand Down
1 change: 1 addition & 0 deletions src/ReleaseHistory.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

## **v3.2.0** (UNRELEASED)

* FEATURE: Allow per-line rolling (partial) hash computation for a file. [#2605](/~https://github.com/microsoft/sarif-sdk/pull/2605)
* BREAKING: Rename `--normalize-for-github` argument to `--normalize-for-ghas` for `convert` command and mark `--normalize-for-github` as obsolete. [#2581](/~https://github.com/microsoft/sarif-sdk/pull/2581)
* BREAKING: Update `IAnalysisContext.LogToolNotification` method to add `ReportingDescriptor` parameter. This is required in order to populated `AssociatedRule` data in `Notification` instances. The new method has an option value of null for the `associatedRule` parameter to maximize build compatibility. [#2604](/~https://github.com/microsoft/sarif-sdk/pull/2604)
* BREAKING: Correct casing of `LogMissingreportingConfiguration` helper to `LogMissingReportingConfiguration`. [#2599](/~https://github.com/microsoft/sarif-sdk/pull/2599)
Expand Down
139 changes: 139 additions & 0 deletions src/Sarif/HashUtilities.cs
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,22 @@
using System.Text;
using System.Threading.Tasks;

using Microsoft.CodeAnalysis.Sarif.Numeric;

namespace Microsoft.CodeAnalysis.Sarif
{
public static class HashUtilities
{
static HashUtilities() => FileSystem = Sarif.FileSystem.Instance;

private static readonly int TAB = '\t';
private static readonly int SPACE = ' ';
private static readonly int LF = '\n';
private static readonly int CR = '\r';
private static readonly int EOF = 65535;
private static readonly int BLOCK_SIZE = 100;
private static readonly Long MOD = new Long(37, 0, false);

private static IFileSystem _fileSystem;
internal static IFileSystem FileSystem
{
Expand Down Expand Up @@ -206,5 +216,134 @@ public static string ComputeMD5Hash(string fileName)
catch (UnauthorizedAccessException) { }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dictionary

this method really argues for an xml doc comment.

return md5;
}

public static Dictionary<int, string> RollingHash(string fileText)
{
Dictionary<int, string> rollingHashes = new Dictionary<int, string>();

// A rolling view into the input
int[] window = new int[BLOCK_SIZE];

int[] lineNumbers = new int[BLOCK_SIZE];
for (int i = 0; i < lineNumbers.Length; i++)
{
lineNumbers[i] = -1;
}

Long hashRaw = new Long(0, 0, false);
Long firstMod = ComputeFirstMod();

// The current index in the window, will wrap around to zero when we reach BLOCK_SIZE
int index = 0;
// The line number of the character we are currently processing from the input
int lineNumber = 0;
// Is the next character to be read the start of a new line
bool lineStart = true;
// Was the previous character a CR (carriage return)
bool prevCR = false;

Dictionary<string, int> hashCounts = new Dictionary<string, int>();

// Output the current hash and line number to the cache
Action outputHash = () =>
{
string hashValue = hashRaw.ToUnsigned().ToString(16);

if (!hashCounts.ContainsKey(hashValue))
{
hashCounts[hashValue] = 0;
}

hashCounts[hashValue]++;
rollingHashes[lineNumbers[index]] = $"{hashValue}:{hashCounts[hashValue]}";
lineNumbers[index] = -1;
};

// Update the current hash value and increment the index in the window
Action<int> updateHash = (current) =>
{
int begin = window[index];
window[index] = current;

hashRaw = MOD.Multiply(hashRaw)
.Add(Long.FromInt(current))
.Subtract(firstMod.Multiply(Long.FromInt(begin)));

index = (index + 1) % BLOCK_SIZE;
};

// First process every character in the input, updating the hash and lineNumbers
// as we go. Once we reach a point in the window again then we've processed
// BLOCK_SIZE characters and if the last character at this point in the window
// was the start of a line then we should output the hash for that line.
Action<int> processCharacter = (current) =>
{
// skip tabs, spaces, and line feeds that come directly after a carriage return
if (current == SPACE || current == TAB || (prevCR && current == LF))
{
prevCR = false;
return;
}
// replace CR with LF
if (current == CR)
{
current = LF;
prevCR = true;
}
else
{
prevCR = false;
}
if (lineNumbers[index] != -1)
{
outputHash();
}
if (lineStart)
{
lineStart = false;
lineNumber++;
lineNumbers[index] = lineNumber;
}
if (current == LF)
{
lineStart = true;
}
updateHash(current);
};

if (fileText != null)
{
for (int i = 0; i < fileText.Length; i++)
{
processCharacter(fileText[i]);
}

processCharacter(EOF);

// Flush the remaining lines
for (int i = 0; i < BLOCK_SIZE; i++)
{
if (lineNumbers[index] != -1)
{
outputHash();
}
updateHash(0);
}
}

return rollingHashes;
}

private static Long ComputeFirstMod()
{
Long firstMod = new Long(1, 0, false);

for (int i = 0; i < 100; i++)
{
firstMod = firstMod.Multiply(MOD);
}

return firstMod;
}
}
}
Loading