- Rust Performance Guide for Hayabusa Developers
- Table of Contents
- Author
- English translation
- About this document
- Speed improvement
- Reducing memory usage
- Benchmarking
- References
- Contributions
Fukusuke Takahashi
Zach Mathis (@yamatosecurity)
Hayabusa (English: "peregrine falcon") is a fast forensics analysis tool developed by the Yamato Security group in Japan. It is developed in Rust in order to (threat) hunt as fast as a peregrine falcon. Rust is a fast language in itself, however, there are many pitfalls that can result in slow speeds and high memory usage. We created this document based on actual performance improvements in Hayabusa (see the changelog here), but these techniques should be applicable to other Rust programs as well. We hope you can benefit from the knowledge we have gained through our trial and error.
Simply changing the default memory allocator may improve speed significantly. For example, according to these benchmarks, the following two memory allocators
are much faster than the default memory allocator. We were able to confirm a significant speed improvement by changing our memory allocator from jemalloc to mimalloc, so we made mimalloc the default since version 1.8.0. (Although mimalloc does use slightly more memory than jemalloc.)
# Not applicable. (You do not need to declare anything to use the default memory allocator.)
You only need to perform the following 2 steps in order to change the global memory allocator:
- Add the mimalloc crate to the
Cargo.toml
file's [dependencies] section:[dependencies] mimalloc = { version = "*", default-features = false }
- Define that you want to use mimalloc under #[global_allocator] somewhere in the program:
use mimalloc::MiMalloc; #[global_allocator] static GLOBAL: MiMalloc = MiMalloc;
That is all you need to do to change the memory allocator.
How much speed improves will depend on the program, but in the following example
changing the memory allocator to mimalloc resulted in a 20-30% performance increase on Intel CPUs. (For some reason, there was not as a significant performance increase on ARM based macOS devices.)
Disk IO processing is much slower than processing in memory. Therefore, it is desirable to avoid IO processing as much as possible, especially in loops.
The example below shows a file open occuring one million times in a loop:
use std::fs;
fn main() {
for _ in 0..1000000 {
let f = fs::read_to_string("sample.txt").unwrap();
f.len();
}
}
By opening the file outside of the loop as follows
use std::fs;
fn main() {
let f = fs::read_to_string("sample.txt").unwrap();
for _ in 0..1000000 {
f.len();
}
}
there will be about a 1000 times speed increase.
In the following example, the IO processing when handling one detection result at a time was able to be performed outside of the loop:
This resulted in a speed improvement of about 20%.
Regular expression compilation is a very costly process compared to regular expression matching. Therefore, it is advisable to avoid regular expression compilation as much as possible, especially in loops.
For example, the following process creates 100,000 attempts to match a regular expression in a loop:
extern crate regex;
use regex::Regex;
fn main() {
let text = "1234567890";
let match_str = "abc";
for _ in 0..100000 {
if Regex::new(match_str).unwrap().is_match(text){ // Regular expression compilation in a loop
println!("matched!");
}
}
}
By doing a regular expression compilation outside the loop, as shown below
extern crate regex;
use regex::Regex;
fn main() {
let text = "1234567890";
let match_str = "abc";
let r = Regex::new(match_str).unwrap(); // Compile the regular expression outside the loop
for _ in 0..100000 {
if r.is_match(text) {
println!("matched!");
}
}
}
the updated code is about 100 times faster.
In the following example, regular expression compilation is performed outside the loop and cached.
This resulted in significant speed improvements.
Without buffer IO, file IO is slow. With buffer IO, IO operations are performed through buffers in memory, reducing the number of system calls and improving speed.
For example, in the following process, write occurs 1,000,000 times.
use std::fs::File;
use std::io::{BufWriter, Write};
fn main() {
let mut f = File::create("sample.txt").unwrap();
for _ in 0..1000000 {
f.write(b"hello world!");
}
}
By using BufWriter as follows
use std::fs::File;
use std::io::{BufWriter, Write};
fn main() {
let mut f = File::create("sample.txt").unwrap();
let mut writer = BufWriter::new(f);
for _ in 0..1000000 {
writer.write(b"some text");
}
writer.flush().unwrap();
}
there is about a 50 times speed improvement.
The method described above was implemented here
and has resulted in significant speed improvements in output processing.
While regular expressions can cover complex matching patterns, they are slower than standard String methods. Therefore, it is faster to use standard String methods for simple string matching such as the following.
- Starts-with matching(Regex:
foo.*
)-> String::starts_with() - Ends-with matching(Regex:
.*foo
)-> String::ends_with() - Contains matching(Regex:
.*foo.*
)-> String::contains()
For example, the following code performs ends-with matching in a regular expression one million times.
extern crate regex;
use regex::Regex;
fn main() {
let text = "1234567890";
let match_str = ".*abc";
let r = Regex::new(match_str).unwrap();
for _ in 0..1000000 {
if r.is_match(text) {
println!("matched!");
}
}
}
By using String::ends_with() as follows
fn main() {
let text = "1234567890";
let match_str = "abc";
for _ in 0..1000000 {
if text.ends_with(match_str) {
println!("matched!");
}
}
}
processing will be 10 times faster.
Since Hayabusa requires case-insensitive string comparison, we use to_lowercase() and then apply the above method. Even then, in the following examples
- Imporving speed by changing wildcard search process from regular expression match to starts_with/ends_with match #890
- Improving speed by using eq_ignore_ascii_case() before regular expression match #884
speed has improved by about 15% compared to before.
Depending on the characteristics of the strings being handled, adding a simple filter may reduce the number of string matching attempts and speed up the process. If you often compare strings of non-fixed and unmatched string lengths, you can speed up the process by using string length as a primary filter.
For example, the following code attempts one million regular expression matches.
extern crate regex;
use regex::Regex;
fn main() {
let text = "1234567890";
let match_str = "abc";
let r = Regex::new(match_str).unwrap();
for _ in 0..1000000 {
if r.is_match(text) {
println!("matched!");
}
}
}
By using String::len() as a primary filter, as shown below
extern crate regex;
use regex::Regex;
fn main() {
let text = "1234567890";
let match_str = "abc";
let r = Regex::new(match_str).unwrap();
for _ in 0..1000000 {
if text.len() == match_str.len() { // Primary filter by string length
if r.is_match(text) {
println!("matched!");
}
}
}
}
speed will improve by about 20 times.
In the following example, the above method is used.
This improved speed by about 15%.
Many articles on performance optimization with Rust advise to add codegen-units = 1
under the [profile.release]
section.
This will cause slower compilation times as the default is to compile in parallel but in theory should result in more optimized and faster code.
However, in our testing, Hayabusa actually runs slower with this option turned on and compilation takes longer so we keep this off.
The binary size of the executable is about 100kb smaller so this may be ideal for embedded systems where hard disk space is limited.
Using clone() or to_string() are easy ways to resolve compilation errors related to ownership. However, they will usually result in high usage of memory and should be avoided. It is always best to first see if you can replace them with low cost references.
For example, if you want to iterate the same Vec
multiple times, you can use clone() to eliminate compilation errors.
fn main() {
let lst = vec![1, 2, 3];
for x in lst.clone() { // In order to eliminate compile errors
println!("{x}");
}
for x in lst {
println!("{x}");
}
}
However, by using references as shown below, you can remove the need to use clone().
fn main() {
let lst = vec![1, 2, 3];
for x in &lst { // Eliminate compile errors with a reference
println!("{x}");
}
for x in lst {
println!("{x}");
}
}
By removing the clone() usage, memory usage is reduced by up to 50%.
In the following example, by replacing unnecessary clone(), to_string(), and to_owned() usage,
we were able to significantly reduce memory usage.
Vec keeps all elements in memory, so it uses a lot of memory in proportion to the number of elements. If processing one element at a time is sufficient, then using an Iterator instead will use much less memory.
For example, the following return_lines()
function reads a file of about 1 GB and returns a Vec:
use std::fs::File;
use std::io::{BufRead, BufReader};
fn return_lines() -> Vec<String> {
let f = File::open("sample.txt").unwrap();
let buf = BufReader::new(f);
buf.lines()
.map(|l| l.expect("Could not parse line"))
.collect()
}
fn main() {
let lines = return_lines();
for line in lines {
println!("{}", line)
}
}
Instead you should return an Iterator Trait as follows:
use std::fs::File;
use std::io::{BufRead, BufReader};
fn return_lines() -> impl Iterator<Item=String> {
let f = File::open("sample.txt").unwrap();
let buf = BufReader::new(f);
buf.lines()
.map(|l| l.expect("Could not parse line"))
// ここでcollect()せずに、Iteratorを戻り値として返す
}
fn main() {
let lines = return_lines();
for line in lines {
println!("{}", line)
}
}
Or if the type is different depending on which branch is taken, you can return a Box<dyn Iterator<Item = T>>
as follows:
use std::fs::File;
use std::io::{BufRead, BufReader};
fn return_lines(need_filter:bool) -> Box<dyn Iterator<Item = String>> {
let f = File::open("sample.txt").unwrap();
let buf = BufReader::new(f);
if need_filter {
let result= buf.lines()
.filter_map(|l| l.ok())
.map(|l| l.replace("A", "B"));
return Box::new(result)
}
let result= buf.lines()
.map(|l| l.expect("Could not parse line"));
Box::new(result)
}
fn main() {
let lines = return_lines(true);
for line in lines {
println!("{}", line)
}
}
Memory usage drops significantly from 1 GB to only 3 MB.
The following example uses the method described above:
When tested on a 1.7GB JSON file, memory decreased by 75%.
When dealing with a large number of short strings of less than 24 bytes, the compact_str crate can be used to reduce memory usage.
In the example below, the Vec holds 10 million strings.
fn main() {
let v: Vec<String> = vec![String::from("ABCDEFGHIJKLMNOPQRSTUV"); 10000000];
// do some kind of processing
}
It is better to replace them with a CompactString:
use compact_str::CompactString;
fn main() {
let v: Vec<CompactString> = vec![CompactString::from("ABCDEFGHIJKLMNOPQRSTUV"); 10000000];
// do some kind of processing
}
By doing this, memory usage is reduced by around 50%.
In the following example, short strings are handled with CompactString:
This gave a reduction of memory usage by about 20%.
Structures that continue to be retained in memory during process startup may affect the overall memory usage. In Hayabusa, the following structures (as of version 2.2.2), in particular, are retained in large numbers.
The removal of fields associated with the above structures had some effect on reducing overall memory usage.
For example, the DetectInfo
field was, until version 1.8.1, the following:
#[derive(Debug, Clone)]
pub struct DetectInfo {
pub rulepath: CompactString,
pub ruletitle: CompactString,
pub level: CompactString,
pub computername: CompactString,
pub eventid: CompactString,
pub detail: CompactString,
pub record_information: CompactString,
pub ext_field: Vec<(CompactString, Profile)>,
pub is_condition: bool,
}
By deleting the record_information
field as follows
#[derive(Debug, Clone)]
pub struct DetectInfo {
pub rulepath: CompactString,
pub ruletitle: CompactString,
pub level: CompactString,
pub computername: CompactString,
pub eventid: CompactString,
pub detail: CompactString,
// remove record_information field
pub ext_field: Vec<(CompactString, Profile)>,
pub is_condition: bool,
}
a reduction in memory usage of several bytes per detection result record was achieved.
In the following example, when tested against data where the number of detection result records was about 1.5 million,
- Reduced memory usage of DetectInfo/EvtxRecordInfo #837
- Reduce memory usage by removing unnecessary regex #894
we were able to achieve about a 300MB reduction in memory usage.
Some memory allocators maintain their own memory usage statistics. For example, in mimalloc, the mi_stats_print_out() function can be called to obtain memory usage.
Prerequisites: You need to be using mimalloc as explained in the Change the memory allocator section.
-
In
Cargo.toml
's dependencies section, add the libmimalloc-sys crate:[dependencies] libmimalloc-sys = { version = "*", features = ["extended"] }
-
Whenever you want to print the memory usage statistics, write the following code and inside an
unsafe
block, call mi_stats_print_out(). The memory usage statistics will be outputted to standard out.use libmimalloc_sys::mi_stats_print_out; use std::ptr::null_mut; fn main() { // Write the following code where you want to measure memory usage unsafe { mi_stats_print_out(None, null_mut()); } }
-
The upper left
peak/reserved
value is the maximum memory usage.
The above implementation was applied in the following:
In Hayabusa, if you add the --debug
option, memory usage statistics will be outputted at the end.
Various resource usage can be checked from statistics that can be obtained on the OS side. In this case, the following two points should be noted.
- Influence from anti-virus software (Windows Defender)
- Only the first run is affected by the scan and is slower, so results from the second and subsequent runs after the build are suitable for comparison. (Or you can disable your anti-virus for more accurate results.)
- Influence from file caching
- The results from the second and subsequent times after OS startup are faster than the first time because evtx and other file IOs are read from the file cache in memory, so the results from the first time after the OS boots is more ideal for taking benchmarks.
Prerequisites:The following procedure is only valid for environments where PowerShell 7
is already installed on Windows.
- Restart the OS
- Run
PowerShell 7
's Get-Counter command which will continuously record the performance counter every second to a CSV file. (If you would like to measure resources other than those listed below, this article is a good reference.)Get-Counter -Counter "\Memory\Available MBytes", "\Processor(_Total)\% Processor Time" -Continuous | ForEach { $_.CounterSamples | ForEach { [pscustomobject]@{ TimeStamp = $_.TimeStamp Path = $_.Path Value = $_.CookedValue } } } | Export-Csv -Path PerfMonCounters.csv -NoTypeInformation
- Execute the process you want to measure.
The following contains an example procedure for measuring performance with Hayabusa.
heaptrack is a sophisticated memory profiler available for Linux and macOS. By using heaptrack, you can thoroughly investigate bottlenecks.
Prerequisites: Below is the procedure for Ubuntu 22.04. You cannot use heaptrack on Windows.
-
Install heaptrack with the following two commands.
sudo apt install heaptrack sudo apt install heaptrack-gui
-
Remove the following mimalloc code from Hayabusa. (You cannot use heaptrack's memory profiler with mimalloc.
-
Delete the [profile.release] section in Hayabusa's
Cargo.toml
file and change it to the following:[profile.release] debug = true
-
Build a release build:
cargo build --release
-
Run
heaptrack hayabusa csv-timeline -d sample -o out.csv
Now when Hayabusa finishes running, heaptrack's results will automatically open in a GUI application.
An example of heaptrack's results are shown below. The Flame Graph
and Top-Down
tabs allow you to visually check functions with high memory usage.
This document is based on findings from actual improvement cases in Hayabusa. If you find any errors or techniques that can improve performance, please send us an issue or pull request.