-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove the ownership requirement from the into Cholesky factorization… #8
base: master
Are you sure you want to change the base?
Remove the ownership requirement from the into Cholesky factorization… #8
Conversation
… and linear system solver routines.
This commit would allow one to pass a mutable array view and end up modifying the original matrix. I don't think that's a desirable behavior for *_into functions. |
Would you prefer it be called |
I think an *_onto style of function makes sense in cases where there is only one output, with the exact same shape as the input. So, the cholesky factorization and make_triangular methods would makes sense. For the linear solver, I wouldn't be in favor since both A and b are modified in the course of solving. |
(W.r.t linear solve) I was not aware (oops, I'll have to check my code), thanks. However, I'd still like to add an interface for in place computation: calling malloc in a hot (or maybe not so hot) loop is Bad(TM) for perf and I'm trying to make sure my code doesn't allocate at all for it's algos. I think that providing the inplace versions would benefit developers, even if both inputs are modified. How about adding separate functions to the linear solver interfaces with the suffix |
@DiamondLovesYou The calloc changes upstream are exciting rust-lang/rust/pull/40409 for our field. Always have to keep things in context though. The default matrix multiplication in ndarray allocates (The area needed is limited in size though), but it has five nested loops following that so it's not exactly in the inner loop, it's five levels removed! 😉 The allocation is dwarfed as soon as the matrix multiplication is as large as 10x10x10 or so. matrixmultiply code |
@bluss Indeed, nice to see that it got approved recently (and I use nightly, so yay)! W.r.t. |
What in particular is the C style? I can say that we can not use for example slice types since ndarray's memory borrowing is not delimited into contiguous regions. It is the allocator's job already to enable efficient reuse of memory regions, so I haven't really explored if it's worth it to reimplement that. |
I don't know if it's useful to linxal, but the high level interface to matrixmultiply is ndarray::linalg::general_mat_mul I also need help/suggestions on how to properly resolve the blas backend selection. Very hard to find a perfect solution there. As a library, it should not really have an opinion, just enable the end user application to pick. |
Yeah, but that's what the stride arguments are for; we're talking about dense matrices, so an unsized slice would just be the whole matrix just the same as if a pointer+len is passed. This really isn't that big of an issue, and I kinda phrased it poorly: bounds checking in slice indexing would probably be harmful, and the Vec workspace Thingy(tm) can of course be provided by a mutable Option ref.
It may be worth it if one has to do a bunch of gemms back-to-back, as I have for my Kalman filter experiments.
It looks like you punt to BLAS only when the matrix is big "enough". I think either: always punt or never punt (never punting meaning implement the algos in Rust). Either the problem isn't big enough that it matters if it's a tiny bit faster (debug mode 60fps), or the arch specific optimizations gained from BLAS are game changing (the problem is big). I wager the latter is more important for most production codebases. As for BLAS library selection, a very very rough suggestion: pub trait BlasBackend<Field> {
/// all the functions here. signatures should match the "spec" for `Field`
}
#[repr(C)]
pub struct Generic;
const GENERIC_BACKEND: Generic = Generic;
impl BlasBackend<f32> for Generic {
/// ...
}
impl BlasBackend<f64> for Generic {
/// ...
}
/// ...
pub trait BlasBackends: for</* for all the fields */> BlasBackend<Field> {
fn f32_backend(&self) -> &BlasBackend<f32> { self as &BlasBackend<f32> }
/// etc
}
#[cfg(target_arch="spirv")]
pub mod impl {
pub struct SpirV;
impl BlasBackend<f32> for SpirV {
/// ...
}
pub fn set_backend<T>()
where T: Default + BlasBackends,
{
panic!("can't set blas backend");
}
const BACKEND: SpirV = SpirV;
pub fn backend() -> &'static BlasBackends { &BACKEND as &BlasBackends }
}
#[cfg(not(target_arch="spirv"))]
pub mod impl {
/// another option: macro maybe? Possibly other magic with `std::raw::TraitObject`.
pub fn set_backend<T>()
where T: Default + BlasBackends,
{
const BACKEND: T = Default::default();
#[no_mangle]
const __BACKENDS_TRAIT: &'static BlasBackends = &BACKEND as &BlasBackends;
}
pub fn backends() -> &'static BlasBackends {
#[allow(improper_ctypes)]
extern {
#[no_mangle]
#[linkage = "extern_weak"]
const __BACKENDS_TRAIT: &'static BlasBackends;
}
/// check for null __BACKENDS_TRAIT ... atomically write generic pointer
/// we only need the atomically, not the order, because at anytime at runtime,
/// if __BACKENDS_TRAIT is null => `set_backend` is never monomorphized
/// Thus if two threads race or if a thread has a stale cache, they'll just try to
/// write the same value.
unsafe {
__BACKENDS_TRAIT
}
}
} IDK if LLVM is powerful enough to devirtualize something like this, for cases when the specific target processor is known in advance. I also don't know how much variation there is between the BLAS libraries. Honestly, I'd kinda like to see this stuff happen in Rust: Blas is annoying to use on Windows, and what about GPGPU use? |
I think using slices instead of raw pointers would risk being invalid. Imagine one input is the right half of a matrix, another mutable input is the left half of a matrix. If we used slices, the slices would overlap. Even if the strides ensure they don't use overlapping elements, it seems to be this would be invalid use of The biggest problem with blas backend selection is the design around crate features and blas-sys's features. (Openblas vs netlib vs other etc). My open question is how linxal and ndarray should allow blas implementation selection for the end user application. |
For matrix multiplication, lib matrixmultiply supports all ndarray arrays (arbitrary stride and 64-bit integers). Blas doesn't support this (one dimension must have stride 1, and sizes are 32-bit integers!). We want to use blas whenever the representation is compatible (doesn't matter which BLAS implementation, that's the separate problem). The integer size issue effectively means that if matrices are large enough (silly large) we can't use BLAS. |
That's fair, but I'd argue it's no more risk than what's already present w/ the use of pointers (since the regions must overlap by assumption), and Rust won't permit (assuming no I'm just saying that
Ahhhh, I see. Yeah that tedious. Maybe keep structure definitions in extern crate ndarray as nd;
extern crate ndarray_openblas_ops as nd_ops;
use nd_ops::general_mat_mul; |
I think matrixmultiply will be more widely used this way, it is already used by ndarray and rulinalg, and maybe they wouldn't use it if some ndarray data structure was required. The high level interface to matrixmultiply already exists, and it is the ndarray crate, in particular |
… and linear system solver routines.
NFC.