Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the ownership requirement from the into Cholesky factorization… #8

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

DiamondLovesYou
Copy link
Contributor

… and linear system solver routines.

NFC.

@masonium
Copy link
Owner

This commit would allow one to pass a mutable array view and end up modifying the original matrix. I don't think that's a desirable behavior for *_into functions.

@DiamondLovesYou
Copy link
Contributor Author

Would you prefer it be called _onto?

@masonium
Copy link
Owner

I think an *_onto style of function makes sense in cases where there is only one output, with the exact same shape as the input. So, the cholesky factorization and make_triangular methods would makes sense. For the linear solver, I wouldn't be in favor since both A and b are modified in the course of solving.

@DiamondLovesYou
Copy link
Contributor Author

(W.r.t linear solve) I was not aware (oops, I'll have to check my code), thanks. However, I'd still like to add an interface for in place computation: calling malloc in a hot (or maybe not so hot) loop is Bad(TM) for perf and I'm trying to make sure my code doesn't allocate at all for it's algos.

I think that providing the inplace versions would benefit developers, even if both inputs are modified.

How about adding separate functions to the linear solver interfaces with the suffix _inplace for this purpose?

@bluss
Copy link
Contributor

bluss commented Mar 15, 2017

@DiamondLovesYou The calloc changes upstream are exciting rust-lang/rust/pull/40409 for our field.

Always have to keep things in context though. The default matrix multiplication in ndarray allocates (The area needed is limited in size though), but it has five nested loops following that so it's not exactly in the inner loop, it's five levels removed! 😉 The allocation is dwarfed as soon as the matrix multiplication is as large as 10x10x10 or so. matrixmultiply code

@DiamondLovesYou
Copy link
Contributor Author

@bluss Indeed, nice to see that it got approved recently (and I use nightly, so yay)! W.r.t. matrixmultiply: I was also unaware of that. I have a question not directly related to this issue however: is there a reason you kept the C style when you wrote matrixmultiply? gemm/demm aren't externed, so use in C codes doesn't seem to be a requirement. I ask because it seems pretty trivial to allow users to pass in an empty Vec (or provide a trait for resizing ...) which is reserveed and set_lened to the correct size if too small (thus this packing_vec could be shared between multiple gemm's with potentially different sized matrix arguments).

@bluss
Copy link
Contributor

bluss commented Mar 17, 2017

What in particular is the C style? I can say that we can not use for example slice types since ndarray's memory borrowing is not delimited into contiguous regions.

It is the allocator's job already to enable efficient reuse of memory regions, so I haven't really explored if it's worth it to reimplement that.

@bluss
Copy link
Contributor

bluss commented Mar 17, 2017

I don't know if it's useful to linxal, but the high level interface to matrixmultiply is ndarray::linalg::general_mat_mul

I also need help/suggestions on how to properly resolve the blas backend selection. Very hard to find a perfect solution there. As a library, it should not really have an opinion, just enable the end user application to pick.

@DiamondLovesYou
Copy link
Contributor Author

DiamondLovesYou commented Mar 21, 2017

What in particular is the C style? I can say that we can not use for example slice types since ndarray's memory borrowing is not delimited into contiguous regions.

Yeah, but that's what the stride arguments are for; we're talking about dense matrices, so an unsized slice would just be the whole matrix just the same as if a pointer+len is passed.

This really isn't that big of an issue, and I kinda phrased it poorly: bounds checking in slice indexing would probably be harmful, and the Vec workspace Thingy(tm) can of course be provided by a mutable Option ref.

It is the allocator's job already to enable efficient reuse of memory regions, so I haven't really explored if it's worth it to reimplement that.

It may be worth it if one has to do a bunch of gemms back-to-back, as I have for my Kalman filter experiments.

I also need help/suggestions on how to properly resolve the blas backend selection. Very hard to find a perfect solution there. As a library, it should not really have an opinion, just enable the end user application to pick.

It looks like you punt to BLAS only when the matrix is big "enough". I think either: always punt or never punt (never punting meaning implement the algos in Rust). Either the problem isn't big enough that it matters if it's a tiny bit faster (debug mode 60fps), or the arch specific optimizations gained from BLAS are game changing (the problem is big). I wager the latter is more important for most production codebases.

As for BLAS library selection, a very very rough suggestion:

pub trait BlasBackend<Field> {
  /// all the functions here. signatures should match the "spec" for `Field`
}

#[repr(C)]
pub struct Generic;
const GENERIC_BACKEND: Generic = Generic;
impl BlasBackend<f32> for Generic {
  /// ...
}
impl BlasBackend<f64> for Generic {
  /// ...
}
/// ...

pub trait BlasBackends: for</* for all the fields */> BlasBackend<Field> {
  fn f32_backend(&self) -> &BlasBackend<f32> { self as &BlasBackend<f32> }
  /// etc
}

#[cfg(target_arch="spirv")]
pub mod impl {
  pub struct SpirV;
  impl BlasBackend<f32> for SpirV {
    /// ...
  }

  pub fn set_backend<T>() 
    where T: Default + BlasBackends,
  {
    panic!("can't set blas backend");
  }
  const BACKEND: SpirV = SpirV;
  pub fn backend() -> &'static BlasBackends { &BACKEND as &BlasBackends }
}
#[cfg(not(target_arch="spirv"))]
pub mod impl {
  /// another option: macro maybe? Possibly other magic with `std::raw::TraitObject`.
  pub fn set_backend<T>() 
    where T: Default + BlasBackends,
  {
    const BACKEND: T = Default::default();
    #[no_mangle]
    const __BACKENDS_TRAIT: &'static BlasBackends = &BACKEND as &BlasBackends;
  }
  pub fn backends() -> &'static BlasBackends {
    #[allow(improper_ctypes)]
    extern {
       #[no_mangle]
       #[linkage = "extern_weak"]
       const __BACKENDS_TRAIT: &'static BlasBackends;
    }

    /// check for null __BACKENDS_TRAIT ... atomically write generic pointer
    /// we only need the atomically, not the order, because at anytime at runtime,
    /// if __BACKENDS_TRAIT is null => `set_backend` is never monomorphized
    /// Thus if two threads race or if a thread has a stale cache, they'll just try to
    /// write the same value.

    unsafe {
      __BACKENDS_TRAIT
    }
  }
}

IDK if LLVM is powerful enough to devirtualize something like this, for cases when the specific target processor is known in advance. I also don't know how much variation there is between the BLAS libraries.

Honestly, I'd kinda like to see this stuff happen in Rust: Blas is annoying to use on Windows, and what about GPGPU use?

@bluss
Copy link
Contributor

bluss commented Mar 21, 2017

I think using slices instead of raw pointers would risk being invalid. Imagine one input is the right half of a matrix, another mutable input is the left half of a matrix. If we used slices, the slices would overlap. Even if the strides ensure they don't use overlapping elements, it seems to be this would be invalid use of &mut [T] if it doesn't have an exclusive access to the whole slice.

The biggest problem with blas backend selection is the design around crate features and blas-sys's features. (Openblas vs netlib vs other etc). My open question is how linxal and ndarray should allow blas implementation selection for the end user application.

@bluss
Copy link
Contributor

bluss commented Mar 21, 2017

For matrix multiplication, lib matrixmultiply supports all ndarray arrays (arbitrary stride and 64-bit integers). Blas doesn't support this (one dimension must have stride 1, and sizes are 32-bit integers!). We want to use blas whenever the representation is compatible (doesn't matter which BLAS implementation, that's the separate problem).

The integer size issue effectively means that if matrices are large enough (silly large) we can't use BLAS.

@DiamondLovesYou
Copy link
Contributor Author

I think using slices instead of raw pointers would risk being invalid. Imagine one input is the right half of a matrix, another mutable input is the left half of a matrix. If we used slices, the slices would overlap. Even if the strides ensure they don't use overlapping elements, it seems to be this would be invalid use of &mut [T] if it doesn't have an exclusive access to the whole slice.

That's fair, but I'd argue it's no more risk than what's already present w/ the use of pointers (since the regions must overlap by assumption), and Rust won't permit (assuming no unsafe) such slicing for row major matrices.

I'm just saying that ArrayBase and ViewRepr etc provide safe access to disjoint regions, so why not build the matrix multiplication on top of those structures, instead of almost the reverse?

For matrix multiplication, lib matrixmultiply supports all ndarray arrays (arbitrary stride and 64-bit integers). Blas doesn't support this (one dimension must have stride 1, and sizes are 32-bit integers!). We want to use blas whenever the representation is compatible (doesn't matter which BLAS implementation, that's the separate problem).
The integer size issue effectively means that if matrices are large enough (silly large) we can't use BLAS.

Ahhhh, I see. Yeah that tedious. Maybe keep structure definitions in ndarray and put the operations in a separate crate, and just have the provider crates have a type compatible functions for the various functions. One crate per blas library, ie ndarray-openblas-ops, ndarray-netlib-ops etc. Then

extern crate ndarray as nd;
extern crate ndarray_openblas_ops as nd_ops;

use nd_ops::general_mat_mul;

@bluss
Copy link
Contributor

bluss commented Apr 15, 2017

I think matrixmultiply will be more widely used this way, it is already used by ndarray and rulinalg, and maybe they wouldn't use it if some ndarray data structure was required.

The high level interface to matrixmultiply already exists, and it is the ndarray crate, in particular ndarray::linalg::general_mat_mul.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants