2020 12 11 Putting Glenside in context with other current DSE work (Gemmini, Eyeriss's taxonomy, PolySA) #97
gussmith23
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The goal is to set up an optimization problem which considers both hardware and software concerns simultaneously. The problem I've been having is that I don't think I've set up the representation to capture what we really care about. The representation wastes too much space on figuring out every last data dependency, and doesn't have enough power to express different hardware concerns.
I focused a bit too hard on taking a whole-program view. I really did think that there was value to treating all operators as equal, but it really is the case that the focus should be on fully connected and convolution operators. I do imagine there should be a way to build a representation where we can at least represent other operators, but we need to put the focus more on optimizing for conv/fc layers.
We are also a bit constrained to the weight-stationary case. That's partly because that's the only hardware we can build. But there might be an argument for looking at this one level up -- that is, if there aren't many degrees of freedom if we're locked into weight stationary, is there a way to explore the other types of systolic arrays?
The Eyeriss paper's taxonomy of types of systolic arrays ("dataflows", they call them, not systolic array types) is part of what made me start to think that we were limiting ourselves a bit too much. By the time you've decided that you're going to be weight-stationary, a lot of the interesting hardware and software decisions may have been made for you already. So the interesting space is in the space of potential systolic arrays (and the hardware decisions around them).
This line of thinking leads me, finally, back to Gemmini, which is, as far as I can tell, a design space exploration tool for exploring possible GEMM systolic arrays. A quote: "Systolic array hardware generators should target pertinent architectural parameters such as dataflow, pipeline depth, banking strategy, precision, and on-chip memory capacity." Their actual design-space exploration is lacking, though -- essentially just turning knobs manually and trying out different designs. They've done a bunch of work in making it easy to get a working design out, given knob settings -- but the way they actually turn the knobs seems to be manual.
PolySA is a better example of a project that focuses on the design space exploration angle. There are significant differences from Glenside -- they're designing the systolic arrays automatically, and doing so using FPGAs (via HLS) as a backend. I like the paper because it's doing what I feel like we should have been doing: it's not treating, for example, a convolution as something that needs to be broken down into a (extremely regular) tree of computation. Instead, using polyhedral, we look at a convolution as a set of loop nests, where we transform loop nests directly, rather than transforming single iterations of loops (as in Glenside).
This represents one of the biggest pieces of learning -- and one of the things I'm trying to not to kick myself over. Specifically, Glenside allows for too much control and too much expressivity over how the data dependency graph is expressed. A blocked matrix multiply running on a systolic array does not need a unique program node for every single individual systolic array invocation, which is currently how we represent it. This gives us control in a way that we don't need, like, blocking a single convolution across multiple, differently-sized systolic arrays. I have to admit that that was, semi-consciously, part of my initial bet -- that we could discover irregular hardware designs that accelerate specific models. We haven't shown that that's not true, but I feel more and more pessimistic about it being true...and regardless, I'm just not sure the rest of the system will be set up to discover it, if it is true. I'm not sure how well we'll be able to model multiple systolic arrays of different sizes, for example.
What are we set up to do? Well, we're constrained at the low-level by our atom library; that is, we can't go lower in abstraction than those. Everything has to map to one of those at some point. We can't generate new systolic array types, for now; we can only control the size of the systolic array. (So, in different terms, our innermost loops are fixed.) We can represent the whole program, but that might not actually mean as much.
Beta Was this translation helpful? Give feedback.
All reactions