index.html

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
<meta http-equiv="X-UA-Compatible" content="IE=9"/>
<meta name="generator" content="Doxygen 1.8.13"/>
<meta name="viewport" content="width=device-width, initial-scale=1"/>
<title>POWER Vector Library Manual: POWER Vector Library (pveclib)</title>
<link href="tabs.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="jquery.js"></script>
<script type="text/javascript" src="dynsections.js"></script>
<link href="search/search.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="search/searchdata.js"></script>
<script type="text/javascript" src="search/search.js"></script>
<link href="doxygen.css" rel="stylesheet" type="text/css" />
</head>
<body>
<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
<div id="titlearea">
<table cellspacing="0" cellpadding="0">
 <tbody>
 <tr style="height: 56px;">
  <td id="projectalign" style="padding-left: 0.5em;">
   <div id="projectname">POWER Vector Library Manual
   &#160;<span id="projectnumber">1.0.4</span>
   </div>
  </td>
 </tr>
 </tbody>
</table>
</div>
<!-- end header part -->
<!-- Generated by Doxygen 1.8.13 -->
<script type="text/javascript">
var searchBox = new SearchBox("searchBox", "search",false,'Search');
</script>
<script type="text/javascript" src="menudata.js"></script>
<script type="text/javascript" src="menu.js"></script>
<script type="text/javascript">
$(function() {
  initMenu('',true,false,'search.php','Search');
  $(document).ready(function() { init_search(); });
});
</script>
<div id="main-nav"></div>
</div><!-- top -->
<!-- window showing the filter options -->
<div id="MSearchSelectWindow"
     onmouseover="return searchBox.OnSearchSelectShow()"
     onmouseout="return searchBox.OnSearchSelectHide()"
     onkeydown="return searchBox.OnSearchSelectKey(event)">
</div>

<!-- iframe showing the search results (closed by default) -->
<div id="MSearchResultsWindow">
<iframe src="javascript:void(0)" frameborder="0" 
        name="MSearchResults" id="MSearchResults">
</iframe>
</div>

<div class="header">
  <div class="headertitle">
<div class="title">POWER Vector Library (pveclib) </div>  </div>
</div><!--header-->
<div class="contents">
<div class="textblock"><p>A library of useful vector functions for POWER. This library fills in the gap between the instructions defined in the POWER Instruction Set Architecture (<b>PowerISA</b>) and higher level library APIs. The intent is to improve the productivity of application developers who need to optimize their applications or dependent libraries for POWER. </p><dl class="section author"><dt>Authors</dt><dd>Steven Munroe</dd></dl>
<dl class="section copyright"><dt>Copyright</dt><dd>2017-2018 IBM Corporation. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at: <a href="http://www.apache.org/licenses/LICENSE-2.0">http://www.apache.org/licenses/LICENSE-2.0</a> .</dd></dl>
<p>Unless required by applicable law or agreed to in writing, software and documentation distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.</p>
<h1><a class="anchor" id="mainpage_notices"></a>
Notices</h1>
<p>IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at <a href="http:www.ibm.com/legal/copytrade.shtml">http:www.ibm.com/legal/copytrade.shtml</a>.</p>
<p>The following terms are trademarks or registered trademarks licensed by Power.org in the United States and/or other countries: Power ISA<sup>TM</sup>, Power Architecture<sup>TM</sup>. Information on the list of U.S. trademarks licensed by Power.org may be found at <a href="http:www.power.org/about/brand-center/">http:www.power.org/about/brand-center/</a>.</p>
<p>The following terms are trademarks or registered trademarks of Freescale Semiconductor in the United States and/or other countries: AltiVec<sup>TM</sup>. Information on the list of U.S. trademarks owned by Freescale Semiconductor may be found at <a href="http://www.freescale.com/files/abstract/help_page/TERMSOFUSE.html">http://www.freescale.com/files/abstract/help_page/TERMSOFUSE.html</a>.</p>
<h2><a class="anchor" id="mainpage_ref_docs"></a>
Reference Documentation</h2>
<ul>
<li>Power Instruction Set Architecture, Versions <a href="https://ibm.ent.box.com/s/jd5w15gz301s5b5dt375mshpq9c3lh4u">2.07B</a> and <a href="https://ibm.ent.box.com/s/1hzcwkwf8rbju5h9iyf44wm94amnlcrv">3.0B</a>, IBM, 2013-2017. Available from the <a href="https://www-355.ibm.com/systems/power/openpower/">IBM Portal for OpenPOWER</a> under the <b>Public Documents</b> tab.<ul>
<li>Publicly available PowerISA docs for older processors are hard to find. But here is a link to <a href="http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=995FB78240B0A62F1629AB3454C3DFB7?doi=10.1.1.175.7365&amp;rep=rep1&amp;type=pdf">PowerISA-2.06B</a> for POWER7.</li>
</ul>
</li>
<li><a href="http://www.freescale.com/files/32bit/doc/ref_manual/ALTIVECPIM.pdf">ALTIVEC PIM</a>: AltiVecTM Technology Programming Interface Manual, Freescale Semiconductor, 1999.</li>
<li><a href="http://refspecs.linuxfoundation.org/ELF/ppc64/PPC-elf64abi.html">64-bit PowerPC ELF Application Binary Interface (ABI)</a> Supplement 1.9.</li>
<li><a href="http://openpowerfoundation.org/wp-content/uploads/resources/leabi/leabi-20170510.pdf">OpenPOWER ELF V2 application binary interface (ABI)</a>, OpenPOWER Foundation, 2017.</li>
<li><a href="https://gcc.gnu.org/onlinedocs/">Using the GNU Compiler Collection (GCC)</a>, Free Software Foundation, 1988-2018.</li>
<li><a href="https://sourceware.org/glibc/wiki/GNU_IFUNC">What is an indirect function (IFUNC)?</a>, glibc wiki.</li>
<li><a href="https://ibm.ent.box.com/s/649rlau0zjcc0yrulqf4cgx5wk3pgbfk">POWER8 Processor User’s Manual</a> for the Single-Chip Module.</li>
<li><a href="https://ibm.ent.box.com/s/8uj02ysel62meji4voujw29wwkhsz6a4">POWER9 Processor User’s Manual</a>.</li>
<li>Warren, Henry S. Jr, Hacker's Delight, 2nd Edition, Upper Saddle River, NJ: Addison Wesley, 2013.</li>
</ul>
<h1><a class="anchor" id="mainpage_rationale"></a>
Rationale</h1>
<p>The C/C++ language compilers (that support PowerISA) may implement vector intrinsic functions (compiler built-ins as embodied by altivec.h). These vector intrinsics offer an alternative to assembler programming, but do little to reduce the complexity of the underlying PowerISA. Higher level vector intrinsic operations are needed to improve productivity and encourage developers to optimize their applications for PowerISA. Another key goal is to smooth over the complexity of the evolving PowerISA and compiler support.</p>
<p>For example: the PowerISA 2.07 (POWER8) provides population count and count leading zero operations on vectors of byte, halfword, word, and doubleword elements but not on the whole vector as a __int128 value. Before PowerISA 2.07, neither operation was supported, for any element size.</p>
<p>Another example: The original <b>Altivec</b> (AKA Vector Multimedia Extension (<b>VMX</b>)) provided Vector Multiply Odd / Even operations for signed / unsigned byte and halfword elements. The PowerISA 2.07 added Vector Multiply Even/Odd operations for signed / unsigned word elements. This release also added a Vector Multiply Unsigned Word Modulo operation. This was important to allow auto vectorization of C loops using 32-bit (int) multiply.</p>
<p>But PowerISA 2.07 did not add support for doubleword or quadword (__int128) multiply directly. Nor did it fill in the missing multiply modulo operations for byte and halfword. However it did add support for doubleword and quadword add / subtract modulo, This can be helpful, if you are willing to apply grade school arithmetic (add, carry the 1) to vector elements.</p>
<p>PowerISA 3.0 (POWER9) did add a Vector Multiply-Sum Unsigned Doubleword Modulo operation. With this instruction (and a generated vector of zeros as input) you can effectively implement the simple doubleword integer multiply modulo operation in a few instructions. Similarly for Vector Multiply-Sum Unsigned Halfword Modulo. But this may not be obvious.</p>
<p>This history embodies a set of trade-offs negotiated between the Software and Processor design architects at specific points in time. But most programmers would prefer to use a set of operators applied across the supported element types and sizes.</p>
<h2><a class="anchor" id="mainpage_sub0"></a>
POWER Vector Library Goals</h2>
<p>Obviously many useful operations can be constructed from existing PowerISA operations and GCC &lt;altivec.h&gt; built-ins but the implementation may not be obvious. The optimum sequence will vary across the PowerISA levels as new instructions are added. And finally the compiler's built-in support for new PowerISA instructions evolves with the compiler's release cycle.</p>
<p>So the goal of this project is to provide well crafted implementations of useful vector and large number operations.</p>
<ul>
<li>Provide equivalent functions across versions of the PowerISA. This includes some of the most useful vector instructions added to POWER9 (PowerISA 3.0B). Many of these operations can be implemented as inline function in a few vector instructions on earlier PowerISA versions.</li>
<li>Provide equivalent functions across versions of the compiler. For example built-ins provided in later versions of the compiler can be implemented as inline functions with inline asm in earlier compiler versions.</li>
<li>Provide complete arithmetic operations across supported C types. For example multiply modulo and even/odd for int, long, and __int128.</li>
<li>Provide complete extended arithmetic (carry / extend / multiple high) operations across supported C types. For example add / subtract with carry and extend for int, long, and __int128.</li>
<li>Provide higher order functions not provided directly by the PowerISA. For example vector SIMD implementation for ASCII __isalpha, etc. As another example full __int128 implementations of Count Leading Zeros, Population Count, Shift left/right immediate, and large integer multiply/divide.</li>
<li>Most implementations should be small enough to inline and allow the compiler opportunity to apply common optimization techniques.</li>
<li>Larger Implementations should be built into platform specific object archives and dynamic shared objects. Shared objects should use <b>IFUNC resolvers</b> to bind the dynamic symbol to best implementation for the platform (see <a class="el" href="index.html#main_libary_issues_0_0">Putting the Library into PVECLIB</a>).</li>
</ul>
<h3><a class="anchor" id="mainpage_sub0_1"></a>
POWER Vector Library Intrinsic headers</h3>
<p>The POWER Vector Library will be primarily delivered as C language inline functions in headers files.</p><ul>
<li><a class="el" href="vec__common__ppc_8h.html" title="Common definitions and typedef used by the collection of Power Vector Library (pveclib) headers...">vec_common_ppc.h</a> Typedefs and helper macros</li>
<li><a class="el" href="vec__int512__ppc_8h.html" title="Header package containing a collection of multiple precision quadword integer computation functions i...">vec_int512_ppc.h</a> Operations on multiple precision integer values</li>
<li><a class="el" href="vec__int128__ppc_8h.html" title="Header package containing a collection of 128-bit computation functions implemented with PowerISA VMX...">vec_int128_ppc.h</a> Operations on vector __int128 values</li>
<li><a class="el" href="vec__int64__ppc_8h.html" title="Header package containing a collection of 128-bit SIMD operations over 64-bit integer elements...">vec_int64_ppc.h</a> Operations on vector long int (64-bit) values</li>
<li><a class="el" href="vec__int32__ppc_8h.html" title="Header package containing a collection of 128-bit SIMD operations over 32-bit integer elements...">vec_int32_ppc.h</a> Operations on vector int (32-bit) values</li>
<li><a class="el" href="vec__int16__ppc_8h.html" title="Header package containing a collection of 128-bit SIMD operations over 16-bit integer elements...">vec_int16_ppc.h</a> Operations on vector short int (16-bit) values</li>
<li><a class="el" href="vec__char__ppc_8h.html" title="Header package containing a collection of 128-bit SIMD operations over 8-bit integer (char) elements...">vec_char_ppc.h</a> Operations on vector char (values) values</li>
<li><a class="el" href="vec__bcd__ppc_8h.html" title="Header package containing a collection of Binary Coded Decimal (BCD) computation and Zoned Character ...">vec_bcd_ppc.h</a> Operations on vectors of Binary Code Decimal and Zoned Decimal values</li>
<li><a class="el" href="vec__f128__ppc_8h.html" title="Header package containing a collection of 128-bit SIMD operations over Quad-Precision floating point ...">vec_f128_ppc.h</a> Operations on vector _Float128 values</li>
<li><a class="el" href="vec__f64__ppc_8h.html" title="Header package containing a collection of 128-bit SIMD operations over 64-bit double-precision floati...">vec_f64_ppc.h</a> Operations on vector double values</li>
<li><a class="el" href="vec__f32__ppc_8h.html" title="Header package containing a collection of 128-bit SIMD operations over 4x32-bit floating point elemen...">vec_f32_ppc.h</a> Operations on vector float values</li>
</ul>
<dl class="section note"><dt>Note</dt><dd>The list above is complete in the current public github as a first pass. A backlog of functions remain to be implemented across these headers. Development continues while we work on the backlog listed in: <a href="/~https://github.com/open-power-sdk/pveclib/issues/13">Issue #13 TODOs</a></dd></dl>
<p>The goal is to provide high quality implementations that adapt to the specifics of the compile target (-mcpu=) and compiler (&lt;altivec.h&gt;) version you are using. Initially pveclib will focus on the GCC compiler and -mcpu=[power7|power8|power9] for Linux. Testing will focus on Little Endian (<b>powerpc64le</b> for power8 and power9 targets. Any testing for Big Endian (<b>powerpc64</b> will be initially restricted to power7 and power8 targets.</p>
<p>Expanding pveclib support beyond this list to include:</p><ul>
<li>additional compilers (ie Clang)</li>
<li>additional PPC platforms (970, power6, ...)</li>
<li>Larger functions that just happen to use vector registers (Checksum, Crypto, compress/decompress, lower precision neural networks, ...)</li>
</ul>
<p>will largely depend on additional skilled practitioners joining this project and contributing (code and platform testing) on a sustained basis.</p>
<h2><a class="anchor" id="mainpage_sub1"></a>
How pveclib is different from compiler vector built-ins</h2>
<p>The PowerPC vector built-ins evolved from the original <a href="https://www.nxp.com/docs/en/reference-manual/ALTIVECPIM.pdf">AltiVec (TM) Technology Programming Interface Manual</a> (PIM). The PIM defined the minimal extensions to the application binary interface (ABI) required to support the Vector Facility. This included new keywords (vector, pixel, bool) for defining new vector types, and new operators (built-in functions) required for any supporting and compliant C language compiler.</p>
<p>The vector built-in function support included:</p><ul>
<li>generic AltiVec operations, like vec_add()</li>
<li>specific AltiVec operations (instructions, like vec_vaddubm())</li>
<li>predicates computed from AltiVec operations, like vec_all_eq() which are also generic</li>
</ul>
<p>See <a class="el" href="index.html#mainpage_sub2">Background on the evolution of &lt;altivec.h&gt;</a> for more details.</p>
<p>There are clear advantages with the compiler implementing the vector operations as built-ins:</p><ul>
<li>The compiler can access the C language type information and vector extensions to implement the function overloading required to process generic operations.</li>
<li>Built-ins can be generated inline, which eliminates function call overhead and allows more compact code generation.</li>
<li>The compiler can then apply higher order optimization across built-ins including: Local and global register allocation. Global common subexpression elimination. Loop-invariant code motion.</li>
<li>The compiler can automatically select the best instructions for the <em>target</em> processor ISA level (from the -mcpu compiler option).</li>
</ul>
<p>While this is an improvement over writing assembler code, it does not provide much function beyond the specific operations specified in the PowerISA. As a result the generic operations were not uniformly applied across vector element types. And this situation often persisted long after the PowerISA added instructions for wider elements. Some examples:</p><ul>
<li>Initially vec_add / vec_sub applied to float, int, short and char.</li>
<li>Later compilers added support for double (with POWER7 and the Vector Scalar Extensions (VSX) facility)</li>
<li>Later still, integer long (64-bit) and __int128 support (with POWER8 and PowerISA 2.07B).</li>
</ul>
<p>But vec_mul / vec_div did not:</p><ul>
<li>Initially vec_mul applied to vector float only. Later vector double was supported for POWER7 VSX. Much later integer multiply modulo under the generic vec_mul intrinsic.</li>
<li>vec_mule / vec_mulo (Multiply even / odd elements) applied to [signed | unsigned] integer short and char. Later compilers added support for vector int after POWER8 added vector multiply word instructions.</li>
<li>vec_div was not included in the original PIM as Altivec (VMX) only included vector reciprocal estimate for float and no vector integer divide for any size. Later compilers added support for vec_div float / double after POWER7 (VSX) added vector divide single/double-precision instructions.</li>
</ul>
<dl class="section note"><dt>Note</dt><dd>While the processor you (plan to) use, may support the specific instructions you want to exploit, the compiler you are using may not support, the generic or specific vector operations, for the element size/types, you want to use. This is common for GCC versions installed by "Enterprise Linux" distributions. They tend to freeze the GCC version early and maintain that GCC version for long term stability. One solution is to use the <a href="https://developer.ibm.com/linuxonpower/advance-toolchain/">IBM Advance toolchain for Linux on Power</a> (AT). AT is free for download and new AT versions are released yearly (usually in August) with the latest stable GCC from that spring.</dd></dl>
<p>This can be a frustrating situation unless you are familiar with:</p><ul>
<li>the PowerISA and how it has evolved.</li>
<li>the history and philosophy behind the implementation of &lt;altivec.h&gt;.</li>
<li>The specific level of support provided by the compiler(s) you are using.</li>
</ul>
<p>And to be fair, this author believes, this too much to ask from your average library or application developer. A higher level and more intuitive API is needed.</p>
<h3><a class="anchor" id="mainpage_sub_1_1"></a>
What can we do about this?</h3>
<p>A lot can be done to improve this situation. For older compilers we substitute inline assembler for missing &lt;altivec.h&gt; operations. For older processors we can substitute short instruction sequences as equivalents for new instructions. And useful higher level (and more intuitive) operations can be written and shared. All can be collected and provided in headers and libraries.</p>
<h4><a class="anchor" id="mainpage_sub_1_1_1"></a>
Use inline assembler carefully</h4>
<p>First the Binutils assembler is usually updated within weeks of the public release of the PowerISA document. So while your compiler may not support the latest vector operations as built-in operations, an older compiler with an updated assembler, may support the instructions as inline assembler.</p>
<p>Sequences of inline assembler instructions can be wrapped within C language static inline functions and placed in a header files for shared use. If you are careful with the input / output register <em>constraints</em> the GCC compiler can provide local register allocation and minimize parameter marshaling overhead. This is very close (in function) to a specific Altivec (built-in) operation.</p>
<dl class="section note"><dt>Note</dt><dd>Using GCC's inline assembler can be challenging even for the experienced programmer. The register constraints have grown in complexity as new facilities and categories were added. The fact that some (VMX) instructions are restricted to the original 32 Vector Registers (<b>VRs</b>) (the high half of the Vector-Scalar Registers <b>VSRs</b>), while others (Binary and Decimal Floating-Point) are restricted to the original 32 Floating-Point Registers (<b>FPRs</b> (overlapping the low half of the VSRs), and the new VSX instructions can access all 64 VSRs, is just one source of complexity. So it is very important to get your input/output constraints correct if you want inline assembler code to work correctly.</dd></dl>
<p>In-line assembler should be reserved for the first implementation using the latest PowerISA. Where possible you should use existing vector built-ins to implement specific operations for wider element types, support older hardware, or higher order operations. Again wrapping these implementations in static inline functions for collection in header files for reuse and distribution is recommended.</p>
<h4><a class="anchor" id="mainpage_sub_1_1_2"></a>
Define multi-instruction sequences to fill in gaps</h4>
<p>The PowerISA vector facility has all the instructions you need to implement extended precision operations for add, subtract, and multiply. Add / subtract with carry-out and permute or double vector shift and grade-school arithmetic is all you need.</p>
<p>For example the Vector Add Unsigned Quadword Modulo introduced in POWER8 (PowerISA 2.07B) can be implemented for POWER7 and earlier machines in 10-11 instructions. This uses a combination of Vector Add Unsigned Word Modulo (vadduwm), Vector Add and Write Carry-Out Unsigned Word (vaddcuw), and Vector Shift Left Double by Octet Immediate (vsldoi), to propagate the word carries through the quadword.</p>
<p>For POWER8 and later, C vector integer (modulo) multiply can be implemented in a single Vector Unsigned Word Modulo (<b>vmuluwm</b>) instruction. This was added explicitly to address vectorizing loops using int multiply in C language code. And some newer compilers do support generic vec_mul() for vector int. But this is not documented. Similarly for char (byte) and short (halfword) elements.</p>
<p>POWER8 also introduced Vector Multiply Even Signed|Unsigned Word (<b>vmulesw</b>|<b>vmuleuw</b>) and Vector Multiply Odd Signed|Unsigned Word (<b>vmulosw</b>|<b>vmulouw</b>) instructions. So you would expect the generic vec_mule and vec_mulo operations to be extended to support <em>vector int</em>, as these operations have long been supported for char and short. Sadly this is not supported as of GCC 7.3 and inline assembler is required for this case. This support was added for GCC 8.</p>
<p>So what will the compiler do for vector multiply int (modulo, even, or odd) for targeting power7? Older compilers will reject this as a <em>invalid parameter combination ...</em>. A newer compiler may implement the equivalent function in a short sequence of VMX instructions from PowerISA 2.06 or earlier. And GCC 7.3 does support vec_mul (modulo) for element types char, short, and int. These sequences are in the 2-7 instruction range depending on the operation and element type. This includes some constant loads and permute control vectors that can be factored and reused across operations. See <a class="el" href="vec__int32__ppc_8h.html#ab3ea7653d4e60454b91d669e2b1bcfdf" title="Vector Multiply Unsigned Word Modulo. ">vec_muluwm()</a> code for details.</p>
<h4><a class="anchor" id="mainpage_sub_1_1_3"></a>
Define new and useful operations</h4>
<p>Once the pattern is understood it is not hard to write equivalent sequences using operations from the original &lt;altivec.h&gt;. With a little care these sequences will be compatible with older compilers and older PowerISA versions. These concepts can be extended to operations that PowerISA and the compiler does not support yet. For example; a processor that may not have multiply even/odd/modulo of the required width (word, doubleword, or quadword). This might take 10-12 instructions to implement the next element size bigger then the current processor. A full 128-bit by 128-bit multiply with 256-bit result only requires 36 instructions on POWER8 (using multiple word even/odd) and 15 instructions on POWER9 (using vmsumudm).</p>
<h4><a class="anchor" id="mainpage_sub_1_1_4"></a>
Leverage other PowerISA facilities</h4>
<p>Also many of the operations missing from the vector facility, exist in the Fixed-point, Floating-point, or Decimal Floating-point scalar facilities. There will be some loss of efficiency in the data transfer but compared to a complex operation like divide or decimal conversions, this can be a workable solution. On older POWER processors (before power7/8) transfers between register banks (GPR, FPR, VR) had to go through memory. But with the VSX facility (POWER7) FPRs and VRs overlap with the lower and upper halves of the 64 VSR registers. So FPR &lt;-&gt; VSR transfer are 0-2 cycles latency. And with power8 we have direct transfer (GPR &lt;-&gt; FPR | VR | VSR) instructions in the 4-5 cycle latency range.</p>
<p>For example POWER8 added Decimal (<b>BCD</b>) Add/Subtract Modulo (<b>bcdadd</b>, <b>bcdsub</b>) instructions for signed 31 digit vector values. POWER9 added Decimal Convert From/To Signed Quadword (<b>bcdcfsq</b>, <b>bcdctsq</b>) instructions. So far vector unit does not support BCD multiply / divide. But the Decimal Floating-Point (<b>DFP</b>) facility (introduced with PowerISA 2.05 and Power6) supports up to 34-digit (__Decimal128) precision and all the expected (add/subtract/multiply/divide/...) arithmetic operations. DFP also supports conversion to/from 31-digit BCD and __Decimal128 precision. This is all supported with a hardware Decimal Floating-Point Unit (<b>DFU</b>).</p>
<p>So we can implement <a class="el" href="vec__bcd__ppc_8h.html#a047be6d6339193b854e0b41759888939" title="Decimal Add Signed Modulo Quadword. ">vec_bcdadd()</a> and <a class="el" href="vec__bcd__ppc_8h.html#aeb48adc4d015b874089fdf9fc4318509" title="Subtract two Vector Signed BCD 31 digit values. ">vec_bcdsub()</a> with single instructions on POWER8, and 10-11 instructions for Power6/7. This count include the VSR &lt;-&gt; FPRp transfers, BCD &lt;-&gt; DFP conversions, and DFP add/sub. Similarly for <a class="el" href="vec__bcd__ppc_8h.html#a5a1aec05a6dadcf5a1a8e028223745df" title="Vector Decimal Convert From Signed Quadword returning up to 31 BCD digits. ">vec_bcdcfsq()</a> and <a class="el" href="vec__bcd__ppc_8h.html#a5086ba6056febb11acd5d5cd18e96dfb" title="Vector Decimal Convert to Signed Quadword. ">vec_bcdctsq()</a>. The POWER8 and earlier implementations are a bit bigger (83 and 32 instruction respectively) but even the POWER9 hardware implementation runs 37 and 23 cycles (respectively).</p>
<p>The <a class="el" href="vec__bcd__ppc_8h.html#a31e982fe4ae794073eb8e60a2525bb0e" title="Divide a Vector Signed BCD 31 digit value by another BCD value. ">vec_bcddiv()</a> and <a class="el" href="vec__bcd__ppc_8h.html#abd65a5de9b45c2ecd452ee8a546d1418" title="Multiply two Vector Signed BCD 31 digit values. ">vec_bcdmul()</a> operations are implement by transfer/conversion to __Decimal128 and execute in the DFU. This is slightly complicated by the requirement to preserve correct fix-point alignment/truncation in the floating-point format. The operation timing runs ~100-200 cycles mostly driven the DFP multiply/divide and the number of digits involved.</p>
<dl class="section note"><dt>Note</dt><dd>So why does anybody care about BCD and DFP? Sometimes you get large numbers in decimal that you need converted to binary for extended computation. Sometimes you need to display the results of your extended binary computation in decimal. The multiply by 10 and BCD vector operations help simplify and speed-up these conversions.</dd></dl>
<h4><a class="anchor" id="mainpage_sub_1_1_5"></a>
Use clever tricks</h4>
<p>And finally: Henry S. Warren's wonderful book Hacker's Delight provides inspiration for SIMD versions of; count leading zeros, population count, parity, etc.</p>
<h3><a class="anchor" id="mainpage_sub_1_2"></a>
So what can the Power Vector Library project do?</h3>
<p>Clearly the PowerISA provides multiple, extensive, and powerful computational facilities that continue to evolve and grow. But the best instruction sequence for a specific computation depends on which POWER processor(s) you have or plan to support. It can also depend on the specific compiler version you use, unless you are willing to write some of your application code in assembler. Even then you need to be aware of the PowerISA versions and when specific instructions where introduced. This can be frustrating if you just want to port your application to POWER for a quick evaluation.</p>
<p>So you would like to start evaluating how to leverage this power for key algorithms at the heart of your application.</p><ul>
<li>But you are working with an older POWER processor (until the latest POWER box is delivered).</li>
<li>Or the latest POWER machine just arrived at your site (or cloud) but you are stuck using an older/stable Linux distro version (with an older distro compiler).</li>
<li>Or you need extended precision multiply for your crypto code but you are not really an assembler level programmer (or don't want to be).</li>
<li>Or you would like to program with higher level operations to improve your own productivity.</li>
</ul>
<p>Someone with the right background (knowledge of the PowerISA, assembler level programming, compilers and the vector built-ins, ...) can solve any of the issues described above. But you don't have time for this.</p>
<p>There should be an easier way to exploit the POWER vector hardware without getting lost in the details. And this extends beyond classical vector (Single Instruction Multiple Data (SIMD)) programming to exploiting larger data width (128-bit and beyond), and larger register space (64 x 128 Vector Scalar Registers)</p>
<h4><a class="anchor" id="mainpage_para_1_2_0"></a>
Vector Add Unsigned Quadword Modulo example</h4>
<p>Here is an example of what can be done:</p><div class="fragment"><div class="line"><span class="keyword">static</span> <span class="keyword">inline</span> <a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a></div><div class="line"><a class="code" href="vec__int128__ppc_8h.html#a539de2a4426a84102471306acc571ce8">vec_adduqm</a> (<a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a> a, <a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a> b)</div><div class="line">{</div><div class="line">  <a class="code" href="vec__common__ppc_8h.html#a2ff4a776536870e01b7c9e454586544b">vui32_t</a> t;</div><div class="line"><span class="preprocessor">#ifdef _ARCH_PWR8</span></div><div class="line"><span class="preprocessor">#ifndef vec_vadduqm</span></div><div class="line">  __asm__(</div><div class="line">      <span class="stringliteral">&quot;vadduqm %0,%1,%2;&quot;</span></div><div class="line">      : <span class="stringliteral">&quot;=v&quot;</span> (t)</div><div class="line">      : <span class="stringliteral">&quot;v&quot;</span> (a),</div><div class="line">      <span class="stringliteral">&quot;v&quot;</span> (b)</div><div class="line">      : );</div><div class="line"><span class="preprocessor">#else</span></div><div class="line">  t = (<a class="code" href="vec__common__ppc_8h.html#a2ff4a776536870e01b7c9e454586544b">vui32_t</a>) vec_vadduqm (a, b);</div><div class="line"><span class="preprocessor">#endif</span></div><div class="line"><span class="preprocessor">#else</span></div><div class="line">  <a class="code" href="vec__common__ppc_8h.html#a2ff4a776536870e01b7c9e454586544b">vui32_t</a> c, c2;</div><div class="line">  <a class="code" href="vec__common__ppc_8h.html#a2ff4a776536870e01b7c9e454586544b">vui32_t</a> z= { 0,0,0,0};</div><div class="line"></div><div class="line">  c = vec_vaddcuw ((<a class="code" href="vec__common__ppc_8h.html#a2ff4a776536870e01b7c9e454586544b">vui32_t</a>)a, (<a class="code" href="vec__common__ppc_8h.html#a2ff4a776536870e01b7c9e454586544b">vui32_t</a>)b);</div><div class="line">  t = vec_vadduwm ((<a class="code" href="vec__common__ppc_8h.html#a2ff4a776536870e01b7c9e454586544b">vui32_t</a>)a, (<a class="code" href="vec__common__ppc_8h.html#a2ff4a776536870e01b7c9e454586544b">vui32_t</a>)b);</div><div class="line">  c = vec_sld (c, z, 4);</div><div class="line">  c2 = vec_vaddcuw (t, c);</div><div class="line">  t = vec_vadduwm (t, c);</div><div class="line">  c = vec_sld (c2, z, 4);</div><div class="line">  c2 = vec_vaddcuw (t, c);</div><div class="line">  t = vec_vadduwm (t, c);</div><div class="line">  c = vec_sld (c2, z, 4);</div><div class="line">  t = vec_vadduwm (t, c);</div><div class="line"><span class="preprocessor">#endif</span></div><div class="line">  <span class="keywordflow">return</span> ((<a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a>) t);</div><div class="line">}</div></div><!-- fragment --><p>The <b>_ARCH_PWR8</b> macro is defined by the compiler when it targets POWER8 (PowerISA 2.07) or later. This is the first processor and PowerISA level to support vector quadword add/subtract. Otherwise we need to use the vector word add modulo and vector word add and write carry-out word, to add 32-bit chunks and propagate the carries through the quadword.</p>
<p>One little detail remains. Support for vec_vadduqm was added to GCC in March of 2014, after GCC 4.8 was released and GCC 4.9's feature freeze. So the only guarantee is that this feature is in GCC 5.0 and later. At some point this change was backported to GCC 4.8 and 4.9 as it is included in the current GCC 4.8/4.9 documentation. When or if these backports where propagated to a specific Linux Distro version or update is difficult to determine. So support for this vector built-in dependes on the specific version of the GCC compiler, or if specific Distro update includes these specific backports for the GCC 4.8/4.9 compiler they support. The:</p><div class="fragment"><div class="line"><span class="preprocessor">#ifndef vec_vadduqm</span></div></div><!-- fragment --><p> C preprocessor conditional checks if the <b>vec_vadduqm</b> is defined in &lt;altivec.h&gt;. If defined we can assume that the compiler implements <b>__builtin_vec_vadduqm</b> and that &lt;altivec.h&gt; includes the macro definition:</p><div class="fragment"><div class="line"><span class="preprocessor">#define vec_vadduqm __builtin_vec_vadduqm</span></div></div><!-- fragment --><p> For <b>_ARCH_PWR7</b> and earlier we need a little grade school arithmetic using Vector Add Unsigned Word Modulo (<b>vadduwm</b>) and Vector Add and Write Carry-Out Unsigned Word (<b>vaddcuw</b>). This treats the vector __int128 as 4 32-bit binary digits. The first instruction sums each (32-bit digit) column and the second records the carry out of the high order bit of each word. This leaves the carry bit in the original (word) column, so a shift left 32-bits is needed to line up the carries with the next higher word.</p>
<p>To propagate any carries across all 4 (word) digits, repeat this (add / carry / shift) sequence three times. Then a final add modulo word to complete the 128-bit add. This sequence requires 10-11 instructions. The 11th instruction is a vector splat word 0 immediate, which in needed in the shift left (vsldoi) instructions. This is common in vector codes and the compiler can usually reuse this register across several blocks of code and inline functions.</p>
<p>For POWER7/8 these instructions are all 2 cycle latency and 2 per cycle throughput. The vadduwm / vaddcuw instruction pairs should issue in the same cycle and execute in parallel. So the expected latency for this sequence is 14 cycles. For POWER8 the vadduqm instruction has a 4 cycle latency.</p>
<p>Similarly for the carry / extend forms which can be combined to support wider (256, 512, 1024, ...) extended arithmetic. </p><dl class="section see"><dt>See also</dt><dd><a class="el" href="vec__int128__ppc_8h.html#ad7aaadba249ce46c4c94f78df1020da3" title="Vector Add &amp; write Carry Unsigned Quadword. ">vec_addcuq</a>, <a class="el" href="vec__int128__ppc_8h.html#a44e63f70b182d60fe03b43a80647451a" title="Vector Add Extended Unsigned Quadword Modulo. ">vec_addeuqm</a>, and <a class="el" href="vec__int128__ppc_8h.html#af18b98d2d73f1afbc439e1407c78f305" title="Vector Add Extended &amp; write Carry Unsigned Quadword. ">vec_addecuq</a></dd></dl>
<h4><a class="anchor" id="mainpage_para_1_2_1"></a>
Vector Multiply-by-10 Unsigned Quadword example</h4>
<p>PowerISA 3.0 (POWER9) added this instruction and it's extend / carry forms to speed up decimal to binary conversion for large numbers. But this operation is generally useful and not that hard to implement for earlier processors. </p><div class="fragment"><div class="line"><span class="keyword">static</span> <span class="keyword">inline</span> <a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a></div><div class="line"><a class="code" href="vec__int128__ppc_8h.html#a3675fa1a2334eff913df447904be78ad">vec_mul10uq</a> (<a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a> a)</div><div class="line">{</div><div class="line">  <a class="code" href="vec__common__ppc_8h.html#a2ff4a776536870e01b7c9e454586544b">vui32_t</a> t;</div><div class="line"><span class="preprocessor">#ifdef _ARCH_PWR9</span></div><div class="line">  __asm__(</div><div class="line">      <span class="stringliteral">&quot;vmul10uq %0,%1;\n&quot;</span></div><div class="line">      : <span class="stringliteral">&quot;=v&quot;</span> (t)</div><div class="line">      : <span class="stringliteral">&quot;v&quot;</span> (a)</div><div class="line">      : );</div><div class="line"><span class="preprocessor">#else</span></div><div class="line">  <a class="code" href="vec__common__ppc_8h.html#afb47075b07673afbf78f8c60298f3712">vui16_t</a> ts = (<a class="code" href="vec__common__ppc_8h.html#afb47075b07673afbf78f8c60298f3712">vui16_t</a>) a;</div><div class="line">  <a class="code" href="vec__common__ppc_8h.html#afb47075b07673afbf78f8c60298f3712">vui16_t</a> t10;</div><div class="line">  <a class="code" href="vec__common__ppc_8h.html#a2ff4a776536870e01b7c9e454586544b">vui32_t</a> t_odd, t_even;</div><div class="line">  <a class="code" href="vec__common__ppc_8h.html#a2ff4a776536870e01b7c9e454586544b">vui32_t</a> z = { 0, 0, 0, 0 };</div><div class="line">  t10 = vec_splat_u16(10);</div><div class="line"><span class="preprocessor">#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__</span></div><div class="line">  t_even = vec_vmulouh (ts, t10);</div><div class="line">  t_odd = vec_vmuleuh (ts, t10);</div><div class="line"><span class="preprocessor">#else</span></div><div class="line">  t_even = vec_vmuleuh(ts, t10);</div><div class="line">  t_odd = vec_vmulouh(ts, t10);</div><div class="line"><span class="preprocessor">#endif</span></div><div class="line">  t_even = vec_sld (t_even, z, 2);</div><div class="line"><span class="preprocessor">#ifdef _ARCH_PWR8</span></div><div class="line">  t = (<a class="code" href="vec__common__ppc_8h.html#a2ff4a776536870e01b7c9e454586544b">vui32_t</a>) vec_vadduqm ((<a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a>) t_even, (<a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a>) t_odd);</div><div class="line"><span class="preprocessor">#else</span></div><div class="line">  t = (<a class="code" href="vec__common__ppc_8h.html#a2ff4a776536870e01b7c9e454586544b">vui32_t</a>) <a class="code" href="vec__int128__ppc_8h.html#a539de2a4426a84102471306acc571ce8">vec_adduqm</a> ((<a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a>) t_even, (<a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a>) t_odd);</div><div class="line"><span class="preprocessor">#endif</span></div><div class="line"><span class="preprocessor">#endif</span></div><div class="line">  <span class="keywordflow">return</span> ((<a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a>) t);</div><div class="line">}</div></div><!-- fragment --><p>Notice that under the <b>_ARCH_PWR9</b> conditional, there is no check for the specific <b>vec_vmul10uq</b> built-in. As of this writing <b>vec_vmul10uq</b> is not included in the <em>OpenPOWER ELF2 ABI</em> documentation nor in the latest GCC trunk source code.</p>
<dl class="section note"><dt>Note</dt><dd>The <em>OpenPOWER ELF2 ABI</em> does define <b>bcd_mul10</b> which (from the description) will actually generate Decimal Shift (<b>bcds</b>). This instruction shifts 4-bit nibbles (BCD digits) left or right while preserving the BCD sign nibble in bits 124-127, While this is a handy instruction to have, it is not the same operation as <b>vec_vmul10uq</b>, which is a true 128-bit binary multiply by 10. As of this writing <b>bcd_mul10</b> support is not included in the latest GCC trunk source code.</dd></dl>
<p>For <b>_ARCH_PWR8</b> and earlier we need a little grade school arithmetic using <b>Vector Multiply Even/Odd Unsigned Halfword</b>. This treats the vector __int128 as 8 16-bit binary digits. We multiply each of these 16-bit digits by 10, which is done in two (even and odd) parts. The result is 4 32-bit (2 16-bit digits) partial products for the even digits and 4 32-bit products for the odd digits. The vector register (independent of endian); the even product elements are higher order and odd product elements are lower order.</p>
<p>The even digit partial products are offset right by 16-bits in the register. If we shift the even products left 1 (16-bit) digit, the even digits are lined up in columns with the odd digits. Now we can sum across partial products to get the final 128 bit product.</p>
<p>Notice also the conditional code for endian around the <b>vec_vmulouh</b> and <b>vec_vmuleuh</b> built-ins:</p><div class="fragment"><div class="line"><span class="preprocessor">#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__</span></div></div><!-- fragment --><p>Little endian (<b>LE</b>) changes the element numbering. This also changes the meaning of even / odd and this effects the code generated by compilers. But the relationship of high and low order bytes, within multiplication products, is defined by the hardware and does not change. (See: <a class="el" href="index.html#mainpage_endian_issues_1_1">General Endian Issues</a>) So the pveclib implementation needs to pre-swap the even/odd partial product multiplies for LE. This in effect nullifies the even / odd swap hidden in the compilers <b>LE</b> code generation and the resulting code gives the correct results.</p>
<p>Now we are ready to sum the partial product <em>digits</em> while propagating the digit carries across the 128-bit product. For <b>_ARCH_PWR8</b> we can use <b>Vector Add Unsigned Quadword Modulo</b> which handles all the internal carries in hardware. Before <b>_ARCH_PWR8</b> we only have <b>Vector Add Unsigned Word Modulo</b> and <b>Vector Add and Write Carry-Out Unsigned Word</b>.</p>
<p>We see these instructions used in the <b>else</b> leg of the pveclib <b>vec_adduqm</b> implementation above. We can assume that this implementation is correct and tested for supported platforms. So here we use another pveclib function to complete the implementation of <b>Vector Multiply-by-10 Unsigned Quadword</b>.</p>
<p>Again similarly for the carry / extend forms which can be combined to support wider (256, 512, 1024, ...) extended decimal to binary conversions. </p><dl class="section see"><dt>See also</dt><dd><a class="el" href="vec__int128__ppc_8h.html#a8c641b0107fc3e1621ef729c04efd583" title="Vector Multiply by 10 &amp; write Carry Unsigned Quadword. ">vec_mul10cuq</a>, <a class="el" href="vec__int128__ppc_8h.html#a2245626e7b90621b33ba79b763a4215e" title="Vector Multiply by 10 Extended Unsigned Quadword. ">vec_mul10euq</a>, and <a class="el" href="vec__int128__ppc_8h.html#a7ca2a6427ecb9458858b5caaac8c4dca" title="Vector Multiply by 10 Extended &amp; write Carry Unsigned Quadword. ">vec_mul10ecuq</a></dd></dl>
<p>And similarly for full 128-bit x 128-bit multiply which combined with the add quadword carry / extended forms above can be used to implement wider (256, 512, 1024, ...) multiply operations. </p><dl class="section see"><dt>See also</dt><dd><a class="el" href="vec__int128__ppc_8h.html#a9aaaf0e4c2705be1e0e8e925b09c52de" title="Vector Multiply Low Unsigned Quadword. ">vec_mulluq</a> and <a class="el" href="vec__int128__ppc_8h.html#aee5c5b2998ef105b4c6f39739748ffa8" title="Vector Multiply Unsigned Double Quadword. ">vec_muludq</a> </dd>
<dd>
<a class="el" href="vec__int32__ppc_8h.html#i32_example_0_0_0">Vector Merge Algebraic High Word example</a> </dd>
<dd>
<a class="el" href="vec__int32__ppc_8h.html#i32_example_0_0_1">Vector Multiply High Unsigned Word example</a></dd></dl>
<h3><a class="anchor" id="mainpage_sub3"></a>
pveclib is not a matrix math library</h3>
<p>The pveclib does not implement general purpose matrix math operations. These should continue to be developed and improved within existing projects (ie LAPACK, OpenBLAS, ATLAS, etc). We believe that pveclib will be helpful to implementors of matrix math libraries by providing a higher level, more portable, and more consistent vector interface for the PowerISA.</p>
<p>The decision is still pending on: extended arithmetic, cryptographic, compression/decompression, pattern matching / search and small vector libraries (libmvec). This author believes that the small vector math implementation should be part of GLIBC (libmvec). But the lack of optimized implementations or even good documentation and examples for these topics is a concern. This may be something that PVECLIB can address by providing enabling kernels or examples.</p>
<h2><a class="anchor" id="mainpage_sub_2x"></a>
Practical considerations.</h2>
<h3><a class="anchor" id="mainpage_endian_issues_1_1"></a>
General Endian Issues</h3>
<p>For POWER8, IBM made the explicit decision to support Little Endian (<b>LE</b>) data format in the Linux ecosystem. The goal was to enhance application code portability across Linux platforms. This goal was integrated into the OpenPOWER ELF V2 Application Binary Interface <b>ABI</b> specification.</p>
<p>The POWER8 processor architecturally supports an <em>Endian Mode</em> and supports both BE and LE storage access in hardware. However, register to register operations are not effected by endian mode. The ABI extends the LE storage format to vector register (logical) element numbering. See OpenPOWER ABI specification <a href="http://openpowerfoundation.org/wp-content/uploads/resources/leabi/content/dbdoclet.50655244_pgfId-1095944.html">Chapter 6. Vector Programming Interfaces</a> for details.</p>
<p>This has no effect for most altivec.h operations where the input elements and the results "stay in their
   lanes". For operations of the form (T[n] = A[n] op B[n]), it does not matter if elements are numbered [0, 1, 2, 3] or [3, 2, 1, 0].</p>
<p>But there are cases where element renumbering can change the results. Changing element numbering does change the even / odd relationship for merge and integer multiply. For <b>LE</b> targets, operations accessing even vector elements are implemented using the equivalent odd instruction (and visa versa) and inputs are swapped. Similarly for high and low merges. Inputs are also swapped for Pack, Unpack, and Permute operations and the permute select vector is inverted. The above is just a sampling of a larger list of <em>LE transforms</em>. The OpenPOWER ABI specification provides a helpful table of <a href="http://openpowerfoundation.org/wp-content/uploads/resources/leabi/content/dbdoclet.50655244_90667.html">Endian-Sensitive Operations</a>.</p>
<dl class="section note"><dt>Note</dt><dd>This means that the vector built-ins provided by altivec.h may not generate the instructions you expect.</dd></dl>
<p>This does matter when doing extended precision arithmetic. Here we need to maintain most-to-least significant byte order and align "digit" columns for summing partial products. Many of these operations where defined long before Little Endian was seriously considered and are decidedly Big Endian in register format. Basically, any operation where the element changes size (truncated, extended, converted, subsetted) from input to output is suspect for <b>LE</b> targets.</p>
<p>The coding for these higher level operations is complicated by <em>Little Endian</em> (LE) support as specified in the OpenPOWER ABI and as implemented in the compilers. Little Endian changes the effective vector element numbering and the location of even and odd elements.</p>
<p>This is a general problem for using vectors to implement extended precision arithmetic. The multiply even/odd operations being the primary example. The products are double-wide and in BE order in the vector register. This is reinforced by the Vector Add/Subtract Unsigned Doubleword/Quadword instructions. And the products from multiply even instructions are always <em>numerically</em> higher digits then multiply odd products. The pack, unpack, and sum operations have similar issues.</p>
<p>This matters when you need to align (shift) the partial products or select the <em>numerically</em> high or lower portion of the products. The (high to low) order of elements for the multiply has to match the order of the largest element size used in accumulating partial sums. This is normally a quadword (vadduqm instruction).</p>
<p>So the element order is fixed while the element numbering and the partial products (between even and odd) will change between BE and LE. This effects splatting and octet shift operations required to align partial product for summing. These are the places where careful programming is required, to nullify the compiler's LE transforms, so we will get the correct numerical answer.</p>
<p>So what can the Power Vector Library do to help?</p><ul>
<li>Be aware of these mandated LE transforms and if required provide compliant inline assembler implementations for LE.</li>
<li>Where required for correctness provide LE specific implementations that have the effect of nullifying the unwanted transforms.</li>
<li>Provide higher level operations that help pveclib and applications code in an endian neutral way and get correct results.</li>
</ul>
<dl class="section see"><dt>See also</dt><dd><a class="el" href="vec__int32__ppc_8h.html#i32_endian_issues_0_0">Endian problems with word operations</a> </dd>
<dd>
<a class="el" href="index.html#mainpage_para_1_2_1">Vector Multiply-by-10 Unsigned Quadword example</a></dd></dl>
<h3><a class="anchor" id="mainpage_sub_1_3"></a>
Returning extended quadword results.</h3>
<p>Extended quadword add, subtract and multiply results can exceed the width of a single 128-bit vector. A 128-bit add can produce 129-bit results. A unsigned 128-bit by 128-bit multiply result can produce 256-bit results. This is simplified for the <em>modulo</em> case where any result bits above the low order 128 can be discarded. But extended arithmetic requires returning the full precision result. Returning double wide quadword results are a complication for both RISC processor and C language library design.</p>
<h4><a class="anchor" id="mainpage_sub_1_3_1"></a>
PowerISA and Implementation.</h4>
<p>For a RISC processor, encoding multiple return registers forces hard trade-offs in a fixed sized instruction format. Also building a vector register file that can support at least one (or more) double wide register writes per cycle is challenging. For a super-scalar machine with multiple vector execution pipelines, the processor can issue and complete multiple instructions per cycle. As most operations return single vector results, this is a higher priority than optimizing for double wide results.</p>
<p>The PowerISA addresses this by splitting these operations into two instructions that execute independently. Here independent means that given the same inputs, one instruction does not depend on the result of the other. Independent instructions can execute out-of-order, or if the processor has multiple vector execution pipelines, can execute (issue and complete) concurrently.</p>
<p>The original VMX implementation had Vector Add/Subtract Unsigned Word Modulo (<b>vadduwm</b> / <b>vsubuwm</b>), paired with Vector Add/Subtract and Write Carry-out Unsigned Word (<b>vaddcuw</b> / <b>vsubcuw</b>). Most usage ignores the carry-out and only uses the add/sub modulo instructions. Applications requiring extended precision, pair the add/sub modulo with add/sub write carry-out, to capture the carry and propagate it to higher order bits.</p>
<p>The (four word) carries are generated into the same <em>word lane</em> as the source addends and modulo result. Propagating the carries require a separate shift (to align the carry-out with the low order (carry-in) bit of the next higher word) and another add word modulo.</p>
<p>POWER8 (PowerISA 2.07B) added full Vector Add/Subtract Unsigned Quadword Modulo (<b>vadduqm</b> / <b>vsubuqm</b>) instructions, paired with corresponding Write Carry-out instructions. (<b>vaddcuq</b> / <b>vsubcuq</b>). A further improvement over the word instructions was the addition of three operand <em>Extend</em> forms which combine add/subtract with carry-in (<b>vaddeuqm</b>, <b>vsubeuqm</b>, <b>vaddecuq</b> and <b>vsubecuq</b>). This simplifies propagating the carry-out into higher quadword operations. </p><dl class="section see"><dt>See also</dt><dd><a class="el" href="vec__int128__ppc_8h.html#a539de2a4426a84102471306acc571ce8" title="Vector Add Unsigned Quadword Modulo. ">vec_adduqm</a>, <a class="el" href="vec__int128__ppc_8h.html#ad7aaadba249ce46c4c94f78df1020da3" title="Vector Add &amp; write Carry Unsigned Quadword. ">vec_addcuq</a>, <a class="el" href="vec__int128__ppc_8h.html#a44e63f70b182d60fe03b43a80647451a" title="Vector Add Extended Unsigned Quadword Modulo. ">vec_addeuqm</a>, <a class="el" href="vec__int128__ppc_8h.html#af18b98d2d73f1afbc439e1407c78f305" title="Vector Add Extended &amp; write Carry Unsigned Quadword. ">vec_addecuq</a></dd></dl>
<p>POWER9 (PowerISA 3.0B) added Vector Multiply-by-10 Unsigned Quadword (Modulo is implied), paired with Vector Multiply-by-10 and Write Carry-out Unsigned Quadword (<b>vmul10uq</b> / <b>vmul10cuq</b>). And the <em>Extend</em> forms (<b>vmul10euq</b> / <b>vmul10ecuq</b>) simplifies the digit (0-9) carry-in for extended precision decimal to binary conversions. </p><dl class="section see"><dt>See also</dt><dd><a class="el" href="vec__int128__ppc_8h.html#a3675fa1a2334eff913df447904be78ad" title="Vector Multiply by 10 Unsigned Quadword. ">vec_mul10uq</a>, <a class="el" href="vec__int128__ppc_8h.html#a8c641b0107fc3e1621ef729c04efd583" title="Vector Multiply by 10 &amp; write Carry Unsigned Quadword. ">vec_mul10cuq</a>, <a class="el" href="vec__int128__ppc_8h.html#a2245626e7b90621b33ba79b763a4215e" title="Vector Multiply by 10 Extended Unsigned Quadword. ">vec_mul10euq</a>, <a class="el" href="vec__int128__ppc_8h.html#a7ca2a6427ecb9458858b5caaac8c4dca" title="Vector Multiply by 10 Extended &amp; write Carry Unsigned Quadword. ">vec_mul10ecuq</a></dd></dl>
<p>The VMX integer multiply operations are split into multiply even/odd instructions by element size. The product requires the next larger element size (twice as many bits). So a vector multiply byte would generate 16 halfword products (256-bits in total). Requiring separate even and odd multiply instructions cuts the total generated product bits (per instruction) in half. It also simplifies the hardware design by keeping the generated product in adjacent element lanes. So each vector multiply even or odd byte operation generates 8 halfword products (128-bits) per instruction.</p>
<p>This multiply even/odd technique applies to most element sizes from byte up to doubleword. The original VMX supports multiply even/odd byte and halfword operations. In the original VMX, arithmetic operations where restricted to byte, halfword, and word elements. Multiply halfword products fit within the integer word element. No multiply byte/halfword modulo instructions were provided, but could be implemented via a vmule, vmulo, vperm sequence.</p>
<p>POWER8 (PowerISA 2.07B) added multiply even/odd word and multiply modulo word instructions. </p><dl class="section see"><dt>See also</dt><dd><a class="el" href="vec__int32__ppc_8h.html#ac93f07d5ad73243db2771da83b50d6d8" title="Vector multiply even unsigned words. ">vec_muleuw</a>, <a class="el" href="vec__int32__ppc_8h.html#a3ca45c65b9627abfc493d4ad500a961d" title="Vector multiply odd unsigned words. ">vec_mulouw</a>, <a class="el" href="vec__int32__ppc_8h.html#ab3ea7653d4e60454b91d669e2b1bcfdf" title="Vector Multiply Unsigned Word Modulo. ">vec_muluwm</a></dd></dl>
<p>The latest PowerISA (3.0B for POWER9) does add a doubleword integer multiply via <b>Vector Multiply-Sum unsigned Doubleword Modulo</b>. This is a departure from the Multiply even/odd byte/halfword/word instructions available in earlier Power processors. But careful conditioning of the inputs can generate the equivalent of multiply even/odd unsigned doubleword. </p><dl class="section see"><dt>See also</dt><dd><a class="el" href="vec__int64__ppc_8h.html#a1d183ebd232e5826be109cdaa421aeed" title="Vector Multiply-Sum Unsigned Doubleword Modulo. ">vec_msumudm</a>, <a class="el" href="vec__int64__ppc_8h.html#a26f95e02f7b0551e3f2bb7e4b4da040d" title="Vector Multiply Even Unsigned Doublewords. ">vec_muleud</a>, <a class="el" href="vec__int64__ppc_8h.html#aa989582cbfaa7984f78a937225e92f4a" title="Vector Multiply Odd Unsigned Doublewords. ">vec_muloud</a></dd></dl>
<p>This (multiply even/odd) technique breaks down when the input element size is quadword or larger. A quadword integer multiply forces a different split. The easiest next step would be a high/low split (like the Fixed-point integer multiply). A multiply low (modulo) quadword would be a useful function. Paired with multiply high quadword provides the double quadword product. This would provide the basis for higher (multi-quadword) precision multiplies. </p><dl class="section see"><dt>See also</dt><dd><a class="el" href="vec__int128__ppc_8h.html#a9aaaf0e4c2705be1e0e8e925b09c52de" title="Vector Multiply Low Unsigned Quadword. ">vec_mulluq</a>, <a class="el" href="vec__int128__ppc_8h.html#aee5c5b2998ef105b4c6f39739748ffa8" title="Vector Multiply Unsigned Double Quadword. ">vec_muludq</a></dd></dl>
<h4><a class="anchor" id="mainpage_sub_1_3_2"></a>
C Language restrictions.</h4>
<p>The Power Vector Library is implemented using C language (inline) functions and this imposes its own restrictions. Standard C language allows an arbitrary number of formal parameters and one return value per function. Parameters and return values with simple C types are normally transfered (passed / returned) efficiently in local (high performance) hardware registers. Aggregate types (struct, union, and arrays of arbitrary size) are normally handled by pointer indirection. The details are defined in the appropriate Application Binary Interface (ABI) documentation.</p>
<p>The POWER processor provides lots of registers (96) so we want to use registers wherever possible. Especially when our application is composed of collections of small functions. And more especially when these functions are small enough to inline and we want the compiler to perform local register allocation and common subexpression elimination optimizations across these functions. The PowerISA defines 3 kinds of registers;</p><ul>
<li>General Purpose Registers (GPRs),</li>
<li>Floating-point Registers (FPRs),</li>
<li>Vector registers (VRs),</li>
</ul>
<p>with 32 of each kind. We will ignore the various special registers for now.</p>
<p>The PowerPC64 64-bit ELF (and OpenPOWER ELF V2) ABIs normally pass simple arguments and return values in a single register (of the appropriate kind) per value. Arguments of aggregate types are passed as storage pointers in General Purpose Registers (GPRs).</p>
<p>The language specification, the language implementation, and the ABI provide some exceptions. The C99 language adds _Complex floating types which are composed of real and imaginary parts. GCC adds _Complex integer types. For PowerPC ABIs complex values are held in a pair of registers of the appropriate kind. C99 also adds double word integers as the <em>long long int</em> type. This only matters for PowerPC 32-bit ABIs. For PowerPC64 ABIs <em>long long</em> and <em>long</em> are both 64-bit integers and are held in 64-bit GPRs.</p>
<p>GCC also adds the __int128 type for some targets including the PowerPC64 ABIs. Values of __int128 type are held (for operations, parameter passing and function return) in 64-bit GPR pairs. Starting with version 4.9 GCC supports the vector signed/unsigned __int128 type. This is passed and returned as a single vector register and should be used for all 128-bit integer types (bool/signed/unsigned).</p>
<p>GCC supports __ibm128 and _Decimal128 floating point types which are held in Floating-point Registers pairs. These are distinct types from vector double and oriented differently in the VXS register file. But the doubleword halves can be moved between types using the VSX permute double word immediate instructions (xxpermdi). This useful for type conversions and implementing some vector BCD operations.</p>
<p>GCC recently added the __float128 floating point type which are held in single vector register. The compiler considers this to be floating scalar and is not cast compatible with any vector type. To access the __float128 value as a vector it must be passed through a union.</p>
<dl class="section note"><dt>Note</dt><dd>The implementation will need to provide transfer functions between vectors and other 128-bit types.</dd></dl>
<p>GCC defines Generic Vector Extensions that allow typedefs for vectors of various element sizes/types and generic SIMD (arithmetic, logical, and element indexing) operations. For PowerPC64 ABIs this is currently restricted to 16-byte vectors as defined in &lt;altivec.h&gt;. For currently available compilers attempts to define vector types with larger (32 or 64 byte) <em>vector_size</em> values are treated as arrays of scalar elements. Only vector_size(16) variables are passed and returned in vector registers.</p>
<p>The OpenPOWER 64-Bit ELF V2 ABI Specification makes specific provisions for passing/returning <em>homogeneous aggregates</em> of multiple like (scalar/vector) data types. Such aggregates can be passed/returned as up to eight floating-point or vector registers. A parameter list may include multiple <em>homogeneous aggregates</em> with up to a total of twelve parameter registers.</p>
<p>This is defined for the Little Endian ELF V2 ABI and is not applicable to Big Endian ELF V1 targets. Also GCC versions before GCC8, do not fully implement this ABI feature, and revert to old ABI structure passing (passing through storage).</p>
<p>Passing large <em>homogeneous aggregates</em> becomes the preferred solution as PVECLIB starts to address wider (256 and 512-bit) vector operations. For example the ABI allows passing up to 3 512-bit parameters and return a 1024-bit result in vector registers (as in <a class="el" href="vec__int512__ppc_8h.html#a666241f67c39d7fae639235edfb8c3b5" title="Vector 512-bit Unsigned Integer Multiply-Add. ">vec_madd512x512a512_inline()</a>). For large multi-quadword precision operations the only practical solution uses reference parameters to arrays or structs in storage (as in <a class="el" href="vec__int512__ppc_8h.html#a8287aa4483acb25ac3188a97cc23b89a" title="Vector 2048x2048-bit Unsigned Integer Multiply. ">vec_mul2048x2048()</a>). See <a class="el" href="vec__int512__ppc_8h.html" title="Header package containing a collection of multiple precision quadword integer computation functions i...">vec_int512_ppc.h</a> for more examples.</p>
<p>So we have shown that there are mechanisms for functions to return multiple vector register values.</p>
<h4><a class="anchor" id="mainpage_sub_1_3_3"></a>
Subsetting the problem.</h4>
<p>We can simplify this problem by remembering that:</p><ul>
<li>Only a subset of the pveclib functions need to return more than one 128-bit vector.</li>
<li>The PowerISA normally splits these cases into multiple instructions anyway.</li>
<li>Most of these functions are small and fully inlined.</li>
<li>The exception will be the multiple quadword precision arithmetic operations.</li>
</ul>
<p>So we have two (or three) options given the current state of GCC compilers in common use:</p><ul>
<li>Mimic the PowerISA and split the operation into two functions, where each function only returns (up to) 128-bits of the result.</li>
<li>Use pointer parameters to return a second vector value in addition to the function return.</li>
<li>Support both options above and let the user decide which works best.</li>
<li>With a availability of GCC 8/9 compilers, pass/return 256, 512 and 1024-bit vectors as <em>homogeneous aggregates</em>.</li>
</ul>
<p>The add/subtract quadword operations provide good examples. For exmaple adding two 256-bit unsigned integer values and returning the 257-bit (the high / low sum and the carry)result looks like this:</p><div class="fragment"><div class="line">s1 = vec_vadduqm (a1, b1); <span class="comment">// sum low 128-bits a1+b1</span></div><div class="line">c1 = vec_vaddcuq (a1, b1); <span class="comment">// write-carry from low a1+b1</span></div><div class="line">s0 = vec_vaddeuqm (a0, b0, c1); <span class="comment">// Add-extend high 128-bits a0+b0+c1</span></div><div class="line">c0 = vec_vaddecuq (a0, b0, c1); <span class="comment">// write-carry from high a0+b0+c1</span></div></div><!-- fragment --><p> This sequence uses the built-ins from &lt;altivec.h&gt; and generates instructions that will execute on POWER8 and POWER9. The compiler must target POWER8 (-mcpu=power8) or higher. In fact the compile will fail if the target is POWER7.</p>
<p>Now let's look at the pveclib version of these operations from &lt;<a class="el" href="vec__int128__ppc_8h.html" title="Header package containing a collection of 128-bit computation functions implemented with PowerISA VMX...">vec_int128_ppc.h</a>&gt;:</p><div class="fragment"><div class="line">s1 = <a class="code" href="vec__int128__ppc_8h.html#a539de2a4426a84102471306acc571ce8">vec_adduqm</a> (a1, b1); <span class="comment">// sum low 128-bits a1+b1</span></div><div class="line">c1 = <a class="code" href="vec__int128__ppc_8h.html#ad7aaadba249ce46c4c94f78df1020da3">vec_addcuq</a> (a1, b1); <span class="comment">// write-carry from low a1+b1</span></div><div class="line">s0 = <a class="code" href="vec__int128__ppc_8h.html#a44e63f70b182d60fe03b43a80647451a">vec_addeuqm</a> (a0, b0, c1); <span class="comment">// Add-extend high 128-bits a0+b0+c1</span></div><div class="line">c0 = <a class="code" href="vec__int128__ppc_8h.html#af18b98d2d73f1afbc439e1407c78f305">vec_addecuq</a> (a0, b0, c1); <span class="comment">// write-carry from high a0+b0+c1</span></div></div><!-- fragment --><p> Looks almost the same but the operations do not use the 'v' prefix on the operation name. This sequence generates the same instructions for (-mcpu=power8) as the &lt;altivec.h&gt; version above. It will also generate a different (slightly longer) instruction sequence for (-mcpu=power7) which is functionally equivalent.</p>
<p>The pveclib &lt;<a class="el" href="vec__int128__ppc_8h.html" title="Header package containing a collection of 128-bit computation functions implemented with PowerISA VMX...">vec_int128_ppc.h</a>&gt; header also provides a coding style alternative:</p><div class="fragment"><div class="line">s1 = <a class="code" href="vec__int128__ppc_8h.html#a363fa7103ccd730c47bb34cb9f05e80b">vec_addcq</a> (&amp;c1, a1, b1);</div><div class="line">s0 = <a class="code" href="vec__int128__ppc_8h.html#a9e27910c148d525e17d099688aec9ba1">vec_addeq</a> (&amp;c0, a0, b0, c1);</div></div><!-- fragment --><p> Here vec_addcq combines the adduqm/addcuq operations into a <em>add and carry quadword</em> operation. The first parameter is a pointer to vector to receive the carry-out while the 128-bit modulo sum is the function return value. Similarly vec_addeq combines the addeuqm/addecuq operations into a <em>add with extend and carry quadword</em> operation.</p>
<p>As these functions are inlined by the compiler the implied store / reload of the carry can be converted into a simple register assignment. For (-mcpu=power8) the compiler should generate the same instruction sequence as the two previous examples.</p>
<p>For (-mcpu=power7) these functions will expand into a different (slightly longer) instruction sequence which is functionally equivalent to the instruction sequence generated for (-mcpu=power8).</p>
<p>For older processors (power7 and earlier) and under some circumstances instructions generated for this "combined form" may perform better than the "split form" equivalent from the second example. Here the compiler may not recognize all the common subexpressions, as the "split forms" are expanded before optimization.</p>
<h1><a class="anchor" id="mainpage_sub2"></a>
Background on the evolution  of &lt;altivec.h&gt;</h1>
<p>The original <a href="https://www.nxp.com/docs/en/reference-manual/ALTIVECPIM.pdf">AltiVec (TM) Technology Programming Interface Manual</a> defined the minimal vector extensions to the application binary interface (ABI), new keywords (vector, pixel, bool) for defining new vector types, and new operators (built-in functions).</p>
<ul>
<li>generic AltiVec operations, like vec_add()</li>
<li>specific AltiVec operations (instructions, like vec_addubm())</li>
<li>predicates computed from a AltiVec operation like vec_all_eq()</li>
</ul>
<p>A generic operation generates specific instructions based on the types of the actual parameters. So a generic vec_add operation, with vector char parameters, will generate the (specific) vector add unsigned byte modulo (vaddubm) instruction. Predicates are used within if statement conditional clauses to access the condition code from vector operations that set Condition Register 6 (vector SIMD compares and Decimal Integer arithmetic and format conversions).</p>
<p>The PIM defined a set of compiler built-ins for vector instructions (see section "4.4 Generic and Specific AltiVec Operations") that compilers should support. The document suggests that any required typedefs and supporting macro definitions be collected into an include file named &lt;altivec.h&gt;.</p>
<p>The built-ins defined by the PIM closely match the vector instructions of the underlying PowerISA. For example: vec_mul, vec_mule / vec_mulo, and vec_muleub / vec_muloub.</p><ul>
<li>vec_mul is defined for float and double and will (usually) generate a single instruction for the type. This is a simpler case as floating point operations usually stay in their lanes (result elements are the same size as the input operand elements).</li>
<li>vec_mule / vec_mulo (multiply even / odd) are defined for integer multiply as integer products require twice as many bits as the inputs (the results don't stay in their lane).</li>
</ul>
<p>The RISC philosophy resists and POWER Architecture avoids instructions that write to more than one register. So the hardware and PowerISA vector integer multiply generate even and odd product results (from even and odd input elements) from two instructions executing separately. The PIM defines and compiler supports these operations as overloaded built-ins and selects the specific instructions based on the operand (char or short) type.</p>
<p>As the PowerISA evolved adding new vector (VMX) instructions, new facilities (Vector Scalar Extended (VSX)), and specialized vector categories (little endian, AES, SHA2, RAID), some of these new operators were added to &lt;altivec.h&gt;. This included some new specific and generic operations and additional vector element types (long (64-bit) int, __int128, double and quad precision (__Float128) float). This support was <em>staged</em> across multiple compiler releases in response to perceived need and stake-holder requests.</p>
<p>The result was a patchwork of &lt;altivec.h&gt; built-ins support versus new instructions in the PowerISA and shipped hardware. The original Altivec (VMX) provided Vector Multiply (Even / Odd) operations for byte (char) and halfword (short) integers. Vector Multiply Even / Odd Word (int) instructions were not introduced until PowerISA V2.07 (POWER8) under the generic built-ins vec_mule, vec_mulo. PowerISA 2.07 also introduced Vector Multiply Word Modulo under the generic built-in vec_mul. Both where first available in GCC 8. Specific built-in forms (vec_vmuleuw, vec_vmulouw, vec_vmuluwm) where not provided. PowerISA V3.0 (POWER9) added Multiply-Sum Unsigned Doubleword Modulo but neither generic (vec_msum) or specific (vec_msumudm) forms have been provided (so far as of GCC 9).</p>
<p>However the original PIM documents were primarily focused on embedded processors and were not updated to include the vector extensions implemented by the server processors. So any documentation for new vector operations were relegated to the various compilers. This was a haphazard process and some divergence in operation naming did occur between compilers.</p>
<p>In the run up to the POWER8 launch and the OpenPOWER initiative it was recognized that switching to Little Endian would require and new and well documented Application Binary Interface (<b>ABI</b>). It was also recognized that new &lt;altivec.h&gt; extensions needed to be documented in a common place so the various compilers could implement a common vector built-in API. So ...</p>
<h2><a class="anchor" id="mainpage_sub2_1"></a>
The ABI is evolving</h2>
<p>The <a href="https://openpowerfoundation.org/?resource_lib=64-bit-elf-v2-abi-specification-power-architecture">OpenPOWER ELF V2 application binary interface (ABI)</a>: Chapter 6. <b>Vector Programming Interfaces</b> and <b>Appendix A. Predefined Functions for Vector Programming</b> document the current and proposed vector built-ins we expect all C/C++ compilers to implement for the PowerISA.</p>
<p>The ABI defined generic operations as overloaded built-in functions. Here the ABI suggests a specific PowerISA implementation based on the operand (vector element) types. The ABI also defines the (big/little) endian behavior and the ABI may suggests different instructions based on the endianness of the target.</p>
<p>This is an important point as the vector element numbering changes between big and little endian, and so does the meaning of even and odd. Both affect what the compiler supports and the instruction sequence generated.</p><ul>
<li><b>vec_mule</b> and <b>vec_mulo</b> (multiply even / odd are examples of generic built-ins defined by the ABI. One would assume these built-ins will generate the matching instruction based only on the input vector type, however the GCC compiler will adjust the generated instruction based on the target endianness (reversing even / odd for little endian).</li>
<li>Similarly for the merge (even/odd high/low) operations. For little endian the compiler reverses even/odd (high/low) and swaps operands as well.</li>
<li>See <b>Table 6.1. Endian-Sensitive Operations</b> for details.</li>
</ul>
<p>The many existing specific built-ins (where the name includes explicit type and signed/unsigned notation) are included in the ABI but listed as deprecated. Specifically the Appendix <b>A.6. Deprecated Compatibility Functions</b> and <b>Table A.8. Functions Provided for Compatibility</b>.</p>
<p>This reflects an explicit decision by the ABI and compiler maintainers that a generic only interface would be smaller/easier to implement and document as the PowewrISA evolves.</p>
<p>Certainly the addition of VSX to POWER7 and the many vector extensions added to POWER8 and POWER9 added hundreds of vector instructions. Many of these new instructions needed build-ins to:</p><ul>
<li>Enable early library exploitations. For example new floating point element sizes (double and Float128).</li>
<li>Support specialized operations not generally supported in the language. For example detecting Not-a-Number and Infinities without triggering exceptions. These are needed in the POSIX library implementation.</li>
<li>Supporting wider integer element sizes can result in large multiples of specific built-ins if you include variants for:<ul>
<li>signed and unsigned</li>
<li>saturated</li>
<li>even, odd, modulo, write-carry, and extend</li>
<li>high and low</li>
<li>and additional associated merge, pack, unpack, splat, operations</li>
</ul>
</li>
</ul>
<p>So implementing new instructions as generic built-ins first, and delaying the specific built-in permutations, is a wonderful simplification. This moves naturally from tactical to strategy to plan quickly. Dropping the specific built-ins for new instructions and deprecating the existing specific built-ins saves a lot of work.</p>
<p>As the ABI places more emphasis on generic built-in operations, we are seeing more cases where the compiler generates multiple instruction sequences. The first example was vec_abs (vector absolute value) from the original Altivec PIM. There was no vector absolute instruction for any of the supported types (including vector float at the time). But this could be implemented in a 3 instruction sequence. This generic operation was extended to vector double for VSX (PowerISA 2.06) which introduced hardware instructions for absolute value of single and double precision vectors. But vec_abs remains a multiple instruction sequence for integer elements.</p>
<p>Another example is vec_mul. POWER8 (PowerISA 2.07) introduced Vector Multiply Unsigned Word Modulo (vmuluwm). This was included in the ISA as it simplified vectorizing C language (int) loops. This also allowed a single instruction implementation for vec_mul for vector (signed/unsigned) int. The PowerISA does not provide direct vector multiply modulo instructions for char, short, or long. Again this requires a multiple-instruction sequence to implement.</p>
<h2><a class="anchor" id="mainpage_sub2_2"></a>
The current &lt;altivec.h&gt; is a mixture</h2>
<p>The current vector ABI implementation in the compiler and &lt;altivec.h&gt; is mixture of old and new.</p><ul>
<li>Many new instruction (since PowerISA 2.06) are supported only under existing built-ins (with new element types; vec_mul, vec_mule, vec_mulo). Or as newly defined generic built-ins (vec_eqv. vec_nand, vec_orc).<ul>
<li>Specific types/element sizes under these generic built-ins may be marked <em>phased in</em>.</li>
</ul>
</li>
<li>Some new instructions are supported with both generic (vec_popcnt) and specific built-ins (vec_vpopcntb, vec_vpopcntd, vec_vpopcnth, vec_vpopcntw).</li>
<li>Other new instructions are only supported with specific built-ins (vec_vaddcuq, vec_vaddecuq, vec_vaddeuqm, vec_vsubcuq, vec_vsubecuq, vec_vsubeuqm). To be fair only the quadword element supports the write-carry and extend variants.</li>
<li>Endian sensitivity may be applied in surprising ways.<ul>
<li><b>vec_muleub</b> and <b>vec_muloub</b> (multiply even / odd unsigned byte) are examples of non-overloaded built-ins provided by the GCC compiler but not defined in the ABI. One would assume these built-ins will generate the matching instruction, however the GCC compiler will adjust the generated instruction based on the target endianness (even / odd is reversed for little endian).</li>
<li><b>vec_sld</b>, <b>vec_sldw</b>, <b>vec_sll</b>, and <b>vec_slo</b> (vector shift left) are <b>not</b> endian sensitive. Historically, these built-ins are often used to shift by amounts not a multiple of the element size, across types.</li>
</ul>
</li>
<li>A number of built-ins are defined in the ABI and marked (all or in part) as <em>phased in</em>. This implies that compilers <b>shall</b> implement these built-ins (eventually) in &lt;altivec.h&gt;. However the specific compiler version you are using many not have implemented it yet.</li>
</ul>
<h2><a class="anchor" id="mainpage_sub2_3"></a>
Best practices</h2>
<p>This is a small sample of the complexity we encounter programming at this low level (vector intrinsic) API. This is also an opportunity for a project like the Power Vector Library (PVECLIB) to smooth off the rough edges and simplify software development for the OpenPOWER ecosystem.</p>
<p>If the generic vector built-in operation you need:</p><ul>
<li>is defined in the ABI, and</li>
<li>defined in the PowerISA across the processor versions you need to support, and</li>
<li>defined in &lt;altivec.h&gt; for the compilers and compiler versions you expect to use, and</li>
<li>implemented for the vector types/element sizes you need for the compilers and compiler versions you expect to use.</li>
</ul>
<p>Then use the generic vector built-in from &lt;altivec.h&gt; in your application/library.</p>
<p>Otherwise if the specific vector built-in operation you need is defined in &lt;altivec.h&gt;:</p><ul>
<li>For the vector types/element sizes you need, and</li>
<li>defined in the PowerISA across the processor versions you need to support, and</li>
<li>implemented for the compilers and compiler versions you expect to use.</li>
</ul>
<p>Then use the specific vector built-in from &lt;altivec.h&gt; in your application/library.</p>
<p>Otherwise if the vector operation you need is defined in PVECLIB.</p><ul>
<li>For the vector types/element sizes you need.</li>
</ul>
<p>Then use the vector operation from PVECLIB in your application/library.</p>
<p>Otherwise</p><ul>
<li>Check on <a href="/~https://github.com/open-power-sdk/pveclib">/~https://github.com/open-power-sdk/pveclib</a> and see if there is newer version of PVECLIB.</li>
<li>Open an issue on <a href="/~https://github.com/open-power-sdk/pveclib/issues">/~https://github.com/open-power-sdk/pveclib/issues</a> for the operation you would like to see.</li>
<li>Look at source for PVECLIB for examples similar to what you are trying to do.</li>
</ul>
<h1><a class="anchor" id="main_libary_issues_0_0"></a>
Putting the Library into PVECLIB</h1>
<p>Until recently (as of v1.0.3) PVECLIB operations were <b>static inline</b> only. This was reasonable as most operations were small (one to a few vector instructions). This offered the compiler opportunity for:</p><ul>
<li>Better register allocation.</li>
<li>Identifying common subexpressions and factoring them across operation instances.</li>
<li>Better instruction scheduling across operations.</li>
</ul>
<p>Even then, a few operations (quadword multiply, BCD multiply, BCD &lt;-&gt; binary conversions, and some POWER8/7 implementations of POWER9 instructions) were getting uncomfortably large (10s of instructions). But it was the multiple quadword precision operations that forced the issue as they can run to 100s and sometimes 1000s of instructions. So, we need to build some functions from pveclib into a static archive and/or a dynamic library (DSO).</p>
<h2><a class="anchor" id="main_libary_issues_0_0_0"></a>
Building Multi-target Libraries</h2>
<p>Building libraries of compiled binaries is not that difficult. The challenge is effectively supporting multiple processor (POWER7/8/9) targets, as many PVECLIB operations have different implementations for each target. This is especially evident on the multiply integer word, doubleword, and quadword operations (see; <a class="el" href="vec__int128__ppc_8h.html#aee5c5b2998ef105b4c6f39739748ffa8" title="Vector Multiply Unsigned Double Quadword. ">vec_muludq()</a>, <a class="el" href="vec__int128__ppc_8h.html#ad6be9c8f02e43c39a659d6bbc9c3a2d2" title="Vector Multiply High Unsigned Quadword. ">vec_mulhuq()</a>, <a class="el" href="vec__int128__ppc_8h.html#a9aaaf0e4c2705be1e0e8e925b09c52de" title="Vector Multiply Low Unsigned Quadword. ">vec_mulluq()</a>, <a class="el" href="vec__int128__ppc_8h.html#a84e6361054b52ac4564bcef25b718151" title="Vector Multiply Even Unsigned Doublewords. ">vec_vmuleud()</a>, <a class="el" href="vec__int128__ppc_8h.html#a208744996e7482604ad274b44999d6ce" title="Vector Multiply Odd Unsigned Doublewords. ">vec_vmuloud()</a>, <a class="el" href="vec__int128__ppc_8h.html#a1d183ebd232e5826be109cdaa421aeed" title="Vector Multiply-Sum Unsigned Doubleword Modulo. ">vec_msumudm()</a>, <a class="el" href="vec__int32__ppc_8h.html#ac93f07d5ad73243db2771da83b50d6d8" title="Vector multiply even unsigned words. ">vec_muleuw()</a>, <a class="el" href="vec__int32__ppc_8h.html#a3ca45c65b9627abfc493d4ad500a961d" title="Vector multiply odd unsigned words. ">vec_mulouw()</a>).</p>
<p>This is dictated by both changes in the PowerISA and in the micro-architecture as it evolved across processor generations. So an implementation to run on a POWER7 is necessarily restricted to the instructions of PowerISA 2.06. But if we are running on a POWER9, leveraging new instructions from PowerISA 3.0 can yield better performance than the POWER7 compatible implementation. When we are dealing with larger operations (10s and 100s of instructions) the compiler can schedule instruction sequences based on the platform (-mtune=) for better performance.</p>
<p>So, we need to deliver multiple implementations for some operations and we need to provide mechanisms to select a specific target implementation statically at compile/build or dynamically at runtime. First we need to compile multiple version of these operations, as unique functions, each with a different effective compile target (-mcpu= options).</p>
<p>Obviously, creating multiple source files implementing the same large operation, each supporting a different specific target platform, is a possibility. However, this could cause maintenance problems where changes to a operation must be coordinated across multiple source files. This is also inconsistent with the current PVECLIB coding style where a file contains an operation's complete implementation, including documentation and target specific implementation variants.</p>
<p>The current PVECLIB implementation makes extensive use of C Preprocessor (<b>CPP</b>) conditional source code. These includes testing for; compiler version, target endianness, and current target processor, then selects the appropriate source code snippet (<a class="el" href="index.html#mainpage_sub_1_2">So what can the Power Vector Library project do?</a>). This was intended to simplify the application/library developer's life were they could use the PVECLIB API and not worry about these details.</p>
<p>So far, this works as intended (single vector source for multiple PowerISA VMX/VSX targets) when the entire application is compiled for a single target. However, this dependence on CPP conditionals is mixed blessing then the application needs to support multiple platforms in a single package.</p>
<h3><a class="anchor" id="main_libary_issues_0_0_0_0"></a>
The mechanisms available</h3>
<p>The compiler and ABI offer options that at first glance seem to allow multiple target specific binaries from a single source. Besides the compiler's command level target options a number of source level mechanisms to change the target. These include:</p><ul>
<li>__ attribute __ (target ("cpu=power8"))</li>
<li>__ attribute __ (target_clones ("cpu=power9,default"))</li>
<li>#pragma GCC target ("cpu=power8")</li>
<li>multiple compiles with different command line options (i.e. -mcpu=)</li>
</ul>
<p>The target and target_clones attributes are function attributes (apply to single function). The target attribute overrides the command line -mcpu= option. However it is not clear which version of GCC added explicit support for (target ("cpu="). This was not explicitly documented until GCC 5. The target_clones attribute will cause GCC will create two (or more) function clones, one (or more) compiled with the specified cpu= target and another with the default (or command line -mcpu=) target. It also creates a resolver function that dynamically selects a clone implementation suitable for current platform architecture. This PowerPC specific variant was not explicitly documented until GCC 8.</p>
<p>There are a few issues with function attributes:</p><ul>
<li>The Doxygen preprocessor can not parse function attributes without a lot of intervention.</li>
<li>The availability of these attributes seems to be limited to the latest GCC compilers.</li>
</ul>
<dl class="section note"><dt>Note</dt><dd>The Clang/LLVM compilers don't provide equivalents to attribute (target) or #pragma target.</dd></dl>
<p>But there is a deeper problem related to the usage of CPP conditionals. Many PVECLIB operation implementations depend on GCC/compiler predefined macros including:</p><ul>
<li>__ GNUC __</li>
<li>__ GNUC_MINOR __</li>
<li>__ BYTE_ORDER __</li>
<li>__ ORDER_LITTLE_ENDIAN __</li>
<li>__ ORDER_BIG_ENDIAN __</li>
</ul>
<p>PVECLIB also depends on many system-specific predefined macros including:</p><ul>
<li>__ ALTIVEC __</li>
<li>__ VSX __</li>
<li>__ FLOAT128 __</li>
<li>_ARCH_PWR9</li>
<li>_ARCH_PWR8</li>
<li>_ARCH_PWR7</li>
</ul>
<p>PVECLIB also depends on the &lt;altivec.h&gt; include file which provides the mapping between the ABI defined intrinsics and compiler defined built-ins. In some places PVECLIB conditionally tests if specific built-in is defined and substitutes an in-line assembler implementation if not. Altivec.h also depends on system-specific predefined macros to enable/disable blocks of intrinsic built-ins based on PowerISA level of the compile target.</p>
<h3><a class="anchor" id="main_libary_issues_0_0_0_1"></a>
Some things just do not work</h3>
<p>This issue is the compiler (GCC at least) only expands the compiler and system-specific predefined macros once per source file. The preprocessed source does not change due to embedded function attributes that change the target. So the following does not work as expected.</p>
<div class="fragment"><div class="line"><span class="preprocessor">#include &lt;altivec.h&gt;</span></div><div class="line"><span class="preprocessor">#include &lt;<a class="code" href="vec__int128__ppc_8h.html">pveclib/vec_int128_ppc.h</a>&gt;</span></div><div class="line"><span class="preprocessor">#include &lt;<a class="code" href="vec__int512__ppc_8h.html">pveclib/vec_int512_ppc.h</a>&gt;</span></div><div class="line"></div><div class="line"><span class="comment">// Defined in vec_int512_ppc.h but included here for clarity.</span></div><div class="line"><span class="keyword">static</span> <span class="keyword">inline</span> <a class="code" href="struct____VEC__U__256.html">__VEC_U_256</a></div><div class="line"><a class="code" href="vec__int512__ppc_8h.html#a958e029fc824ec3a73ad9550bf7ea506">vec_mul128x128_inline</a> (<a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a> a, <a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a> b)</div><div class="line">{</div><div class="line">  <a class="code" href="struct____VEC__U__256.html">__VEC_U_256</a> result;</div><div class="line">  <span class="comment">// vec_muludq is defined in vec_int128_ppc.h</span></div><div class="line">  result.vx0 = <a class="code" href="vec__int128__ppc_8h.html#aee5c5b2998ef105b4c6f39739748ffa8">vec_muludq</a> (&amp;result.vx1, a, b);</div><div class="line">  <span class="keywordflow">return</span> result;</div><div class="line">}</div><div class="line"></div><div class="line"><a class="code" href="struct____VEC__U__256.html">__VEC_U_256</a> __attribute__(target (<span class="stringliteral">&quot;cpu=power7&quot;</span>))</div><div class="line">vec_mul128x128_PWR7 (<a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a> m1l, <a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a> m2l)</div><div class="line">{</div><div class="line">  <span class="keywordflow">return</span> <a class="code" href="vec__int512__ppc_8h.html#a958e029fc824ec3a73ad9550bf7ea506">vec_mul128x128_inline</a> (m1l, m2l);</div><div class="line">}</div><div class="line"></div><div class="line"><a class="code" href="struct____VEC__U__256.html">__VEC_U_256</a> __attribute__(target (<span class="stringliteral">&quot;cpu=power8&quot;</span>))</div><div class="line">vec_mul128x128_PWR8 (<a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a> m1l, <a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a> m2l)</div><div class="line">{</div><div class="line">  <span class="keywordflow">return</span> <a class="code" href="vec__int512__ppc_8h.html#a958e029fc824ec3a73ad9550bf7ea506">vec_mul128x128_inline</a> (m1l, m2l);</div><div class="line">}</div><div class="line"></div><div class="line"><a class="code" href="struct____VEC__U__256.html">__VEC_U_256</a> __attribute__(target (<span class="stringliteral">&quot;cpu=power9&quot;</span>))</div><div class="line">vec_mul128x128_PWR9 (<a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a> m1l, <a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a> m2l)</div><div class="line">{</div><div class="line">  <span class="keywordflow">return</span> <a class="code" href="vec__int512__ppc_8h.html#a958e029fc824ec3a73ad9550bf7ea506">vec_mul128x128_inline</a> (m1l, m2l);</div><div class="line">}</div></div><!-- fragment --><p>For example if we assume that the compiler default is (or the command line specifies) -mcpu=power8 the compiler will use this to generate the system-specific predefined macros. This is done before the first include file is processed. In this case &lt;altivec.h&gt;, <a class="el" href="vec__int128__ppc_8h.html" title="Header package containing a collection of 128-bit computation functions implemented with PowerISA VMX...">vec_int128_ppc.h</a>, and <a class="el" href="vec__int512__ppc_8h.html" title="Header package containing a collection of multiple precision quadword integer computation functions i...">vec_int512_ppc.h</a> source will be expanded for power8 (PowerISA-2.07). The result is the vec_muludq and vec_muludq inline source implementations will be the power8 specific version.</p>
<p>This will all be established before the compiler starts to parse and generate code for vec_mul128x128_PWR7. This compile is likely to fail because we are trying to compile code containing power8 instructions for a -mcpu=power7 target.</p>
<p>The compilation of vec_mul128x128_PWR8 should work as we are compiling power8 code with a -mcpu=power8 target. The compilation of vec_mul128x128_PWR9 will compile without error but will generate essentially the same code as vec_mul128x128_PWR8. The target("cpu=power9") allows that compiler to use power9 instructions but the expanded source coded from vec_muludq and vec_mul128x128_inline will not contain any power9 intrinsic built-ins.</p>
<dl class="section note"><dt>Note</dt><dd>The GCC attribute <b>target_clone</b> has the same issue.</dd></dl>
<p>Pragma GCC target has a similar issue if you try to change the target multiple times within the same source file.</p>
<div class="fragment"><div class="line"><span class="preprocessor">#include &lt;altivec.h&gt;</span></div><div class="line"><span class="preprocessor">#include &lt;<a class="code" href="vec__int128__ppc_8h.html">pveclib/vec_int128_ppc.h</a>&gt;</span></div><div class="line"><span class="preprocessor">#include &lt;<a class="code" href="vec__int512__ppc_8h.html">pveclib/vec_int512_ppc.h</a>&gt;</span></div><div class="line"></div><div class="line"><span class="comment">// Defined in vec_int512_ppc.h but included here for clarity.</span></div><div class="line"><span class="keyword">static</span> <span class="keyword">inline</span> <a class="code" href="struct____VEC__U__256.html">__VEC_U_256</a></div><div class="line"><a class="code" href="vec__int512__ppc_8h.html#a958e029fc824ec3a73ad9550bf7ea506">vec_mul128x128_inline</a> (<a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a> a, <a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a> b)</div><div class="line">{</div><div class="line">  <a class="code" href="struct____VEC__U__256.html">__VEC_U_256</a> result;</div><div class="line">  <span class="comment">// vec_muludq is defined in vec_int128_ppc.h</span></div><div class="line">  result.vx0 = <a class="code" href="vec__int128__ppc_8h.html#aee5c5b2998ef105b4c6f39739748ffa8">vec_muludq</a> (&amp;result.vx1, a, b);</div><div class="line">  <span class="keywordflow">return</span> result;</div><div class="line">}</div><div class="line"></div><div class="line"><span class="preprocessor">#pragma GCC push_options</span></div><div class="line"><span class="preprocessor">#pragma GCC target (&quot;cpu=power7&quot;)</span></div><div class="line"></div><div class="line"><a class="code" href="struct____VEC__U__256.html">__VEC_U_256</a></div><div class="line">vec_mul128x128_PWR7 (<a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a> m1l, <a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a> m2l)</div><div class="line">{</div><div class="line">  <span class="keywordflow">return</span> <a class="code" href="vec__int512__ppc_8h.html#a958e029fc824ec3a73ad9550bf7ea506">vec_mul128x128_inline</a> (m1l, m2l);</div><div class="line">}</div><div class="line"></div><div class="line"><span class="preprocessor">#pragma GCC pop_options</span></div><div class="line"><span class="preprocessor">#pragma GCC push_options</span></div><div class="line"><span class="preprocessor">#pragma GCC target (&quot;cpu=power8&quot;)</span></div><div class="line"></div><div class="line"><a class="code" href="struct____VEC__U__256.html">__VEC_U_256</a></div><div class="line">vec_mul128x128_PWR8 (<a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a> m1l, <a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a> m2l)</div><div class="line">{</div><div class="line">  <span class="keywordflow">return</span> <a class="code" href="vec__int512__ppc_8h.html#a958e029fc824ec3a73ad9550bf7ea506">vec_mul128x128_inline</a> (m1l, m2l);</div><div class="line">}</div><div class="line"></div><div class="line"><span class="preprocessor">#pragma GCC pop_options</span></div><div class="line"><span class="preprocessor">#pragma GCC push_options</span></div><div class="line"><span class="preprocessor">#pragma GCC target (&quot;cpu=power9&quot;)</span></div><div class="line"></div><div class="line"><a class="code" href="struct____VEC__U__256.html">__VEC_U_256</a></div><div class="line">vec_mul128x128_PWR9 (<a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a> m1l, <a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a> m2l)</div><div class="line">{</div><div class="line">  <span class="keywordflow">return</span> <a class="code" href="vec__int512__ppc_8h.html#a958e029fc824ec3a73ad9550bf7ea506">vec_mul128x128_inline</a> (m1l, m2l);</div><div class="line">}</div></div><!-- fragment --><p> This has the same issues as the target attribute example above. However you can use #pragma GCC target if;</p><ul>
<li>it proceeds the first #include in the source file.</li>
<li>there is only one target #pragma in the file.</li>
</ul>
<p>For example: </p><div class="fragment"><div class="line"><span class="preprocessor">#pragma GCC target (&quot;cpu=power9&quot;)</span></div><div class="line"><span class="preprocessor">#include &lt;altivec.h&gt;</span></div><div class="line"><span class="preprocessor">#include &lt;<a class="code" href="vec__int128__ppc_8h.html">pveclib/vec_int128_ppc.h</a>&gt;</span></div><div class="line"><span class="preprocessor">#include &lt;<a class="code" href="vec__int512__ppc_8h.html">pveclib/vec_int512_ppc.h</a>&gt;</span></div><div class="line"></div><div class="line"><span class="comment">// vec_mul128x128_inline is defined in vec_int512_ppc.h</span></div><div class="line"><a class="code" href="struct____VEC__U__256.html">__VEC_U_256</a></div><div class="line">vec_mul128x128_PWR9 (<a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a> m1l, <a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a> m2l)</div><div class="line">{</div><div class="line">  <span class="keywordflow">return</span> <a class="code" href="vec__int512__ppc_8h.html#a958e029fc824ec3a73ad9550bf7ea506">vec_mul128x128_inline</a> (m1l, m2l);</div><div class="line">}</div></div><!-- fragment --><p> In this case the cpu=power9 option is applied before the compiler reads the first include file and initializes the system-specific predefined macros. So the CPP source expansion reflects the power9 target.</p>
<dl class="section note"><dt>Note</dt><dd>So far the techniques described only work reliably for C/C++ codes, compiled with GCC, that don't use &lt;altivec.h&gt; intrinsics or use CPP conditionals.</dd></dl>
<p>The implication is we need a build system that allows source files to be compiled multiple times, each with different compile targets.</p>
<h3><a class="anchor" id="main_libary_issues_0_0_0_2"></a>
Some tricks to build targeted runtime objects.</h3>
<p>We need a unique compiled object implementation for each target processor. We still prefer a single file implementation for each function to improve maintenance. So we need a way to separate setting the platform target from the implementation source. Also we need to provide a unique external symbol for each target specific implementation of a function.</p>
<p>This can be handled with a simple macro to append a suffix based on system-specific predefined macro settings.</p>
<div class="fragment"><div class="line"><span class="preprocessor">#ifdef _ARCH_PWR9</span></div><div class="line"><span class="preprocessor">#define __VEC_PWR_IMP(FNAME) FNAME ## _PWR9</span></div><div class="line"><span class="preprocessor">#else</span></div><div class="line"><span class="preprocessor">#ifdef _ARCH_PWR8</span></div><div class="line"><span class="preprocessor">#define __VEC_PWR_IMP(FNAME) FNAME ## _PWR8</span></div><div class="line"><span class="preprocessor">#else</span></div><div class="line"><span class="preprocessor">#define __VEC_PWR_IMP(FNAME) FNAME ## _PWR7</span></div><div class="line"><span class="preprocessor">#endif</span></div><div class="line"><span class="preprocessor">#endif</span></div></div><!-- fragment --><p> Then use <a class="el" href="vec__int512__ppc_8h.html#a77eca5d7bebe0f30894fe9669c01b7a7" title="Macro to add platform suffix for static calls. ">__VEC_PWR_IMP()</a> as function name wrapper in the implementation source file.</p>
<div class="fragment"><div class="line"> <span class="comment">//</span></div><div class="line"> <span class="comment">//  \file  vec_int512_runtime.c</span></div><div class="line"> <span class="comment">//</span></div><div class="line"></div><div class="line"><span class="preprocessor">#include &lt;altivec.h&gt;</span></div><div class="line"><span class="preprocessor">#include &lt;<a class="code" href="vec__int128__ppc_8h.html">pveclib/vec_int128_ppc.h</a>&gt;</span></div><div class="line"><span class="preprocessor">#include &lt;<a class="code" href="vec__int512__ppc_8h.html">pveclib/vec_int512_ppc.h</a>&gt;</span></div><div class="line"></div><div class="line"><span class="comment">// vec_mul128x128_inline is defined in vec_int512_ppc.h</span></div><div class="line"><a class="code" href="struct____VEC__U__256.html">__VEC_U_256</a></div><div class="line"><a class="code" href="vec__int512__ppc_8h.html#a77eca5d7bebe0f30894fe9669c01b7a7">__VEC_PWR_IMP</a> (<a class="code" href="vec__int512__ppc_8h.html#ab5b80fd9694cea8bf502b26e55af37f7">vec_mul128x128</a>) (<a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a> m1l, <a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a> m2l)</div><div class="line">{</div><div class="line">  <span class="keywordflow">return</span> <a class="code" href="vec__int512__ppc_8h.html#a958e029fc824ec3a73ad9550bf7ea506">vec_mul128x128_inline</a> (m1l, m2l);</div><div class="line">}</div></div><!-- fragment --><p> Then the use <a class="el" href="vec__int512__ppc_8h.html#a77eca5d7bebe0f30894fe9669c01b7a7" title="Macro to add platform suffix for static calls. ">__VEC_PWR_IMP()</a> function wrapper for any calling function that is linked statically to that library function. </p><div class="fragment"><div class="line"><a class="code" href="struct____VEC__U__1024.html">__VEC_U_1024</a></div><div class="line"><a class="code" href="vec__int512__ppc_8h.html#a77eca5d7bebe0f30894fe9669c01b7a7">__VEC_PWR_IMP</a> (<a class="code" href="vec__int512__ppc_8h.html#a56a5da10870d9878e2ab888d3c4d2e7b">vec_mul512x512</a>) (<a class="code" href="struct____VEC__U__512.html">__VEC_U_512</a> m1, <a class="code" href="struct____VEC__U__512.html">__VEC_U_512</a> m2)</div><div class="line">{</div><div class="line">  <a class="code" href="struct____VEC__U__1024.html">__VEC_U_1024</a> result;</div><div class="line">  <a class="code" href="union____VEC__U__512x1.html">__VEC_U_512x1</a> mp3, mp2, mp1, mp0;</div><div class="line"></div><div class="line">  mp0.x640 = <a class="code" href="vec__int512__ppc_8h.html#a77eca5d7bebe0f30894fe9669c01b7a7">__VEC_PWR_IMP</a>(<a class="code" href="vec__int512__ppc_8h.html#a0cfdc3e00f5e2c3a9a959969f684203e">vec_mul512x128</a>) (m1, m2.vx0);</div><div class="line">  result.vx0 = mp0.x3.v1x128;</div><div class="line">  mp1.x640 = <a class="code" href="vec__int512__ppc_8h.html#a77eca5d7bebe0f30894fe9669c01b7a7">__VEC_PWR_IMP</a>(<a class="code" href="vec__int512__ppc_8h.html#acf5c808a77a8486a82a9ee87ff414fd2">vec_madd512x128a512</a>) (m1, m2.vx1, mp0.x3.v0x512);</div><div class="line">  result.vx1 = mp1.x3.v1x128;</div><div class="line">  mp2.x640 = <a class="code" href="vec__int512__ppc_8h.html#a77eca5d7bebe0f30894fe9669c01b7a7">__VEC_PWR_IMP</a>(<a class="code" href="vec__int512__ppc_8h.html#acf5c808a77a8486a82a9ee87ff414fd2">vec_madd512x128a512</a>) (m1, m2.vx2, mp1.x3.v0x512);</div><div class="line">  result.vx2 = mp2.x3.v1x128;</div><div class="line">  mp3.x640 = <a class="code" href="vec__int512__ppc_8h.html#a77eca5d7bebe0f30894fe9669c01b7a7">__VEC_PWR_IMP</a>(<a class="code" href="vec__int512__ppc_8h.html#acf5c808a77a8486a82a9ee87ff414fd2">vec_madd512x128a512</a>) (m1, m2.vx3, mp2.x3.v0x512);</div><div class="line">  result.vx3 = mp3.x3.v1x128;</div><div class="line">  result.vx4 = mp3.x3.v0x512.vx0;</div><div class="line">  result.vx5 = mp3.x3.v0x512.vx1;</div><div class="line">  result.vx6 = mp3.x3.v0x512.vx2;</div><div class="line">  result.vx7 = mp3.x3.v0x512.vx3;</div><div class="line">  <span class="keywordflow">return</span> result;</div><div class="line">}</div></div><!-- fragment --><p>The <b>runtime</b> library implementation is in a separate file from the <b>inline</b> implementation. The <a class="el" href="vec__int512__ppc_8h.html" title="Header package containing a collection of multiple precision quadword integer computation functions i...">vec_int512_ppc.h</a> file contains:</p><ul>
<li>static inline implementations and associated doxygen interface descriptions. These are still small enough to used directly by application codes and as building blocks for larger library implementations.</li>
<li>extern function declarations and associated doxygen interface descriptions. These names are for the dynamic shared object (<b>DSO</b>) function implementations. The functions are not qualified with inline or target suffixes. The expectation is the dynamic linker mechanism with bind to the appropriate implementation.</li>
<li>extern function declarations qualified with a target suffix. These names are for the statically linked (<b>archive</b>) function implementations. The suffix is applied by the <a class="el" href="vec__int512__ppc_8h.html#a77eca5d7bebe0f30894fe9669c01b7a7" title="Macro to add platform suffix for static calls. ">__VEC_PWR_IMP()</a> macro for the current (default) target processor. These have no doxygen descriptions as using the <a class="el" href="vec__int512__ppc_8h.html#a77eca5d7bebe0f30894fe9669c01b7a7" title="Macro to add platform suffix for static calls. ">__VEC_PWR_IMP()</a> macro interferes with the doxygen scanner. But the interface is the same as the unqualified extern for the DSO implementation of the same name.</li>
</ul>
<p>The runtime source file (for example vec_int512_runtime.c) contains the common implementations for all the target qualified static interfaces.</p><ul>
<li>Again the function names are target qualified via the <a class="el" href="vec__int512__ppc_8h.html#a77eca5d7bebe0f30894fe9669c01b7a7" title="Macro to add platform suffix for static calls. ">__VEC_PWR_IMP()</a> macro.</li>
<li>The runtime implementation can use any of the PVECLIB inline operations (see: <a class="el" href="vec__int512__ppc_8h.html#ab5b80fd9694cea8bf502b26e55af37f7" title="Vector 128x128bit Unsigned Integer Multiply. ">vec_mul128x128()</a> and <a class="el" href="vec__int512__ppc_8h.html#a131bdfc55718991610c886b2c77f6ae7" title="Vector 256x256-bit Unsigned Integer Multiply. ">vec_mul256x256()</a>) as well as other function implementations from the same file (see: <a class="el" href="vec__int512__ppc_8h.html#a56a5da10870d9878e2ab888d3c4d2e7b" title="Vector 512x512-bit Unsigned Integer Multiply. ">vec_mul512x512()</a> and <a class="el" href="vec__int512__ppc_8h.html#a8287aa4483acb25ac3188a97cc23b89a" title="Vector 2048x2048-bit Unsigned Integer Multiply. ">vec_mul2048x2048()</a>).</li>
<li>At the -O3 optimization level the compiler will attempt to inline functions referenced from the same file. Compiler heuristics will limit this based on estimates for the final generated object size. GCC also supports the function __ attribute __ ((flatten)) which overrides the in-lining size heuristics.</li>
<li>These implementations can also use target specific CPP conditional codes to manually tweak code optimization or generated code size for specific targets.</li>
</ul>
<p>This simple strategy allows the collection of the larger function implementations into a single source file and build object files for multiple platform targets. For example collect all the multiple precision quadword implementations into a source file named <b>vec_int512_runtime.c</b>.</p>
<h2><a class="anchor" id="main_libary_issues_0_0_1"></a>
Building static runtime libraries</h2>
<p>This source file can be compiled multiple times for different platform targets. The resulting object files have unique function symbols due to the platform specific suffix provided by the <a class="el" href="vec__int512__ppc_8h.html#a77eca5d7bebe0f30894fe9669c01b7a7" title="Macro to add platform suffix for static calls. ">__VEC_PWR_IMP()</a> macro. There are a number of build strategies for this.</p>
<p>For example, create a small source file named <b>vec_runtime_PWR8.c</b> that starts with the target pragma and includes the multi-platform source file. </p><div class="fragment"><div class="line"><span class="comment">//  \file  vec_runtime_PWR8.c</span></div><div class="line"></div><div class="line"><span class="preprocessor">#pragma GCC target (&quot;cpu=power8&quot;)</span></div><div class="line"></div><div class="line"><span class="preprocessor">#include &quot;vec_int512_runtime.c&quot;</span></div></div><!-- fragment --><p> Similarly for <b>vec_runtime_PWR7.c</b>, <b>vec_runtime_PWR9.c</b> with appropriate changes for "cpu='. Additional runtime source files can be included as needed. Other multiple precision functions supporting BCD and BCD &lt;-&gt; binary conversions are likely candidates.</p>
<dl class="section note"><dt>Note</dt><dd>Current Clang compilers silently ignore "#pragme GCC target". This causes all such targeted runtimes to revert to the compiler default target or configure CFLAGS "-mcpu=". In this case the <a class="el" href="vec__int512__ppc_8h.html#a77eca5d7bebe0f30894fe9669c01b7a7" title="Macro to add platform suffix for static calls. ">__VEC_PWR_IMP()</a> macro will apply the same suffix to all functions across the targeted runtime builds. As a result linking these targeted runtime objects into the DSO will fail with duplicate symbols.</dd></dl>
<p>Projects using autotools (like PVECLIB) can use Makefile.am rules to associate rumtime source files with a library. For example: </p><div class="fragment"><div class="line">libpvec_la_SOURCES = vec_runtime_PWR9.c \
        vec_runtime_PWR8.c \
        vec_runtime_PWR7.c</div></div><!-- fragment --><p> If compiling with GCC this is sufficient for automake to generate Makefiles to compile each of the runtime sources and combine them into a single static archive named libpvec.a. However it is not that simple, especially if the build uses a different compiler.</p>
<p>We would like to use Makefile.am rules to specify different -mcpu= compile options. This eliminates the #pragma GCC target and simplifies the platform source files too something like: </p><div class="fragment"><div class="line"> <span class="comment">//</span></div><div class="line"> <span class="comment">//  \file  vec_runtime_PWR8.c</span></div><div class="line"> <span class="comment">//</span></div><div class="line"></div><div class="line"><span class="preprocessor">#include &quot;vec_int512_runtime.c&quot;</span></div></div><!-- fragment --><p> This requires splitting the target specific runtimes into distinct automake libraries. </p><div class="fragment"><div class="line">libpveccommon_la_SOURCES = tipowof10.c <a class="code" href="vec__common__ppc_8h.html#a7b0ffb619c4d9904c405e792347b1553">decpowof2</a>.c</div><div class="line">libpvecPWR9_la_SOURCES = vec_runtime_PWR9.c</div><div class="line">libpvecPWR8_la_SOURCES = vec_runtime_PWR8.c</div><div class="line">libpvecPWR7_la_SOURCES = vec_runtime_PWR7.c</div></div><!-- fragment --><p> Then add the -mcpu compile option to runtime library CFLAGS </p><div class="fragment"><div class="line">libpvecPWR9_la_CFLAGS = -mcpu=power9</div><div class="line">libpvecPWR8_la_CFLAGS = -mcpu=power8</div><div class="line">libpvecPWR7_la_CFLAGS = -mcpu=power7</div></div><!-- fragment --><p> Then use additional automake rules to combine these targeted runtimes into a single static archive library. </p><div class="fragment"><div class="line">libpvecstatic_la_LIBADD = libpveccommon.la</div><div class="line">libpvecstatic_la_LIBADD += libpvecPWR9.la</div><div class="line">libpvecstatic_la_LIBADD += libpvecPWR8.la</div><div class="line">libpvecstatic_la_LIBADD += libpvecPWR7.la</div></div><!-- fragment --><p>However this does not work if the user (build configure) specifies flag variables (i.e. CFLAGS) containing -mcpu= options internal use of target options.</p>
<dl class="section note"><dt>Note</dt><dd>Automake/libtool will always apply the user CFLAGS after any AM_CFLAGS or yourlib_la_CFLAGS (See: <a href="https://www.gnu.org/software/automake/manual/html_node/Flag-Variables-Ordering.html">Automake documentation: Flag Variables Ordering</a>) and the last -mcpu option always wins. This has the same affect as the compiler ignoring the #pragma GCC target options described above.</dd></dl>
<h3><a class="anchor" id="main_libary_issues_0_0_0_4"></a>
A deeper look at library Makefiles</h3>
<p>This requires a deeper dive into the black arts of automake and libtools. In this case the libtool macro LTCOMPILE expands the various flag variables in a specific order (with $CFLAGS last) for all &ndash;tag=CC &ndash;mode=compile commands. In this case we need to either:</p><ul>
<li>locally edit CFLAGS to eliminates any -mcpu= (or -O) options so that our internal build targets are applied.</li>
<li>provide our own alternative to the LTCOMPILE macro and use our own explicit make rules. (See ./pveclib/src/Makefile.am for examples.)</li>
</ul>
<p>So lets take a look at LTCOMPILE: </p><div class="fragment"><div class="line">LTCOMPILE = $(LIBTOOL) $(AM_V_lt) --tag=CC $(AM_LIBTOOLFLAGS) \</div><div class="line">        $(LIBTOOLFLAGS) --mode=compile $(CC) $(DEFS) \</div><div class="line">        $(DEFAULT_INCLUDES) $(INCLUDES) $(AM_CPPFLAGS) $(CPPFLAGS) \</div><div class="line">        $(AM_CFLAGS) $(CFLAGS)</div></div><!-- fragment --> <dl class="section note"><dt>Note</dt><dd>"$(CFLAGS)" is always applied after all other <em>FLAGS</em>.</dd></dl>
<p>The generated Makefile.in includes rules that depend on LTCOMPILE. For example the general rule for compile .c source to .lo objects. </p><div class="fragment"><div class="line">.c.lo:</div><div class="line">@am__fastdepCC_TRUE@    $(AM_V_CC)depbase=`echo $@ | sed <span class="stringliteral">&#39;s|[^/]*$$|$(DEPDIR)/&amp;|;s|\.lo$$||&#39;</span>`;\</div><div class="line">@am__fastdepCC_TRUE@    $(LTCOMPILE) -MT $@ -MD -MP -MF $$depbase.Tpo -c -o $@ $&lt; &amp;&amp;\</div><div class="line">@am__fastdepCC_TRUE@    $(am__mv) $$depbase.Tpo $$depbase.Plo</div><div class="line">@AMDEP_TRUE@@am__fastdepCC_FALSE@       $(AM_V_CC)source=<span class="stringliteral">&#39;$&lt;&#39;</span> <span class="keywordtype">object</span>=<span class="stringliteral">&#39;$@&#39;</span> libtool=yes @AMDEPBACKSLASH@</div><div class="line">@AMDEP_TRUE@@am__fastdepCC_FALSE@       DEPDIR=$(DEPDIR) $(CCDEPMODE) $(depcomp) @AMDEPBACKSLASH@</div><div class="line">@am__fastdepCC_FALSE@   $(AM_V_CC@am__nodep@)$(LTCOMPILE) -c -o $@ $&lt;</div></div><!-- fragment --><p> Or the more specific rule to compile the vec_runtime_PWR9.c for the -mcpu=power9 target: </p><div class="fragment"><div class="line"> libpvecPWR9_la-vec_runtime_PWR9.lo: vec_runtime_PWR9.c</div><div class="line">@am__fastdepCC_TRUE@    $(AM_V_CC)$(LIBTOOL) $(AM_V_lt) --tag=CC $(AM_LIBTOOLFLAGS) \</div><div class="line"> $(LIBTOOLFLAGS) --mode=compile $(CC) $(DEFS) $(DEFAULT_INCLUDES) $(INCLUDES) \</div><div class="line"> $(AM_CPPFLAGS) $(CPPFLAGS) $(libpvecPWR9_la_CFLAGS) $(CFLAGS) \</div><div class="line"> -MT libpvecPWR9_la-vec_runtime_PWR9.lo -MD -MP -MF \</div><div class="line"> $(DEPDIR)/libpvecPWR9_la-vec_runtime_PWR9.Tpo -c -o libpvecPWR9_la-vec_runtime_PWR9.lo \</div><div class="line"> `test -f <span class="stringliteral">&#39;vec_runtime_PWR9.c&#39;</span> || echo <span class="stringliteral">&#39;$(srcdir)/&#39;</span>`vec_runtime_PWR9.c</div><div class="line">@am__fastdepCC_TRUE@    $(AM_V_at)$(am__mv) $(DEPDIR)/libpvecPWR9_la-vec_runtime_PWR9.Tpo \</div><div class="line"> $(DEPDIR)/libpvecPWR9_la-vec_runtime_PWR9.Plo</div><div class="line">@AMDEP_TRUE@@am__fastdepCC_FALSE@       $(AM_V_CC)source=<span class="stringliteral">&#39;vec_runtime_PWR9.c&#39;</span> \</div><div class="line"> <span class="keywordtype">object</span>=<span class="stringliteral">&#39;libpvecPWR9_la-vec_runtime_PWR9.lo&#39;</span> libtool=yes @AMDEPBACKSLASH@</div><div class="line">@AMDEP_TRUE@@am__fastdepCC_FALSE@       DEPDIR=$(DEPDIR) $(CCDEPMODE) \</div><div class="line"> $(depcomp) @AMDEPBACKSLASH@</div><div class="line">@am__fastdepCC_FALSE@   $(AM_V_CC@am__nodep@)$(LIBTOOL) $(AM_V_lt) --tag=CC \</div><div class="line"> $(AM_LIBTOOLFLAGS) $(LIBTOOLFLAGS) --mode=compile $(CC) $(DEFS) $(DEFAULT_INCLUDES) \</div><div class="line"> $(INCLUDES) $(AM_CPPFLAGS) $(CPPFLAGS) $(libpvecPWR9_la_CFLAGS) $(CFLAGS) -c \</div><div class="line"> -o libpvecPWR9_la-vec_runtime_PWR9.lo `test -f <span class="stringliteral">&#39;vec_runtime_PWR9.c&#39;</span> \</div><div class="line"> || echo <span class="stringliteral">&#39;$(srcdir)/&#39;</span>`vec_runtime_PWR9.c</div></div><!-- fragment --><p> Which is eventually generated into the Makefile as: </p><div class="fragment"><div class="line">libpvecPWR9_la-vec_runtime_PWR9.lo: vec_runtime_PWR9.c</div><div class="line">        $(AM_V_CC)$(LIBTOOL) $(AM_V_lt) --tag=CC $(AM_LIBTOOLFLAGS) $(LIBTOOLFLAGS) \</div><div class="line">        --mode=compile $(CC) $(DEFS) $(DEFAULT_INCLUDES) $(INCLUDES) $(AM_CPPFLAGS) \</div><div class="line">        $(CPPFLAGS) $(libpvecPWR9_la_CFLAGS) $(CFLAGS) -MT libpvecPWR9_la-vec_runtime_PWR9.lo \</div><div class="line">        -MD -MP -MF $(DEPDIR)/libpvecPWR9_la-vec_runtime_PWR9.Tpo -c -o \</div><div class="line">        libpvecPWR9_la-vec_runtime_PWR9.lo `test -f <span class="stringliteral">&#39;vec_runtime_PWR9.c&#39;</span> || \</div><div class="line">        echo <span class="stringliteral">&#39;$(srcdir)/&#39;</span>`vec_runtime_PWR9.c</div><div class="line">        $(AM_V_at)$(am__mv) $(DEPDIR)/libpvecPWR9_la-vec_runtime_PWR9.Tpo \</div><div class="line">        $(DEPDIR)/libpvecPWR9_la-vec_runtime_PWR9.Plo</div><div class="line"><span class="preprocessor">#       $(AM_V_CC)source=&#39;vec_runtime_PWR9.c&#39; object=&#39;libpvecPWR9_la-vec_runtime_PWR9.lo&#39; \</span></div><div class="line"><span class="preprocessor">#       libtool=yes DEPDIR=$(DEPDIR) $(CCDEPMODE) $(depcomp) \</span></div><div class="line"><span class="preprocessor">#       $(AM_V_CC_no)$(LIBTOOL) $(AM_V_lt) --tag=CC $(AM_LIBTOOLFLAGS) $(LIBTOOLFLAGS) \</span></div><div class="line"><span class="preprocessor">#       --mode=compile $(CC) $(DEFS) $(DEFAULT_INCLUDES) $(INCLUDES) $(AM_CPPFLAGS) \</span></div><div class="line"><span class="preprocessor">#       $(CPPFLAGS) $(libpvecPWR9_la_CFLAGS) $(CFLAGS) -c -o libpvecPWR9_la-vec_runtime_PWR9.lo \</span></div><div class="line"><span class="preprocessor">#       `test -f &#39;vec_runtime_PWR9.c&#39; || echo &#39;$(srcdir)/&#39;`vec_runtime_PWR9.c</span></div></div><!-- fragment --><p> Somehow in the internal struggle for the dark soul of automake/libtools, the <em>@am__fastdepCC_TRUE@</em> conditional wins out over <em>@AMDEP_TRUE@@am__fastdepCC_FALSE@</em> , and the alternate rule was commented out as the Makefile was generated.</p>
<p>However this still leaves a problem. While we see that $(libpvecPWR9_la_CFLAGS) applies the "-mcpu=power9" target option, it is immediately followed by $(CFLAGS). And it CFLAGS contains any "-mcpu=" option the last "-mcpu=" option always wins. The result will a broken library archives with duplicate symbols.</p>
<dl class="section note"><dt>Note</dt><dd>The techniques described work reliably for most codes and compilers as long as the user does not override target (-mcpu=) with CFLAGS on configure.</dd></dl>
<h3><a class="anchor" id="main_libary_issues_0_0_0_5"></a>
Adding our own Makefile magic</h3>
<dl class="todo"><dt><b><a class="el" href="todo.html#_todo000001">Todo:</a></b></dt><dd>Is there a way for automake to compile vec_int512_runtime.c with -mcpu=power9 and -o vec_runtime_PWR9.o? And similarly for PWR7/PWR8.</dd></dl>
<p>Once we get a glimpse of the underlying automake/libtool rule generation we have a template for how to solve this problem. However while we need to workaround some automake/libtool constraints we also want fit into overall flow.</p>
<p>First we need an alternative to <b>LTCOMPILE</b> where we can bypass user provided <b>CFLAGS</b>. For example: </p><div class="fragment"><div class="line">PVECCOMPILE = $(LIBTOOL) $(AM_V_lt) --tag=CC $(AM_LIBTOOLFLAGS) \</div><div class="line">        $(LIBTOOLFLAGS) --mode=compile $(CC) $(DEFS) \</div><div class="line">        $(DEFAULT_INCLUDES) $(INCLUDES) $(AM_CPPFLAGS) $(CPPFLAGS) \</div><div class="line">        $(AM_CFLAGS)</div></div><!-- fragment --><p> In this variant (<b>PVECCOMPILE</b>) we simply leave $(CFLAGS) off the end of the macro.</p>
<p>Now we can use the generated rule above as an example to provide our own Makefile rules. These rules will be passed directly to the generated Makefile. For example: </p><div class="fragment"><div class="line">vec_staticrt_PWR9.lo: vec_runtime_PWR9.c $(pveclibinclude_HEADERS)</div><div class="line">if am__fastdepCC</div><div class="line">        $(PVECCOMPILE) $(PVECLIB_POWER9_CFLAGS) -MT $@ -MD -MP -MF \</div><div class="line">        $(DEPDIR)/$*.Tpo -c -o $@ $(srcdir)/vec_runtime_PWR9.c</div><div class="line">        mv -f $(DEPDIR)/$*.Tpo $(DEPDIR)/$*.Plo</div><div class="line">else</div><div class="line">if AMDEP</div><div class="line">        source=&#39;vec_runtime_PWR9.c&#39; object=&#39;$@&#39; libtool=yes @AMDEPBACKSLASH@</div><div class="line">        DEPDIR=$(DEPDIR) $(CCDEPMODE) $(depcomp) @AMDEPBACKSLASH@</div><div class="line">endif</div><div class="line">        $(PVECCOMPILE) $(PVECLIB_POWER9_CFLAGS) -c -o $@ $(srcdir)/vec_runtime_PWR9.c</div><div class="line">endif</div></div><!-- fragment --><p> We change the target (vec_staticrt_PWR9.lo) of the rule to indicate that this object is intended for a <em>static</em> runtime archive. And we list prerequisites vec_runtime_PWR9.c and $(pveclibinclude_HEADERS)</p>
<p>For the recipe we expand both clauses (am__fastdepCC and AMDEP) from the example. We don't know exactly what they represent or do, but assume they both are needed for some configurations. We use the alternative PVECCOMPILE to provide all the libtool commands and options we need without the CFLAGS. We use new PVECLIB_POWER9_CFLAGS macro to provide all the platform specific target options we need. The automatic variable $@ provides the file name of the target object (vec_staticrt_PWR9.lo). And we specify the $(srcdir) qualified source file (vec_runtime_PWR9.c) as input to the compile. We can provide similar rules for the other processor targets (PWR8/PWR7).</p>
<p>With this technique we control the compilation of specific targets without requiring unique LTLIBRARIES. This was only required before so libtool would allow target specific CFLAGS. So we can eliminate libpvecPWR9.la, libpvecPWR8.la, and libpvecPWR7.la from lib_LTLIBRARIES.</p>
<p>Continuing the theme of separating the static archive elements from DSO elements we rename libpveccommon.la to libpvecstatic.la. We can add the common (none target specific) source files and CFLAGS to <em>libpvecstatic_la</em>. </p><div class="fragment"><div class="line">libpvecstatic_la_SOURCES = tipowof10.c <a class="code" href="vec__common__ppc_8h.html#a7b0ffb619c4d9904c405e792347b1553">decpowof2</a>.c</div><div class="line"></div><div class="line">libpvecstatic_la_CFLAGS = $(AM_CPPFLAGS) $(PVECLIB_DEFAULT_CFLAGS) $(AM_CFLAGS)</div></div><!-- fragment --><p> We still need to add the target specific objects generated by the rules above to the libpvecstatic.a archive. </p><div class="fragment"><div class="line"><span class="preprocessor"># libpvecstatic_la already includes tipowof10.c decpowof2.c.</span></div><div class="line"><span class="preprocessor"># Now add the name qualified -mcpu= target runtimes.</span></div><div class="line">libpvecstatic_la_LIBADD = vec_staticrt_PWR9.lo</div><div class="line">libpvecstatic_la_LIBADD += vec_staticrt_PWR8.lo</div><div class="line">libpvecstatic_la_LIBADD += vec_staticrt_PWR7.lo</div></div><!-- fragment --> <dl class="section note"><dt>Note</dt><dd>the libpvecstatic archive will contain 2 or 3 implementations of each target specific function (i.e. the function <a class="el" href="vec__int512__ppc_8h.html#ab5b80fd9694cea8bf502b26e55af37f7" title="Vector 128x128bit Unsigned Integer Multiply. ">vec_mul128x128()</a> will have implementations vec_mul128x128_PWR7() and vec_mul128x128_PWR8(), vec_mul128x128_PWR9()). This OK because because the target suffix insures the name is unique within the archive. When an application calls function with the appropriate target suffix (using the <a class="el" href="vec__int512__ppc_8h.html#a77eca5d7bebe0f30894fe9669c01b7a7" title="Macro to add platform suffix for static calls. ">__VEC_PWR_IMP()</a> wrapper macro) and links to libpvecstatic, the linker will extract only the matching implementations and include them in the static program image.</dd></dl>
<h2><a class="anchor" id="main_libary_issues_0_0_2"></a>
Building dynamic runtime libraries</h2>
<p>Building objects for dynamic runtime libraries is a bit more complicated than building static archives. For one dynamic libraries requires position independent code (<b>PIC</b>) while static code does not. Second we want to leverage the Dynamic Linker/Loader's GNU Indirect Function (See: <a href="https://sourceware.org/glibc/wiki/GNU_IFUNC">What is an indirect function (IFUNC)?</a>) binding mechanism.</p>
<p>PIC functions require a more complicated call linkage or function prologue. This usually requires the -fpic compiler option. This is the case for the OpenPOWER ELF V2 ABI. Any PIC function must assume that the caller may be from an different execution unit (library or main executable). So the called function needs to establish the Table of Contents (<b>TOC</b>) base address for itself. This is the case if the called function needs to reference static or const storage variables or calls to functions in other dynamic libraries. So it is normal to compile library runtime codes separately for static archives and DSOs.</p>
<dl class="section note"><dt>Note</dt><dd>The details of how the <b>TOC</b> is established differs between the ELF V1 ABI (Big Endian POWER) and the ELF V2 ABI (Little Endian POWER). This should not be an issue if compile options (-fpic) are used correctly.</dd></dl>
<p>There are additional differences associated with dynamic selection of function Implementations for different processor targets. The Linux dynamic linker/loader (ld64.so) provides general mechanism for target specific binding of function call linkage.</p>
<p>The dynamic linker employees a user supplied resolver mechanism as function calls are dynamically bound to to an implementation. The DSO exports function symbols that externally look like a normal <em>extern</em>. For example: </p><div class="fragment"><div class="line"><span class="keyword">extern</span> <a class="code" href="struct____VEC__U__256.html">__VEC_U_256</a></div><div class="line"><a class="code" href="vec__int512__ppc_8h.html#ab5b80fd9694cea8bf502b26e55af37f7">vec_mul128x128</a> (<a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a>, <a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a>);</div></div><!-- fragment --><p> This symbol's implementation has a special <b>STT_GNU_IFUNC</b> attribute recognized by the dynamic linker which associates this symbol with the corresponding runtime resolver function. So in addition to any platform specific implementations we need to provide the resolver function referenced by the <em>IFUNC</em> symbol. For example: </p><div class="fragment"><div class="line"> <span class="comment">//</span></div><div class="line"> <span class="comment">//  \file  vec_runtime_DYN.c</span></div><div class="line"> <span class="comment">//</span></div><div class="line"><span class="keyword">extern</span> <a class="code" href="struct____VEC__U__256.html">__VEC_U_256</a></div><div class="line">vec_mul128x128_PWR7 (<a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a>, <a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a>);</div><div class="line"></div><div class="line"><span class="keyword">extern</span> <a class="code" href="struct____VEC__U__256.html">__VEC_U_256</a></div><div class="line">vec_mul128x128_PWR8 (<a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a>, <a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a>);</div><div class="line"></div><div class="line"><span class="keyword">extern</span> <a class="code" href="struct____VEC__U__256.html">__VEC_U_256</a></div><div class="line">vec_mul128x128_PWR9 (<a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a>, <a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a>);</div><div class="line"></div><div class="line"><span class="keyword">static</span></div><div class="line"><a class="code" href="struct____VEC__U__256.html">__VEC_U_256</a></div><div class="line">(*resolve_vec_mul128x128 (<span class="keywordtype">void</span>))(<a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a>, <a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a>)</div><div class="line">{</div><div class="line"><span class="preprocessor">#ifdef  __BUILTIN_CPU_SUPPORTS__</span></div><div class="line">  <span class="keywordflow">if</span> (__builtin_cpu_is (<span class="stringliteral">&quot;power9&quot;</span>))</div><div class="line">    <span class="keywordflow">return</span> vec_mul128x128_PWR9;</div><div class="line">  <span class="keywordflow">else</span></div><div class="line">    {</div><div class="line">      <span class="keywordflow">if</span> (__builtin_cpu_is (<span class="stringliteral">&quot;power8&quot;</span>))</div><div class="line">        <span class="keywordflow">return</span> vec_mul128x128_PWR8;</div><div class="line">      <span class="keywordflow">else</span></div><div class="line">        <span class="keywordflow">return</span> vec_mul128x128_PWR7;</div><div class="line">    }</div><div class="line"><span class="preprocessor">#else // ! __BUILTIN_CPU_SUPPORTS__</span></div><div class="line">    <span class="keywordflow">return</span> vec_mul128x128_PWR7;</div><div class="line"><span class="preprocessor">#endif</span></div><div class="line">}</div><div class="line"></div><div class="line"><a class="code" href="struct____VEC__U__256.html">__VEC_U_256</a></div><div class="line"><a class="code" href="vec__int512__ppc_8h.html#ab5b80fd9694cea8bf502b26e55af37f7">vec_mul128x128</a> (<a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a>, <a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a>)</div><div class="line">__attribute__ ((ifunc (<span class="stringliteral">&quot;resolve_vec_mul128x128&quot;</span>)));</div></div><!-- fragment --><p> For convince we collect the:</p><ul>
<li>IFUNC symbols</li>
<li>corresponging resolver functions</li>
<li>and externs to target specific implementations</li>
</ul>
<p>into one or more source files (For example: vec_runtime_DYN.c).</p>
<p>On the program's first call to an <em>IFUNC</em> symbol, the dynamic linker calls the resolver function associated with that symbol. The resolver function performs a runtime check to determine the platform, selects the (closest) matching platform specific function, then returns that function pointer to the dynamic linker.</p>
<p>The dynamic linker stores this function pointer in the callers Procedure Linkage Tables (PLT) before forwarding the call to the resolved implementation. Any subsequent calls to this function symbol branch (via the PLT) directly to the appropriate platform specific implementation.</p>
<dl class="section note"><dt>Note</dt><dd>The platform specific implementations we use here are compiled from the same source files we used to build the static library archive.</dd></dl>
<p>Like the static libraries we need to build multiple target specific implementations of the functions. So we can leverage the example of explicit Makefile rules we used for the static archive but with some minor differences. For example: </p><div class="fragment"><div class="line">vec_dynrt_PWR9.lo: vec_runtime_PWR9.c $(pveclibinclude_HEADERS)</div><div class="line"><span class="keywordflow">if</span> am__fastdepCC</div><div class="line">        $(PVECCOMPILE) -fpic $(PVECLIB_POWER9_CFLAGS) -MT $@ -MD -MP -MF \</div><div class="line">        $(DEPDIR)/$*.Tpo -c -o $@ $(srcdir)/vec_runtime_PWR9.c</div><div class="line">        mv -f $(DEPDIR)/$*.Tpo $(DEPDIR)/$*.Plo</div><div class="line"><span class="keywordflow">else</span></div><div class="line"><span class="keywordflow">if</span> AMDEP</div><div class="line">        source=<span class="stringliteral">&#39;vec_runtime_PWR9.c&#39;</span> <span class="keywordtype">object</span>=<span class="stringliteral">&#39;$@&#39;</span> libtool=yes @AMDEPBACKSLASH@</div><div class="line">        DEPDIR=$(DEPDIR) $(CCDEPMODE) $(depcomp) @AMDEPBACKSLASH@</div><div class="line">endif</div><div class="line">        $(PVECCOMPILE) -fpic $(PVECLIB_POWER9_CFLAGS) -c -o $@ \</div><div class="line">        $(srcdir)/vec_runtime_PWR9.c</div><div class="line">endif</div></div><!-- fragment --><p> Again we change the rule target (vec_dynrt_PWR9.lo) of the rule to indicate that this object is intended for a <em>DSO</em> runtime. And we list the same prerequisites vec_runtime_PWR9.c and $(pveclibinclude_HEADERS)</p>
<p>For the recipe we expand both clauses (am__fastdepCC and AMDEP) from the example. We use the alternative PVECCOMPILE to provide all the libtool commands and options we need without the CFLAGS. But we insert the -fpic option so the compiler will will generate position independent code. We use a new PVECLIB_POWER9_CFLAGS macro to provide all the platform specific target options we need. The automatic variable $@ provides the file name of the target object (vec_dynrt_PWR9.lo). And we specify the same $(srcdir) qualified source file (vec_runtime_PWR9.c) we used for the static library. We can provide similar rules for the other processor targets (PWR8/PWR7). We also build an -fpic version of vec_runtime_common.c.</p>
<p>Continuing the theme of separating the static archive elements from DSO elements, we use libpvec.la as the libtool name for libpvec.so. Here we add the source files for the IFUNC resolvers and add -fpic as library specific CFLAGS to <em>libpvec_la</em>. </p><div class="fragment"><div class="line">libpvec_la_SOURCES = vec_runtime_DYN.c</div><div class="line"></div><div class="line">libpvec_la_CFLAGS = $(AM_CPPFLAGS) -fpic $(PVECLIB_DEFAULT_CFLAGS) $(AM_CFLAGS)</div></div><!-- fragment --><p> We still need to add the target specific and common objects generated by the rules above to the libpvec library. </p><div class="fragment"><div class="line"><span class="preprocessor"># libpvec_la already includes vec_runtime_DYN.c compiled compiled -fpic</span></div><div class="line"><span class="preprocessor"># for IFUNC resolvers.</span></div><div class="line"><span class="preprocessor"># Now adding the -fpic -mcpu= target built runtimes.</span></div><div class="line">libpvec_la_LDFLAGS = -version-info $(PVECLIB_SO_VERSION)</div><div class="line">libpvec_la_LIBADD = vec_dynrt_PWR9.lo</div><div class="line">libpvec_la_LIBADD += vec_dynrt_PWR8.lo</div><div class="line">libpvec_la_LIBADD += vec_dynrt_PWR7.lo</div><div class="line">libpvec_la_LIBADD += vec_dynrt_common.lo</div><div class="line">libpvec_la_LIBADD += -lc</div></div><!-- fragment --><h2><a class="anchor" id="make_libary_issues_0_0_3"></a>
Calling Multi-platform functions</h2>
<p>The next step is to provide mechanisms for applications to call these functions via static or dynamic linkage. For static linkage the application needs to reference a specific platform variant of the functions name. For dynamic linkage we will use <b>STT_GNU_IFUNC</b> symbol resolution (a symbol type extension to the ELF standard).</p>
<h3><a class="anchor" id="main_libary_issues_0_0_1_1"></a>
Static linkage to platform specific functions</h3>
<p>For static linkage the application is compiled for a specific platform target (via -mcpu=). So function calls should be bound to the matching platform specific implementations. The application may select the platform specific function directly by defining a <em>extern</em> and invoking the platform qualified function.</p>
<p>Or simply use the <a class="el" href="vec__int512__ppc_8h.html#a77eca5d7bebe0f30894fe9669c01b7a7" title="Macro to add platform suffix for static calls. ">__VEC_PWR_IMP()</a> macro as wrapper for the function name in the application. This selects the appropriate platform specific implementation based on the -mcpu= specified for the application compile. For example. </p><div class="fragment"><div class="line">k = <a class="code" href="vec__int512__ppc_8h.html#a77eca5d7bebe0f30894fe9669c01b7a7">__VEC_PWR_IMP</a> (<a class="code" href="vec__int512__ppc_8h.html#ab5b80fd9694cea8bf502b26e55af37f7">vec_mul128x128</a>)(i, j);</div></div><!-- fragment --><p>The <a class="el" href="vec__int512__ppc_8h.html" title="Header package containing a collection of multiple precision quadword integer computation functions i...">vec_int512_ppc.h</a> header provides the default platform qualified <em>extern</em> declarations for this and related functions based on the -mcpu= specified for the compile of application including this header. For example. </p><div class="fragment"><div class="line"><span class="keyword">extern</span> <a class="code" href="struct____VEC__U__256.html">__VEC_U_256</a></div><div class="line"><a class="code" href="vec__int512__ppc_8h.html#a77eca5d7bebe0f30894fe9669c01b7a7">__VEC_PWR_IMP</a> (<a class="code" href="vec__int512__ppc_8h.html#ab5b80fd9694cea8bf502b26e55af37f7">vec_mul128x128</a>) (<a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a>, <a class="code" href="vec__common__ppc_8h.html#aaf7a8e92d8ba681dac3d2ec3259c0820">vui128_t</a>);</div></div><!-- fragment --><p> For example if the applications calling <a class="el" href="vec__int512__ppc_8h.html#ab5b80fd9694cea8bf502b26e55af37f7" title="Vector 128x128bit Unsigned Integer Multiply. ">vec_mul128x128()</a> is itself compiled with -mcpu=power8, then the <a class="el" href="vec__int512__ppc_8h.html#a77eca5d7bebe0f30894fe9669c01b7a7" title="Macro to add platform suffix for static calls. ">__VEC_PWR_IMP()</a> will insure that:</p><ul>
<li>The <a class="el" href="vec__int512__ppc_8h.html" title="Header package containing a collection of multiple precision quadword integer computation functions i...">vec_int512_ppc.h</a> header will define an extern for vec_mul128x128_PWR8.</li>
<li>That application's calls to __VEC_PWR_IMP (vec_mul128x128) will reference vec_mul128x128_PWR8.</li>
</ul>
<p>The application should then link to the libpvecstatic.a archive. Where the application references PVECLIB functions with the appropriate target suffix, the linker will extract only the matching implementations and include them in the program image.</p>
<h3><a class="anchor" id="main_libary_issues_0_0_1_2"></a>
Dynamic linkage to platform specific functions</h3>
<p>Applications using dynamic linkage will call the unqualified function symbol. For example: </p><div class="fragment"><div class="line"><span class="keyword">extern</span> <a class="code" href="struct____VEC__U__256.html">__VEC_U_256</a></div><div class="line"><a class="code" href="vec__int512__ppc_8h.html#ab5b80fd9694cea8bf502b26e55af37f7">vec_mul128x128</a> (vui128_t, vui128_t);</div></div><!-- fragment --><p>This symbol's implementation (in libpvec.so) has a special <b>STT_GNU_IFUNC</b> attribute recognized by the dynamic linker which associates this symbol with the corresponding runtime resolver function. The application simply calls the (unqualified) function and the dynamic linker (with the help of PVECLIB's IFUNC resolvers) handles the details.</p>
<h1><a class="anchor" id="perf_data"></a>
Performance data.</h1>
<p>It is useful to provide basic performance data for each pveclib function. This is challenging as these functions are small and intended to be in-lined within larger functions (algorithms). As such they are subject to both the compiler's instruction scheduling and common subexpression optimizations plus the processors super-scalar and out-of-order execution design features.</p>
<p>As pveclib functions are normally only a few instructions, the actual timing will depend on the context it is in (the instructions that it depends on for data and instructions that proceed them in the pipelines).</p>
<p>The simplest approach is to use the same performance metrics as the Power Processor Users Manuals Performance Profile. This is normally per instruction latency in cycles and throughput in instructions issued per cycle. There may also be additional information for special conditions that may apply.</p>
<p>For example the vector float absolute value function. For recent PowerISA implementations this a single (VSX <b>xvabssp</b>) instruction which we can look up in the POWER8 / POWER9 Processor User's Manuals (<b>UM</b>).</p>
<table class="doxtable">
<tr>
<th align="right">processor</th><th align="center">Latency</th><th align="left">Throughput  </th></tr>
<tr>
<td align="right">power8 </td><td align="center">6-7 </td><td align="left">2/cycle </td></tr>
<tr>
<td align="right">power9 </td><td align="center">2 </td><td align="left">2/cycle </td></tr>
</table>
<p>The POWER8 UM specifies a latency of <em>"6 cycles to FPU (+1 cycle to other VSU ops"</em> for this class of VSX single precision FPU instructions. So the minimum latency is 6 cycles if the register result is input to another VSX single precision FPU instruction. Otherwise if the result is input to a VSU logical or integer instruction then the latency is 7 cycles. The POWER9 UM shows the pipeline improvement of 2 cycles latency for simple FPU instructions like this. Both processors support dual pipelines for a 2/cycle throughput capability.</p>
<p>A more complicated example:</p><div class="fragment"><div class="line"><span class="keyword">static</span> <span class="keyword">inline</span> <a class="code" href="vec__common__ppc_8h.html#aafeddf1e79ef817440ff01fafb0e00ca">vb32_t</a></div><div class="line"><a class="code" href="vec__f32__ppc_8h.html#acd364c3e220e61061f6c5ecd858a78de">vec_isnanf32</a> (<a class="code" href="vec__common__ppc_8h.html#a18f1382a0cb269770bbb8387dfcbbe1c">vf32_t</a> vf32)</div><div class="line">{</div><div class="line"><a class="code" href="vec__common__ppc_8h.html#a2ff4a776536870e01b7c9e454586544b">vui32_t</a> tmp2;</div><div class="line"><span class="keyword">const</span> <a class="code" href="vec__common__ppc_8h.html#a2ff4a776536870e01b7c9e454586544b">vui32_t</a> expmask = <a class="code" href="vec__common__ppc_8h.html#ae4520a89b9b5a292a3e647a6d5b712ad">CONST_VINT128_W</a>(0x7f800000, 0x7f800000, 0x7f800000,</div><div class="line">                                        0x7f800000);</div><div class="line"><span class="preprocessor">#if _ARCH_PWR9</span></div><div class="line"><span class="comment">// P9 has a 2 cycle xvabssp and eliminates a const load.</span></div><div class="line">tmp2 = (<a class="code" href="vec__common__ppc_8h.html#a2ff4a776536870e01b7c9e454586544b">vui32_t</a>) vec_abs (vf32);</div><div class="line"><span class="preprocessor">#else</span></div><div class="line"><span class="keyword">const</span> <a class="code" href="vec__common__ppc_8h.html#a2ff4a776536870e01b7c9e454586544b">vui32_t</a> signmask = <a class="code" href="vec__common__ppc_8h.html#ae4520a89b9b5a292a3e647a6d5b712ad">CONST_VINT128_W</a>(0x80000000, 0x80000000, 0x80000000,</div><div class="line">                                         0x80000000);</div><div class="line">tmp2 = vec_andc ((<a class="code" href="vec__common__ppc_8h.html#a2ff4a776536870e01b7c9e454586544b">vui32_t</a>)vf32, signmask);</div><div class="line"><span class="preprocessor">#endif</span></div><div class="line"><span class="keywordflow">return</span> vec_cmpgt (tmp2, expmask);</div><div class="line">}</div></div><!-- fragment --><p> Here we want to test for <em>Not A Number</em> without triggering any of the associate floating-point exceptions (VXSNAN or VXVC). For this test the sign bit does not effect the result so we need to zero the sign bit before the actual test. The vector abs would work for this, but we know from the example above that this instruction has a high latency as we are definitely passing the result to a non-FPU instruction (vector compare greater than unsigned word).</p>
<p>So the code needs to load two constant vectors masks, then vector and-compliment to clear the sign-bit, before comparing each word for greater then infinity. The generated code should look something like this:</p><div class="fragment"><div class="line">addis   r9,r2,.rodata.cst16+0x10@ha</div><div class="line">addis   r10,r2,.rodata.cst16+0x20@ha</div><div class="line">addi    r9,r9,.rodata.cst16+0x10@l</div><div class="line">addi    r10,r10,.rodata.cst16+0x20@l</div><div class="line">lvx     v0,0,r10  # load vector <span class="keyword">const</span> signmask</div><div class="line">lvx     v12,0,r9  # load vector <span class="keyword">const</span> expmask</div><div class="line">xxlandc vs34,vs34,vs32</div><div class="line">vcmpgtuw v2,v2,v12</div></div><!-- fragment --><p> So six instructions to load the const masks and two instructions for the actual vec_isnanf32 function. The first six instructions are only needed once for each containing function, can be hoisted out of loops and into the function prologue, can be <em>commoned</em> with the same constant for other pveclib functions, or executed out-of-order and early by the processor.</p>
<p>Most of the time, constant setup does not contribute measurably to the over all performance of vec_isnanf32. When it does it is limited by the longest (in cycles latency) of the various independent paths that load constants. In this case the const load sequence is composed of three pairs of instructions that can issue and execute in parallel. The addis/addi FXU instructions supports throughput of 6/cycle and the lvx load supports 2/cycle. So the two vector constant load sequences can execute in parallel and the latency is same as a single const load.</p>
<p>For POWER8 it appears to be (2+2+5=) 9 cycles latency for the const load. While the core vec_isnanf32 function (xxlandc/vcmpgtuw) is a dependent sequence and runs (2+2) 4 cycles latency. Similar analysis for POWER9 where the addis/addi/lvx sequence is still listed as (2+2+5) 9 cycles latency. While the xxlandc/vcmpgtuw sequence increases to (2+3) 5 cycles.</p>
<p>The next interesting question is what can we say about throughput (if anything) for this example. The thought experiment is "what
  would happen if?";</p><ul>
<li>two or more instances of vec_isnanf32 are used within a single function,</li>
<li>in close proximity in the code,</li>
<li>with independent data as input,</li>
</ul>
<p>could the generated instructions execute in parallel and to what extent. This illustrated by the following (contrived) example: </p><div class="fragment"><div class="line"><span class="keywordtype">int</span></div><div class="line">test512_all_f32_nan (<a class="code" href="vec__common__ppc_8h.html#a18f1382a0cb269770bbb8387dfcbbe1c">vf32_t</a> val0, <a class="code" href="vec__common__ppc_8h.html#a18f1382a0cb269770bbb8387dfcbbe1c">vf32_t</a> val1, <a class="code" href="vec__common__ppc_8h.html#a18f1382a0cb269770bbb8387dfcbbe1c">vf32_t</a> val2, <a class="code" href="vec__common__ppc_8h.html#a18f1382a0cb269770bbb8387dfcbbe1c">vf32_t</a> val3)</div><div class="line">{</div><div class="line"><span class="keyword">const</span> <a class="code" href="vec__common__ppc_8h.html#aafeddf1e79ef817440ff01fafb0e00ca">vb32_t</a> alltrue = { -1, -1, -1, -1 };</div><div class="line"><a class="code" href="vec__common__ppc_8h.html#aafeddf1e79ef817440ff01fafb0e00ca">vb32_t</a> nan0, nan1, nan2, nan3;</div><div class="line"></div><div class="line">nan0 = <a class="code" href="vec__f32__ppc_8h.html#acd364c3e220e61061f6c5ecd858a78de">vec_isnanf32</a> (val0);</div><div class="line">nan1 = <a class="code" href="vec__f32__ppc_8h.html#acd364c3e220e61061f6c5ecd858a78de">vec_isnanf32</a> (val1);</div><div class="line">nan2 = <a class="code" href="vec__f32__ppc_8h.html#acd364c3e220e61061f6c5ecd858a78de">vec_isnanf32</a> (val2);</div><div class="line">nan3 = <a class="code" href="vec__f32__ppc_8h.html#acd364c3e220e61061f6c5ecd858a78de">vec_isnanf32</a> (val3);</div><div class="line"></div><div class="line">nan0 = vec_and (nan0, nan1);</div><div class="line">nan2 = vec_and (nan2, nan3);</div><div class="line">nan0 = vec_and (nan2, nan0);</div><div class="line"></div><div class="line"><span class="keywordflow">return</span> vec_all_eq(nan0, alltrue);</div><div class="line">}</div></div><!-- fragment --><p> which tests 4 X vector float (16 X float) values and returns true if all 16 floats are NaN. Recent compilers will generates something like the following PowerISA code: </p><div class="fragment"><div class="line">   addis   r9,r2,-2</div><div class="line">   addis   r10,r2,-2</div><div class="line">   vspltisw v13,-1      # load vector <span class="keyword">const</span> alltrue</div><div class="line">   addi    r9,r9,21184</div><div class="line">   addi    r10,r10,-13760</div><div class="line">   lvx     v0,0,r9      # load vector <span class="keyword">const</span> signmask</div><div class="line">   lvx     v1,0,r10     # load vector <span class="keyword">const</span> expmask</div><div class="line">   xxlandc vs35,vs35,vs32</div><div class="line">   xxlandc vs34,vs34,vs32</div><div class="line">   xxlandc vs37,vs37,vs32</div><div class="line">   xxlandc vs36,vs36,vs32</div><div class="line">   vcmpgtuw v3,v3,v1    # nan1 = <a class="code" href="vec__f32__ppc_8h.html#acd364c3e220e61061f6c5ecd858a78de">vec_isnanf32</a> (val1);</div><div class="line">   vcmpgtuw v2,v2,v1    # nan0 = <a class="code" href="vec__f32__ppc_8h.html#acd364c3e220e61061f6c5ecd858a78de">vec_isnanf32</a> (val0);</div><div class="line">   vcmpgtuw v5,v5,v1    # nan3 = <a class="code" href="vec__f32__ppc_8h.html#acd364c3e220e61061f6c5ecd858a78de">vec_isnanf32</a> (val3);</div><div class="line">   vcmpgtuw v4,v4,v1    # nan2 = <a class="code" href="vec__f32__ppc_8h.html#acd364c3e220e61061f6c5ecd858a78de">vec_isnanf32</a> (val2);</div><div class="line">   xxland  vs35,vs35,vs34       # nan0 = vec_and (nan0, nan1);</div><div class="line">   xxland  vs36,vs37,vs36       # nan2 = vec_and (nan2, nan3);</div><div class="line">   xxland  vs36,vs35,vs36       # nan0 = vec_and (nan2, nan0);</div><div class="line">   vcmpequw. v4,v4,v13  # vec_all_eq(nan0, alltrue);</div><div class="line">...</div></div><!-- fragment --><p> first the generated code loading the vector constants for signmask, expmask, and alltrue. We see that the code is generated only once for each constant. Then the compiler generate the core vec_isnanf32 function four times and interleaves the instructions. This enables parallel pipeline execution where conditions allow. Finally the 16X isnan results are reduced to 8X, then 4X, then to a single condition code.</p>
<p>For this exercise we will ignore the constant load as in any realistic usage it will be <em>commoned</em> across several pveclib functions and hoisted out of any loops. The reduction code is not part of the vec_isnanf32 implementation and also ignored. The sequence of 4X xxlandc and 4X vcmpgtuw in the middle it the interesting part.</p>
<p>For POWER8 both xxlandc and vcmpgtuw are listed as 2 cycles latency and throughput of 2 per cycle. So we can assume that (only) the first two xxlandc will issue in the same cycle (assuming the input vectors are ready). The issue of the next two xxlandc instructions will be delay by 1 cycle. The following vcmpgtuw instruction are dependent on the xxlandc results and will not execute until their input vectors are ready. The first two vcmpgtuw instruction will execute 2 cycles (latency) after the first two xxlandc instructions execute. Execution of the second two vcmpgtuw instructions will be delayed 1 cycle due to the issue delay in the second pair of xxlandc instructions.</p>
<p>So at least for this example and this set of simplifying assumptions we suggest that the throughput metric for vec_isnanf32 is 2/cycle. For latency metric we offer the range with the latency for the core function (without and constant load overhead) first. Followed by the total latency (the sum of the constant load and core function latency). For the vec_isnanf32 example the metrics are:</p>
<table class="doxtable">
<tr>
<th align="right">processor</th><th align="center">Latency</th><th align="left">Throughput  </th></tr>
<tr>
<td align="right">power8 </td><td align="center">4-13 </td><td align="left">2/cycle </td></tr>
<tr>
<td align="right">power9 </td><td align="center">5-14 </td><td align="left">2/cycle </td></tr>
</table>
<p>Looking at a slightly more complicated example where core functions implementation can execute more then one instruction per cycle. Consider: </p><div class="fragment"><div class="line"><span class="keyword">static</span> <span class="keyword">inline</span> <a class="code" href="vec__common__ppc_8h.html#aafeddf1e79ef817440ff01fafb0e00ca">vb32_t</a></div><div class="line"><a class="code" href="vec__f32__ppc_8h.html#a0d808fb7bf9b6603274b1b3fdbe626a1">vec_isnormalf32</a> (<a class="code" href="vec__common__ppc_8h.html#a18f1382a0cb269770bbb8387dfcbbe1c">vf32_t</a> vf32)</div><div class="line">{</div><div class="line"><a class="code" href="vec__common__ppc_8h.html#a2ff4a776536870e01b7c9e454586544b">vui32_t</a> tmp, tmp2;</div><div class="line"><span class="keyword">const</span> <a class="code" href="vec__common__ppc_8h.html#a2ff4a776536870e01b7c9e454586544b">vui32_t</a> expmask = <a class="code" href="vec__common__ppc_8h.html#ae4520a89b9b5a292a3e647a6d5b712ad">CONST_VINT128_W</a>(0x7f800000, 0x7f800000, 0x7f800000,</div><div class="line">                                        0x7f800000);</div><div class="line"><span class="keyword">const</span> <a class="code" href="vec__common__ppc_8h.html#a2ff4a776536870e01b7c9e454586544b">vui32_t</a> minnorm = <a class="code" href="vec__common__ppc_8h.html#ae4520a89b9b5a292a3e647a6d5b712ad">CONST_VINT128_W</a>(0x00800000, 0x00800000, 0x00800000,</div><div class="line">                                        0x00800000);</div><div class="line"><span class="preprocessor">#if _ARCH_PWR9</span></div><div class="line"><span class="comment">// P9 has a 2 cycle xvabssp and eliminates a const load.</span></div><div class="line">tmp2 = (<a class="code" href="vec__common__ppc_8h.html#a2ff4a776536870e01b7c9e454586544b">vui32_t</a>) vec_abs (vf32);</div><div class="line"><span class="preprocessor">#else</span></div><div class="line"><span class="keyword">const</span> <a class="code" href="vec__common__ppc_8h.html#a2ff4a776536870e01b7c9e454586544b">vui32_t</a> signmask = <a class="code" href="vec__common__ppc_8h.html#ae4520a89b9b5a292a3e647a6d5b712ad">CONST_VINT128_W</a>(0x80000000, 0x80000000, 0x80000000,</div><div class="line">                                         0x80000000);</div><div class="line">tmp2 = vec_andc ((<a class="code" href="vec__common__ppc_8h.html#a2ff4a776536870e01b7c9e454586544b">vui32_t</a>)vf32, signmask);</div><div class="line"><span class="preprocessor">#endif</span></div><div class="line">tmp = vec_and ((<a class="code" href="vec__common__ppc_8h.html#a2ff4a776536870e01b7c9e454586544b">vui32_t</a>) vf32, expmask);</div><div class="line">tmp2 = (<a class="code" href="vec__common__ppc_8h.html#a2ff4a776536870e01b7c9e454586544b">vui32_t</a>) vec_cmplt (tmp2, minnorm);</div><div class="line">tmp = (<a class="code" href="vec__common__ppc_8h.html#a2ff4a776536870e01b7c9e454586544b">vui32_t</a>) vec_cmpeq (tmp, expmask);</div><div class="line"></div><div class="line"><span class="keywordflow">return</span> (<a class="code" href="vec__common__ppc_8h.html#aafeddf1e79ef817440ff01fafb0e00ca">vb32_t</a> )vec_nor (tmp, tmp2);</div><div class="line">}</div></div><!-- fragment --><p> which requires two (independent) masking operations (sign and exponent), two (independent) compares that are dependent on the masking operations, and a final <em>not OR</em> operation dependent on the compare results.</p>
<p>The generated POWER8 code looks like this:</p><div class="fragment"><div class="line">addis   r10,r2,-2</div><div class="line">addis   r8,r2,-2</div><div class="line">addi    r10,r10,21184</div><div class="line">addi    r8,r8,-13760</div><div class="line">addis   r9,r2,-2</div><div class="line">lvx     v13,0,r8</div><div class="line">addi    r9,r9,21200</div><div class="line">lvx     v1,0,r10</div><div class="line">lvx     v0,0,r9</div><div class="line">xxland  vs33,vs33,vs34</div><div class="line">xxlandc vs34,vs45,vs34</div><div class="line">vcmpgtuw v0,v0,v1</div><div class="line">vcmpequw v2,v2,v13</div><div class="line">xxlnor  vs34,vs32,vs34</div></div><!-- fragment --><p> Note this this sequence needs to load 3 vector constants. In previous examples we have noted that POWER8 lvx supports 2/cycle throughput. But with good scheduling, the 3rd vector constant load, will only add 1 additional cycle to the timing (10 cycles).</p>
<p>Once the constant masks are loaded the xxland/xxlandc instructions can execute in parallel. The vcmpgtuw/vcmpequw can also execute in parallel but are delayed waiting for the results of masking operations. Finally the xxnor is dependent on the data from both compare instructions.</p>
<p>For POWER8 the latencies are 2 cycles each, and assuming parallel execution of xxland/xxlandc and vcmpgtuw/vcmpequw we can assume (2+2+2=) 6 cycles minimum latency and another 10 cycles for the constant loads (if needed).</p>
<p>While the POWER8 core has ample resources (10 issue ports across 16 execution units), this specific sequence is restricted to the two <em>issue ports and VMX execution units</em> for this class of (simple vector integer and logical) instructions. For vec_isnormalf32 this allows for a lower latency (6 cycles vs the expected 10, over 5 instructions), it also implies that both of the POWER8 cores <em>VMX execution units</em> are busy for 2 out of the 6 cycles.</p>
<p>So while the individual instructions have can have a throughput of 2/cycle, vec_isnormalf32 can not. It is plausible for two executions of vec_isnormalf32 to interleave with a delay of 1 cycle for the second sequence. To keep the table information simple for now, just say the throughput of vec_isnormalf32 is 1/cycle.</p>
<p>After that it gets complicated. For example after the first two instances of vec_isnormalf32 are issued, both <em>VMX execution units</em> are busy for 4 cycles. So either the first instructions of the third vec_isnormalf32 will be delayed until the fifth cycle. Or the compiler scheduler will interleave instructions across the instances of vec_isnormalf32 and the latencies of individual vec_isnormalf32 results will increase. This is too complicated to put in a simple table.</p>
<p>For POWER9 the sequence is slightly different</p><div class="fragment"><div class="line">addis   r10,r2,-2</div><div class="line">addis   r9,r2,-2</div><div class="line">xvabssp vs45,vs34</div><div class="line">addi    r10,r10,-14016</div><div class="line">addi    r9,r9,-13920</div><div class="line">lvx     v1,0,r10</div><div class="line">lvx     v0,0,r9</div><div class="line">xxland  vs34,vs34,vs33</div><div class="line">vcmpgtuw v0,v0,v13</div><div class="line">vcmpequw v2,v2,v1</div><div class="line">xxlnor  vs34,vs32,vs34</div></div><!-- fragment --><p> We use vec_abs (xvabssp) to replace the sigmask and vec_andc and so only need to load two vector constants. So the constant load overhead is reduced to 9 cycles. However the the vector compares are now 3 cycles for (2+3+2=) 7 cycles for the core sequence. The final table for vec_isnormalf32:</p>
<table class="doxtable">
<tr>
<th align="right">processor</th><th align="center">Latency</th><th align="left">Throughput  </th></tr>
<tr>
<td align="right">power8 </td><td align="center">6-16 </td><td align="left">1/cycle </td></tr>
<tr>
<td align="right">power9 </td><td align="center">7-16 </td><td align="left">1/cycle </td></tr>
</table>
<h2><a class="anchor" id="perf_data_sub_1"></a>
Additional analysis and tools.</h2>
<p>The overview above is simplified analysis based on the instruction latency and throughput numbers published in the Processor User's Manuals (see <a class="el" href="index.html#mainpage_ref_docs">Reference Documentation</a>). These values are <em>best case</em> (input data is ready, SMT1 mode, no cache misses, mispredicted branches, or other hazards) for each instruction in isolation.</p>
<dl class="section note"><dt>Note</dt><dd>This information is intended as a guide for compiler and application developers wishing to optimize for the platform. Any performance tables provided for pveclib functions are in this spirit.</dd></dl>
<p>Of course the actual performance is complicated by the overall environment and how the pveclib functions are used. It would be unusual for pveclib functions to be used in isolation. The compiler will in-line pveclib functions and look for sub-expressions it can hoist out of loops or share across pveclib function instances. The The compiler will also model the processor and schedule instructions across the larger containing function. So in actual use the instruction sequences for the examples above are likely to be interleaved with instructions from other pvevlib functions and user written code.</p>
<p>Larger functions that use pveclib and even some of the more complicated pveclib functions (like vec_muludq) defy simple analysis. For these cases it is better to use POWER specific analysis tools. To understand the overall pipeline flows and identify hazards the instruction trace driven performance simulator is recommended.</p>
<p>The <a href="https://developer.ibm.com/linuxonpower/advance-toolchain/">IBM Advance Toolchain</a> includes an updated (POWER enabled) Valgrind tool and instruction trace plug-in (itrace). The itrace tool (&ndash;tool=itrace) collects instruction traces for the whole program or specific functions (via &ndash;fnname= option).</p>
<dl class="section note"><dt>Note</dt><dd>The Valgrind package provided by the Linux Distro may not be enabled for the latest POWER processor. Nor will it include the itrace plug-in or the associated vgi2qt conversion tool.</dd></dl>
<p>Instruction trace files are processed by the <a href="https://developer.ibm.com/linuxonpower/sdk-packages/">Performance Simulator</a> (sim_ppc) models. Performance simulators are specific to each processor generation (POWER7-9) and provides a cycle accurate modeling for instruction trace streams. The results of the model (a pipe file) can viewed via one the interactive display tools (scrollpv, jviewer) or passed to an analysis tool like <a href="https://developer.ibm.com/linuxonpower/sdk-packages/">pipestat</a>. </p>
</div></div><!-- contents -->
<!-- start footer part -->
<hr class="footer"/><address class="footer"><small>
Generated on Fri Jul 17 2020 17:13:19 for POWER Vector Library Manual by &#160;<a href="http://www.doxygen.org/index.html">
<img class="footer" src="doxygen.png" alt="doxygen"/>
</a> 1.8.13
</small></address>
</body>
</html>