Vector processor

In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set where its instructions operate on one-dimensional arrays of data called vectors. This is in contrast to scalar processors, whose instructions operate on single data items only, and in contrast to those same scalar processors having SIMD Arithmetic Units. Vector processors can greatly improve performance on certain workloads, notably numerical simulation and similar tasks. Vector processing techniques also operate in video-game console hardware and in graphics accelerators.

Vector machines appeared in the early 1970s and dominated supercomputer design through the 1970s into the 1990s, notably the various Cray platforms. The rapid fall in the price-to-performance ratio of conventional microprocessor designs led to the vector supercomputer's demise in the late 1990s.

History[edit]

Early work[edit]

Vector processing development began in the early 1960s at Westinghouse in their "Solomon" project. Solomon's goal was to dramatically increase math performance by using a large number of simple math co-processors under the control of a single master CPU. The CPU fed a single common instruction to all of the arithmetic logic units (ALUs), one per cycle, but with a different data point for each one to work on. This allowed the Solomon machine to apply a single algorithm to a large data set, fed in the form of an array.

In 1962, Westinghouse cancelled the project, but the effort was restarted at the University of Illinois as the ILLIAC IV. Their version of the design originally called for a 1 GFLOPS machine with 256 ALUs, but, when it was finally delivered in 1972, it had only 64 ALUs and could reach only 100 to 150 MFLOPS. Nevertheless, it showed that the basic concept was sound, and, when used on data-intensive applications, such as computational fluid dynamics, the ILLIAC was the fastest machine in the world. The ILLIAC approach of using separate ALUs for each data element is not common to later designs, and is often referred to under a separate category, massively parallel computing.

A computer for operations with functions was presented and developed by Kartsev in 1967.^[1]

Supercomputers[edit]

The first successful implementation of vector processing occurred in 1966, when both the Control Data Corporation STAR-100 and the Texas Instruments Advanced Scientific Computer (ASC) were introduced.

The basic ASC (i.e., "one pipe") ALU used a pipeline architecture that supported both scalar and vector computations, with peak performance reaching approximately 20 MFLOPS, readily achieved when processing long vectors. Expanded ALU configurations supported "two pipes" or "four pipes" with a corresponding 2X or 4X performance gain. Memory bandwidth was sufficient to support these expanded modes.

The STAR-100 was otherwise slower than CDC's own supercomputers like the CDC 7600, but at data related tasks they could keep up while being much smaller and less expensive. However the machine also took considerable time decoding the vector instructions and getting ready to run the process, so it required very specific data sets to work on before it actually sped anything up.

The vector technique was first fully exploited in 1976 by the famous Cray-1. Instead of leaving the data in memory like the STAR-100 and ASC, the Cray design had eight vector registers, which held sixty-four 64-bit words each. The vector instructions were applied between registers, which is much faster than talking to main memory. Whereas the STAR-100 would apply a single operation across a long vector in memory and then move on to the next operation, the Cray design would load a smaller section of the vector into registers and then apply as many operations as it could to that data, thereby avoiding many of the much slower memory access operations.

The Cray design used pipeline parallelism to implement vector instructions rather than multiple ALUs. In addition, the design had completely separate pipelines for different instructions, for example, addition/subtraction was implemented in different hardware than multiplication. This allowed a batch of vector instructions to be pipelined into each of the ALU subunits, a technique they called vector chaining. The Cray-1 normally had a performance of about 80 MFLOPS, but with up to three chains running it could peak at 240 MFLOPS and averaged around 150 – far faster than any machine of the era.

Cray J90 processor module with four scalar/vector processors

Other examples followed. Control Data Corporation tried to re-enter the high-end market again with its ETA-10 machine, but it sold poorly and they took that as an opportunity to leave the supercomputing field entirely. In the early and mid-1980s Japanese companies (Fujitsu, Hitachi and Nippon Electric Corporation (NEC) introduced register-based vector machines similar to the Cray-1, typically being slightly faster and much smaller. Oregon-based Floating Point Systems (FPS) built add-on array processors for minicomputers, later building their own minisupercomputers.

Throughout, Cray continued to be the performance leader, continually beating the competition with a series of machines that led to the Cray-2, Cray X-MP and Cray Y-MP. Since then, the supercomputer market has focused much more on massively parallel processing rather than better implementations of vector processors. However, recognising the benefits of vector processing IBM developed Virtual Vector Architecture for use in supercomputers coupling several scalar processors to act as a vector processor.

Although vector supercomputers resembling the Cray-1 are less popular these days, NEC has continued to make this type of computer up to the present day with their SX series of computers. Most recently, the SX-Aurora TSUBASA places the processor and either 24 or 48 gigabytes of memory on an HBM 2 module within a card that physically resembles a graphics coprocessor, but instead of serving as a co-processor, it is the main computer with the PC-compatible computer into which it is plugged serving support functions.

One extreme and rare example of an Array Processor was the Aspex Microelectronics ASP^[2], which categorised itself as "Massive wide SIMD" but had bit-level ALUs and bit-level predication, and so could definitively be considered an Array (Vector) Processor.^[3]^[4] The Linedancer, released in 2010, contained 4096 2-bit Predicated SIMD ALUs, each with its own Content-Addressable Memory, and was capable of 800 billion instructions per second.^[5]

GPU[edit]

Modern graphics processing units (GPUs) include an array of shader pipelines which may be driven by compute kernels, and can be considered vector processors (using a similar strategy for hiding memory latencies). The key distinguishing factor of SIMT-based GPUs is that it has a single instruction decoder-broadcaster but that the cores receiving and executing that same instruction are otherwise reasonably normal: their own ALUs, their own register files, their own Load/Store units and their own independent L1 data caches. Thus although all cores simultaneously execute the exact same instruction in lock-step with each other they do so with completely different data from completely different memory locations. This is significantly more complex and involved than SIMD, which is strictly limited to execution of parallel pipelined arithmetic operations only. Although the exact internal details of today's commercial GPUs are proprietary secrets, the MIAOW^[6] team was able to piece together anecdotal information sufficient to implement a subset of the AMDGPU architecture.^[7]

Comparison with modern architectures[edit]

As of 2016^[update] most commodity CPUs implement architectures that feature fixed-length SIMD instructions. On first inspection these can be considered a form of vector processing because they operate on multiple (vectorized, explicit length) data sets, and borrow features from vector processors. However by definition the addition of SIMD cannot by itself qualify a processor as an actual Vector Processor because SIMD is fixed-length and Vectors are variable. The difference is illustrated below with examples, showing and comparing the three categories: Pure SIMD, Predicated SIMD, and Pure Vector Processing.^{[citation needed]}

Pure (fixed) SIMD - Common examples using SIMD with features inspired by Vector processors include Intel x86's MMX, SSE and AVX instructions, AMD's 3DNow! extensions, ARM NEON, Sparc's VIS extension, PowerPC's AltiVec and MIPS' MSA. In 2000, IBM, Toshiba and Sony collaborated to create the Cell processor, which is also SIMD.
Predicated SIMD - Two notable examples which have per-element (lane-based) predication are ARM SVE2 and AVX-512
Pure Vectors - these include the original Cray-1, RISC-V RVV and SX-Aurora TSUBASA. Although memory-based the STAR-100 was also a Vector Processor.

Other CPU designs include some multiple instructions for vector processing on multiple (vectorised) data sets, typically known as MIMD (Multiple Instruction, Multiple Data) and realized with VLIW (Very Long Instruction Word). The Fujitsu FR-V VLIW/vector processor combines both technologies.

Definition of Vector Processing.[edit]

Vector Processors, inherently by definition and design, have always been variable-length since their inception.

Where pure (fixed-width, no predication) SIMD is commonly mistakenly claimed to be "Vectors", through close analysis and comparison of historic and modern ISAs, actual Vector processors are found to be discernable by having one or both of two key features:^{[citation needed]}

a way to set the Vector Length (such as the setvl instruction in RISCV RVV^[8]) or providing a REP (instruction repeating) feature in some form, without limiting repeats to a power of two
comprehensive individual element-level predicate masks on every Vector instruction, as is now available in ARM SVE2.^[9] and AVX-512

The difference between pure (fixed-length) SIMD and pure (variable-length) Vector Processing is illustrated below in examples: Predicated SIMD, which uses fixed-width SIMD ALUs but allows variable (predicated) activation of units to provide the appearance of variable length, sits somewhere in between.

Description[edit]

In general terms, CPUs are able to manipulate one or two pieces of data at a time. For instance, most CPUs have an instruction that essentially says "add A to B and put the result in C". The data for A, B and C could be—in theory at least—encoded directly into the instruction. However, in efficient implementation things are rarely that simple. The data is rarely sent in raw form, and is instead "pointed to" by passing in an address to a memory location that holds the data. Decoding this address and getting the data out of the memory takes some time, during which the CPU traditionally would sit idle waiting for the requested data to show up. As CPU speeds have increased, this memory latency has historically become a large impediment to performance; see Memory wall.

In order to reduce the amount of time consumed by these steps, most modern CPUs use a technique known as instruction pipelining in which the instructions pass through several sub-units in turn. The first sub-unit reads the address and decodes it, the next "fetches" the values at those addresses, and the next does the math itself. With pipelining the "trick" is to start decoding the next instruction even before the first has left the CPU, in the fashion of an assembly line, so the address decoder is constantly in use. Any particular instruction takes the same amount of time to complete, a time known as the latency, but the CPU can process an entire batch of operations, in an overlapping fashion, much faster and more efficiently than if it did so one at a time.

Vector processors take this concept one step further. Instead of pipelining just the instructions, they also pipeline the data itself. The processor is fed instructions that say not just to add A to B, but to add all of the numbers "from here to here" to all of the numbers "from there to there". Instead of constantly having to decode instructions and then fetch the data needed to complete them, the processor reads a single instruction from memory, and it is simply implied in the definition of the instruction itself that the instruction will operate again on another item of data, at an address one increment larger than the last. This allows for significant savings in decoding time.

To illustrate what a difference this can make, consider the simple task of adding two groups of 10 numbers together. In a normal programming language one would write a "loop" that picked up each of the pairs of numbers in turn, and then added them. To the CPU, this would look something like this:

; Hypothetical RISC machine
; add 10 numbers in a to 10 numbers in b, storing results in c
; assume a, b, and c are memory locations in their respective registers
  move  $10, count   ; count := 10
loop:
  load  r1, a
  load  r2, b
  add   r3, r1, r2   ; r3 := r1 + r2
  store r3, c
  add   a, a, $4     ; move on
  add   b, b, $4
  add   c, c, $4
  dec   count        ; decrement
  jnez  count, loop  ; loop back if count is not yet 0
  ret

But to a vector processor, this task looks considerably different:

; assume we have vector registers v1-v3 
; with size equal or larger than 10
  move   $10, count    ; count = 10
  vload  v1, a, count
  vload  v2, b, count
  vadd   v3, v1, v2
  vstore v3, c, count
  ret

Note the complete lack of looping in the instructions, because it is the hardware which has performed 10 sequential operations: effectively the loop count is on an explicit per-instruction basis.

Cray-style Vector ISAs take this a step further and provide a global "count" Register, called Vector Length (VL):

; again assume we have vector registers v1-v3
; with size larger than or equal to 10
  setvli  $10        # Set vector length VL=10
  vload   v1, a      # 10 loads from a
  vload   v2, b      # 10 loads from b
  vadd   v3, v1, v2  # 10 adds
  vstore v3, c       # 10 stores into c
  ret

There are several savings inherent in this approach.

only three address translations are needed. Depending on the architecture, this can represent a significant savings by itself.
Another saving is fetching and decoding the instruction itself, which has to be done only one time instead of ten.
The code itself is also smaller, which can lead to more efficient memory use, reduction in L1 instruction cache size, reduction in power consumption.
With the length (equivalent to SIMD width) not being hard-coded into the instruction, not only is the encoding more compact, it's also "future-proof" and allows even Embedded processor designs to consider using Vectors purely to gain all the other advantages, rather than go for high performance.

Additionally, In more modern Vector Processor ISAs, "Fail on First" or "Fault First" has been introduced (see below) which brings even more advantages.

But more than that, a high performance vector processor may have multiple functional units adding those numbers in parallel. The checking of dependencies between those numbers is not required as a vector instruction specifies multiple independent operations. This simplifies the control logic required, and can further improve performance by avoiding stalls. The math operations thus completed far faster overall, the limiting factor being the time required to fetch the data from memory.

Not all problems can be attacked with this sort of solution. Including these types of instructions necessarily adds complexity to the core CPU. That complexity typically makes other instructions run slower—i.e., whenever it is not adding up many numbers in a row. The more complex instructions also add to the complexity of the decoders, which might slow down the decoding of the more common instructions such as normal adding. (This can be somewhat mitigated by keeping the entire ISA to RISC principles: RVV only adds around 190 Vector instructions even with the advanced features.^[10])

Vector processors were traditionally designed to work best only when there are large amounts of data to be worked on. For this reason, these sorts of CPUs were found primarily in supercomputers, as the supercomputers themselves were, in general, found in places such as weather prediction centers and physics labs, where huge amounts of data are "crunched". However as shown above and demonstrated by RISC-V RVV the efficiency of Vector ISAs brings other benefits which are compelling even for Embedded use-cases.

Vector instructions[edit]

The vector pseudocode example above comes with a big assumption that the vector computer can process more than ten numbers in one batch. For a greater quantity of numbers in the vector register, it becomes unfeasible for the computer to have a register that large. As a result, the vector processor either gains the ability to perform loops itself, or exposes some sort of Vector Control (Status) register to the programmer, usually known as a Vector Length.

The self-repeating instructions are found in early vector computers like the STAR-100, where the above action would be described in a single instruction (somewhat like vadd c, a, b, $10). They are also found in the x86 architecture as the REP prefix. However, only very simple calculations can be done effectively in hardware this way without a very large cost increase. Since all operands have to be in memory for the STAR-100 architecture, the latency caused by access becomes huge too.

Interestingly, though, Broadcom included space in all Vector operations of the Videocore IV ISA for a REP field, but unlike the STAR-100 which uses memory for its repeats, the Videocore IV repeats are on all operations including arithmetic vector operations. The repeat length can be a small range of power of two or sourced from one of the scalar registers.^[11]

The Cray-1 introduced the idea of using processor registers to hold vector data in batches. The batch lengths (Vector Length, VL) could be dynamically set with a special instruction, the significance compared to Videocore IV (and, crucially as will be shown below, SIMD as well) being that the repeat length does not have to be part of the instruction encoding. This way, a lot more work can be done in each batch plus the instruction encoding is much more elegant and compact, the only drawback being that in order to take full advantage of this extra batch processing capacity, the memory load and store speed correspondingly had to increase as well. This is sometimes claimed to be a disadvantage of Cray-style Vector processors: reality is, it just goes with the territory of achieving high performance throughput, as seen in GPUs, which face exactly the same issue. Put colloquially: you want number-crunching, you need bandwidth.

Modern SIMD computers claim to improve on early Cray by directly using multiple ALUs, for a higher degree of parallelism compared to only using the normal scalar pipeline. Modern Vector Processors (such as the SX-Aurora TSUBASA) combine both, by issuing multiple data to multiple internal pipelined SIMD ALUs, the number issued being dynamically chosen by the Vector Program at runtime. Masks can be used to selectively load and store data in memory locations, and use those same masks to selectively disable processing element of SIMD ALUs. Some pure SIMD processors (AVX-512, ARM SVE2) are capable of this kind of selective, per-element ("predicated") processing, and it is these which somewhat deserve the nomenclature "Vector Processor" or at least deserve the claim of being capable of "Vector processing". SIMD processors without per-element predication (MMX, SSE, AltiVec) categorically do not.

Modern GPUs, which have many small compute units each with their own independent SIMD ALUs, use something called Single Instruction Multiple Threads (SIMT). SIMT units run from a shared single broadcast synchronised Instruction Unit. The "vector registers" are very wide and the pipelines tend to be long. The "threading" part of SIMT involves the way data is handled independently on each of the compute units.

In addition, GPUs such as the Broadcom Videocore IV and other external vector processors like the NEC SX-Aurora TSUBASA may use fewer vector units than the width implies: instead of having 64 units for a 64-number-wide register, the hardware might instead do a pipelined loop over 16 units for a hybrid approach. The Broadcom Videocore IV is also capable of this hybrid approach: nominally stating that its SIMD QPU Engine supports 16-long FP array operations in its instructions, it actually does them 4 at a time, as a form of "threads".^[12]

Vector instruction example[edit]

In this example we start with an algorithm ("IAXPY"), first show it in scalar instructions, then SIMD, then Predicated SIMD, and finally Vector instructions. This helps illustrate the difference between a traditional vector processor and a modern SIMD one. We start with a 32 bit integer variant of the "DAXPY" function, in c:

void iaxpy(size_t n, int a, const int x[], int y[]) {
    for (size_t i = 0; i < n; i++)
        y[i] = a * x[i] + y[i];
}

Scalar Assembler[edit]

Our scalar version of this would load one of each of x and y, process one calculation, store one result, and loop:

loop:
  load32  r1, x      ; load one 32bit data
  load32  r2, y
  mul32   r1, a, r1  ; r1 := r1 * a
  add32   r3, r1, r2 ; r3 := r1 + r2
  store32 r3, y
  addl    x, x, $4   ; x := x + 4
  addl    y, y, $4
  subl    n, n, $1   ; n := n - 1
  jgz     n, loop    ; loop back if n > 0
out:
  ret

The STAR-like code remains concise, but because the STAR-100's Vectorisation was by design based around memory accesses we now require an extra slot of memory to process the information. Two times the latency is also needed due to the extra requirement of memory access.

  ; Assume tmp is pre-allocated
  vmul tmp, a, x, n ; tmp[i] = a * x[i]
  vadd y, y, tmp, n ; y[i] = y[i] + tmp[i]
  ret

Pure (non-predicated) SIMD[edit]

A modern SIMD machine can do most of the operation in batches. The code is mostly similar to the scalar version. We are assuming that both x and y are properly aligned here and that n is a multiple of 4, as otherwise some setup code would be needed to calculate a mask or to run a scalar version. We also assume, for simplicity, that the SIMD instructions have an option to automatically repeat scalar operands, just like ARM NEON can.^[13] If it does not, a "splat" (broadcast) must be used, to copy the scalar argument across a SIMD register:

  splatx4   v4, a        ; v4 = a,a,a,a

The time taken would be basically the same as a vector implementation of c = a + b described above.

vloop:
  load32x4  v1, x
  load32x4  v2, y
  mul32x4   v1, a, v1  ; v1 := v1 * a
  add32x4   v3, v1, v2 ; v3 := v1 + v2
  store32x4 v3, y
  addl      x, x, $16  ; x := x + 16
  addl      y, y, $16
  subl      n, n, $4   ; n := n - 4
  jgz       n, vloop   ; go back if n > 0
out:
  ret

Note that both x and y pointers are incremented by 16, because that is how long (in bytes) four 32-bit integers are. The decision was made that the algorithm shall only cope with 4-wide SIMD, therefore the constant is hard-coded into the program.

Unfortunately for SIMD, the clue was in the assumption above, "that n is a multiple of 4" as well as "aligned access", which, clearly, is a limited specialist use-case.

Realistically, for general-purpose loops such as in portable libraries, where n cannot be limited in this way, the overhead of setup and cleanup for SIMD in order to cope with non-multiples of the SIMD width, can far exceed the instruction count inside the loop itself. Assuming worst-case that the hardware cannot do misaligned SIMD memory accesses, a real-world algorithm first has to have a preparatory section which works on the beginning unaligned data, up to the first point where SIMD memory-aligned operations can take over. Eight-wide SIMD requires repeating the inner loop algorithm first with four-wide SIMD elements, then two-wide SIMD, then one (scalar), with a test and branch in between each one, in order to cover the first and last remaining SIMD elements (0 <= n <= 7).

This more than triples the size of the code, in fact in extreme cases it results in an order of magnitude increase in instruction count! This can easily be demonstrated by compiling the iaxpy example for AVX512, using the options "-O3 -march=knl" to gcc.

Over time as the ISA evolves to keep increasing performance, it results in ISA Architects adding 2-wide SIMD, then 4-wide SIMD, then 8-wide and upwards. We start to see, therefore, why AVX-512 exists in x86.

Without predication, the wider the SIMD width the worse the problems get, leading to massive opcode proliferation. Vector ISAs require only the one suite of operations, providing variable-length data coverage independent of the back-end hardware.^[14]

Vector Processors on the other hand are designed to issue computations of variable length for an arbitrary count, n, and thus require very little setup, and no cleanup. Even compared to those SIMD ISAs which have masks (but no setvl instruction), Vector Processors produce much more compact code because they do not need to perform explicit mask calculation to cover the last few elements (illustrated below).

Predicated SIMD[edit]

Assuming a hypothetical predicated (mask capable) SIMD ISA, the instruction loop would look like this:

vloop:
  # prepare mask. few ISAs have min though
  min       t0, n, $4     ; t0 = min(n, 4)
  shift     m, $1, t0     ; m = 1<<t0
  sub       m, m, $1      ; m = (1<<t0)-1
  # now do the operation, masked by m bits
  load32x4  v1, x, m
  load32x4  v2, y, m
  mul32x4   v1, a, v1, m  ; v1 := v1 * a
  add32x4   v3, v1, v2, m ; v3 := v1 + v2
  store32x4 v3, y, m
  # update x, y and n for next loop
  addl      x, t0*4      ; x := x + t0*4
  addl      y, t0*4
  subl      n, n, t0     ; n := n - t0
  # loop?
  jgz       n, vloop     ; go back if n > 0
out:
  ret

Here we can see that the code is much cleaner but a little complex: at least however there is no setup or cleanup: on the last iteration of the loop, the predicate mask wil be set to either 0b0000, 0b0001, 0b0011, 0b0111 or 0b1111, resulting in between 0 and 4 SIMD element operations being performed, respectively. One additional potential complication: some RISC ISAs do not have a "min" instruction, needing instead to use a branch or scalar predicated compare.

It is clear how predicated SIMD at least merits the term "Vector capable", because it can cope with variable-length Vectors by using predicate masks. The final evolving step to a "true" Vector ISA however is to not have any evidence in the ISA at all of a SIMD width, leaving that entirely up to the hardware.

Pure (true) Vector ISA[edit]

For Cray-style Vector ISAs such as RVV, an instruction called "setvl" (set Vector Length) is used. The hardware first defines how many data values it can process in one "Vector": this could be either actual registers or it could be an internal loop (the hybrid approach, mentioned above). This maximum amount (the number of hardware "Lanes") is termed "MVL" (Maximum Vector Length). Note that, as we saw in SX-Aurora and Videocore IV, MVL may be an actual hardware lane quantity or a virtual one. (Note: As mentioned in the ARM SVE2 Tutorial, programmers must not make the mistake of assuming a fixed Vector width: consequently MVL is not a quantity that the programmer needs to know. This can be a little disconcerting after years of SIMD mindset).

On calling setvl with the number of outstanding data elements to be processed, "setvl" is permitted (more like, required) to limit that to the Maximum Vector Length (MVL) and thus returns the actual number that can be processed by the hardware in subsequent Vector instructions, and sets the internal special register, "VL", to that same amount. ARM refers to this technique as "Vector Length Agnostic" programming in its tutorials on SVE2.^[15]

Below is the Cray-style Vector Assembler for the same SIMD style loop, above. Look carefully how t0 (which can vary) is used instead of hard-coded constants:

vloop:
  setvl   t0, n      # VL=t0=min(MVL, n)
  vld32   v0, x      # load vector x
  vld32   v1, y      # load vector y
  vmadd32 v1, v0, a  # v1 += v0 * a
  vst32   v1, y      # store Y
  add     y, t0*4    # advance y by VL*4
  add     x, t0*4    # advance x by VL*4
  sub     n, t0      # n -= VL (t0)
  bnez    n, vloop   # repeat if n != 0

This really is not that different from the SIMD version (processes 4 data elements per loop), or from the initial Scalar version (processes just the one). We can see that n still contains the number of data elements remaining to be processed, but that t0 contains the copy of VL - the number that is going to be processed in each iteration. t0 is subtracted from n after each iteration, and if n is zero then all elements have been processed.

A number of fascinating things to note, when comparing against the Predicated SIMD assembly variant:

the setvl instruction has embedded within it a min instruction
Where the SIMD variant hard-coded both the width (4) into the creation if the mask and in the SIMD width (load32x4 etc) the Vector ISA equivalents have no such limit. This makes Vector programs both portable, Vendor Independent, and future-proof.
setting VL effectively creates a hidden predicate mask that is automatically applied to the Vectors
Where with Predicated SIMD the mask bitlength is limited to that which may be held in a scalar (or special mask) register, Vector ISA's mask registers have no such limitation. Cray-I Vectors could be just over 1,000 elements (in 1977).

Thus we can see, very clearly, how Vector ISAs reduce the number of instructions.

Also note, that just like the Predicated SIMD variant, the pointers to x and y are advanced by t0 times four because they both point to 32 bit data, but that n is decremented by straight t0. Compared to the fixed-size SIMD assembler there is very little apparent difference: x and y are advanced by hard-coded constant 16, n is decremented by a hard-coded 4, so initially it is hard to appreciate the significance. The difference comes in the realisation that the Vector hardware could be capable of doing 4 simultaneous operations, or 64, or 10,000, it would be the exact same Vector Assembler for all of them and there would still be no SIMD cleanup code. Even compared to the Predicate-capable SIMD, it's still more compact, clearer, more elegant and uses less resources.

Not only do we have a much more compact program (saving on L1 Cache size) but as previously mentioned the Vector version can issue far more data processing to the ALUs, again saving power because Instruction Decode and Issue can sit idle.

One additional fascinating aspect here: the number of elements going in to the function can start at zero. This sets the Vector Length to zero, which effectively disables all Vector instructions, turning them into no-ops, at runtime. Thus, unlike non-predicated SIMD, even when there are no elements to process there is still no wasted cleanup code.

Vector Processor features[edit]

Typical features that a good Vector Processor will have are:^[16]^[17]^[18]

Vector Load and Store - these inherently save on Virtual Memory lookups and are designed to put data into registers with the minimum of fuss. Advanced Vector Load/store enhancements include support for structure packing, Fail-First, Gather-scatter, Indexed, Unit and Element strides.
Masked Operations - as now commonly found in GPUs, Predicate Masks allow parallel if/then/else constructs without branches (which are inherently scalar by nature)
Compress and Expand - usually using a bit-mask, data is linearly compressed or expanded (redistributed) based on whether bits in the mask are set or clear, whilst always preserving the sequential order and never duplicating values (unlike Gather-Scatter aka permute). These instructions feature in AVX-512
Register Gather, Scatter (aka permute)^[19] - a less restrictive more generic variation of the Compress/Expand theme which instead takes one Vector to specify the indices to use to "reorder" another Vector. Gather/Scatter is more complex to implement than Compress/Expand, and, being inherently non-sequential, can interfere with vector chaining. Not to be confused with Gather-scatter Memory Load/Store modes, Gather/Scatter Vector operations act on the Vector registers, and are often termed a permute instruction instead.
Splat and Extract - useful for interaction between Scalar and Vector, these broadcast a single value across a Vector, or extract one item from a Vector, respectively.
Iota - a very simple and strategically useful instruction which drops sequentially-incrementing immediates into successive elements. Usually starts from zero.
Reduction and Iteration - operations that peform mapreduce on a Vector (for example, find the one maximum value of an entire Vector, or sum all elements). Iteration is of the form x[i] = y[i] + x[i-1] where Reduction is of the form x = y[0] + y[1]… + y[n-1]
Matrix Multiply support - either by way of algorithmically loading data from memory, or reordering (remapping) the normally linear access to Vector elements, or providing "Accumulators", arbitrary-sized matrices may be efficiently processed. IBM POWER10 provides MMA instructions^[20] although for arbitrary Matrix widths that do not fit the exact SIMD size data repetition techniques are needed which is wasteful of register file resources.^[21] The Aspex ASP Linedancer had a 2D/3D Memory reordering DMA Engine which required significant effort to use optimally.^[22]^[23]. NVidia provides a high-level Matrix CUDA API although the internal details are not available.^[24] The most resource-efficient technique is in-place reordering of access to otherwise linear Vector data.
Advanced Math formats - often includes Galois field arithmetic, but can include Binary-coded decimal or Decimal Fixed-Point, and support for much larger (arbitrary precision) arithmetic operations by supporting parallel carry-in and carry-out
Bit manipulation - including Vectorised versions of bit-level permutation operations, bitfield insert and extract, Centrifuge operations, population count, and many others.

GPU Vector Processing Features[edit]

With many 3D Shader applications needing trigonometric operations as well as short vectors for common operations (RGB, ARGB, XYZ, XYZW) support for the following is typically present in modern GPUs, in addition to those found in Vector Processors:

Sub-vectors - elements may typically contain two, three or four sub-elements (vec2, vec3, vec4) where any given bit of a predicate mask applies to the whole vec2/3/4, not the elements in the sub-vector. Sub-vectors are also introduced in RISC-V RVV (termed "LMUL").^[25]. Subvectors are a critical integral part of the Vulkan SPIR-V spec.
Sub-vector Swizzle - aka "Lane Shuffling" which allows sub-vector inter-element computations without needing a extra (costly, wasteful) instructions to move the sub-elements into the correct SIMD "lanes". Also saves predicate mask bits. Effectively this is an in-flight mini-permute of the sub-vector, heavily features in 3D Shader binaries, and is sufficiently important as to be part of the Vulkan SPIR-V spec. The Broadcom Videocore IV uses the terminology "Lane rotate"^[26] where the rest of the industry uses the term "swizzle".^[27]
Transcendentals - trigonometric operations such as sine, cosine and logarithm obviously feature much more predominantly in 3D than in many demanding HPC workloads. Of interest however is that speed is far more important than accuracy in 3D for GPUs, where computation of pixel coordinates simply do not require high precision. The Vulkan specification recognises this and sets surprisingly low accuracy requirements, so that GPU Hardware can reduce power usage. The concept of reducing accuracy where it is simply not needed is explored in the MIPS-3D extension. Also typically present will be vector normalisation as well as dot product.

Further information and hints on GPU Vector ISA features can be found by examining the SPIR-V specification,^[28] Vulkan specification^[29] and the OpenCL specification.^[30] These APIs were created by the Khronos Group by members such as Intel, Google, AMD and NVIDIA, and thus provide insight into the features provided of the world's fastest GPUs. Also of help is the reverse-engineered instruction set of the MALI Midgard GPU.^[31]

Fault (or Fail) First[edit]

Introduced in ARM SVE2 and RISC-V RVV is the concept of speculative sequential Vector Loads. ARM SVE2 has a special register named "First Fault Register",^[32] where RVV modifies (truncates) the Vector Length (VL).^[33]

The basic principle of ffirst is to attempt a large sequential Vector Load, but to allow the hardware to arbitrarily truncate the actual amount loaded to either the amount that would succeed without raising a memory fault or simply to an amount (greater than zero) that is most convenient. The important factor is that subsequent instructions are notified or may determine exactly how many Loads actually succeeded, using that quantity to only carry out work on the data that has actually been loaded.

Contrast this situation with SIMD, which is a fixed (inflexible) load width and fixed data processing width, unable to cope with loads that cross page boundaries, and even if they were they are unable to adapt to what actually succeeded, yet, paradoxically, if the SIMD program were to even attempt to find out in advance (in each inner loop, every time) what might optimally succeed, those instructions only serve to hinder performance because they would, by necessity, be part of the critical inner loop.

This begins to hint at the reason why ffirst is so innovative, and is best illustrated by memcpy or strcpy when implemented with standard 128-bit non-predicated non-ffirst SIMD. For IBM POWER9 the number of hand-optimised instructions to implement strncpy is in excess of 240.^[34]. By contrast, the same strncpy routine in hand-optimised RVV assembler is a mere 22 instructions.^[35]

The above SIMD example could potentially fault and fail at the end of memory, due to attempts to read too many values: it could also cause significant numbers of page or misaligned faults by similarly crossing over boundaries. In contrast, by allowing the Vector Architecture the freedom to decide how many elements to load, the first part of a strncpy, if beginning initially on a sub-optimal memory boundary, may return just enough loads such that on subsequent iterations of the loop the batches of Vectorised memory reads are optimally aligned with the underlying caches and Virtual Memory arrangements. Additionally, the hardware may choose to use the opportunity to end any given loop iteration's memory reads exactly on a page boundary (avoiding a costly second TLB lookup), with speculative execution preparing the next Virtual Memory page whilst data is still being processed in the current loop. All of this is determined by the hardware, not the program itself.^[36]

Performance and speed up[edit]

Let r be the vector speed ratio and f be the vectorization ratio. If the time taken for the vector unit to add an array of 64 numbers is 10 times faster than its equivalent scalar counterpart, r = 10. Also, if the total number of operations in a program is 100, out of which only 10 are scalar (after vectorization), then f = 0.9, i.e., 90% of the work is done by the vector unit. It follows the achievable speed up of:

$r/[(1-f)*r+f]$

So, even if the performance of the vector unit is very high ( $r=\infty$ ) we get a speedup less than $1/(1-f)$ , which suggests that the ratio f is crucial to the performance. This ratio depends on the efficiency of the compilation like adjacency of the elements in memory.

Programming heterogeneous computing architectures[edit]

Various machines were designed to include both traditional processors and vector processors, such as the Fujitsu AP1000 and AP3000. Programming such heterogeneous machines can be difficult since developing programs that make best use of characteristics of different processors increases the programmer's burden. It increases code complexity and decreases portability of the code by requiring hardware specific code to be interleaved throughout application code.^[37] Balancing the application workload across processors can be problematic, especially given that they typically have different performance characteristics. There are different conceptual models to deal with the problem, for example using a coordination language and program building blocks (programming libraries or higher order functions). Each block can have a different native implementation for each processor type. Users simply program using these abstractions and an intelligent compiler chooses the best implementation based on the context.^[38]

References[edit]

^ Malinovsky, B.N. (1995). The history of computer technology in their faces (in Russian). Kiew: Firm "KIT". ISBN 5-7707-6131-8.
^ [1]
^ [2]
^ [3]
^ Aspex Linedancer HD
^ MIAOW Vertical Research Group
^ MIAOW GPU
^ [4]
^ [5]
^ [6]
^ Videocore IV Programmer's Manual
^ Videocore IV QPU analysis by Jeff Bush
^ [7]
^ SIMD considered harmful
^ ARM SVE2 tutorial
^ Cray Overview
^ RISC-V RVV ISA
^ SX-Arora Overview
^ RVV register gather-scatter instructions
^ [8]
^ [9]
^ [10]
^ [11]
^ [12]
^ LMUL > 1 in RVV
^ Abandoned US patent US20110227920-0096
^ Videocore IV QPU
^ [13]
^ [ https://www.khronos.org/registry/vulkan/]
^ [14]
^ [15]
^ Introduction to ARM SVE2
^ RVV fault-first loads
^ PATCH to libc6 to add optimised POWER9 strncpy
^ RVV strncpy example
^ ARM SVE2 paper by N. Stevens
^ Kunzman, D. M.; Kale, L. V. (2011). "Programming Heterogeneous Systems". 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum. p. 2061. doi:10.1109/IPDPS.2011.377. ISBN 978-1-61284-425-1.
^ John Darlinton; Moustafa Ghanem; Yike Guo; Hing Wing To (1996), "Guided Resource Organisation in Heterogeneous Parallel Computing", Journal of High Performance Computing, 4 (1): 13–23, CiteSeerX 10.1.1.37.4309

External links[edit]

[Malinovsky-1] Malinovsky, B.N. (1995). The history of computer technology in their faces (in Russian). Kiew: Firm "KIT". ISBN 5-7707-6131-8.

[2] [1]

[3] [2]

[4] [3]

[5] Aspex Linedancer HD

[6] MIAOW Vertical Research Group

[7] MIAOW GPU

[8] [4]

[9] [5]

[10] [6]

[11] Videocore IV Programmer's Manual

[12] Videocore IV QPU analysis by Jeff Bush

[13] [7]

[14] SIMD considered harmful

[15] ARM SVE2 tutorial

[16] Cray Overview

[17] RISC-V RVV ISA

[18] SX-Arora Overview

[19] RVV register gather-scatter instructions

[20] [8]

[21] [9]

[22] [10]

[23] [11]

[24] [12]

[25] LMUL > 1 in RVV

[26] Abandoned US patent US20110227920-0096

[27] Videocore IV QPU

[28] [13]

[29] [ https://www.khronos.org/registry/vulkan/]

[30] [14]

[31] [15]

[32] Introduction to ARM SVE2

[33] RVV fault-first loads

[34] PATCH to libc6 to add optimised POWER9 strncpy

[35] RVV strncpy example

[36] ARM SVE2 paper by N. Stevens

[37] Kunzman, D. M.; Kale, L. V. (2011). "Programming Heterogeneous Systems". 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum. p. 2061. doi:10.1109/IPDPS.2011.377. ISBN 978-1-61284-425-1.

[38] John Darlinton; Moustafa Ghanem; Yike Guo; Hing Wing To (1996), "Guided Resource Organisation in Heterogeneous Parallel Computing", Journal of High Performance Computing, 4 (1): 13–23, CiteSeerX 10.1.1.37.4309

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

v t e Parallel computing
General	Distributed computing Parallel computing Massively parallel Cloud computing High-performance computing Multiprocessing Manycore processor GPGPU Computer network Systolic array
Levels	Bit Instruction Thread Task Data Memory Loop Pipeline
Multithreading	Temporal Simultaneous (SMT) Speculative (SpMT) Preemptive Cooperative Clustered Multi-Thread (CMT) Hardware scout
Theory	PRAM model PEM Model Analysis of parallel algorithms Amdahl's law Gustafson's law Cost efficiency Karp–Flatt metric Slowdown Speedup
Elements	Process Thread Fiber Instruction window Array data structure
Coordination	Multiprocessing Memory coherency Cache coherency Cache invalidation Barrier Synchronization Application checkpointing
Programming	Stream processing Dataflow programming Models Implicit parallelism Explicit parallelism Concurrency Non-blocking algorithm
Hardware	Flynn's taxonomy SISD SIMD SIMT MISD MIMD Dataflow architecture Pipelined processor Superscalar processor Vector processor Multiprocessor symmetric asymmetric Memory shared distributed distributed shared UMA NUMA COMA Massively parallel computer Computer cluster Grid computer Hardware acceleration
APIs	Ateji PX Boost Chapel HPX Charm++ Cilk Coarray Fortran CUDA Dryad C++ AMP Global Arrays GPUOpen MPI OpenMP OpenCL OpenHMPP OpenACC Parallel Extensions PVM POSIX Threads RaftLib UPC TBB ZPL
Problems	Automatic parallelization Deadlock Deterministic algorithm Embarrassingly parallel Parallel slowdown Race condition Software lockout Scalability Starvation
Category: Parallel computing