9  Hardware and Its Implications

Alec Loudenback

a CPU is literally a rock that we tricked into thinking.

not to oversimplify: first you have to flatten the rock and put lightning inside it.

- Twitter user daisyowl, 2017

9.1 Chapter Overview

In this chapter, we’ll explore why a basic understanding of computing hardware is essential for optimizing financial models and working efficiently with data. Understanding how data is stored and processed can help you make better design decisions, improve performance, and avoid common pitfalls. We’ll cover topics like memory architecture, data storage, and the impact of hardware on computational speed.

9.2 Introduction

The quote that opens the chapter is a silly way of describing that most modern computers are made with silicon, a common mineral found in rocks. However, we will not concern ourselves with the raw materials of computers and instead will focus on the key architectural aspect.

A computer handles data at rest (in memory) or data being acted upon (processed). This chapter will try to explain both of those processes in a way that reveals key reasons why different approaches to programming can yield different results in terms of processing speed, memory usage, and compiled outputs.

9.3 Memory and Moving Data Around

The core of modeling on computers is to perform computations on data, but unfortunately the speed at which data can be accessed has grown much slower than the rate the actual computations can be performed. Further, the size of the available persistent data storage (HDDs/SSDs) has ballooned, exacerbating the problem: the overall throughput of memory is the typical workflow constraint. To try to address the bottleneck (memory throughput), solutions have been developed to create a pipeline to efficiently shuttle data to and from the processor and the persistent storage. This memory and processing architecture applies at both the single computer level as well as extending to workflows between different data stores and computers.

We will focus primarily on the architecture of a single computer, as even laptop computers today contain enough power for most modeling tasks, if the computer is used effectively. Further, learning how to optimize a program for a single computer/processor is almost always a precursor step to effective parallelization as well.

9.3.1 Memory Types and Location

Memory has an inverse relationship between size and proximity to the central processor unit (CPU). The closer the data is to the processor units, the smaller the storage and the less likely the data will persist at that location for very long.

Kind Rough Size Lifecycle
Solid State Disk (SSD) or Hard Disk Drive (HDD) TBs Persistent/Permanent
Random Access Memory (RAM) Dozens to Hundereds of GBs Seconds to Hours (while computer is powered on)
CPU Cache - L3 8 MB to 128 MB Microseconds to Milliseconds
CPU Cache - L2 2 MB to 16 MB Nanoseconds to Microseconds
CPU Cache - L1 ~16 KB Nanoseconds

After requesting data from a persistent location like a Solid State Drive (SSD), the memory is read into Random Access Memory (RAM). The advantage of RAM over a persistent location is speed - typically that memory can be accessed and modified many times faster than the persistent data location. The tradeoff is that RAM is not persistent: when the computer is powered down, the RAM loses the information stored within.

When data is needed by the CPU, data is read from RAM into a small hierarchy of caches before being accessed by the CPU. The CPU Caches are small (physically and in capacity), but very fast. The caches are also physically colocated with the CPU for efficiency. Data is organized and funneled through the caches as an intermediary between the CPU and RAM and is fed from Level 3 (L3) cache in steps down to L1 cache as the data gets closer to the processor.

Note

For reference, memory units are:

  • 1 bit is a single binary digit (0 or 1)
  • 8 bits = 1 byte
  • 8 bytes = 1 word
  • 1024 bytes = 1 Kilobyte (KB)
  • 1024 KB = 1 Megabyte (MB)
  • 1024 MB = 1 Gigabyte (GB)
  • 1024 GB = 1 Terabyte (TB)
  • 1024 TB = 1 Petabyte (PB)

Sometimes, you might see Kb, which is Kilobits, or 1024 bits. Therefore 1 KB is 8 times larger than a Kb.

The increments are 1024 and not the usual 1000, because 1024 is \(2^{10}\). The even binary multiple of 1024 is more convenient than 1000 when working with bits.

9.3.2 Stack vs Heap

Sitting within the RAM region of memory are two sections called Stack and Heap. These are places where variables created from our program’s code can be stored. In both cases, the program will request memory space but they have some differences to be aware of.

The stack stores small, fixed-size (known bit length), data and program components. The stack is a last-in-first-out queue of data that is able to be written to and read from very quickly. The heap is a region which can be dynamically sized and has random read/write (you need not access the data in a particular order). The heap is much slower but more flexible.

9.3.2.1 Garbage Collector

The garbage collector is a program that gets run to free up previously requested/allocated memory. It accomplishes this by keeping track of references to data in memory by section of your code. If a section of code is no longer reachable (e.g. inside a function that will never get called again, or a loop that ran earlier in the program but is now complete), then, periodically, the garbage collector will pause execution of the primary program in order to “sweep” the memory. This step marks the space as able to be reused by your program or the operating system.

9.4 Processor

The processor reads lines (groups of bits) from the cache into registers and then executes instructions. An example would be to take the bytes from register 10 and add the bytes from register 11 to them. This is really all a processor does at the lowest level: combining bits of data using logical circuits.

Logical circuits (transistors) are an arrangement of wires that output a new electrical signal that varies depending on the input. From a collection of smaller building block gates (e.g. AND, OR, NOR, XOR) more complex operations can be built up1, into operations like addition, multiplication, division, etc. Electric impulses move the state of the program forward once time per CPU cycle (controlled by a master “clock” ticking billions of times per second). CPU cycle speed is what’s quoted for chip performance, e.g. when a CPU is advertised as 3.0 GHz (or 3 billion cycles per second).

The programmer (or compiler, if we are working in a higher level language like Julia) tells the CPU which instruction to run. The set of instructions that are valid for a given processor are called the Instruction Set Architecture, or ISA. In computer Assembly language (roughly one-level above directly manipulating the bits), the instructions are given names like ADD, SUB, MUL, DIV, and MOV. These instructions mirror the raw instruction that is part of the ISA.

All instructions are not all created equal, however. Some instructions take many CPU cycles to complete. For example, floating point DIV (division) takes 10-20 CPU cycles while ADD only takes a single CPU cycle.

Some architecture examples that may be familiar:

  • Intel x86-64 (a.k.a. AMD64) are common computer processors that use registers that are 64 bits wide (the prior generation was 32 bits wide) and use the x86 instruction set developed by Intel.

  • ARM chips, including the Apple M-Series processors are characterized by the use of the ARM instruction set and recent processors of this kind are also 64 bit.

The ARM architecture is known as a reduced instruction set chip (RISC), which means that it has fewer available instructions compared to, e.g. the x86 architecture. The benefit of the reduced instruction set is that it is generally much more power efficient, but comes at the cost of sacrificing specialized instructions such as string manipulation, or in lower-end chips, even the division operation (which have to be implemented via software routines instead of CPU operations). However, for specialized workloads, the availability of a key instruction can make a program run on the CPU 10-100x faster at times. An example of this is that at the time of writing, AVX512 processors are becoming available (see Chapter 11 for a discussion of vectorization) which can benefit some workloads greatly.

Tip

Trying to optimize your program via selecting specialized chips should be one of the last ways that you seek to optimize runtime, as generally a similar order of magnitude speedup can be achieved through more efficient algorithm design or general parallelization techniques. Developing programs in this way makes the performance portable, able to be used on other systems and not just special architectures.

When writing in Julia, you need not be concerned with the low-level instructions as the compiler will optimize the execution for you. However, should it be useful, it is easy to inspect the compiled code. For example, if we create a function to add three numbers, we can see that the ADD instruction is called twice: first adding the first and second arguments, and then adding the third argument to that intermediate sum.

myadd(x, y, z) = x + y + z
@code_native myadd(1, 2, 3)
   .section    __TEXT,__text,regular,pure_instructions
    .build_version macos, 15, 0
    .globl  _julia_myadd_6893               ; -- Begin function julia_myadd_6893
    .p2align    2
_julia_myadd_6893:                      ; @julia_myadd_6893
; Function Signature: myadd(Int64, Int64, Int64)
; ┌ @ /Users/alecloudenback/prog/julia-fin-book/hardware.qmd:103 within `myadd`
; %bb.0:                                ; %top
; │ @ /Users/alecloudenback/prog/julia-fin-book/hardware.qmd within `myadd`
    ;DEBUG_VALUE: myadd:x <- $x0
    ;DEBUG_VALUE: myadd:x <- $x0
    ;DEBUG_VALUE: myadd:y <- $x1
    ;DEBUG_VALUE: myadd:y <- $x1
    ;DEBUG_VALUE: myadd:z <- $x2
    ;DEBUG_VALUE: myadd:z <- $x2
; │ @ /Users/alecloudenback/prog/julia-fin-book/hardware.qmd:103 within `myadd`
; │┌ @ operators.jl:596 within `+` @ int.jl:87
    add  x8, x1, x0
    add  x0, x8, x2
    ret
; └└
                                        ; -- End function
.subsections_via_symbols

Compilers are complex, hyper-optimized programs which turn your source code into the raw bits executed by the computer. Key steps in the process of converting Julia code you write all the way to binary machine instructions include the items in Table 9.1. Note the Julia @code_... macros allow the programmer to inspect the intermediate representations.

Table 9.1: Part the key to Julia’s speed is to be able to compile down to a different, specialized version of the machine code depending on the types given to a function. As described in the table above, the instructions for adding floating point together or integer numbers together are different. Julia code can reflect that distinction by compiling a different method for each combination of input types.
Step Description Example
Julia Source Code The level written by the programmer in a high level language. myadd(x,y,z) = x + y + z
Lowered Abstract Syntax Tree (AST) An intermediate representation of the code after the first stage of compilation, where the high-level syntax is simplified into a more structured form that’s easier for the compiler to work with.
julia> @code_lowered myadd(1,2,3)
CodeInfo(
1 ─ %1 = x + y + z
└──      return %1
)
LLVM

Low-Level Virtual Machine language, which is a massively popular compiler used by languages like Julia and Rust. The core logic are the three lines with add, add, and ret.

Note that the the add instruction is add i64 which means an addition operation of 64 bit integers.

julia> @code_llvm myadd(1,2,3)
;  @ REPL[7]:1 within `myadd`
define i64 @julia_myadd_2022(i64 signext %0, i64 signext %1, i64 signext %2) #0 {
top:
; ┌ @ operators.jl:587 within `+` @ int.jl:87
   %3 = add i64 %1, %0
   %4 = add i64 %3, %2
   ret i64 %4
; └
}
Native

The final machine code output, specific to the target CPU architecture. This is at the same level as Assembly language. The core logic are the three lines beginning with add, add, and ret.

If we used floating point addition instead, the CPU instruction would be fadd instead of add.

julia> @code_native myadd(1,2,3)
        .section        __TEXT,__text,regular,pure_instructions
        .build_version macos, 14, 0
        .globl  _julia_myadd_1851               ; -- Begin function julia_myadd_1851
        .p2align        2
_julia_myadd_1851:                      ; @julia_myadd_1851
; ┌ @ REPL[7]:1 within `myadd`
; %bb.0:                                ; %top
; │┌ @ operators.jl:587 within `+` @ int.jl:87
        add     x8, x1, x0
        add     x0, x8, x2
        ret
; └└

9.4.0.1 Increasing Complexity in Search of Performance

Transistors are the building-block that creates the CPU and enables the physical process which governs the computations. For a very long time, the major source of improved computer performance was simply to make smaller transistors, allowing more of them to be packed together to create computer chips. This worked for many years and the propensity for the transistor count to double about every two years. In this way, software performance improvements came as side effect of the phenomenal scaling in hardware capability. However, raw single core performance and clock frequency (CPU cycle speed) dramatically flattened out starting a bit before the year 2010. This was due to the fact that transistor density has been starting to be limited by:

  1. Pure physical constraints (transistors can be measured in width of atoms) where we have limited ability to manufacture something so small.
  2. Thermodynamics, where heat can’t be removed from the CPU core fast enough to avoid damaging the core and therefore operations per second are capped.

To obtain increasing performance, two main strategies have been employed in lieu of throwing more transistors into a single core:

  1. Utilize multiple, separate cores and operate in an increasingly parallel way.
  2. Use clever tricks to predict, schedule, and optimize the computations to make better use of the memory pipeline and otherwise idle CPU cycles.

We will cover techniques to utilize concurrent/parallel processing in Chapter 11. As for the second technique, it is capable of very impressive accelerations (on the order 2x to 100x faster than a naive implementation. However, it has sometimes caused issues. There have been some famous security vulnerabilities such as Spectre and Meltdown, which exploited speculative execution – a technique used to optimize CPU performance which will execute code before being explicitly asked to because the scheduler anticipates the next steps (with very good, but imperfect accuracy).

9.5 Logistics Warehouse Analogy

The problem is analogous to a logistics warehouse (persistent data) which needs to package up orders (processor instructions). There’s a conveyor belt of items being constantly routed to the packaging station. In order to keep the packing station working at full capacity, the intermediate systems (RAM & CPU caches) are funneling items they think will be needed to the packager (data that’s expected to be used in the processor). Most of the time, the necessary item (data) is optimally brought to the packaging station(process), or a nearby holding spot (CPU cache).

This system has grown very efficient, but sometimes the predictions miss or a never-before-ordered item needs to be picked from the far side of the warehouse and this causes significant delays to the system. Sometimes a package will start to be assembled before the packager has even gotten to that order (branch prediction) which can make the system faster most of the time, but if the predicted package isn’t actually what the customer ordered, then the work is lost and has to be redone (branch mis-predict).

There are a lot more optimizations along the way:

  • Since the items are already mostly arranged so that related items are next to each other, the conveyor belt will bring nearby items at the same time it brings the requested item (memory blocks).

  • If an item usually ordered after another one is, the conveyor system will start to bring that second item as soon as the first one is ordered (prefetching).

  • Different types of packaging stations might be used for specialized items (e.g. vector processing or cryptography instructions in the CPU).

9.6 Speed of Computer Actions

In a financial model, even small delays (such as main memory references vs. L1 cache access) can accumulate quickly in high-frequency trading or risk calculation routines. Understanding these timings can guide decisions on structuring data access patterns and deciding what data to cache or load in memory for optimal performance.

Representative time is given in Table 9.2 for a variety of common actions performed on a computer. It’s clear that having memory access from a local source is better for computer performance!

Table 9.2: How long different actions take on a computer. As an interesting yardstick, the distance a beam of light travels is also given to provide a sense of scale (this comparison originally comes from Admiral Grace Hopper). Source for the timings comes from: https://cs.brown.edu/courses/csci0300/2022/assign/labs/lab4.html
Operation Time (ns) Distance Light Traveled
Single CPU Cycle (e.g. one ADD or OR operation on a register) 0.3 9 centimeters
L1 cache reference 1 30 centimeters
Branch mispredict 5 150 centimeters
L2 cache reference 5 150 centimeters
Main memory reference 100 30 meters
Read 1MB sequentially from RAM 250,000 75 km (~2 marathons)
Round trip within a datacenter 500,000 150km (the thickness of Earth’s atmosphere)
Read 1MB sequentially from SSD 1,000,000 300km (distance Washington D.C. to New York City)
Hard disk seek 10,000,000 3,000km (width of continental United States)
Send packet CA->Netherlands->CA 150,000,000 45,000km (circumference of earth)

As introduced in Section 9.4, calculations on the CPU vary widely. Take, for example, Table 9.3 shows runtime varying quite a bit for common mathematical operations. Despite mathematically all essentially being elementary operations, the compute intensity varies widely:

Table 9.3: Different operations utilize different levels of required compute. On different hardware, the result may be different depending on the fundamental computational complexity of an operation and what hardware instructions are built into the CPU.
Mathematical Operation Julia Function Runtime in nanoseconds (Float64 arguments)
\(a+b\) + 1.7
\(a-b\) - 1.7
\(a/b\) / 1.7
\(a \times b\) * 1.7
\(a^b\) ^ 10.8
\(exp(a)\) exp 3.9
\(log_{10}(a)\) log 5.4
\(log(a)\) ln 3.7
\(abs(a)\) abs 1.7
Tip

One way to speed up financial models is to utilize techniques to avoid using the ^ operation, as it’s one of the most expensive basic operations, but unfortunately is also one of the most common when dealing with rates and compounding.

Some strategies:

  • Instead of using interest rates which require exponentiation, utilize continuously compounded rates and use the exp and log functions instead.
  • Say you are running a monthly model with inputs that are annual rates. Before running through the ‘hot loop’, pre-compute the annual rates into a vector of monthly rates to avoid re-computing this transformation at the cell/seriatim level.

  1. In fact, only two logical gates are needed to reproduce all boolean logical gates: NAND (Not AND) and NOR gates can be composed to create AND, OR, NOR, etc. gates.↩︎