Cache lines are precious for a programmer

Dwspandey
3 min readApr 10, 2021

When it comes to high-performance computation, the hierarchy of caches along with prefetcher plays a very important role. The processors are getting faster and faster while memories are still slow. The execution time is determined by how fast the data can be fetched from memory, in other words, the performance is bound by data access. In the best case, the whole application should be able to load in the L1 cache to avoid any cache miss. Caches misses are very costly, it takes only a couple of cycles for the CPU to access data from the L1 cache. The latency increases to ~10 cycles when data is fetched from the L2 cache. And if data is not present even in L2 Cache, it is fetched from memory and this is when all hell breaks loose! The latency increased to more than 100 cycles!

The following picture illustrates the flight time of the data access between the CPU and different memory hierarchies. Assuming that we have only 2 levels of caches ( L1 and L2).

Data access speed between CPU and different memories in the hierarchy

The CPU and memory architecture remains more or less the same in all of the von-Neuman-based computer architectures. It typically contains an L1 cache which is further divided into the data cache and instruction cache and most of the architectures usually have 2 or 3 levels of caches.

CPU and memory hierarchy

Caching and prefetching works in tandem, where caching have the following 2 key aspects:

  1. Temporal locality: How frequently data is being accessed. Which basically determines what to keep in the cache and what to evict from it and a lot depends upon the access pattern of the data and the applied cache policies
  2. Spatial locality: The application will probably need neighboring bytes as well.

Prefetching: The job of the prefetcher is to preemptively load data in the cache.

The data fetching job is done using cache lines which have typically a size of 64KB. This enables to fetch data in bulk even if a single byte is required. A block of data which is sitting closely fetched together based on the size of the cache line.

Cacheline

How cachelines can be leveraged by a programmer?

As we found out cache lines give us a lot of data freely, this important aspect can be leveraged while writing a program. Most importantly, the selection of storage types directly impacts how better cache lines can be used. The array has very good spatial locality due to data storage in a contiguous memory location. While accessing a single element from an array, the cache lines can automatically bring additional data into the cache. Which avoids cache misses while fetching the neighboring data elements from the array. For smaller data sizes, the linear time array traversal can even beat the heap-allocated map which has logarithmic complexity.

Therefore it is always better to use an array or dynamic arrays (vectors in C++) for high performance, and pretty much everything can be implemented using them. Additionally, creating smaller objects helps as they can better fit into the cache lines or even if not, fewer cycles are required in fetching them from the main memory.

The next important thing which should be looked into is the data usage pattern. Regular patterns are good for faster data fetching into the cache. The prefetcher decodes such patterns and preemptively fetches data into the cache. Due to regularity, a 2D array traversal with row-major is faster than a column-major. https://en.wikipedia.org/wiki/Row-_and_column-major_order

I hope this article gives you a gist of cacheline operation and how they can be better utilized. There are many things which are not considered here such as the operation of caches in a multi-core environment, branch prediction impact, etc.

--

--

Dwspandey

working in high performance computing, combinatorial optimization, computational geometry