[Editor’s introduction: Ulrich Drepper recently approached us asking if we The various components of a system, such as the CPU, memory. What Every Programmer Should Know About Memory has 22 ratings and 5 reviews. Jaseem said: I can only tell that Every Programmer by. Ulrich Drepper. pdfs/What Every Programmer Should Know About Memory – Ulrich Drepper ( ).pdf. b8fa4bb on Jun 5, @tpn tpn Checkpoint commit. 1 contributor.
|Published (Last):||28 August 2007|
|PDF File Size:||15.30 Mb|
|ePub File Size:||9.58 Mb|
|Price:||Free* [*Free Regsitration Required]|
The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features.
Please sign up today! The original document prints out at over ddrepper. We will be splitting it into about seven segments, each run weeks mfmory its predecessor. Once the entire series is out, Ulrich will be releasing the full text. Reformatting the text from the original LaTeX has been a bit of a challenge, but the results, hopefully, will be good. Hyperlinked cross-references and [bibliography references] will not be possible until the full series is published.
Many thanks to Memorh for allowing Memoory to publish this material; we hope that it will lead to more memory-efficient software across our systems in the near future. The various components of a system, such as the CPU, memory, mass storage, and network interfaces, were developed together and, as a result, were quite balanced in their performance. For example, the memory and network interfaces were not much faster than the CPU at providing data.
This situation changed once the basic structure of computers stabilized and hardware developers concentrated on optimizing individual subsystems. Suddenly the performance of some components of the computer fell significantly behind and bottlenecks developed.
This was especially true for mass storage and memory subsystems which, for cost reasons, improved more slowly relative to other components. The slowness of mass storage has mostly been dealt drepprr using software techniques: Cache storage was added to the storage devices themselves, which requires no changes in the operating system to increase performance.
Unlike storage subsystems, removing the main memory as a bottleneck has proven much more difficult and almost all solutions require changes to the hardware. Today these changes mainly come in the following forms: RAM hardware design speed and parallelism.
What every programmer should know about memory, Part 1
Direct memory access DMA for devices. For the most part, this document will deal with CPU caches and some effects of memory controller design. In the process of exploring these topics, we will explore DMA and bring it into the larger picture. However, we will start with an overview of the design for today’s commodity hardware.
This is a prerequisite to understanding the problems and the limitations of efficiently using memory subsystems. We will also learn about, in some detail, the different types of RAM and illustrate why these differences still exist.
This document is in no way all inclusive and final. It is limited to commodity hardware and further limited to a subset of that hardware. Also, many topics will be discussed in just enough detail for the goals of this paper. For such topics, readers are recommended to find more detailed documentation. When it comes to operating-system-specific details and solutions, the text exclusively describes Linux.
At no time will it contain any information about other OSes. The author has no interest in discussing the implications for other OSes. One last comment before the start.
The technology discussed here exists in many, many variations in the real world and this paper only addresses the most common, mainstream versions.
It is rare that absolute statements can be made about this technology, thus the qualifiers. It does not go into enough technical details of the hardware to be useful for hardware-oriented readers. But before we can go into the practical information for developers a lot of groundwork must be laid.
To that end, the second section describes random-access memory Drepler in technical detail. This section’s content is nice to know but not absolutely critical to be able to understand the later sections. Appropriate back references to the section are added in places where the content is required so that the anxious reader could skip most of this section at first.
The third section goes into a dgepper of details of CPU cache behavior. Graphs have been used to keep the text from being as dry as it would otherwise be. This content is essential for an understanding of the rest of the document.
“What every programmer should know about memory” – the PDF version 
This is also required groundwork for the rest. Section 6 is the central section of this paper. It brings together all the previous sections’ information and gives programmers advice on how to write code which performs well in the various situations.
The very impatient reader could start with this section and, if necessary, go back to the earlier sections to freshen up the knowledge of the underlying technology. Section 7 introduces tools which can help the programmer do a better job. Even with a complete understanding of the technology it is far from obvious where in a non-trivial software project the problems are.
Some tools are necessary.
In section 8 we finally give an outlook of technology which can be expected in the near future or which might just simply be good to have. This includes updates made necessary by advances in technology but also to correct mistakes. Readers willing to report problems are encouraged to send email. Markus Armbruster provided a lot of valuable input on problems and omissions in the text.
Goldberg’s paper is still not widely known, although it should be a prerequisite for anybody daring to touch a keyboard for serious programming.
Scaling these days is most often achieved horizontally instead of vertically, meaning today it is more cost-effective to use many smaller, connected commodity computers instead of a few really large and exceptionally fast and expensive systems. This is the case because fast and inexpensive network hardware is widely available. There are still situations where the large specialized systems have their place and these systems still provide a business opportunity, but the overall market is dwarfed by the commodity hardware market.
Bigger machines will be supported, but the quad socket, quad CPU core case is currently thought to be the sweet spot and most optimizations are targeted for such machines. Large differences exist in the structure of commodity computers. Note that these technical details tend to change rapidly, so the reader is advised to take the date of this writing into account.
Over the years the personal computers and smaller servers standardized on a chipset with two parts: The Northbridge contains, among other things, the memory controller, and its implementation determines the type of RAM chips used for the computer. To reach all other system devices, the Northbridge must communicate with the Southbridge.
Older systems had AGP slots which were attached to the Northbridge. This was done for performance reasons related to insufficiently fast connections between the Northbridge and Southbridge. Such a system structure has a number of noteworthy consequences: All data communication from one CPU to another must travel over the same bus used to communicate with the Northbridge.
All communication with RAM must pass through the Northbridge. The RAM has only a single port. It can be found in specialized hardware such as network routers which depend on utmost speed. A couple of bottlenecks are immediately apparent in this design. One such bottleneck involves access to RAM for devices. In the earliest days of the PC, all communication with devices on either bridge had to pass through the CPU, negatively impacting overall system performance.
To work around this problem some devices became capable of direct memory access DMA. Today all high-performance devices attached to any of the buses can utilize DMA. This problem, therefore, must to be taken into account. A second bottleneck involves the bus from the Northbridge to the RAM. The exact details of the bus depend on the memory types deployed. On older systems there is only one bus to all the RAM chips, so parallel access is not possible. The Northbridge interleaves memory access across the channels.
With limited bandwidth available, it is important to schedule memory access in ways that minimize delays. As we will see, processors are much faster and must wait to access memory, despite the use of CPU caches. If multiple hyper-threads, cores, or processors access memory at the same time, the wait times for memory access are even longer. This is also true for DMA operations. There is more to accessing memory than concurrency, however. Access patterns themselves also greatly influence the performance of the memory subsystem, especially with multiple memory channels.
Refer to Section 2. On some more expensive systems, the Northbridge does not actually contain the memory controller. Instead the Northbridge can be connected to a number of external memory controllers in the following example, four of them. Northbridge with External Controllers The advantage of this architecture is that more than one memory bus exists and therefore total bandwidth increases.
This design also supports more memory. Concurrent memory access patterns reduce delays by simultaneously accessing different memory banks.
This is especially true when multiple processors are directly connected to the Northbridge, as in Figure 2. For such a design, the primary limitation is the internal bandwidth of the Northbridge, which is phenomenal for this architecture from Intel. Integrated Memory Controller With an architecture like this there are as many memory banks available as there are processors. On a quad-CPU machine the memory bandwidth is quadrupled without the need for a complicated Northbridge with enormous bandwidth.
Having a memory controller integrated into the CPU has some additional advantages; we will not dig deeper into this technology here. There are disadvantages to this architecture, too.
First of all, because the machine still has to make all the memory of the system accessible to all processors, the memory is not uniform anymore hence the name NUMA – Non-Uniform Memory Architecture – for such an architecture.