ARMv7 Common Memory System Architecture
Introduction
This article seeks to provide a concise summary of the ARM version 7 Common Memory System Architecture, which is a very convoluted name for the ARM Caching and Branch Prediction architecture. ARM has 3 main "Memory Systems Architectures" which form the components of its memory access scheme:
- CMSA: Common Memory System Architecture: Caches and Branch Prediction. CMSA is common to all profiles, with adjustments as needed for the particular implementation and intended use of the processor.
- VMSA: Virtual Memory System Architecture: TLBs and Virtual Memory as a memory protection and isolation, and linearization scheme. VMSA is generally implemented by ARM profile-A processor implementations.
- PMSA: Protected Memory System Architecture: a simplified memory protection scheme, also known in general CPU design parlance as an MPU (Memory Protection Unit), often held in contrast to an MMU (Memory Management Unit). An MPU is generally preferred to the memory and speed overheads of an MMU, in very resource-constrained, or performance critical microcontrollers. PMSA is generally implemented by ARM profile-R processor implementations.
Any particular ARM implementation may have one or more of these "Memory Systems" (CMSA, VMSA and PMSA).
This article is not current with the ARM version 8 CMSA.
Strong warning: The ARM reference manuals are subject to change at the whims and discretion of ARM, and this article does not claim to be a current description of the behaviour of architected ARM features, on any given date. This is a concise summary of critical points for quick recollection, for a person who has already read the manuals. Do not use this as a primary learning source.
Components of the CMSA
Caches
Cache policies
ARMv7 CMSA compliant processors have buffering and caching policies which, when combined in a non-LPAE-mode processor, give the following cache policies:
- Non-cacheable
- Write-through cacheable
- Write-back Write-allocate cacheable (aka, read-around or read-behind caching)
- Write-back no-write-allocate cacheable (aka write-around or write-behind caching)
An ARMv7 processor with the LPAE extensions enabled allows for more cache policies, as well as a cache-allocation hint.
Cache allocation hints
ARM processors' memory attributes include an allocation hint for the processor. This enables software to cue the processor into whether or not it should allocate cache lines for accesses to that memory area. The ARM specification does not require implementations to respect these hints. The hints are:
- Allocate
- Do-not allocate.
Cache topology
In ARM processors, the instruction and data/unified caches can be separately enabled and disabled. The Data/unified cache(s) are enabled as a group, and the instruction cache(s) are separately enabled as their own group. ARMv7 provides a way for software to reason about caches in a uniform manner across implementations. Prior to ARMv7, ARM only architecturally specified one level of cache and support for and management of all other levels of cache was <IMPLEMENTATION DEFINED>. ARMv7 introduces the concepts of Level of Unification and Level of Coherency in order for software to interact robustly with diverse caches.
Level of Unification (verbatim)
The PoU for a processor is the point by which the instruction and data caches and the translation table walks of that processor are guaranteed to see the same copy of a memory location. In many cases, the point of unification is the point in a uniprocessor memory system by which the instruction and data caches and the translation table walks have merged. The PoU for an Inner Shareable shareability domain is the point by which the instruction and data caches and the translation table walks of all the processors in that Inner Shareable shareability domain are guaranteed to see the same copy of a memory location.Defining this point permits self-modifying software to ensure future instruction fetches are associated with the modified version of the software by using the standard correctness policy of:
- Clean data cache entry by address.
- Invalidate instruction cache entry by address.
The PoU also permits a uniprocessor system that does not implement the Multiprocessing Extensions to use the clean data cache entry operation to ensure that all writes to the translation tables are visible to the translation table walk hardware. In other words, ARM TLBs walk the page tables from the Level of Unification, so page tables need only be cleaned to the level of Unification, and not the Level of Coherency.
Level of Coherency (verbatim)
For a particular MVA, the PoC is the point at which all agents that can access memory are guaranteed to see the same copy of a memory location. In many cases, this is effectively the main system memory, although the architecture does not prohibit the implementation of caches beyond the PoC that have no effect on the coherence between memory system agents.
"All agents" here refers to non-processor microcontrollers such as DMA-capable devices.
Enumerating processor-controlled caches
Notice that the title of this section heading is "Enumerating processor-controlled caches". This is because on any ARM platform, not all caches may be under the control of the processor. In an ARMv7 implementation where some cache is under the control of the processor however, this is how that cache is specified to be interacted with by ARM.
Architected cache behaviour, caveats and guarantees
Properties of a cache system
- Cache line size: The size of cache lines as allocated and evicted according to the cache policy. ARM allows cache-line sizes from 16Bytes to 2KiB (Section B2.2.2).
- Cache granule: The write-back size of the processor when a write-back policy is in use. The processor may write multiple cache lines back at once, and the size of the burst-transaction that is written at once, is the cache granule.
- Number of levels of caches: The number of cache levels which must be kept coherent in the cache system.
- The cache allocation algorithm: what policy dictates whether or not the processor will allocate a line into the cache for an access to a particular memory location. The cache eviction algorithm is not architecturally specified.
- Behaviour at major junctions in software execution such as exception entry.
- The presence or absence of speculative caching and what its behaviour is like.
Cache lockdown
Due to all the properties of caches (perhaps especially the lattermost), which factor into their behaviour, ARM cannot guarantee:
- Whether or not a particular memory location present in the cache will remain in the cache.
- Whether or not a particular memory location not allocated into the cache will be allocated into the cache.
Instead, ARM provides a "Locked Entry" mechanism, which pins a line into the cache. The implementation details behind a particular platform's cache lockdown are <IMPLEMENTATION DEFINED>, and it is not mandatory, and may not be supported by an implementation. Additionally there is no architected control over the feature, but it is to be described by the implementation documentation.
- A memory location cannot be relied upon to remain in the cache unless locked into the cache.
- If an unlocked entry remains in the cache for some duration, it cannot be relied on to remain incoherent with memory; it will be written back eventually, and software should not rely on it remaining dirty (why would anyone do this?).
- If an entry is locked, it can be relied on to remain in the cache; but it cannot be relied on to remain incoherent with memory, and again, software should not rely on it remaining dirty (again, why?).
In ARMv6 the Cache Type Register held information about the Cache lockdown mechanism; in ARMv7 this is no longer the case.
General
- Even if a memory location is accessible in the current translation scheme (the manual said scheme here, but probably meant "regime") and is accessible in the current privilege level, and is also marked cacheable for the current translation regime, there is "no mechanism that can guarantee" that the memory location cannot be allocated into an enabled cache at any point. This basically means that there is no architected secondary means of preventing a memory location from being cached if it is marked cachable (this should be obvious).
- If the cache is disabled, it is guaranteed that no new memory location will be allocated into the cache. Stale cache entries may still cause cache hits when the cache is disabled.
- If the cache is enabled, it is guaranteed that no memory location that does not have the Cacheable attribute set, will be allocated into the cache. If a memory location was already in the cache before being marked uncachable, there is no guarantee that it will be evicted.
- If the cache is enabled, it is guaranteed that no memory location that is not accessible to software at the current translation regime and privilege level (or higher) will be allocated into the cache. Again, stale entries already in the cache may cause cache hits.
- For data accesses, if a memory location is marked as Normal Shareable, it is guaranteed to be coherent with all masters in that shareability domain.
- The eviction (author's interpretation of "eviction" here is a clean operation, not an invalidate) of a cache line from cache to memory cannot overwrite a memory location written by another observer in the cache system unless the two observers are in the same shareability domain.
- Verbatim: "The allocation of a memory location into a cache cannot cause the most recent value of that memory location to become invisible to an observer, if it had previously been visible to that observer". This appears to be stating that cache coherency is a given.
Additionally, along the same vein as the fact that stale entries in the cache might generate cache hits even with if caches are disabled, the following two caveats are given:
- If a location is marked not-Cacheable, but exists in the cache, there is no guarantee of whether the memory access will be returned from the caches or from memory.
- If a location is in the cache and is marked cacheable, but the cache is disabled.
Cache state on #RESET
On #RESET, all caches are disabled and before the Cache system is responsive to the ARMv7 architected cache controls, there may be an <IMPLEMENTATION SPECIFIC> procedure initialization routine that software may need to execute. Furthermore:
- It is <IMPLEMENTATION DEFINED> whether or not an access can cause a cache hit when caches are disabled. If cache hits to stale/garbage cache lines are possible while caches are disabled, the implementation must clearly document the correct way to initialize the platform caches.
- ARM recommends that invalidation routines conform to the ARM architected form.
If software enables the caches using the ARM architected mechanism before executing the <IMPLEMENTATION SPECIFIC> platform cache initialization routine, the behaviour of the caches is <UNPREDICTABLE>.
Before ARMv7, caches were specified to be invalidated by the assertion of #RESET.
Branch predictors
The ARM architecturally visible Branch Prediction architecture consists of the Branch Target Buffer (BTB), the Pipeline, and the Instruction Prefetch Buffer (IPB). The following sections explain why software may need to care about these components. ARM allows the branch predictor to be visible to software and it is not architecturally hidden. Thus software must under certain circumstances perform maintenance operations to ensure desired behaviour.
A branch-predictor invalidate operation has no functional effect on software execution. The Invalidate BTB by MVA operation must use the address of the branch-target.
If branch prediction is architecturally visible, an Instruction-cache Invalidate-ALL operation also invalidates all branch predictors.
In general, ff for a given translation regime, VMID and ASID (each where appropriate), the instructions at a virtual address change, then invalidation is necessary to ensure that the change is visible to subsequent execution. The following events may require a BTB invalidate:
- Enabling or disabling the MMU (VMSA).
- Writing new mappings to the translation tables (changing page table entries).
- Changes to TTBR0 ,TTBR1 and TTBCR, except if accompanied by a change of VMID or ContextID.
- Changes to VTTBR or VTTCR, unless accompanied by a change of VMID.
Failure to perform appropriate BTB maintenance operations may result in <UNPREDICTABLE> behaviour as old stale branches may be executed.
Software does not need to invalidate the BTB in the following cases however:
- When changing the page table entry attributes for a page with instructions, but the change does not cause the virtual address to point to new instructions; i.e, only the permissions are changed, for the current translation regime, VMID and ASID.
- After changing the ContextID, ASID or FCSE ProcessID.
- After executing an operation that is stated to also invalidate the BTB.
However, on ARMv6, software must invalidate the BTB after changing the ContextID or FCSE ProcessID.
The IPB
Translation Regimes
In ARM lingo, the current translation regime of the processor refers to whether or not the processor is doing single-stage or 2-stage MMU translation. In other words, it refers to whether or not the processor is using two-dimensional address space lookup from Guest virtual addresses to Host physical addresses. In x86 parlance this is called "Extended Page Tables".
2-level translation schemes are used by Hypervisors to support multiple Guest Operating Systems.