Non-Uniform Memory Access

Non-Uniform Memory Access: Non-Uniform Memory Access (NUMA) is a computer memory design used in Multiprocessing, where the memory access time depends on the memory location relative to a processor. Under NUMA, a processor can access its own local memory faster than non-local memory, that is, memory local to another processor or memory shared between processors.

NUMA architectures logically follow in scaling from symmetric multiprocessing (SMP) architectures. Their commercial development came in work by Burroughs (later Unisys), Convex Computer (later Hewlett-Packard), Silicon Graphics (later Silicon Graphics International), Sequent Computer Systems (later IBM), Data General (later EMC) and Digital (later Compaq, now HP) during the 1990s. Techniques developed by these companies later featured in a variety of Unix-like operating systems, and somewhat in Windows NT.

Contents

1 Basic concept

2 Cache coherent NUMA (ccNUMA)

3 NUMA vs. cluster computing

4 See also

5 References

6 External links

Basic concept

One possible architecture of a NUMA system. Notice that the processors are connected to the bus or crossbar by connections of varying thickness/number. This shows that different CPUs have different priorities to memory access based on their location.

Modern CPUs operate considerably faster than the main memory to which they are attached. In the early days of computing and data processing the CPU generally ran slower than its memory. The performance lines crossed in the 1960s with the advent of the first supercomputers and high-speed computing. Since then, CPUs, increasingly "starved for data", have had to stall while they wait for memory accesses to complete. Many supercomputer designs of the 1980s and 90s focused on providing high-speed memory access as opposed to faster processors, allowing them to work on large data sets at speeds other systems could not approach.

Limiting the number of memory accesses provided the key to extracting high performance from a modern computer. For commodity processors, this means installing an ever-increasing amount of high-speed cache memory and using increasingly sophisticated algorithms to avoid "cache misses". But the dramatic increase in size of the operating systems and of the applications run on them has generally overwhelmed these cache-processing improvements. Multi-processor systems make the problem considerably worse. Now a system can starve several processors at the same time, notably because only one processor can access memory at a time.

NUMA attempts to address this problem by providing separate memory for each processor, avoiding the performance hit when several processors attempt to address the same memory. For problems involving spread data (common for servers and similar applications), NUMA can improve the performance over a single shared memory by a factor of roughly the number of processors (or separate memory banks).

Of course, not all data ends up confined to a single task, which means that more than one processor may require the same data. To handle these cases, NUMA systems include additional hardware or software to move data between banks. This operation has the effect of slowing down the processors attached to those banks, so the overall speed increase due to NUMA will depend heavily on the exact nature of the tasks run on the system at any given time.

Cache coherent NUMA (ccNUMA)

Nearly all CPU architectures use a small amount of very fast non-shared memory known as cache to exploit locality of reference in memory accesses. With NUMA, maintaining cache coherence across shared memory has a significant overhead.

Although simpler to design and build, non-cache-coherent NUMA systems become prohibitively complex to program in the standard von Neumann architecture programming model. As a result, all NUMA computers sold to the market use special-purpose hardware to maintain cache coherence^{[citation needed]}, and thus class as "cache-coherent NUMA", or ccNUMA.

Typically, this takes place by using inter-processor communication between cache controllers to keep a consistent memory image when more than one cache stores the same memory location. For this reason, ccNUMA may perform poorly when multiple processors attempt to access the same memory area in rapid succession. Operating-system support for NUMA attempts to reduce the frequency of this kind of access by allocating processors and memory in NUMA-friendly ways and by avoiding scheduling and locking algorithms that make NUMA-unfriendly accesses necessary. Alternatively, cache coherency protocols such as the MESIF protocol attempt to reduce the communication required to maintain cache coherency. Scalable Coherent Interface (SCI) is an IEEE standard defining a directory based cache coherency protocol to avoid scalability limitations found in earlier multiprocessor systems. SCI is used as basis for the Numascale NumaConnect technology.

Current^[when?] ccNUMA systems are multiprocessor systems based on the AMD Opteron, which can be implemented without external logic, and Intel Itanium, which requires the chipset to support NUMA. Examples of ccNUMA enabled chipsets are the SGI Shub (Super hub), the Intel E8870, the HP sx2000 (used in the Integrity and Superdome servers), and those found in recent NEC Itanium-based systems. Earlier ccNUMA systems such as those from Silicon Graphics were based on MIPS processors and the DEC Alpha 21364 (EV7) processor.

Intel announced NUMA^{[clarification needed]} introduction to its x86 and Itanium servers in late 2007 with Nehalem and Tukwila CPUs.^[1] Both CPU families will share a common chipset; the interconnection is called Intel Quick Path Interconnect (QPI).^[2]

NUMA vs. cluster computing

One can view NUMA as a very tightly coupled form of cluster computing. The addition of virtual memory paging to a cluster architecture can allow the implementation of NUMA entirely in software where no NUMA hardware exists. However, the inter-node latency of software-based NUMA remains several orders of magnitude greater than that of hardware-based NUMA.

See also

Uniform Memory Access (UMA)

Cluster computing

Symmetric multiprocessing (SMP)

Cache only memory architecture (COMA)

Scratchpad RAM (SPM)

Supercomputer

Silicon Graphics, SGI

HiperDispatch

Intel QuickPath Interconnect (QPI)

References

This article was originally based on material from the Free On-line Dictionary of Computing, which is licensed under the GFDL.

^ Intel Corp. (2008). Intel QuickPath Architecture [White paper]. Retrieved from http://www.intel.com/pressroom/archive/reference/whitepaper_QuickPath.pdf

^ Intel Corporation. (September 18th, 2007). Gelsinger Speaks To Intel And High-Tech Industry's Rapid Technology Caden[Press release]. Retrieved from http://www.intel.com/pressroom/archive/releases/2007/20070918corp_b.htm

External links

NUMA FAQ

Page-based distributed shared memory

OpenSolaris NUMA Project

Introduction video for the Alpha EV7 system architecture

More videos related to EV7 systems: CPU, IO, etc

NUMA optimization in Windows Applications

NUMA Support in Linux at SGI

Intel Tukwila

Intel QPI (CSI) explained

current Itanium NUMA systems

v · d · eParallel computing

General
Cloud computing · High-performance computing · Cluster computing · Distributed computing · Grid computing

Levels
Bit · Instruction · Data · Task

Threads
Superthreading · Hyperthreading

Theory
Amdahl's law · Gustafson's law · Cost efficiency · Karp–Flatt metric · slowdown · speedup

Elements
Process · Thread · Fiber · PRAM · Instruction window

Coordination
Multiprocessing · Multithreading (computer architecture) · Memory coherency · Cache coherency · Cache invalidation · Barrier · Synchronization · Application checkpointing

Programming
Models (Implicit parallelism · Explicit parallelism · Concurrency) · Flynn's taxonomy (SISD • SIMD • MISD • MIMD (SPMD)) · Thread (computer science) · Non-blocking algorithm

Hardware

Multiprocessor (Symmetric · Asymmetric) · Memory (NUMA · COMA · distributed · shared · distributed shared) · SMT
MPP · Superscalar · Vector processor · Supercomputer · Beowulf

APIs
Ateji PX · POSIX Threads · OpenMP · OpenHMPP · PVM · MPI · UPC · Intel Threading Building Blocks · Boost.Thread · Global Arrays · Charm++ · Cilk · Co-array Fortran · OpenCL · CUDA · Dryad · DryadLINQ

Problems

Embarrassingly parallel · Grand Challenge · Software lockout · Scalability · Race conditions · Deadlock · Livelock · Deterministic algorithm · Parallel slowdown

Category · Commons

Categories:
Parallel computing
Computer memory

Игры ⚽ Нужно сделать НИР?

Look at other dictionaries:

Non-Uniform Memory Access — NUMA (Non Uniform Memory Access «неравномерный доступ к памяти» или Non Uniform Memory Architecture «Архитектура с неравномерной памятью») схема реализации компьютерной памяти, используемая в мультипроцессорных системах, когда … Википедия
Non Uniform Memory Access — Pour les articles homonymes, voir Numa. En informatique, un système NUMA (pour Non Uniform Memory Access ou Non Uniform Memory Architecture, signifiant respectivement accès mémoire non uniforme et architecture mémoire non uniforme) est un système … Wikipédia en Français
Non-Uniform Memory Access — Non Uniform Memory Architecture oder kurz NUMA ist eine Computer Speicher Architektur für Multiprozessorsysteme, bei denen jeder Prozessor eigenen, lokalen Speicher hat, aber anderen Prozessoren über einen gemeinsamen Adressraum direkten Zugriff… … Deutsch Wikipedia
Non-Uniform Memory Access — parallel processing architecture in which each processor has its own memory but can also access the memory of other processors, NUMA (Computers) … English contemporary dictionary
Uniform Memory Access — (UMA) is a shared memory architecture used in parallel computers. All the processors in the UMA model share the physical memory uniformly. In a UMA architecture, access time to a memory location is independent of which processor makes the request … Wikipedia
Memory architecture — describes the methods used to implement electronic computer data storage in a manner that is a combination of the fastest, most reliable, most durable, and least expensive way to store and retrieve information. Depending on the specific… … Wikipedia
Direct memory access — (DMA) is a feature of modern computers that allows certain hardware subsystems within the computer to access system memory independently of the central processing unit (CPU). Without DMA, the CPU using programmed input/output is typically fully… … Wikipedia
Shared memory — In computing, shared memory is a memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Depending on context, programs may run on a single processor or on… … Wikipedia
Memory management unit — This 68451 MMU could be used with the Motorola 68010 A memory management unit (MMU), sometimes called paged memory management unit (PMMU), is a computer hardware component responsible for handling accesses to memory requested by the CPU. Its… … Wikipedia
Cache only memory architecture — (COMA) is a computer memory organization for use in multiprocessors in which the local memories (typically DRAM) at each node are used as cache. This is in contrast to using the local memories as actual main memory, as in NUMA organizations. In… … Wikipedia

Academic Dictionaries and Encyclopedias

Non-Uniform Memory Access

Contents

Basic concept

Cache coherent NUMA (ccNUMA)

NUMA vs. cluster computing

See also

References

External links

Look at other dictionaries:

Share the article and excerpts

Academic Dictionaries and Encyclopedias

Wikipedia

Non-Uniform Memory Access

Contents

Basic concept

Cache coherent NUMA (ccNUMA)

NUMA vs. cluster computing

See also

References

External links

Look at other dictionaries:

Share the article and excerpts

Direct link