Shared Memory vs. Distributed Memory

Shared Memory:

Advantages

Simpler to program than message passing
Implicit communication

Disadvantages

Controlling locality is difficult but important for performance
Race conditions
False sharing

Advantages – Natural automatic locality control

– Runs everywhere (SM+DM) – Scalability and performance – No data races

 Disadvantages

– Low-level, complex programming

– Large memory consumption due to replication of data

UMA && NUMA

UMA Architecture 结构

Access to memory is uniform

– There is one global physical memory
– Each processor has essentially the same performance characteristics (same latency and bandwidth) when accessing any memory location in the systemUMA in the past (and now NUMA) systems dominate the server market and are becoming more common at the desktop
Shared address space makes UMA attractive for parallel programming (easier than parallel programming for distributed memory systems)
They are important building blocks for larger-scale systems (E.g., Infiniband clusters)

相当于由一个hub来管理内存。每个处理器都有相同的性能表现（相同的latency和bandwitdh)，因为hub是一样的

Bus-Based UMA Design

有一条总线连接缓存块和内存就是Bus Based UMA Design

Remember: we need caches for good performance

Caches allow us to exploit temporal and spatial locality
Instead of fetching a datum from memory, get it from fast cache 内存比缓存慢
In a UMA systems we’ll want to have a separate cache for each processor since the cache needs to be close to the processor and access has to be fast
In multicore systems some of the caches might be shared

缓存非常重要

Temporal 时间性，程序多次不同时间访问一个内存位置
Spatial 空间性，程序一次访问可能多个内存位置
UMA我们想让每个处理器有自己的cache，距离processor越近访问越快
多核系统缓存可能是分享的 (L1,L2给每个核心，L3共享)

Problem: caches lead to replication of data items in several locations

These multiple copies of data items in caches must be kept coherent, somehow

问题是多核处理的话可能一个数据在多个缓存都有

这些多份数据应该是coherent的

Caches Coherence

缓存能提升速度

如果需要更改非共享缓存的数据的话，这些数据可能在其他那里也有，这里就需要一些策略

Write-through cache Each update to a cache also updates the original memory location 缓存里的数据更改了立马写回内存位置很慢
Write-back cache 缓存里数据先不写回去，需要替换当前缓存的值（将当前变量替换成其他的值时候）再写回去

如果用Write-through cache则有

P3缓存更新，内存缓存更新，P1缓存未更新。 P2由于之前未储存P2的值，所以向内存请求到u的新值

如果用Write-back，

P3缓存更新

由于u没有被替换成其他变量（比如x,y）那么这时候Memory中依然是老值5，所以P1从缓存里读取，P2从内存中读取，都是5

解决冲突办法：Snooping

Key properties: – Cache controllers can snoop on the bus (observe transactions) – All bus transactions are visible to all cache controllers (bus is a broadcast medium) – All controllers see the transaction in the same order – Controllers can take action if a bus transaction is relevant, i.e. involves a memory block in its cache – Additional state information is associated with each cache line (e.g., valid, dirty,…) – Coherence is maintained at the granularity of a cache block – State information for uncached blocks is implicitly defined (e.g. invalid or not present) – Controllers can take appropriate action for relevant transactions: invalidate, update, supply value

缓存块能观察总线上的缓存转移
所有的数据都是通过总线传递，总线上存在一个缓存控制器，控制器可以看到这些数据
控制器可以看到转移的顺序
如果转移是有关的，控制器可以采取措施，也就是说控制器会参与到一个存储块在它的缓存里
有额外的状态被储存，（当前块是有效的还是脏数据）
一致性以缓存块的粒度为单位进行储存（也就是说一换就是一行）
未缓存块的状态信息是隐式定义的（无效还是没有出现过）
控制器对于相关的转移可以采取有效的措施：令失效，更新，提供新值

Snooping有两种协议：失效和更新协议：

Invalidation Protocols

Basic idea: There can be multiple readers but only one write at a time.

Initially, a line may be shared among several caches for reading purposes. When one of the caches wants to perform a write to the line it first issues a notice that invalidates that tine in the other caches, making the line exclusive to the writing cache.

Update Protocols

There can be multiple writers as well as multiple readers.

When a processor wishes to update a shared line, the word to be updated is distributed to all others, and caches containing that line can update it.

Problems

True sharing

一个变量共享给两个核心，这时候就是真sharing
如果频繁读写这些共享变量，会降低效率
可以给每个thread分配一个local 变量

False sharing

缓存都是以缓存行为基础假设有一行现在有 a b c 这一行同时被core 1和core 2存储如果a被core 1 修改，那么core 2的b和c也跟着失效了这时core 2 如果想要读取b的话就需要返回高级缓存或者主存找，非常浪费时间如果多线程修改互相独立的变量时，我们往往假设他们占据的是两个行，处理器可以并行处理他们但其实在缓存中他们可能占据的是一个行，那么处理器就要不断进行失效，有效命令，本质上又回到了一个行。解决办法就是加padding

Directory Based Cache Coherence NUMA中的缓存一致性问题

如果只是小范围的节点数量，Bus Based是很好的。但是如果节点数很多，只有一个bus负责传递数据很容易造成一核有难，n核围观的情况。因为一次只允许一个节点使用总线，很损失整体的性能。一个修改，就需要给所有节点广播，很慢。

用一个特殊的directory来服务，而不是用bus
通过目录跟踪所有缓存块的状态（状态，时间，位置）
目录可以点对点的通知修改，不再需要广播

https://en.wikipedia.org/wiki/Directory-based_coherence

Sequential Consistency

仅仅只是Cache Coherence 或许不够.

有时候编译器会重新组织上下文

比如

int a, b;
a = 5;
b = 3;
a = 6;

可能编译器会把a=6放到前面，b=3放到后面。对于编译器来说这样连续修改同一个变量对于缓存的开销可能是小的。但是考虑这样一种情况

这种情况下

P1如果flag被放在了前面，P2在读取data的时候就可能会读到0。

和我们的想法不一样

Idea

我们想让Operations都依照程序里的顺序来。

A read should return the value of the last write to the location(by any processor) 一次读应该返回最后一次写的位置的值
Operations should be executed in some order that respects the individual program order of the threads/process. 运算应该被以一种，尊重线程/进程个别程序的顺序，执行。

Model

处理器依次从memory中存取
但是程序的顺序不饿能改变，每次存取memory应该是原子化的。定义一些基础指令