Synchronization¶

The Need for Synchronization¶

Synchronization

The need to coordinate access to shared resources

Thread Interactions¶

We want to avoid the two extremes, i.e. all grab at once (no locking or coordination, which causes race conditions) or each grabs one at a time (very slow, poor parallelism).

Concurrent threads interact through shared access:
- All threads in the same program can share access to:
  - Program code
  - Global variables
  - Heap
  - Open files
- Threads also share access (via the kernel) to peripheral devices, such as:
  - Secondary storage
  - Display
  - Other I/O devices
The part of a concurrent program in which a shared object is accessed is called a critical section.
This coordination of access to a shared resource is called synchronization.
A race condition is when the program result depends on the order of execution.
- Race conditions (a.k.a data races) occur without synchronization.
- Happens if we don’t do the synchronization

Sequential Consistency¶

When one core updates a variable, other cores may not see that update right away due to caching, reordering, and propagation delays. This is why we need synchronization primitives to ensure memory visibility and consistency across cores.

Program A - a write followed by a read¶

int flag1 = 0, flag2 = 0;
void p1(void *ignored) {
        flag1 = 1;
        if (!flag2) {
            critical_section_1();
    }
}
void p2(void *ignored) {
        flag2 = 1;
        if (!flag1) {
            critical_section_2();
    }
}
int main() {
    pthread_t tid;
    pthread_create(tid, NULL, p1, NULL);
    p2(); pthread_join(tid);
}

Modern CPUs use write buffers to delay writing to memory for performance. Each thread sees its own writes immediately, but not the writes of other threads until they’re flushed to RAM. In multi-core systems, this means one thread may read stale values from another, leading both to enter critical sections at the same time. This behaviour breaks mutual exclusion and causes race conditions, even if the code logic appears correct.

We have to ensure the write to flag1 is seen by other threads before reading flag2. We have to ensure the write to flag2 is seen by other threads before reading flag1.

The system must maintain program order when a write is followed by a read, i.e. for Thread 1, make sure the other threads have access to the updated value of flag1 before it reads the value of flag2.

The system must enforce write-before-read ordering across threads, or else stale values may be seen and both threads could enter the critical section. This issue comes from hardware-level memory reordering or delays, not just incorrect logic.

Program B - Overlapping Writes and Non-blocking Reads¶

int data = 0, ready = 0;
void p1(void *ignored) {
    data = 350;
    ready = 1;
}
void p2(void *ignored) {
    while (!ready)
        ;
    process(data);
}

The assumption is: once ready == 1, data is guaranteed to be set and safe to use.

When one thread (p1) writes data and then sets a ready flag, another thread (p2) might see ready == 1 before data is fully updated, due to memory delays or hardware reordering. This can cause p2 to process stale data. The issue is worse if data and ready are stored in different memory modules or cache lines. To prevent this, the system must preserve the order of write operations — it must not reorder data and ready writes in a way that breaks program logic across threads.

The system must maintain program order between two write operations. I.e., the system cannot reorder write operations.

The same result can happen with non-blocking reads. A non-blocking read would execute the first read and not wait for it to be done before executing the second read.

The system must maintain program order between two read operations.

I.e. cannot speculatively prefetch data. The system must wait until ready is 1 before fetching data.

A cache line is the unit of data transfer between main memory and core’s caches. An optimization may be to reorder writes so that multiple writes to the same cache line occur at the same time.

Cache coherence is ensuring that in a multicore or multiprocessor environment, if two or more caches are storing at the same address, then the value stored for that address in each cache must be the same. I.e. changes in one cache must be propagated to other caches.

A write should not be considered complete until all cached copies have been updated (or marked as invalid so that they will not be used).

Sequential Consistency¶

Definition: Sequential consistency means:

All operations appear to be executed in some sequential order, and
The operations of each processor/core occur in the order specified by the program.

Requirements¶

Program Order:
- Each core must complete one memory operation before starting the next, preserving the program's operation order.
- The operation must update the result to cache and RAM before proceeding to another one
Write Atomicity:
- All writes to a memory address are seen in the same order by all cores.
- All subsequent reads by any processor core must see the same result.

Without Sequential Consistency¶

Different cores or threads may observe inconsistent results.
This can happen with interleaving on a single processor.

Optimization and Sequential Consistency¶

Sequential consistency restricts how hardware can optimize execution.
It prevents the use of performance-improving techniques like write buffers and write reordering.

Write Buffers
- A write buffer is fast memory where data is temporarily held before being written to RAM.
- It allows the CPU to proceed with instructions without waiting for writes to RAM to finish.
- But sequential consistency requires waiting until each write is complete to preserve order.
- Example: In a program, a thread should wait until flag1 is written to RAM before reading flag2 — write buffering would break this ordering.
Write Reordering
- Write reordering can improve performance when writing to different memory modules (MMs).
- Example sequence:
  1. Write to MM1
  2. Another write to MM1
  3. Write to MM2
  4. Another write to MM2
- Reordering 2) and 3) allows parallelism across MM1 and MM2.
- But sequential consistency forbids such reordering, reducing performance potential.
- The reason why 1 and 2 cannot happen at the same time is they are writes to the same memory module (MM1). Overlapping writes to the same memory location or module must be serialized (done one after the other). Otherwise, you risk race conditions, inconsistencies, or partial updates.
Complicate Non-blocking Reads
- If the next instruction is to load data from main memory, then the core can prefetch data before it determines that ready is true. But this may be the old value of data, so wait to read it. Do not prefetch it.
SC makes cache coherence more expensive
- Must delay write completion until its values in other caches are marked as invalid or updated.

x64¶

Write-back (WB) is the default: it writes to cache first and updates main memory later, improving performance but adding complexity. Write-through (WT) writes to both cache and memory simultaneously, ensuring consistency but making writes slower. Uncacheable (UC) disables caching entirely, forcing every access to go directly to memory—useful for real-time device data like mouse input. Write-combining (WC) buffers writes and sends them in bursts to optimize throughput, especially for streaming large amounts of data, but may delay visibility of updates.
Write-combining has weak consistency
- weak consistency means no guarantee that reads and writes will have sequential consistency.
Store-to-load forwarding is a hardware optimization that allows a CPU to forward a value from a recent store (write) to a subsequent load (read) — even before the store has been written to memory or cache.
- Store-to-load forwarding = reading from the store buffer, not the cache.
volatile keyword: tells the compiler do not optimize or cache this variable; always reload it from memory every time it is accessed.
An atomic instruction is an instruction that cannot be interrupted, i.e. no other core can access the memory value until the instruction has completed. It is like the instruction occurred in a single step.
- lock makes the memory instruction that follows it atomic. All lock instructions are totally ordered that happen in a sequential order. Other memory instructions cannot be reordered with the locked ones
- xchg exchanges the value stored in a register with the value stored in memory. It is always locked
- cmpxchg compares two values and exchanges if they are different. It is always locked
Fence assembly language instructions can prevent re-ordering from one side of the fence to the other
- lfence (load fence) – all loads (i.e. reads) from memory prior to the lfence instruction will complete (i.e. are globally visible) before any loads after the lfence instruction are executed. No loads that occur after the lfence will be executed before the lfence.
- sfence (store fence) – all stores (i.e. writes) to memory prior to the sfence instruction will complete (i.e. are globally visible) before any stores after the sfence instruction are executed.
- mfence (memory fence) – is a combination of an lfence and an sfence.

Synchronization Primitives¶

Recall: a critical section is the code that accesses a shared variable (or more generally, a shared resource) and must not be concurrently executed by more than one thread.
mutexes aka locks
```
void mutex_init(mutex_t *m, ...);
void mutex_lock(mutex_t *m);
int mutex_trylock(mutex_t *m);
void mutex_unlock(mutex_t *m);
```
- only one thread can acquire m at a time. Other threads must wait
- If a thread cannot lock the mutex immediately, it must wait until mutex becomes unlocked. Then its call to mutex_lock(&mutex) will return.
- Thread API contract
- All global data should be protected by a mutex.
  - Conditions: global variable + accessed by more than one thread + at least one write
  - Protection is the responsibility of the application writer
- If using mutexes correctly, it’s behaviours should be indistinguishable from sequential consistency.
- OS also need synchronization, but interrupts are incompatible with mutexes since we do not want the interrupt handler to clock waiting for mutex
- pthread_mutex_lock blocks the calling thread until the mutex is available. pthread_mutex_trylock does not block. If mutex is already locked, it returns immediately with -1 (errno == EBUSY). Otherwise, it locks it and return 0.
- If count is 0 or full, then release the mutex and call thread_yield to allow another thread to run.
- While waiting for the mutex the thread does not use the processor but instead waits in a wait queue and the kernel wakes it up when the other thread has unlocked the mutex
- NOTE
  - The difference with using conditional variables with mutex is that, the C.V. only wakes up when it is signalled, such that the thread will sleep without using processor. Mutex also has its own wait queue. But for the pure mutex like below, since thread_yield is used,
    1. The thread voluntarily gives up the CPU
    2. The kernel scheduler picks another runnable thread
    3. That thread placed in the ready queue
    4. Eventually, the scheduler picks it again (based on its scheduling policy, e.g., round-robin)
    5. The thread runs again, re-checking the condition (count)
  - So the below version does use CPU time, because the thread wakes up just to check the condition again, even if nothing has changed. If it does not get the mutex right away then it will block and not use processor time. But when it wakes up, it will check the count again, then unlock the mutex, then yields. That part of the while loop is using the processor and is what you could call busy waiting light, i.e. it occasionally will block with mutex lock, but will also busy wait, checking the condition, unlocking the mutex, and yielding.
Busy waiting: when the thread uses processor time while waiting, busy waiting is inefficient.
A semaphore is an object that has a non-negative value
- sem_wait: if the semaphore value is > 0, decrement the value by one. Otherwise, wait until the value is greater than 0 and then decrement it
- sem_post or sem_signal: increment the value by 1
- By definition, these two operations are atomic
- Does not keep track of ownership, which means that semaphores do not have owners. mutexes do
- sem_post does not have to follow sem_wait
- A thread will wait until the semaphore count is greater than 0 before continuing
Two common uses of semaphores include
- binary semaphore: a semaphore with a single resource; behaves like a mutex but does not keep track of ownership
- counting semaphore: a semaphore with an arbitrary number of resources
- Differences:
  - sem_post does not have to follow sem_wait
  - A semaphore can start with a count of 0
  - Calling sem_post increments the semaphore by 1
  - A thread will wait until the semaphore count is greater than 0 before continuing
  - Semaphores do not have owners. Mutex does
Condition variables allow threads to wait for arbitrary conditions to become true while the thread is waiting inside of a critical section
- When the thread blocks, the kernel unblocks the mutex and allows another thread into the critical section, possibly to make the condition true.
- When another thread detects the condition is true, it uses cond_signal or cond_broadcast to notify any blocked threads and then unlocks the mutex.
- When the kernel wakes up the thread, it ensures that the thread will hold the mutex
int pthread_cond_wait(pthread_cond_t *c, pthread_mutex_t *m);: Atomically unlock m and sleep until c is signalled by another thread, then re-acquire m and resume executing
pthread_cond_signal: wake up one thread waiting for condition c; pthread_cond_broadcast: wake up all waiting threads waiting for condition c
Signalling or broadcasting to a condition variable that has no blocked threads has no effect ⇒ signals do not accumulate
Recheck the condition on wake-up when using cond_wait. Otherwise, there may be another program that gets the mutex lock and consume the item, and if we didn’t recheck on the waiting program, it wakes up and thought there is something in the buffer to consume, but at this point since the item has been consumed by another program, the waiting program has no items to consume.
- cond_wait does not guarantee the condition is still true after waking up
The reason why we use cond_wait both unlock the mutex and sleep instead of separate function calls to unlock the mutex and wait for the condition is that due to bad interleaving, the producer could end up waiting even though the condition is true since it missed the consumer’s signal.
- It is not atomic.
- The mutex is unlocked before the thread goes to sleep.
- Why pthread_cond_wait(&cond, &mutex) exists:
  - This function does three things atomically:
    1. Unlocks the mutex
    2. Puts the thread to sleep on the condition
    3. When woken up, relocks the mutex before returning
  - No gap where the thread is not holding the lock but not sleeping yet

Locks / Mutexes
- Have two states.
- mutex_lock: a thread uses this to enter a critical section and gain exclusive access to a resource.
- mutex_unlock: a thread uses this to leave the critical section, allowing another thread to enter.
Semaphores
- Have many states (integer value, counts from 0 to n).
- sem_wait: blocks the thread if the semaphore value is 0.
- sem_post or sem_signal: increments the semaphore value, potentially unblocking a waiting thread.
Condition Variables
- Used to wait for arbitrary conditions to become true (e.g., complex boolean conditions).
- cond_wait: blocks the thread until a condition becomes true.
- cond_signal or cond_broadcast: wake up one or all threads waiting for that condition.
Sometimes allowing for race conditions provides for greater efficiency. For example, we don’t want an exact number but approximate.
Use lockset algorithm to detect data races:
- For each global memory location, keep a lockset
- On each access of a particular global memory location, note any locks that are not currently held
  - On each access, intersect the lockset with the locks currently held.
- If the lockset becomes all 0, it means there is no mutex protecting this memory location
  - If the lockset becomes empty, it means no common lock protected the variable → potential data race.
- If a shared variable is properly protected, then there should be at least one lock that is always held during every access to that variable.
  - Correct locking means there’s at least one lock consistently held on every access.
- This only means a potential data race, not a guaranteed one if all are 0. It only tells at least one access happened without holding any lock that was used in other accesses. But this doesn’t prove the accesses happened at the same time or caused corruption.
With coarse-grained locking we create one lock for each data structure. Use this if the data structure is not used that often since it uses less space and less complexity as the thread only has to acquire one lock
With fine-grained locking we create one lock for each linked list in the data structure. Use this when the data structure is being used often and maximizing concurrency and performance is the goal.
Lots of contention mean threads often waiting for a particular recourse. Identify places with lots of contention and use fine-grained locks on these areas

Implementing Synchronization Primitives¶

test-and-set

while (*mutex == true){}; // test if mutex is locked
*mutex = true; // set it to true, i.e. locked

Use the xchg to attempt to lock the mutex

xchg returns the old value that was in *mutex

bool test_and_set(bool *mutex) {
    return xchg(mutex, true) == true; // test and set
}

A spinlock is a lock that spins, i.e. it repeatedly tests the lock’s availability in a loop until the lock is acquired. When threads use a spinlock they busy-wait i.e. they execute code on a processor core while they wait for the lock.
- Efficient for short waiting times
- implemented with atomic_swap_uint64, which in turn is using the assembly language instruction xchgq, which is a variant of xchg that swaps quad words
- When a thread acquires a spinlock, interrupts are disabled on that CPU. This ensures the thread won’t be interrupted and will quickly finish its critical section. As a result, other threads spinning for the lock will only wait a short time, making the spinlock efficient for short tasks.

C11 Atomics¶

New atomic type _Atomic(...) which can wrap the most basic types and they become atomic. All standard ops (e.g., +, −, /, ∗) become sequentially consistent.
atomic_flag is a special type. it behaves like a boolean but you can’t directly load/store its value, but use atomic_flag_test_and_set and atomic_flag_clear instead.
- The standard requires that atomic_flag be implemented without using locks
  - lock-free: never waits for another thread to release a lock or finish its operation
- This means: it must complete without the thread being interrupted/preempted
- All other types might require locks

Cache Coherence¶

Caches are divided into slices of bytes called cache lines
- When data is loaded, it loads an entire cache line, not just a single byte or word.
- Reason: to exploit locality — when you access one byte, nearby bytes are likely to be used soon.
Each cache line is one of three states:
- Shared
  - One or more caches (and memory) have a valid copy.
  - The data is valid and identical in multiple caches (and possibly in memory).
- Modified (aka Exclusive)
  - Only one core has the latest, valid copy.
  - That copy is called dirty
  - That cache has changed the data (dirty) and memory is not yet updated.
  - Other cores’ copies are stale or invalid.
  - Before another core can read/write, this core must:
    - Write back to memory (if needed)
    - Invalidate other copies
- Invalid:
  - The data is either not present in cache or is stale and must be updated
Changing a cache line’s state (e.g., from Shared to Modified) involves coordination and can take time:
- 100–2000 clock cycles, depending on architecture and cache hierarchy.
Example flow using MSI protocol:
- Initial load
  - x = 42 is loaded by Core 0 → x is in Shared state in Core 0’s cache
- Core 1 also loads x
  - Now both caches have x in Shared state
- Core 0 writes to x
  - It must change x to Modified state
  - The hardware:
    - Invalidates x in Core 1’s cache
    - Core 0 now owns the only valid copy
    - Memory is marked out of date until write-back
- If Core 1 reads x again
  - It triggers a coherence transaction:
  - Core 0 must write back x to memory
  - Or transfer it directly to Core 1
  - Core 1 can then cache it again

Implications for Multithreaded Design
- Lesson #1: Avoid false sharing
  - False sharing occurs when different threads access different variables that happen to be on the same cache line.
  - Even if threads don’t share the same data logically, the processor will treat the whole cache line as shared, causing coherence overhead.
  - Tip: Separate data used by different threads. Example: pad arrays so each element starts at a new 64-byte boundary (typical cache line size).
- Lesson #2: Align structures to cache lines
  - Keep data that is used together in the same cache line to minimize memory traffic.
  - In C11/C++11, use alignas(64) to tell the compiler to place a structure on a 64-byte boundary to align it with a cache line.
- Lesson #3: Pad data structures
  - If you have arrays of structures, and each structure is used by a different thread, they may accidentally share cache lines.
  - Fix: Add unused fields (padding) to force structures to not share cache lines.
- Lesson #4: Avoid contending on cache lines
  - Multiple threads accessing the same cache line, even if only for reading/writing different variables, causes cache coherence traffic.
  - Advanced locking algorithms like MCS locks let each thread spin on a local cache line (private to its core), reducing this overhead.

Deadlock¶

Lesson: Dangerous to hold locks when crossing abstraction barriers. I.e., top level function holds lock a then calls a lower level function that uses condition variables and needs lock a and b to signal a thread to wake up.
Deadlock conditions (All four of these conditions are necessary for deadlock to occur):
1. Limited access (mutual exclusion): resources can only be shared with a finite number of threads
2. No preemption of recourses: once resource is granted, it cannot be taken away
3. Multiple independent requests (hold and wait): DO NOT HOLD AND WAIT (otherwise, the lower level functions you call may require the locks you are holding. So if you’re holding lock A, don’t try to acquire lock B while still holding A.)
  - A thread is holding one or more resources and also waiting for another.
  - Eg: Thread A holds lock X and wants lock Y; Thread B holds lock Y and wants lock X
  - If threads never held anything while waiting, deadlock wouldn’t happen.
4. Circularity in graph of requests
If we prevent one, deadlock would not happen
Prevention:
- Buy more resources, split into pieces, or virtualize to make "infinite" copies.
- Physical memory: virtualized with VM, can take physical page away and give to another process
  - A process holds a physical memory page.
  - Another process needs that page to continue.
  - Normally, it would be stuck waiting (classic deadlock).
  - But with VM, the OS can reclaim that page from the first process and give it to the second, avoiding the deadlock.
- Wait on all resources at once (must know in advance), i.e. do not hold and wait.
  - A thread should request all the resources it will need at the same time, before doing any work — instead of asking for them one at a time while holding others.
  - lock_both(a, b); // wait until *both* A and B are available
- Partial ordering of resource
Processes, threads, and resources are nodes. Resource requests and assignments are edges. No cycles → No deadlock. If contains a cycle, it is definitely deadlock if only one instance per resource type, e.g. one mutex m1. Otherwise, maybe deadlock, maybe not, e.g. multiple pages in RAM.
Prevent deadlock with partial order on resources. E.g., always acquire mutexes in a specific order, e.g. m1 before m2
- never get a cycle in the resource graph → no deadlock.

Synchronization Implementation¶

Mutexes, semaphores, and condition variables (but not spinlocks) use wait channels. They manage a list of sleeping threads. Each mutex, etc, gets its own wait channel.
Hand-over-hand locking allows for fine-grained locking, i.e. many locks. Useful for concurrent access to a single data structure.
- It’s different from lock(A); lock(B); while holding A and blocking on B.
- Hold at most two locks: the previous lock and the next one: working your way through a sequence of steps.
- Locks must be ordered in sequence
- At any point, the thread only locks the current and next node
- Why don’t we unlock A before locking B (example on page 191)? Because doing that would create a race condition — another thread could jump in and modify or delete B before you lock it, causing data corruption or crashes.
Lock order:
- first acquire Spinlock sem_lock
- then WaitChannel_lock
- then release Spinlock sem_lock
- While the spinlock sem_lock is being held, no other thread can access this struct. After WaitChannel_Lock is called, no other thread can access this wait channel until the calling thread is put to sleep and the kernel releases the WaitChannel’s lock.
Why we need to lock wait channel in semaphore if we have already have a spinlock to protect the struct from being accessed?
- We can’t call sleep while still holding the lock, because that would block other threads from calling Post() and waking us up — leading to deadlock.
- But if we didn’t lock the wait channel before going to sleep, another thread might Post() and try to wake us before we’re actually sleeping. That wake-up would be lost, and we’d end up sleeping forever.
- To avoid both problems, we first lock the wait channel (to block wake-ups), then release the spinlock, and finally go to sleep. This guarantees safe sleeping and prevents lost wake-ups.
wait channel also uses spinlock, like it has its own internal spinlock, not shared with semaphore. Both WaitChannel_Sleep() and WaitChannel_Wake() unlock the wait channel as part of their operation.