Synchronization¶
The Need for Synchronization¶
Synchronization
The need to coordinate access to shared resources

Thread Interactions¶
We want to avoid the two extremes, i.e. all grab at once (no locking or coordination, which causes race conditions) or each grabs one at a time (very slow, poor parallelism).
- Concurrent threads interact through shared access:
- All threads in the same program can share access to:
- Program code
- Global variables
- Heap
- Open files
- Threads also share access (via the kernel) to peripheral devices, such as:
- Secondary storage
- Display
- Other I/O devices
- All threads in the same program can share access to:
- The part of a concurrent program in which a shared object is accessed is called a critical section.
- This coordination of access to a shared resource is called synchronization.
- A race condition is when the program result depends on the order of execution.
- Race conditions (a.k.a data races) occur without synchronization.
- Happens if we don’t do the synchronization
Sequential Consistency¶

When one core updates a variable, other cores may not see that update right away due to caching, reordering, and propagation delays. This is why we need synchronization primitives to ensure memory visibility and consistency across cores.
Program A - a write followed by a read¶
int flag1 = 0, flag2 = 0;
void p1(void *ignored) {
flag1 = 1;
if (!flag2) {
critical_section_1();
}
}
void p2(void *ignored) {
flag2 = 1;
if (!flag1) {
critical_section_2();
}
}
int main() {
pthread_t tid;
pthread_create(tid, NULL, p1, NULL);
p2(); pthread_join(tid);
}
Modern CPUs use write buffers to delay writing to memory for performance. Each thread sees its own writes immediately, but not the writes of other threads until they’re flushed to RAM. In multi-core systems, this means one thread may read stale values from another, leading both to enter critical sections at the same time. This behaviour breaks mutual exclusion and causes race conditions, even if the code logic appears correct.
We have to ensure the write to flag1 is seen by other threads before reading flag2. We have to ensure the write to flag2 is seen by other threads before reading flag1.
The system must maintain program order when a write is followed by a read, i.e. for Thread 1, make sure the other threads have access to the updated value of flag1 before it reads the value of flag2.
The system must enforce write-before-read ordering across threads, or else stale values may be seen and both threads could enter the critical section. This issue comes from hardware-level memory reordering or delays, not just incorrect logic.
Program B - Overlapping Writes and Non-blocking Reads¶
int data = 0, ready = 0;
void p1(void *ignored) {
data = 350;
ready = 1;
}
void p2(void *ignored) {
while (!ready)
;
process(data);
}
The assumption is: once ready == 1, data is guaranteed to be set and safe to use.
When one thread (p1) writes data and then sets a ready flag, another thread (p2) might see ready == 1 before data is fully updated, due to memory delays or hardware reordering. This can cause p2 to process stale data. The issue is worse if data and ready are stored in different memory modules or cache lines. To prevent this, the system must preserve the order of write operations — it must not reorder data and ready writes in a way that breaks program logic across threads.
The system must maintain program order between two write operations. I.e., the system cannot reorder write operations.
The same result can happen with non-blocking reads. A non-blocking read would execute the first read and not wait for it to be done before executing the second read.
The system must maintain program order between two read operations.
I.e. cannot speculatively prefetch data. The system must wait until ready is 1 before fetching data.
A cache line is the unit of data transfer between main memory and core’s caches. An optimization may be to reorder writes so that multiple writes to the same cache line occur at the same time.
Cache coherence is ensuring that in a multicore or multiprocessor environment, if two or more caches are storing at the same address, then the value stored for that address in each cache must be the same. I.e. changes in one cache must be propagated to other caches.
A write should not be considered complete until all cached copies have been updated (or marked as invalid so that they will not be used).
Sequential Consistency¶
Definition: Sequential consistency means:
- All operations appear to be executed in some sequential order, and
- The operations of each processor/core occur in the order specified by the program.
Requirements¶
- Program Order:
- Each core must complete one memory operation before starting the next, preserving the program's operation order.
- The operation must update the result to cache and RAM before proceeding to another one
- Write Atomicity:
- All writes to a memory address are seen in the same order by all cores.
- All subsequent reads by any processor core must see the same result.
Without Sequential Consistency¶
- Different cores or threads may observe inconsistent results.
- This can happen with interleaving on a single processor.
Optimization and Sequential Consistency¶
- Sequential consistency restricts how hardware can optimize execution.
-
It prevents the use of performance-improving techniques like write buffers and write reordering.
Write Buffers
- A write buffer is fast memory where data is temporarily held before being written to RAM.
- It allows the CPU to proceed with instructions without waiting for writes to RAM to finish.
- But sequential consistency requires waiting until each write is complete to preserve order.
- Example: In a program, a thread should wait until
flag1is written to RAM before readingflag2— write buffering would break this ordering.
Write Reordering
- Write reordering can improve performance when writing to different memory modules (MMs).
- Example sequence:
- Write to MM1
- Another write to MM1
- Write to MM2
- Another write to MM2
- Reordering 2) and 3) allows parallelism across MM1 and MM2.
- But sequential consistency forbids such reordering, reducing performance potential.
- The reason why 1 and 2 cannot happen at the same time is they are writes to the same memory module (MM1). Overlapping writes to the same memory location or module must be serialized (done one after the other). Otherwise, you risk race conditions, inconsistencies, or partial updates.
Complicate Non-blocking Reads
- If the next instruction is to load
datafrom main memory, then the core can prefetchdatabefore it determines thatreadyis true. But this may be the old value ofdata, so wait to read it. Do not prefetch it.
SC makes cache coherence more expensive
- Must delay write completion until its values in other caches are marked as invalid or updated.
x64¶
- Write-back (WB) is the default: it writes to cache first and updates main memory later, improving performance but adding complexity. Write-through (WT) writes to both cache and memory simultaneously, ensuring consistency but making writes slower. Uncacheable (UC) disables caching entirely, forcing every access to go directly to memory—useful for real-time device data like mouse input. Write-combining (WC) buffers writes and sends them in bursts to optimize throughput, especially for streaming large amounts of data, but may delay visibility of updates.
- Write-combining has weak consistency
- weak consistency means no guarantee that reads and writes will have sequential consistency.
- Store-to-load forwarding is a hardware optimization that allows a CPU to forward a value from a recent store (write) to a subsequent load (read) — even before the store has been written to memory or cache.
- Store-to-load forwarding = reading from the store buffer, not the cache.
volatilekeyword: tells the compiler do not optimize or cache this variable; always reload it from memory every time it is accessed.- An atomic instruction is an instruction that cannot be interrupted, i.e. no other core can access the memory value until the instruction has completed. It is like the instruction occurred in a single step.
lockmakes the memory instruction that follows it atomic. Alllockinstructions are totally ordered that happen in a sequential order. Other memory instructions cannot be reordered with the locked onesxchgexchanges the value stored in a register with the value stored in memory. It is always lockedcmpxchgcompares two values and exchanges if they are different. It is always locked
- Fence assembly language instructions can prevent re-ordering from one side of the fence to the other
lfence(load fence) – all loads (i.e. reads) from memory prior to thelfenceinstruction will complete (i.e. are globally visible) before any loads after thelfenceinstruction are executed. No loads that occur after thelfencewill be executed before thelfence.sfence(store fence) – all stores (i.e. writes) to memory prior to thesfenceinstruction will complete (i.e. are globally visible) before any stores after thesfenceinstruction are executed.mfence(memory fence) – is a combination of anlfenceand ansfence.
Synchronization Primitives¶
- Recall: a critical section is the code that accesses a shared variable (or more generally, a shared resource) and must not be concurrently executed by more than one thread.
-
mutexesakalocksvoid mutex_init(mutex_t *m, ...); void mutex_lock(mutex_t *m); int mutex_trylock(mutex_t *m); void mutex_unlock(mutex_t *m);- only one thread can acquire
mat a time. Other threads must wait - If a thread cannot lock the mutex immediately, it must wait until mutex becomes unlocked. Then its call to
mutex_lock(&mutex)will return. - Thread API contract
- All global data should be protected by a mutex.
- Conditions: global variable + accessed by more than one thread + at least one write
- Protection is the responsibility of the application writer
- If using
mutexescorrectly, it’s behaviours should be indistinguishable from sequential consistency. - OS also need synchronization, but interrupts are incompatible with mutexes since we do not want the interrupt handler to clock waiting for mutex
pthread_mutex_lockblocks the calling thread until the mutex is available.pthread_mutex_trylockdoes not block. If mutex is already locked, it returns immediately with -1 (errno == EBUSY). Otherwise, it locks it and return 0.- If
countis 0 or full, then release the mutex and callthread_yieldto allow another thread to run. - While waiting for the
mutexthe thread does not use the processor but instead waits in a wait queue and the kernel wakes it up when the other thread has unlocked themutex -
NOTE
- The difference with using conditional variables with mutex is that, the C.V. only wakes up when it is signalled, such that the thread will sleep without using processor. Mutex also has its own wait queue. But for the pure mutex like below, since
thread_yieldis used,- The thread voluntarily gives up the CPU
- The kernel scheduler picks another runnable thread
- That thread placed in the ready queue
- Eventually, the scheduler picks it again (based on its scheduling policy, e.g., round-robin)
- The thread runs again, re-checking the condition (
count)
- So the below version does use CPU time, because the thread wakes up just to check the condition again, even if nothing has changed. If it does not get the mutex right away then it will block and not use processor time. But when it wakes up, it will check the count again, then unlock the mutex, then yields. That part of the while loop is using the processor and is what you could call busy waiting light, i.e. it occasionally will block with mutex lock, but will also busy wait, checking the condition, unlocking the mutex, and yielding.

- The difference with using conditional variables with mutex is that, the C.V. only wakes up when it is signalled, such that the thread will sleep without using processor. Mutex also has its own wait queue. But for the pure mutex like below, since
- only one thread can acquire
-
Busy waiting: when the thread uses processor time while waiting, busy waiting is inefficient.
- A semaphore is an object that has a non-negative value
sem_wait: if the semaphore value is > 0, decrement the value by one. Otherwise, wait until the value is greater than 0 and then decrement itsem_postorsem_signal: increment the value by 1- By definition, these two operations are atomic
- Does not keep track of ownership, which means that semaphores do not have owners.
mutexesdo sem_postdoes not have to followsem_wait- A thread will wait until the semaphore count is greater than 0 before continuing
- Two common uses of semaphores include
- binary semaphore: a semaphore with a single resource; behaves like a mutex but does not keep track of ownership
- counting semaphore: a semaphore with an arbitrary number of resources
- Differences:
sem_postdoes not have to followsem_wait- A semaphore can start with a count of 0
- Calling
sem_postincrements the semaphore by 1 - A thread will wait until the semaphore count is greater than 0 before continuing
- Semaphores do not have owners. Mutex does
- Condition variables allow threads to wait for arbitrary conditions to become true while the thread is waiting inside of a critical section
- When the thread blocks, the kernel unblocks the mutex and allows another thread into the critical section, possibly to make the condition true.
- When another thread detects the condition is true, it uses
cond_signalorcond_broadcastto notify any blocked threads and then unlocks the mutex. - When the kernel wakes up the thread, it ensures that the thread will hold the mutex
int pthread_cond_wait(pthread_cond_t *c, pthread_mutex_t *m);: Atomically unlockmand sleep untilcis signalled by another thread, then re-acquiremand resume executingpthread_cond_signal: wake up one thread waiting for conditionc;pthread_cond_broadcast: wake up all waiting threads waiting for conditionc- Signalling or broadcasting to a condition variable that has no blocked threads has no effect ⇒ signals do not accumulate
- Recheck the condition on wake-up when using
cond_wait. Otherwise, there may be another program that gets the mutex lock and consume the item, and if we didn’t recheck on the waiting program, it wakes up and thought there is something in the buffer to consume, but at this point since the item has been consumed by another program, the waiting program has no items to consume.cond_waitdoes not guarantee the condition is still true after waking up
- The reason why we use
cond_waitboth unlock the mutex and sleep instead of separate function calls to unlock the mutex and wait for the condition is that due to bad interleaving, the producer could end up waiting even though the condition is true since it missed the consumer’s signal.- It is not atomic.
- The mutex is unlocked before the thread goes to sleep.
- Why
pthread_cond_wait(&cond, &mutex)exists:- This function does three things atomically:
- Unlocks the mutex
- Puts the thread to sleep on the condition
- When woken up, relocks the mutex before returning
- No gap where the thread is not holding the lock but not sleeping yet
- This function does three things atomically:

- Locks / Mutexes
- Have two states.
mutex_lock: a thread uses this to enter a critical section and gain exclusive access to a resource.mutex_unlock: a thread uses this to leave the critical section, allowing another thread to enter.
- Semaphores
- Have many states (integer value, counts from
0ton). sem_wait: blocks the thread if the semaphore value is 0.sem_postorsem_signal: increments the semaphore value, potentially unblocking a waiting thread.
- Have many states (integer value, counts from
- Condition Variables
- Used to wait for arbitrary conditions to become true (e.g., complex boolean conditions).
cond_wait: blocks the thread until a condition becomes true.cond_signalorcond_broadcast: wake up one or all threads waiting for that condition.
- Sometimes allowing for race conditions provides for greater efficiency. For example, we don’t want an exact number but approximate.
- Use lockset algorithm to detect data races:
- For each global memory location, keep a lockset
- On each access of a particular global memory location, note any locks that are not currently held
- On each access, intersect the lockset with the locks currently held.
- If the lockset becomes all 0, it means there is no mutex protecting this memory location
- If the lockset becomes empty, it means no common lock protected the variable → potential data race.
- If a shared variable is properly protected, then there should be at least one lock that is always held during every access to that variable.
- Correct locking means there’s at least one lock consistently held on every access.
- This only means a potential data race, not a guaranteed one if all are 0. It only tells at least one access happened without holding any lock that was used in other accesses. But this doesn’t prove the accesses happened at the same time or caused corruption.
- With coarse-grained locking we create one lock for each data structure. Use this if the data structure is not used that often since it uses less space and less complexity as the thread only has to acquire one lock
- With fine-grained locking we create one lock for each linked list in the data structure. Use this when the data structure is being used often and maximizing concurrency and performance is the goal.
- Lots of contention mean threads often waiting for a particular recourse. Identify places with lots of contention and use fine-grained locks on these areas
Implementing Synchronization Primitives¶
-
test-and-setwhile (*mutex == true){}; // test if mutex is locked *mutex = true; // set it to true, i.e. locked -
Use the
xchgto attempt to lock the mutexxchgreturns the old value that was in*mutex
bool test_and_set(bool *mutex) { return xchg(mutex, true) == true; // test and set } -
A
spinlockis a lock that spins, i.e. it repeatedly tests the lock’s availability in a loop until the lock is acquired. When threads use aspinlockthey busy-wait i.e. they execute code on a processor core while they wait for the lock.- Efficient for short waiting times
- implemented with
atomic_swap_uint64, which in turn is using the assembly language instructionxchgq, which is a variant ofxchgthat swaps quad words - When a thread acquires a
spinlock, interrupts are disabled on that CPU. This ensures the thread won’t be interrupted and will quickly finish its critical section. As a result, other threads spinning for the lock will only wait a short time, making thespinlockefficient for short tasks.
C11 Atomics¶
- New atomic type
_Atomic(...)which can wrap the most basic types and they become atomic. All standard ops (e.g., +, −, /, ∗) become sequentially consistent. atomic_flagis a special type. it behaves like a boolean but you can’t directly load/store its value, but useatomic_flag_test_and_setandatomic_flag_clearinstead.- The standard requires that
atomic_flagbe implemented without using locks- lock-free: never waits for another thread to release a lock or finish its operation
- This means: it must complete without the thread being interrupted/preempted
- All other types might require locks
- The standard requires that
Cache Coherence¶
- Caches are divided into slices of bytes called cache lines
- When data is loaded, it loads an entire cache line, not just a single byte or word.
- Reason: to exploit locality — when you access one byte, nearby bytes are likely to be used soon.
- Each cache line is one of three states:
- Shared
- One or more caches (and memory) have a valid copy.
- The data is valid and identical in multiple caches (and possibly in memory).
- Modified (aka Exclusive)
- Only one core has the latest, valid copy.
- That copy is called dirty
- That cache has changed the data (dirty) and memory is not yet updated.
- Other cores’ copies are stale or invalid.
- Before another core can read/write, this core must:
- Write back to memory (if needed)
- Invalidate other copies
- Invalid:
- The data is either not present in cache or is stale and must be updated
- Shared
- Changing a cache line’s state (e.g., from Shared to Modified) involves coordination and can take time:
- 100–2000 clock cycles, depending on architecture and cache hierarchy.
- Example flow using MSI protocol:
- Initial load
- x = 42 is loaded by Core 0 → x is in Shared state in Core 0’s cache
- Core 1 also loads x
- Now both caches have x in Shared state
- Core 0 writes to x
- It must change x to Modified state
- The hardware:
- Invalidates x in Core 1’s cache
- Core 0 now owns the only valid copy
- Memory is marked out of date until write-back
- If Core 1 reads x again
- It triggers a coherence transaction:
- Core 0 must write back x to memory
- Or transfer it directly to Core 1
- Core 1 can then cache it again
- Initial load

- Implications for Multithreaded Design
- Lesson #1: Avoid false sharing
- False sharing occurs when different threads access different variables that happen to be on the same cache line.
- Even if threads don’t share the same data logically, the processor will treat the whole cache line as shared, causing coherence overhead.
- Tip: Separate data used by different threads. Example: pad arrays so each element starts at a new 64-byte boundary (typical cache line size).
- Lesson #2: Align structures to cache lines
- Keep data that is used together in the same cache line to minimize memory traffic.
- In C11/C++11, use
alignas(64)to tell the compiler to place a structure on a 64-byte boundary to align it with a cache line.
- Lesson #3: Pad data structures
- If you have arrays of structures, and each structure is used by a different thread, they may accidentally share cache lines.
- Fix: Add unused fields (padding) to force structures to not share cache lines.
- Lesson #4: Avoid contending on cache lines
- Multiple threads accessing the same cache line, even if only for reading/writing different variables, causes cache coherence traffic.
- Advanced locking algorithms like MCS locks let each thread spin on a local cache line (private to its core), reducing this overhead.
- Lesson #1: Avoid false sharing
Deadlock¶
- Lesson: Dangerous to hold locks when crossing abstraction barriers. I.e., top level function holds lock a then calls a lower level function that uses condition variables and needs lock a and b to signal a thread to wake up.
- Deadlock conditions (All four of these conditions are necessary for deadlock to occur):
- Limited access (mutual exclusion): resources can only be shared with a finite number of threads
- No preemption of recourses: once resource is granted, it cannot be taken away
- Multiple independent requests (hold and wait): DO NOT HOLD AND WAIT (otherwise, the lower level functions you call may require the locks you are holding. So if you’re holding lock A, don’t try to acquire lock B while still holding A.)
- A thread is holding one or more resources and also waiting for another.
- Eg: Thread A holds lock X and wants lock Y; Thread B holds lock Y and wants lock X
- If threads never held anything while waiting, deadlock wouldn’t happen.
- Circularity in graph of requests
- If we prevent one, deadlock would not happen
- Prevention:
- Buy more resources, split into pieces, or virtualize to make "infinite" copies.
- Physical memory: virtualized with VM, can take physical page away and give to another process
- A process holds a physical memory page.
- Another process needs that page to continue.
- Normally, it would be stuck waiting (classic deadlock).
- But with VM, the OS can reclaim that page from the first process and give it to the second, avoiding the deadlock.
- Wait on all resources at once (must know in advance), i.e. do not hold and wait.
- A thread should request all the resources it will need at the same time, before doing any work — instead of asking for them one at a time while holding others.
lock_both(a, b); // wait until *both* A and B are available
- Partial ordering of resource
- Processes, threads, and resources are nodes. Resource requests and assignments are edges. No cycles → No deadlock. If contains a cycle, it is definitely deadlock if only one instance per resource type, e.g. one mutex
m1. Otherwise, maybe deadlock, maybe not, e.g. multiple pages in RAM. - Prevent deadlock with partial order on resources. E.g., always acquire mutexes in a specific order, e.g.
m1beforem2- never get a cycle in the resource graph → no deadlock.
Synchronization Implementation¶
- Mutexes, semaphores, and condition variables (but not spinlocks) use wait channels. They manage a list of sleeping threads. Each mutex, etc, gets its own wait channel.
- Hand-over-hand locking allows for fine-grained locking, i.e. many locks. Useful for concurrent access to a single data structure.
- It’s different from
lock(A); lock(B); while holding A and blocking on B. - Hold at most two locks: the previous lock and the next one: working your way through a sequence of steps.
- Locks must be ordered in sequence
- At any point, the thread only locks the current and next node
- Why don’t we unlock A before locking B (example on page 191)? Because doing that would create a race condition — another thread could jump in and modify or delete B before you lock it, causing data corruption or crashes.
- It’s different from
- Lock order:
- first acquire
Spinlock sem_lock - then
WaitChannel_lock - then release
Spinlock sem_lock - While the
spinlock sem_lockis being held, no other thread can access this struct. AfterWaitChannel_Lockis called, no other thread can access this wait channel until the calling thread is put to sleep and the kernel releases the WaitChannel’s lock.
- first acquire
-
Why we need to lock wait channel in semaphore if we have already have a
spinlockto protect the struct from being accessed?- We can’t call sleep while still holding the lock, because that would block other threads from calling
Post()and waking us up — leading to deadlock. - But if we didn’t lock the wait channel before going to sleep, another thread might
Post()and try to wake us before we’re actually sleeping. That wake-up would be lost, and we’d end up sleeping forever. - To avoid both problems, we first lock the wait channel (to block wake-ups), then release the
spinlock, and finally go to sleep. This guarantees safe sleeping and prevents lost wake-ups.

- We can’t call sleep while still holding the lock, because that would block other threads from calling
-
wait channel also uses spinlock, like it has its own internal
spinlock, not shared with semaphore. BothWaitChannel_Sleep()andWaitChannel_Wake()unlock the wait channel as part of their operation.


