Threads and Concurrency Problems
Problem 1: Thread vs. Process
Section titled “Problem 1: Thread vs. Process”Briefly explain how a thread differs from a process.
Instructor Solution
A process is an independent container with its own address space storing code, static variables, heap to store dynamically allocated memory and other resources such as open files, sockets, semaphores etc. A thread is a unit of execution that exists inside a process. Each process starts with one thread but can have any number of threads. Each thread has its own stack to store local variables but share all other resources of the process with other threads of the process. While communication between processes requires IPC (shared memory, pipes, message queues); threads can communicate directly via shared variables.
Elaboration (by LLM)
Process as a Container:
A process is a heavyweight abstraction that the OS uses to isolate program execution. Each process has:
- Its own memory address space (code, data, heap)
- Its own file descriptor table
- Its own process ID (PID)
- Complete isolation from other processes
Thread as Execution Within a Container:
A thread is a lightweight execution unit that exists within a process. All threads in the same process:
- Share the same memory address space
- Can directly access each other’s data without IPC
- Have individual execution states (program counter, registers)
- Are scheduled independently by the OS
Resource Comparison Table:
| Resource | Process | Thread (in same process) |
|---|---|---|
| Address space | Own | Shared |
| File descriptors | Own | Shared |
| Heap | Own | Shared |
| Stack | Own | Own |
| Registers | Own | Own |
| Program counter | Own | Own |
Cost Implications:
- Process creation: Expensive (allocate memory, set up page tables, initialize resources)
- Thread creation: Cheap (just need a stack and TCB)
- Context switch between processes: Slow (flush TLB, switch memory maps)
- Context switch between threads: Fast (same memory, just switch registers and stack)
Communication Overhead:
- Between processes: Requires explicit IPC (pipes, sockets, shared memory) with kernel involvement
- Between threads: Direct memory access, just like function calls within a program
Problem 2: Thread State Sharing
Section titled “Problem 2: Thread State Sharing”What state (information) does a thread share with other threads in the same process and what state (information) is private/specific to a thread? Be explicit in your answer.
Instructor Solution
Shared State:
All threads in the same process share the same address space and system resources:
- Code Segment: The actual machine instructions.
- Data Segment: Global and static variables.
- Heap: Dynamically allocated memory (via malloc or new).
- OS Resources: Open file descriptors, sockets, signal handlers, and semaphores.
Private State:
To execute its own flow of control, each thread maintains its own execution context:
- Stack: Stores local variables, function parameters, and return addresses.
- Execution Context: Program Counter, Stack Pointer, general purpose registers.
- Thread ID (TID): A unique identifier for the thread.
Elaboration (by LLM)
Visual Memory Layout:
┌─────────────────────────────┐│ Code Segment (shared) │ ← All threads execute same code├─────────────────────────────┤│ Data/Global (shared) │ ← All threads access same globals├─────────────────────────────┤│ Heap (shared) │ ← malloc/new allocations├─────────────────────────────┤│ Stack (Thread 1 - private)│ ← Local vars of thread 1├─────────────────────────────┤│ Stack (Thread 2 - private)│ ← Local vars of thread 2├─────────────────────────────┤│ Stack (Thread 3 - private)│ ← Local vars of thread 3└─────────────────────────────┘Why Each Thread Needs Its Own Stack:
- Function calls need their own stack frames
- Local variables must be independent (otherwise threads interfere)
- Return addresses must be tracked separately per thread
- Each thread has its own execution path through the same code
Why Shared Heap is Dangerous:
When threads share the heap:
int *ptr = malloc(sizeof(int));*ptr = 42; // Thread 1printf("%d", *ptr); // Thread 2 - might see 42 or a different valueRace conditions occur when multiple threads access the same memory location simultaneously. Synchronization (mutexes, locks) is required.
Why File Descriptors are Shared:
int fd = open("file.txt", O_RDONLY);// Thread 1read(fd, buf1, 100); // Reads from current file position// Thread 2 (simultaneously)read(fd, buf2, 100); // Reads from updated file positionBoth threads affect the same file offset, requiring careful synchronization.
Problem 3: Thread Creation Resources
Section titled “Problem 3: Thread Creation Resources”What resources does the OS allocate for a thread during thread creation? How do they differ from those allocated when a process is created?
Instructor Solution
The OS allocates only the bare essentials for a new thread:
- A Private Stack: For local variables, function parameters, and return addresses.
- Thread Control Block (TCB): To store register states (including the Program Counter and Stack Pointer), priority, and Thread ID.
Process creation is far more resource-intensive because the OS must establish a new container:
- Address Space: A process gets a brand new, isolated memory space with code, data and heap; a thread simply uses the parent’s existing map.
- Resource Duplication: Processes get their own copies of file descriptor tables; threads share these globally with all other threads in the process.
Elaboration (by LLM)
Thread Creation Overhead:
When creating a new thread, the OS:
- Allocates a stack (typically 1-8 MB)
- Creates a TCB storing:
- Thread ID
- Program Counter (initially points to thread function)
- Stack Pointer (points to new stack)
- Register values
- Scheduling priority
- State (ready, running, blocked)
- Adds thread to the process’s thread list
- Total time: microseconds to milliseconds
Process Creation Overhead:
When creating a new process, the OS must:
- Allocate new memory (separate address space)
- Copy or set up page tables
- Initialize code segment
- Initialize data segment
- Create empty heap
- Duplicate file descriptor table
- Create process control block (PCB)
- Initialize signal handlers
- Set up working directory
- Total time: milliseconds to seconds
Cost Comparison:
Thread creation: ~1 microsecondProcess creation: ~1-10 milliseconds
Ratio: 1000-10000x slower for process creationMemory Overhead Comparison:
Thread: Stack: 1-8 MB TCB: ~1 KB Total: ~1-8 MB per thread
Process: Address space: varies (minimal ~10 MB) Page tables: ~100s KB File descriptor table: 1-10 KB PCB: ~1 KB Total: ~10-100+ MB per processWhy Threads Share Address Space:
Threads don’t need separate address spaces because:
- They execute the same code
- They benefit from shared data structures
- Isolation isn’t required (all threads are part of same application)
- Memory is precious on embedded systems
Why Processes Have Separate Address Spaces:
Processes need isolation because:
- Different programs shouldn’t crash each other
- Security: one program shouldn’t access another’s data
- Containment: buffer overflows in one process don’t affect others
- The cost of isolation is worth the protection gained
Problem 4: Multi-Threading Advantages
Section titled “Problem 4: Multi-Threading Advantages”What are the advantages of multi-threading? Give an example when multi-threading is preferrable to a single-threaded process.
Instructor Solution
Advantages of Multi-Threading:
- Responsiveness: Keeps an application interactive by running background tasks (like saving a file) without freezing the User Interface.
- Resource Sharing: Threads share memory and resources by default, making communication faster and more efficient than Inter-Process Communication (IPC).
- Economy: Thread creation and context switching are significantly “cheaper” (faster and use less RAM) than process creation.
- Scalability: Allows a single application to utilize multiple CPU cores effectively by executing tasks in parallel.
Example: Web Browser
Multi-threading is preferable especially for I/O-bound applications with a Graphical User Interface such as a Web browser. In a single-threaded browser, if you started downloading a large file, the entire window would freeze and become unresponsive until the download finished.
In a multi-threaded browser:
- Thread 1 handles user input (scrolling, clicking)
- Thread 2 handles the network download
- Thread 3 renders the images and text on the screen
Elaboration (by LLM)
Responsiveness in Depth:
Consider a text editor:
Single-threaded: User types 'H' → System searches dictionary → User waits 5 seconds (Everything freezes until search completes)
Multi-threaded: Thread 1: User types 'H' → Character appears immediately Thread 2: (concurrently) Searches dictionary in background (User never experiences freezing)Resource Sharing Benefit:
Single-threaded process approach (if possible):
Process 1: GUIProcess 2: DownloadProcess 3: Rendering
Problem: Need to share downloaded data → requires IPCSolution: Slow pipes, sockets, or shared memoryMulti-threaded approach:
Shared memory: ┌───────────────┐ │ Downloaded │ │ HTML buffer │ ← All threads access directly └───────────────┘
Thread 1: Reads from network → writes to bufferThread 2: Reads from buffer → renders HTMLThread 3: Handles user inputCPU Utilization (Single vs Multi-core):
Single-core processor:
- Only one thread runs at a time
- Multi-threading still helps (while one thread waits for I/O, another runs)
Multi-core processor:
- Multiple threads run simultaneously
- True parallelism: download + rendering at exact same time
- Perfect for CPU-bound and I/O-bound tasks
Real-World Browser Example:
Google Chrome threads: - Main thread: handles UI events - IO thread: network operations - Worker threads: JavaScript execution, image decoding - Renderer threads: HTML/CSS renderingComparison: Single-threaded vs Multi-threaded Download:
Single-threaded (sequential): Download chunk 1 (100ms) Download chunk 2 (100ms) Download chunk 3 (100ms) Render HTML (200ms) Handle input (0ms - frozen until done) Total: 400ms frozen
Multi-threaded (concurrent): While downloading, rendering happens While downloading, input is processed User sees progress bar updating and can scroll Total: User sees instant feedbackWhen Single-Threading is Sufficient:
- Batch processing with no user interaction
- CPU-bound computation (single processor anyway)
- Simple synchronous operations
When Multi-Threading is Essential:
- User interfaces (event handlers)
- Network servers (handling multiple clients)
- Real-time systems (logging, monitoring)
- Any application where responsiveness matters
Problem 5: Shared vs. Private Items
Section titled “Problem 5: Shared vs. Private Items”List 1 item that is shared by all threads, and 1 item that is private to each thread.
Instructor Solution
Shared: Code, global variables, heap
Private: Stack, Program Counter, Stack Pointer
Elaboration (by LLM)
Why Code is Shared:
void printThread(int id) { printf("Thread %d\n", id);}
// Both Thread 1 and Thread 2 execute the SAME function codethread1 = CreateThread(printThread, 1);thread2 = CreateThread(printThread, 2);Both threads execute identical machine instructions from the code segment. Sharing the code segment saves memory and makes sense since the code doesn’t change.
Why Global Variables are Shared:
int counter = 0; // Global variable
void increment() { counter++; // All threads modify the SAME counter}
// Thread 1increment(); // counter = 1// Thread 2 (simultaneously)increment(); // counter = ?The address of counter is the same for all threads. Without synchronization, this causes race conditions.
Why Heap is Shared:
int *ptr = malloc(sizeof(int));*ptr = 42;
// Pass to another threadThread2UpdateValue(ptr); // Thread 2 modifies same memoryprintf("%d", *ptr); // Thread 1 sees modified valueHeap allocations use addresses that are valid in the shared address space. All threads can access the same allocated memory.
Why Stack is Private:
void function() { int local_var = 10; // On THIS thread's stack // Thread 1's local_var at address 0x7FFF1000 // Thread 2's local_var at address 0x7FFF2000}
// Thread 1function(); // local_var at 0x7FFF1000// Thread 2function(); // local_var at 0x7FFF2000 (different address!)Each thread needs independent local variables, so each gets its own stack with separate memory locations.
Why Program Counter is Private:
Thread 1: Thread 2: line 10: x = 5 line 10: x = 5 line 11: y = x + 1 line 15: z = 100 line 12: ... line 16: ...Each thread is executing different lines of code (or same line at different times). The program counter tracks where each thread is.
Why Stack Pointer is Private:
Thread 1: Thread 2: SP: 0x7FFF1000 SP: 0x7FFF2000 (points to own (points to own stack frame) stack frame)Each thread’s stack grows independently, so each needs its own stack pointer.
Memory Layout Example:
╔═══════════════════════════╗║ Code (SHARED) │ ← function() at 0x4000║ Global Data (SHARED) │ ← counter at 0x6000║ Heap (SHARED) │ ← malloc() results║ │║ Thread 1 Stack (PRIVATE) │ ← SP1: 0x7FFF1000║ local_var_T1 at 0x7FFF0F0║║ param_T1 at 0x7FFF0F8 │║ │║ Thread 2 Stack (PRIVATE) │ ← SP2: 0x7FFF2000║ local_var_T2 at 0x7FFF1F0║║ param_T2 at 0x7FFF1F8 │╚═══════════════════════════╝Implications for Programming:
- Shared data needs locks:
counter++must be atomic - Passing stack variables is dangerous: Passing address of local variable to another thread risks use-after-free
- Passing heap pointers is safe: Heap data persists as long as not freed
- Global state is a liability: Avoid globals; use thread-local storage if needed
Problem 6: Purpose of pthread_detach()
Section titled “Problem 6: Purpose of pthread_detach()”What’s the purpose of pthread_detach() function?
Instructor Solution
The purpose of pthread_detach() is to tell the operating system that a thread’s resources (like its stack and Thread Control Block) should be automatically reclaimed immediately upon its termination. Normally, a thread is “joinable,” meaning it stays in a “zombie” state after finishing until another thread calls pthread_join() to collect its exit status.
Using pthread_detach():
- Prevents Resource Leaks: It ensures the system doesn’t run out of memory if you create many threads but don’t intend to “join” them.
- Enables “Fire and Forget”: It allows the thread to run independently without the parent process needing to monitor its completion.
Elaboration (by LLM)
Thread Lifecycle:
Created → Runnable → Running → Blocked → Exiting → Zombie → Reaped ↑ Still consuming resources!Joinable Thread (Default):
pthread_t tid;pthread_create(&tid, NULL, thread_function, NULL);
// ... thread runs and calls pthread_exit()// Thread is now in ZOMBIE state, waiting to be joined
pthread_join(tid, NULL); // Reap the thread, free resourcesProblems with Always Joining:
for (int i = 0; i < 1000; i++) { pthread_t tid; pthread_create(&tid, NULL, worker, NULL); // BUG: We create 1000 threads but never join them! // All 1000 threads become zombies, consuming memory}Detached Thread:
pthread_t tid;pthread_create(&tid, NULL, thread_function, NULL);pthread_detach(tid); // Mark as detached
// ... thread runs and calls pthread_exit()// Resources are automatically freed - no zombie state!When to Use Detached Threads:
- Worker Thread Pool:
for (int i = 0; i < 100; i++) { pthread_t tid; pthread_create(&tid, NULL, worker_thread, NULL); pthread_detach(tid); // Fire and forget}- Background Tasks:
void save_file_async(const char *filename) { pthread_t tid; pthread_create(&tid, NULL, save_file_worker, (void*)filename); pthread_detach(tid); // Don't wait for completion}- Server Handling Multiple Clients:
while (accept_connection(&client_socket)) { pthread_t tid; pthread_create(&tid, NULL, handle_client, (void*)(intptr_t)client_socket); pthread_detach(tid); // Handle each client independently}When to Use Joinable Threads:
- When you need the return value:
int result;pthread_join(tid, (void**)&result); // Get thread's return valueprintf("Thread returned: %d\n", result);- When you need to wait for completion:
pthread_create(&tid, NULL, computation_thread, NULL);// ... do other work ...pthread_join(tid, NULL); // Wait for computation to finishprintf("Computation done!\n");- Small, bounded number of threads:
int num_threads = 4;pthread_t tids[4];
for (int i = 0; i < num_threads; i++) { pthread_create(&tids[i], NULL, worker, NULL);}
for (int i = 0; i < num_threads; i++) { pthread_join(tids[i], NULL); // Wait for all workers}Resource Leak Example:
// WRONG - Resource leak!void create_worker() { pthread_t tid; pthread_create(&tid, NULL, worker_thread, NULL); // thread becomes zombie when it exits // Resources never freed - leak!}
// CORRECT - Detached threadvoid create_worker() { pthread_t tid; pthread_create(&tid, NULL, worker_thread, NULL); pthread_detach(tid); // Resources auto-freed when thread exits}
// ALSO CORRECT - Joinable with explicit joinvoid create_worker() { pthread_t tid; pthread_create(&tid, NULL, worker_thread, NULL); pthread_join(tid, NULL); // Resources freed after join}Key Differences:
| Aspect | Joinable | Detached |
|---|---|---|
| Return value | Can get it | Lost |
| Synchronization | Explicit join | None |
| Resource cleanup | Manual (join) | Automatic |
| When to use | Small teams, need results | Worker pools, background tasks |
| Overhead | Stays in zombie state | Clean exit |
Problem 7: Multi-Threaded Process Termination
Section titled “Problem 7: Multi-Threaded Process Termination”List 2 conditions under which a multi-threaded process terminates.
Instructor Solution
A multi-threaded process terminates under any of the following conditions:
(1) A thread calls exit(): If any thread calls the exit() system call, the entire process and all its associated threads are immediately destroyed by the kernel.
(2) Last Thread Exits: The process naturally terminates when the last remaining thread calls pthread_exit().
(3) Fatal Signal: If any single thread performs an illegal operation (like a Segmentation Fault) that results in a fatal signal, the operating system terminates the entire process.
Elaboration (by LLM)
Key Insight: Process Termination is Process-Wide:
Unlike individual thread termination, when a process terminates, ALL threads die instantly, regardless of their state.
Condition 1: exit() Immediately Kills All Threads:
void *thread_func(void *arg) { for (int i = 0; i < 1000000; i++) { printf("Thread working...\n"); sleep(1); } pthread_exit(NULL); // Gentle exit}
int main() { pthread_t tid; pthread_create(&tid, NULL, thread_func, NULL); sleep(2); exit(0); // KILLS thread immediately, even mid-loop! // Process ends, thread gets no cleanup chance return 0;}Why exit() is Dangerous:
- Bypasses thread cleanup
- Doesn’t call destructors (in C++)
- Closes file descriptors immediately
- Could corrupt files if thread was writing
- Doesn’t allow graceful shutdown
Condition 2: Last Thread Exiting (Graceful Termination):
void *worker(void *arg) { printf("Thread starting\n"); sleep(2); printf("Thread ending\n"); pthread_exit(NULL); // Graceful exit}
int main() { pthread_t tid; pthread_create(&tid, NULL, worker, NULL); printf("Main waiting...\n"); pthread_join(tid, NULL); // Wait for thread printf("Thread done, main exiting\n"); return 0; // Process terminates}Timeline:
0ms: Main creates thread0ms: Main prints "Main waiting..."0ms: Thread starts, prints "Thread starting"2000ms: Thread exits, prints "Thread ending"2000ms: Main resumes from join()2000ms: Main prints "Thread done, main exiting"2000ms: Process terminates (last thread exited)Condition 3: Fatal Signal (Crash):
void *bad_thread(void *arg) { int *ptr = NULL; *ptr = 42; // Segmentation fault!}
int main() { pthread_t tid; pthread_create(&tid, NULL, bad_thread, NULL); pthread_join(tid, NULL); return 0;}When the thread dereferences NULL:
- OS sends SIGSEGV signal to process
- Default handler terminates entire process
- All threads die immediately
- Process never reaches pthread_join()
Comparing Termination Methods:
// METHOD 1: Graceful (main returns normally)int main() { pthread_t tid; pthread_create(&tid, NULL, worker, NULL); pthread_join(tid, NULL); // Wait for worker return 0; // Clean exit}
// METHOD 2: Abrupt (main calls exit)int main() { pthread_t tid; pthread_create(&tid, NULL, worker, NULL); sleep(1); exit(0); // KILLS process immediately}
// METHOD 3: Crash (signal)int main() { pthread_t tid; pthread_create(&tid, NULL, bad_thread, NULL); sleep(1); // If bad_thread crashes, entire process dies}Best Practices:
- Always join threads before exiting:
int main() { pthread_t tids[10]; for (int i = 0; i < 10; i++) { pthread_create(&tids[i], NULL, worker, NULL); } for (int i = 0; i < 10; i++) { pthread_join(tids[i], NULL); // IMPORTANT! } return 0; // Clean exit}- Or use pthread_exit in main:
int main() { pthread_t tid; pthread_create(&tid, NULL, worker, NULL); // Main sleeps/waits while worker runs pthread_exit(NULL); // Waits for other threads}- Avoid calling exit() in threads:
// WRONGvoid *thread_func(void *arg) { // do work exit(0); // KILLS entire process!}
// RIGHTvoid *thread_func(void *arg) { // do work pthread_exit(NULL); // Only this thread exits}Problem 8: Concurrency vs. Parallelism
Section titled “Problem 8: Concurrency vs. Parallelism”Define concurrency and parallelism and briefly explain the difference between the two.
Instructor Solution
Concurrency is the art of managing multiple tasks at once by interleaving their execution, essentially “dealing with” many things at the same time on a single or multiple cores.
Parallelism is the act of physically executing multiple tasks simultaneously, which requires hardware with multiple processing units (like a multi-core CPU).
The core difference lies in execution: concurrency is about structural design and multitasking (making progress on Task A, then Task B), while parallelism is about simultaneous execution (Task A and Task B running at the exact same millisecond).
Elaboration (by LLM)
Concurrency: Illusion of Parallelism
Concurrency is a software concept—the arrangement of code to handle multiple tasks. It does NOT require multiple cores.
Single-Core Concurrency Example:
Time CPU Core 1─────────────────0ms Task A [----]4ms Task B [----]8ms Task A [----]12ms Task C [----]16ms Task B [----]20ms Done
Result: All tasks "appear" to run at once, but only one runs at a timeThe OS rapidly switches between themParallelism: True Simultaneous Execution
Parallelism is a hardware reality—multiple cores literally running code at the same instant. It REQUIRES multiple cores.
Dual-Core Parallelism Example:
Time CPU Core 1 CPU Core 2───────────────────────────────0ms Task A [----] Task B [----]4ms Task A [----] Task B [----]8ms Task A [----] Task B [----]12ms Task C [----] Task B [----]
Result: Tasks A and B run simultaneouslyCore 1 switches between A and CCore 2 runs B uninterruptedKey Differences Table:
| Aspect | Concurrency | Parallelism |
|---|---|---|
| Definition | Structural arrangement | Physical execution |
| Requirements | Software only | Requires multiple cores |
| CPU cores needed | 1+ | 2+ |
| Context switching | Yes | May have switching too |
| When speedup occurs | When waiting for I/O | Always (CPU-bound work) |
| Example | Web server handling requests | Matrix multiplication on 4 cores |
Visual Comparison:
Single-core Concurrency: [A][B][C][A][B][C] ← One core time-slicing ↑ Sequential at any instant, but concurrent across time
Multi-core Parallelism: Core 1: [A][A][C][C] Core 2: [B][B][B][B] ↑ Simultaneous execution at same instantWhy Concurrency Helps Even Without Parallelism:
// Single-threaded (sequential) I/Ovoid fetch_websites() { download_site_a(); // 1000ms (blocked waiting for network) download_site_b(); // 1000ms (blocked waiting for network) download_site_c(); // 1000ms (blocked waiting for network) // Total: 3000ms}
// Concurrent (multi-threaded) I/Ovoid fetch_websites_concurrent() { Thread A: download_site_a() // 1000ms Thread B: download_site_b() // 1000ms (simultaneous) Thread C: download_site_c() // 1000ms (simultaneous) // Total: ~1000ms (concurrent waiting!)}Even on single core:
- While Thread A waits for network, CPU switches to Thread B
- While Thread B waits, CPU switches to Thread C
- All three downloads overlap in time
Why Parallelism Requires Multiple Cores:
You cannot achieve true parallelism on a single core because:
- Only one instruction sequence can execute at any instant
- Context switching creates the illusion, but still sequential
- CPU has only one execution pipeline per core
Real-World Example: Web Server
Concurrency without parallelism (single core):
Client 1: [request] [wait] [respond]Client 2: [request] [wait] [respond]Client 3: [request] [respond]
All clients "feel" like server is responsiveServer is actually context-switching rapidlyParallelism (multi-core):
Core 1: Client 1: [request] [wait] [respond]Core 2: Client 2: [request] [wait] [respond]Core 3: Client 3: [request] [wait] [respond]
All clients running truly in parallelNo time-sharing neededWhy Both Matter:
- Concurrency helps with responsiveness and I/O-bound work
- Parallelism helps with pure computation on multiple cores
- Multi-threaded app on multi-core: Best of both worlds!
Problem 9: Concurrency Without Multiple Cores
Section titled “Problem 9: Concurrency Without Multiple Cores”If you have a concurrent system, i.e., multi-tasking or multi-threading OS, does the underlying hardware have to have multiple CPU cores? Why or why not?
Instructor Solution
No, the underlying hardware does not need multiple CPU cores to support a concurrent system. Concurrency is a software property managed by the operating system through context switching and time-slicing, where the CPU rapidly alternates between different threads or tasks. By giving each process a few milliseconds of execution time before switching to the next, the OS creates the illusion that multiple tasks are running simultaneously, even if only one instruction is being physically processed at any given moment on a single core.
Elaboration (by LLM)
Time-Slicing: The Core of Concurrency
On a single core, the OS divides CPU time into small chunks (time slices or time quanta, typically 10-100ms).
Timeline Example: Single Core Running 3 Threads
Thread A Thread B Thread C[10ms] [10ms] [10ms][10ms] [10ms] [10ms][10ms] [10ms] [10ms]Each thread gets a turn to run for 10ms, then the OS saves its state and switches to the next thread.
Why This Works:
- Save State: Program counter, registers, stack pointer
- Switch: OS scheduler picks next thread
- Restore State: New thread resumes from where it left off
- Repeat: Rapid cycling creates illusion of parallelism
Context Switch Overhead:
100 threads × 10ms per thread = 1000ms (1 second per full cycle)User doesn't notice threads switching (happens 100 times per second)Appears like all 3 threads are running "simultaneously"Why Multiple Cores Would Help (But Aren’t Required):
Single core (sequential time-slicing):Thread A: [exec 10ms] [wait] [exec 10ms]Thread B: [exec 10ms] [wait] [exec 10ms]Thread C: [exec 10ms] [exec 10ms]Result: All threads make progress, but not truly parallel
Multi-core (true parallelism):Core 1 - Thread A: [exec continuous]Core 2 - Thread B: [exec continuous]Core 3 - Thread C: [exec continuous]Result: All threads run simultaneously, faster executionI/O-Bound Concurrency on Single Core:
Even with single core, concurrency excels with I/O:
Thread A (network request): [execute code] → [network I/O - BLOCKS] → [process response]
Thread B (disk read): [execute code] → [disk I/O - BLOCKS] → [process result]
Timeline:0ms: A executes code5ms: A calls read() for network → BLOCKS5ms: OS switches to B8ms: B executes code10ms: B calls read() for disk → BLOCKS10ms: OS switches back to A100ms: Network response arrives, A unblocks100ms: OS switches to A105ms: A processes network data110ms: A does more work...200ms: Disk response arrives, B unblocks200ms: OS switches to B205ms: B processes disk dataBoth I/O operations happen concurrently (overlapped in time) even though only one core is doing computation.
Historical Context:
Early computers (1970s-2000s) had single-core processors, yet supported:
- Multi-tasking (multiple programs)
- Concurrency (multiple threads within programs)
- Responsiveness (background tasks while user interacts)
All achieved through context switching, not parallelism!
The Takeaway:
Concurrency = Software design property (can work with 1 core)Parallelism = Hardware feature (requires 2+ cores)
A single core can support concurrent codeA single core cannot support parallel executionWhen Single-Core Concurrency Works Well:
- I/O-bound applications (network, disk, user input)
- Applications where responsiveness matters more than speed
- Systems with limited hardware resources (embedded systems)
When Single-Core Concurrency Falls Short:
- CPU-bound computation (heavy math, processing)
- Real-time systems needing deterministic performance
- Applications that need true speedup
Problem 10: Parallelism and Hardware Requirements
Section titled “Problem 10: Parallelism and Hardware Requirements”To achieve parallelism, does the underlying hardware have to have multiple CPU cores? Why or why not?
Instructor Solution
Yes, to achieve true parallelism, the hardware must have multiple CPU cores (or multiple processors). Unlike concurrency, which can be faked on a single core through rapid switching, parallelism requires the physical ability to execute two or more instructions at the exact same moment. Without at least two independent processing units, tasks cannot run simultaneously; they can only ever be interleaved.
Elaboration (by LLM)
The Fundamental Requirement: Multiple Execution Pipelines
A single CPU core has one instruction pipeline. It can only execute one instruction sequence at any given nanosecond.
Single Core (1 Pipeline): ┌─────────────────────────┐ │ Fetch → Decode → Exec │ ← Only one instruction at a time └─────────────────────────┘
Can't execute Task A and Task B simultaneously Can only switch back and forth (context switching)Multi-Core (Multiple Pipelines):
Core 1 (Pipeline 1): ┌─────────────────────────┐ │ Fetch → Decode → Exec │ ← Task A running └─────────────────────────┘
Core 2 (Pipeline 2): ┌─────────────────────────┐ │ Fetch → Decode → Exec │ ← Task B running (simultaneously!) └─────────────────────────┘
Result: At the same nanosecond: Core 1 executes instruction from Task A Core 2 executes instruction from Task B TRUE PARALLELISMWhy Concurrency Alone Doesn’t Give Parallelism:
// Two CPU-bound tasksvoid task_a() { for (int i = 0; i < 1000000; i++) { result += i * i; // Pure computation }}
void task_b() { for (int i = 0; i < 1000000; i++) { result += i * i; // Pure computation }}
// Single-core execution (concurrency, NO parallelism):Time: 0ms 10ms 20ms 30ms 40ms [Task A][Task B][Task A][Task B][Task A][Task B]
Both tasks make progress, but never truly simultaneousTotal time: ~40ms (both tasks' time summed)
// Dual-core execution (parallelism):Core1: [Task A____________][Task A____]Core2: [Task B____________][Task B____]
Both tasks run truly simultaneouslyTotal time: ~20ms (only max time needed)2x faster!The Speedup Formula:
Single-core: Total time = Task A time + Task B timeMulti-core: Total time = max(Task A time, Task B time)
With 2 cores: Speedup = (A + B) / max(A, B)If A ≈ B: Speedup ≈ 2xIf A >> B: Speedup ≈ 1xHardware Parallel Execution Example:
int a[1000], b[1000], c[1000];
// Core 1 Core 2a[0] = 1 a[1] = 1a[0] = a[0] + 2 a[1] = a[1] + 2b[0] = a[0] b[1] = a[1]c[0] = b[0] c[1] = b[1]
↓ All happening at SAME TIME, on same instructionMisconception: Hyper-Threading Threads
Intel’s hyper-threading creates the ILLUSION of 2 cores per physical core:
Physical Core 1 (Hyper-threaded): Pipeline 1: Thread A instructions Pipeline 2: Thread B instructions
BUT: Uses same execution resourcesNOT true parallelism, but better than time-slicingCan exploit instruction-level parallelism
Result: ~1.2x-1.5x speedup (not 2x like real dual-core)Why You Need Multiple Cores for Parallelism:
- Single core can’t superscale: Can’t make one core run 2 instruction sequences at once
- Physical limitation: Only one ALU (arithmetic logic unit) per core
- Memory bandwidth: Additional cores have their own caches and paths to RAM
- No alternative: You MUST have 2+ cores for true parallelism
Parallelism in Modern Processors:
Quad-core CPU: ┌─────────────────────┐ │ Core 1: [Pipeline] │ │ Core 2: [Pipeline] │ │ Core 3: [Pipeline] │ │ Core 4: [Pipeline] │ └─────────────────────┘
Each core can run a different thread/task simultaneouslyUseful for: - Image processing (4 threads processing 4 regions) - Web server (4 cores handling 4 clients) - Parallel algorithms (divide work across cores)Key Limitation: Amdahl’s Law
Even with parallelism, speedup is limited:
If 10% of program is sequential (must run on 1 core)And 90% is parallelizable (can run on N cores)
Speedup with 4 cores ≈ 1/(0.1 + 0.9/4) ≈ 3.1x (not 4x!)Summary:
| Feature | Single Core | Multi-Core |
|---|---|---|
| Concurrency | Yes (context switching) | Yes (still needed) |
| Parallelism | No (impossible) | Yes |
| True speedup on CPU-bound work | No | Yes |
| Hardware cost | Low | Higher |
| Synchronization complexity | Lower | Higher |
Problem 11: Multi-Threading on Single-Core
Section titled “Problem 11: Multi-Threading on Single-Core”Give one reason why you will want to implement a multi-threaded application in a single-core machine.
Instructor Solution
The primary reason is to improve responsiveness during I/O-bound tasks. In a single-threaded application, if the program waits for a slow resource (like a disk read, a network response, or user input), the entire process—including the UI—blocks and freezes. In a multi-threaded system, while one thread is “blocked” waiting for I/O, the OS can switch the CPU to another thread. This allows the application to remain interactive or perform background calculations while the hardware handles the data transfer in the background.
Elaboration (by LLM)
The I/O Blocking Problem:
Modern applications spend most time waiting for I/O, not computing:
// Single-threaded approachint main() { printf("Fetching data...\n"); data = network_read(); // BLOCKS for 2 seconds // During these 2 seconds: process CANNOT respond to user input // CPU sits idle, waiting for network
process_data(data); // Computation: 100ms printf("Done!\n");}User experiences: 2 second freeze while waiting for network
Multi-Threaded Solution:
void *network_thread(void *arg) { data = network_read(); // BLOCKS for 2 seconds return NULL;}
void *ui_thread(void *arg) { while (true) { handle_user_input(); // ALWAYS responsive update_display(); // Show progress sleep(10); // Poll for updates } return NULL;}
int main() { pthread_t net_tid, ui_tid;
pthread_create(&net_tid, NULL, network_thread, NULL); pthread_create(&ui_tid, NULL, ui_thread, NULL);
pthread_join(net_tid, NULL); pthread_cancel(ui_tid);
return 0;}Timeline Comparison:
Single-threaded (FREEZES):0ms: Network request starts0ms: UI frozen (cannot handle input)2000ms: Network response arrives2000ms: UI becomes responsive2000ms: Process data (100ms)2100ms: DoneUser experience: 2 second freeze!
Multi-threaded (RESPONSIVE):0ms: Network thread starts request0ms: UI thread immediately handles input0ms: User can interact with UI50ms: User sees "Downloading..." message100ms: User clicks "Cancel" button (UI immediately responds)500ms: UI shows progress (20% downloaded)1000ms: UI shows progress (50% downloaded)2000ms: Network response arrives2000ms: Process data (100ms)2100ms: UI shows resultsUser experience: Smooth, responsive!Real-World Example: Email Client
Single-threaded:
User clicks "Send Email" → Compose thread exits → Network thread starts, blocks sending... → User CANNOT check mail, CANNOT type new email → After 30 seconds: finally sent, UI responsive againMulti-threaded:
User clicks "Send Email" → Send thread spawned (handles the I/O) → UI thread continues immediately → User can check mail while sending → User can draft new email while sending → All while network thread waits for serverWhy Single-Core is Still Useful:
When network blocks on Core 0:
Time Thread A (UI) Thread B (Network)────────────────────────────────────────0ms [handles input] [network request]5ms [renders] [blocked waiting...]10ms [handles input] [blocked waiting...]15ms [renders] [blocked waiting...]...2000ms [handles input] [response arrived!]While Thread B waits (100% CPU idle), Thread A uses CPU for UI.
This is the KEY benefit of concurrency on single-core: overlapping I/O with computation.
I/O Wait Time Hierarchy:
Memory access: ~100 nanosecondsDisk seek: ~10 milliseconds (100,000x slower!)Network request: ~100 milliseconds (1,000,000x slower!)User think time: ~3 seconds (30,000,000x slower!)A single thread waiting 100ms for network = huge opportunity for other threads!
Common I/O-Bound Scenarios:
-
Web Server:
- Thread 1: Handles Client A (blocked on network read)
- Thread 2: Handles Client B (computing response)
- Thread 3: Handles Client C (blocked on disk read)
Each client thinks server is responsive, even on single core!
-
Database Client:
- Thread 1: UI (handles mouse clicks, keyboard)
- Thread 2: Database query (blocked on DB server response)
User can cancel query or start new one while previous blocks
-
File Downloader:
- Thread 1: Download thread 1 (blocked waiting for network)
- Thread 2: Download thread 2 (blocked waiting for network)
- Thread 3: Download thread 3 (blocked waiting for network)
All three downloads overlap in time (concurrent I/O)
The Golden Rule:
For I/O-bound applications: Even 1 core + multi-threading > 1 core + single thread
Why? Because: - N threads can have N I/O operations in flight - When one blocks, CPU switches to another - CPU is never idle (unless all threads are blocked)
For CPU-bound applications: Multi-threading on 1 core helps responsiveness But NOT computational performancePractical Example with Numbers:
Task: Download 3 files (each 10MB, 1 second over network)
Single-threaded: Download file 1: [1000ms network I/O] Download file 2: [1000ms network I/O] Download file 3: [1000ms network I/O] Total: 3000ms
Multi-threaded (1 core): Thread 1: Download file 1 [1000ms I/O] Thread 2: Download file 2 [1000ms I/O, concurrent] Thread 3: Download file 3 [1000ms I/O, concurrent] Total: ~1000ms (network operations overlap!)The network doesn’t care about cores—it operates in parallel by nature!
Problem 12: Many-to-One Thread Mapping and I/O
Section titled “Problem 12: Many-to-One Thread Mapping and I/O”Consider an OS where the threads are implemented by a user-level thread library and all threads of a process are mapped to just one kernel thread (many-to-one mapping). Consider a process that has two threads, T1 and T2, and T1 is running.
True/False: When T1 blocks while executing an I/O system call, the OS can switch to T2 and start running it. Justify your answer.
Instructor Solution
Answer: False.
In a many-to-one mapping, the operating system only sees the single kernel thread, not the individual user-level threads. When T1 executes a blocking I/O system call, the entire kernel thread (and thus the entire process) is placed into a “Blocked” state by the OS. Because the kernel is unaware that T2 even exists or is ready to run, it cannot schedule it; the user-level library never gets the chance to swap T1 for T2 until the kernel thread becomes “Ready” again.
Elaboration (by LLM)
The Kernel’s Limited Visibility:
The problem hinges on what the kernel can “see”:
Many-to-One Thread Mapping:
User-Level View:┌─────────────────────────────────────────┐│ Single Process (User-level Library) ││ ┌─────────┐ ┌─────────┐ ││ │ Thread1 │ │ Thread2 │ (Managed ││ │ (Ready) │ │ (Ready) │ by library) ││ └─────────┘ └─────────┘ │└─────────────────────────────────────────┘ ↓ (hidden from kernel)
Kernel's View:┌─────────────────────────────────────────┐│ Single Schedulable Entity ││ (One Kernel Thread = LWP) ││ (Kernel thread is "Blocked") │└─────────────────────────────────────────┘The kernel only manages the single kernel thread. User threads are merely an abstraction—like library bookmarks that track which function is executing.
The Blocking Disaster:
Scenario: T1 makes a blocking read
// Thread 1 in user spaceint data[100];int bytes = read(fd, data, sizeof(data)); // Blocking I/O!What happens:
Time Kernel View User Library View────────────────────────────────────────────────────0ms KLT executing read() T1 executing read() (dispatched to CPU) T2 waiting (Ready)
5ms read() syscall issued OS takes control OS blocks waiting for I/O (KLT blocked)
10ms KLT still blocked T2 TRAPPED (waiting for disk/network) Cannot run!
100ms I/O complete T1 can resume KLT becomes Ready (but not yet)
105ms Kernel reschedules KLT User library gets control (only resource it knows) Could now run T2
But OS doesn't know about T2's existence!Why User Library Can’t Help:
The user-level library (like pthreads on many-to-one) cannot intercept kernel system calls:
// User library cannot see this:void *thread_func(void *arg) { read(fd, buf, 100); // ← Direct syscall to kernel // ← User library has NO control here return NULL;}When T1 calls read():
- CPU switches from user mode to kernel mode
- User library is paused (no execution)
- Only kernel code runs
- Kernel doesn’t know about T2
- Entire process blocked
Result: T2 cannot run!
Comparison Table:
┌──────────────────┬──────────────┬──────────────┐│ Scenario │ Many-to-One │ One-to-One │├──────────────────┼──────────────┼──────────────┤│ T1 blocks on I/O │ T2 CANNOT │ T2 CAN ││ │ run │ run │├──────────────────┼──────────────┼──────────────┤│ Kernel sees │ 1 entity │ 2 entities ││ │ (blocked) │ (one ready) │├──────────────────┼──────────────┼──────────────┤│ Scheduler can... │ Switch to │ Switch to T2 ││ │ other │ directly ││ │ processes │ │├──────────────────┼──────────────┼──────────────┤│ Used in │ Green threads│ Linux, Unix ││ │ (obsolete) │ (modern) │└──────────────────┴──────────────┴──────────────┘Real-World Example: Web Server
Many-to-One (FAILS):
Thread 1: Handle Client A request → read() from network (blocking) → OS blocks entire process → Thread 2 CANNOT handle Client B → Client B waits (poor responsiveness) → After network I/O completes: Thread 1 resumes
One-to-One (WORKS):
Thread 1: Handle Client A request → read() from network (blocking) → OS blocks only Thread 1 kernel thread → Scheduler immediately picks Thread 2 → Thread 2 handles Client B (responsive!) → Client B doesn't wait for Client AWhy Many-to-One Fails:
The fundamental problem: The OS controls scheduling, not the user library.
Control Flow with Many-to-One:
User Library: [Init] → [decide to switch T1→T2] → ... (but this decision is useless)
Kernel: [receives syscall] → [KLT blocked] → [User Library frozen, cannot execute] → [User Library cannot switch threads] → [Only kernel can resume execution]
Result: User library's thread switching logic never runs!When Blocking Occurs:
Safe (Non-blocking) I/O Scenarios:
1. Polling: T1: data = read_nonblocking(fd); // Returns immediately if (data) { process(data); } T2: can run between poll() calls
2. Asynchronous I/O: T1: aio_read(fd, &cb); // Async, returns immediately T2: can run while kernel handles I/O in background
But with standard blocking I/O: T1: read(fd, buf, 100); // BLOCKS T2: TRAPPED (user library powerless)Historical Note: Green Threads
Java and some older systems used many-to-one:
Java 1.0 - 1.2: ❌ All user threads → 1 OS thread ❌ Blocking I/O in one thread froze entire app ❌ Limited to single core (no parallelism)
Java 1.3+: ✅ One user thread → 1 OS thread ✅ Blocking doesn't freeze app ✅ Can utilize multiple cores
Why the change? Because many-to-one is fundamentally broken!The Key Lesson:
Blocking System Call + Many-to-One Thread Mapping =Entire Process Freezes(User library has no control once kernel takes over)This is why modern OSes (Linux, Windows, macOS) use one-to-one mapping: the kernel must be aware of each thread for true concurrency with blocking I/O.
Problem 13: One-to-One Thread Mapping and I/O
Section titled “Problem 13: One-to-One Thread Mapping and I/O”Consider an OS where each user-level thread has a corresponding kernel-level thread (one-to-one mapping) as in Linux. Consider a process that has two threads, T1 and T2 and T1 is running.
True/False: When T1 blocks while executing an I/O system call, the OS can switch to T2 and start running it. Justify your answer.
Instructor Solution
Answer: True.
In a one-to-one mapping, every user-level thread is a distinct schedulable entity in the eyes of the kernel. Because the OS kernel manages each thread individually, it recognizes that only T1 is waiting for I/O. While T1 is in the “Blocked” state, the kernel’s scheduler can independently pick T2 from the “Ready” queue and dispatch it to a CPU core.
Elaboration (by LLM)
Kernel Awareness of Each Thread:
In one-to-one mapping, the kernel maintains separate TCBs for each thread:
One-to-One Mapping:
User Process:┌─────────────────┐ ┌─────────────────┐│ Thread 1 │ │ Thread 2 │└────────┬────────┘ └────────┬────────┘ │ │ ↓ ↓
Kernel Thread Manager:┌─────────────────────────────────────────────────┐│ TCB 1: (T1's context) ││ State: Blocked (I/O) │├─────────────────────────────────────────────────┤│ TCB 2: (T2's context) ││ State: Ready │└─────────────────────────────────────────────────┘ ↓ Scheduler: Pick KLT2 freely! T2 runs!Unlike many-to-one where the kernel sees one blocked entity, here it sees two independent schedulable entities.
Key Difference from Problem 12:
When T1 blocks on I/O in one-to-one mapping:
- Kernel knows about TCB1 AND TCB2 (both visible)
- Kernel blocks only TCB1 (puts it in I/O wait queue)
- TCB2 remains in Ready queue (still schedulable)
- Scheduler freely picks TCB2 to run
- T2 executes while T1 waits on network/disk
Real-World Impact - Web Server Example:
With 1000 clients and 1 core:
Many-to-One (Problem 12):
- Client 1 → Thread 1 (blocked on read)
- Process blocked
- Clients 2-1000 wait sequentially
- Response time: ~1000 seconds
One-to-One (Problem 13):
- Thread 1: Client A (blocked)
- Thread 2: Client B (executing)
- Thread 3: Client C (executing)
- CPU time-shares among 1000 threads
- Network handles I/O in parallel
- Response time: ~10 seconds
The kernel schedules based on what it can SEE—and in one-to-one, it sees all threads!
Problem 14: fork() with Multiple Threads
Section titled “Problem 14: fork() with Multiple Threads”Consider a process with 4 threads in Linux. When a thread invokes fork() system call to create a new process, how many threads will the new process have? Briefly explain.
Instructor Solution
In Linux, the new process will have only one thread. When a thread calls fork(), the operating system creates a child process that is a duplicate of the caller’s address space. However, only the calling thread is duplicated into the child process; the other three threads simply cease to exist in the new process.
This behavior prevents complex issues, such as a child process inheriting threads that were in the middle of a sensitive operation or holding a mutex that will never be released (since the thread holding it wasn’t copied).
Elaboration (by LLM)
Why Only the Calling Thread?
When all threads were hypothetically copied:
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
void *thread2_func(void *arg) { pthread_mutex_lock(&lock); // Thread 2 is HERE (middle of critical section) // ... doing critical work ... pthread_mutex_unlock(&lock);}
int main() { pthread_t tid2, tid3, tid4; pthread_create(&tid2, NULL, thread2_func, NULL); pthread_create(&tid3, NULL, other_work, NULL); pthread_create(&tid4, NULL, other_work, NULL);
fork(); // DISASTER if all threads copied!}If all threads were duplicated:
Child Process After fork():
- Thread 2 (copied): Has lock, in middle of critical section
- Cannot unlock (state undefined)
- Mutex is corrupted in child!
- Thread 3 (copied): Tied to parent’s I/O operations
- Thread 4 (copied): Managing memory
- Result: Complete chaos, deadlocks, data corruption!
The Safe Solution: Copy Only Calling Thread
int main() { pthread_t tid2, tid3, tid4;
pthread_create(&tid2, NULL, worker, NULL); pthread_create(&tid3, NULL, worker, NULL); pthread_create(&tid4, NULL, worker, NULL);
pid_t child_pid = fork(); // Safe!
if (child_pid == 0) { // Child process printf("Child has 1 thread\n"); // Only main thread
// Safe to do child-specific work close(shared_fd); // Don't interfere with parent _exit(0); } else { // Parent process printf("Parent still has 4 threads\n");
pthread_join(tid2, NULL); pthread_join(tid3, NULL); pthread_join(tid4, NULL); }}Thread Count After fork():
Before fork(): 4 threads (1 main, 3 workers) ↓After fork(): Parent has 4 threads (unchanged) Child has 1 thread (calling thread only)Common Problem: Lost Worker Threads
void *worker_thread(void *arg) { while (true) { job_t *job = get_next_job(); process_job(job); }}
int main() { // Create thread pool (8 worker threads) for (int i = 0; i < 8; i++) { pthread_create(&workers[i], NULL, worker_thread, NULL); }
fork(); // Oops!}In child process:
- All 8 workers are GONE
- Only main thread remains
- If child tries to add jobs: queue fills up
- No workers to process them: DEADLOCK
Best Practice: Call fork() before creating threads!
Problem 15: exec() with Multiple Threads
Section titled “Problem 15: exec() with Multiple Threads”Consider a process with 4 threads in Linux. When a thread invokes exec() system call to load a new executable on top of the current process, how many threads will the new process have? Briefly explain.
Instructor Solution
The new process will have only one thread. When a thread calls exec(), the entire address space of the process is replaced with the new executable. Because the old code, data, and stacks are wiped out to make room for the new program, all existing threads—including the one that called exec()—are terminated. The new program then begins execution from its main() function with a single, fresh thread of control.
Elaboration (by LLM)
Complete Address Space Replacement:
The exec() syscall doesn’t “update” the process—it completely replaces it:
// Old program - 4 threadsint main() { pthread_t tid2, tid3;
pthread_create(&tid2, NULL, work, NULL); pthread_create(&tid3, NULL, work, NULL);
sleep(10); // Simulate work
execv("/usr/bin/new_program", args); // REPLACED! // Nothing after this executes in old program}What exec() does:
- Flush all state: Old code, old data, old stacks
- Release all threads: Threads 2, 3, 4 are killed
- Load new executable: Read new binary from disk
- Initialize new process: Fresh data, fresh heap
- Start new main(): Single thread control
Thread Termination Timeline:
Time Thread 1 Thread 2 Thread 3──────────────────────────────────────────────────────────0ms main() Starting Starting pthread_create()
5ms sleep(10) Running Running
10ms execv() called Running Running
11ms BANG! TERMINATED TERMINATED (old code destroyed) (killed instantly) (killed instantly)
11ms ┌──────────────────┐ │ NEW EXECUTABLE │ │ CODE LOADED │ │ main() running │ │ 1 thread now │ └──────────────────┘Key Difference from fork():
fork(): ├─ Parent: Unchanged, all threads continue └─ Child: 1 thread (copy of caller) Result: Two processes
exec(): ├─ Replace current process entirely ├─ All threads (except caller) killed ├─ New code loaded └─ New main() starts with 1 thread Result: Same process, different programStandard Pattern: fork() + exec()
This is the standard way to run new programs:
pid_t pid = fork();
if (pid == 0) { // Child process // Child has 1 thread (safe to manipulate)
close(STDIN_FILENO); open("/input.txt", O_RDONLY);
execv("/bin/cat", args); // Never returns if successful
perror("execv failed"); exit(1);} else { // Parent process // Still has all threads waitpid(pid, NULL, 0); // Wait for child's program}Resource Leaks After exec():
int main() { pthread_mutex_t m = PTHREAD_MUTEX_INITIALIZER; int fd = open("data.txt", O_RDONLY);
pthread_create(&tid, NULL, work, NULL);
execv("/bin/true", args);
// exec() doesn't return, but if it did: // - Thread tid: GONE // - Mutex m: LOST (not unlocked) // - File fd: LEFT OPEN (might need cleanup)}Solution: Use fork() + exec() carefully to isolate the new program!
Problem 16: Heap Sharing Between Threads
Section titled “Problem 16: Heap Sharing Between Threads”A process has two threads, T1 and T2. Assume that T1 allocates some memory on the heap as follows:
int *ptr = new int[100];T1 then passes this pointer “ptr” to T2. Can T2 legally access the array pointed to by ptr? Justify your answer.
Instructor Solution
Yes. The array pointed to by “ptr” is located in the heap, and the threads that belong to the same process share the same heap. If T2 has the pointer, it can read/write that memory easily. This is the standard way threads share large amounts of data: by storing them either in static global variables or in the heap.
Elaboration (by LLM)
Why Heap Works (Unlike Stack):
Stack (Problem 17 - dangerous):
- Thread 1 Stack: 0x5000 - 0x6000 (private)
- Thread 2 Stack: 0x3000 - 0x4000 (private)
- Each thread has separate stack
- When thread exits, stack frame destroyed
- Addresses become invalid (dangling pointers)
Heap (Safe):
- Single shared heap: 0x8000 - 0x20000
- All threads allocate from same memory region
- When thread exits, heap memory persists
- Other threads can still access it
Pointer Transfer Mechanism:
void *thread1_func(void *arg) { int *ptr = (int *)malloc(100 * sizeof(int)); // Heap alloc
// T1's local variable 'ptr' contains address, e.g., 0x8000 // This address points to SHARED heap
// Pass to T2 via shared variable shared_ptr = ptr; // Copy the ADDRESS (just the value!)
return NULL;}
void *thread2_func(void *arg) { // T2 receives the same address from shared_ptr int *ptr = shared_ptr; // Same address as T1!
// Now T2 can access T1's heap allocation ptr[0] = 100; // OK - writing to shared heap printf("%d\n", ptr[50]); // OK - reading from shared heap
return NULL;}Memory Lifetime Guarantee:
void *thread1_func(void *arg) { int *ptr = malloc(100 * sizeof(int)); // Heap allocation *(int **)arg = ptr; // Share with T2 (store address) return NULL; // T1 exits} // T1's STACK frame destroyed // BUT heap at ptr still valid!
void *thread2_func(void *arg) { int *data = *(int **)arg; // Get T1's allocation printf("%d\n", data[0]); // ✅ Still accessible! return NULL;}The heap allocation outlives the thread that created it!
Common Heap Sharing Patterns:
Pattern 1: Global Pointer
int *global_ptr = NULL;
void *thread1_func(void *arg) { global_ptr = malloc(1000); // T1 allocates global_ptr[0] = 42;}
void *thread2_func(void *arg) { printf("%d\n", global_ptr[0]); // T2 uses T1's allocation ✅}Pattern 2: Function Argument
void *thread_func(void *arg) { int *buffer = (int *)arg; // Buffer passed from main buffer[0] = 42; // ✅ Access shared heap}
int main() { int *buffer = malloc(100 * sizeof(int));
pthread_t tid; pthread_create(&tid, NULL, thread_func, (void *)buffer);
buffer[0] = 99; // Main also uses it ✅
pthread_join(tid, NULL); free(buffer);}Pattern 3: Thread-Safe Queue
struct Node { int data; struct Node *next;};
void enqueue(int value) { struct Node *new = malloc(sizeof(struct Node)); // Heap! new->data = value; new->next = NULL;
pthread_mutex_lock(&lock); // Link into queue pthread_mutex_unlock(&lock);}
void *worker_thread(void *arg) { struct Node *node = dequeue(); // T2 accesses T1's heap alloc printf("%d\n", node->data); // ✅ Works free(node);}Why This is Safe:
Heap management is centralized in the kernel:
- Kernel tracks all allocations
- When any thread frees: heap manager updates
- All threads see consistent heap state
- Addresses remain valid until explicit
free()
Problem 17: Stack Access Between Threads
Section titled “Problem 17: Stack Access Between Threads”A process has two threads, T1 and T2. Assume that T1 declares a local variable x as follows:
int x = 10;T1 then passes the address of this variable “x” to T2. Can T2 legally access “x”? Justify your answer.
Instructor Solution
Recall that local variables are on the stack, and every thread within the same process has their own private stack. While T2 can technically access T1’s stack if it has the memory address (since they share the same address space), it is extremely dangerous. If T1 returns from its current function, its stack frame is destroyed, and T2 will be pointing to “garbage” memory, likely causing a crash.
Elaboration (by LLM)
Stack Layout: Thread-Private Regions
Each thread gets its own stack:
Process Address Space:
Text (Shared) All threads execute same codeData + BSS (Shared) All threads share globalsHeap (Shared) All threads allocate from same heapSTACKS (THREAD-PRIVATE!) ├─ Stack for Thread 1 0x7000: [x = 10] ← T1's local var └─ Stack for Thread 2 0x4000: [locals] ← T2's localsStacks grow in opposite directions, thread-private regions isolated.
The Danger: Stack Frame Destruction
void function_with_local(void *shared_ptr_arg) { int x = 10; // Local variable on T1's stack at 0x7000
// Pass address to T2 int **ptr_holder = (int **)shared_ptr_arg; *ptr_holder = &x; // T2 now holds 0x7000
return; // CRITICAL MOMENT! // Stack frame is DESTROYED here // Address 0x7000 is no longer valid}Timeline:
Time Thread 1 Thread 2──────────────────────────────────────────────────────0ms function_with_local() x = 10 ptr = &x (passes 0x7000) (waiting)
5ms (doing work) Receives 0x7000 printf(*ptr); // 10 ✅
10ms return; Still holding 0x7000 [Stack destroyed] [0x7000 now invalid] [0x7000 becomes garbage]
15ms Another_function() printf(*ptr); // CRASH! [Reuses stack at 0x7000] (Undefined behavior) int y = 999; [0x7000=999] Might print garbageRace Condition - Timing Problem:
int *shared_ptr = NULL;
void *thread1_func(void *arg) { int x = 10; shared_ptr = &x; // T2 now has the address
return NULL; // Stack destroyed!}
void *thread2_func(void *arg) { sleep(1); // Wait 1 second
// By now, T1 has exited if (shared_ptr) { printf("%d\n", *shared_ptr); // CRASH! Dangling pointer! } return NULL;}Comparison: Why Heap Works vs. Stack Fails
HEAP ALLOCATION (Safe): void *thread1_func() { int *ptr = malloc(100); // Allocated on heap return ptr; // Can return safely } // Stack destroyed, heap survives!
void *thread2_func() { int *data = result; printf("%d", *data); // ✅ Works! (heap still allocated) }
STACK ALLOCATION (Dangerous): void *thread1_func() { int x = 10; return &x; // RETURNS STACK ADDRESS! } // Stack destroyed, &x invalid!
void *thread2_func() { int *data = result; printf("%d", *data); // ❌ CRASH! (stack freed) }Best Practices:
❌ DON'T: Pass stack addresses between threads❌ DON'T: Return &local_variable from thread function❌ DON'T: Store stack pointers in shared variables
✅ DO: Use heap allocations (malloc)✅ DO: Use global variables (for constants)✅ DO: Copy values instead of addresses (when safe)Correct Pattern:
void *thread1_func(void *arg) { // Allocate on HEAP instead int *data = malloc(sizeof(int)); *data = 10;
// Share the address *(int **)arg = data; // Safe!
return NULL; // Stack destroyed, but heap survives}
void *thread2_func(void *arg) { int *data = *(int **)arg;
sleep(10); // T1 has exited
printf("%d\n", *data); // ✅ Works! (heap still allocated)
free(data); // Proper cleanup
return NULL;}Problem 18: File Descriptor Sharing
Section titled “Problem 18: File Descriptor Sharing”A process has two threads, T1 and T2. Assume that T1 opens a file as follows:
int fd = open("file.txt", O_RDONLY);Assume that “fd” is a global variable and T2 has access to it. Can T2 call read(fd, …) to read data from “file.txt”? Justify your answer.
Instructor Solution
Yes. File descriptors are part of the Process Resources. The File Descriptor Table is shared by all threads within the same process. Once any thread opens a file, the resulting integer ID is valid for every other thread in that process. This means that all other threads in the same process who has access to the file descriptor can read from the file.
Elaboration (by LLM)
File Descriptor Table: Process-Wide Resource
File descriptors are stored in a kernel structure that’s shared across all threads:
Process Resources (Shared): File Descriptor Table FD 0: stdin FD 1: stdout FD 2: stderr FD 3: file.txt ← (opened by T1) FD 4: socket FD 5: /dev/random
Threads: T1: Can use FD 3 ✅ T2: Can use FD 3 ✅ (T1 opened it)
Both threads reference same kernel file entry!Unlike stacks (thread-private) and heaps (process-shared but allocated separately), file descriptors are explicitly process-wide resources.
File Descriptor Operations Across Threads:
int fd = -1; // Global variable
void *thread1_func(void *arg) { // T1 opens the file fd = open("file.txt", O_RDONLY);
printf("[T1] Opened file with fd=%d\n", fd);
// Read first 100 bytes char buf1[100]; ssize_t n = read(fd, buf1, 100); printf("[T1] Read %ld bytes\n", n);
// File position is now at byte 100 // This position is SHARED with T2!
sleep(1);
return NULL;}
void *thread2_func(void *arg) { sleep(0.5); // Let T1 read first
// T2 can use the FD that T1 opened // Important: File position is SHARED!
char buf2[50]; ssize_t n = read(fd, buf2, 50); // ✅ Works!
printf("[T2] Read %ld bytes from fd=%d\n", n, fd); printf("[T2] Got data from byte position 100 onwards\n");
return NULL;}
int main() { pthread_t tid1, tid2;
pthread_create(&tid1, NULL, thread1_func, NULL); pthread_create(&tid2, NULL, thread2_func, NULL);
pthread_join(tid1, NULL); pthread_join(tid2, NULL);
if (fd >= 0) close(fd);
return 0;}Output:
[T1] Opened file with fd=3[T1] Read 100 bytes[T2] Read 50 bytes from fd=3[T2] Got data from byte position 100 onwardsCritical Issue: Shared File Position
File descriptors don’t just refer to files—they track the current read/write position:
File: file.txt (1000 bytes)
Initially:[0][1][2]...[100]...[200]...[1000] ↑Position 0
After T1 reads 100 bytes:[0][1][2]...[100]...[200]...[1000] ↑ Position 100 (shared!)
T2 calls read():[0][1][2]...[100]...[150]...[1000] ↑ Position 150 (after reading 50)Race Condition Danger:
int fd = open("data.txt", O_RDONLY);
void *thread1_func(void *arg) { char buf[10]; read(fd, buf, 10); // Reads bytes 0-9 // Scheduling might switch to T2 here!
printf("T1: %s\n", buf);}
void *thread2_func(void *arg) { char buf[10]; read(fd, buf, 10); // Reads which bytes?
printf("T2: %s\n", buf);}File position is shared, so multiple threads reading concurrently can cause data loss or confusion!
Proper Synchronization:
pthread_mutex_t fd_lock = PTHREAD_MUTEX_INITIALIZER;int fd = -1;
void *thread1_func(void *arg) { pthread_mutex_lock(&fd_lock);
// Only T1 is reading, file position safe char buf[10]; ssize_t n = read(fd, buf, 10); printf("T1 read: %.*s\n", (int)n, buf);
pthread_mutex_unlock(&fd_lock);}
void *thread2_func(void *arg) { pthread_mutex_lock(&fd_lock); // Wait if T1 reading
char buf[10]; ssize_t n = read(fd, buf, 10); printf("T2 read: %.*s\n", (int)n, buf);
pthread_mutex_unlock(&fd_lock);}With mutex: Each read() is atomic, file position updates predictably.
Common FD Sharing Patterns:
Pattern 1: Read-Only Safe
// Config file read by multiple threadsint config_fd = open("config.txt", O_RDONLY);
void *worker_thread(void *arg) { // Safe if threads seek to known positions first lseek(config_fd, THREAD_OFFSET, SEEK_SET); read(config_fd, local_buf, SIZE);}Pattern 2: Append-Only (Requires Mutex)
int log_fd = open("log.txt", O_WRONLY | O_APPEND);pthread_mutex_t log_lock = PTHREAD_MUTEX_INITIALIZER;
void *thread_func(void *arg) { pthread_mutex_lock(&log_lock); write(log_fd, "message\n", 8); // Atomic pthread_mutex_unlock(&log_lock);}Comparison: FDs vs. Heap vs. Stack
Resource | Shared? | Access | Issues─────────────┼─────────┼─────────┼─────────────File Descr. | Yes | Via FD | Shared position | (global)| (global)| (races!)─────────────┼─────────┼─────────┼─────────────Heap Memory | Yes | Via ptr | Race conditions | (shared)| (shared)| on data─────────────┼─────────┼─────────┼─────────────Stack Memory | No | Via addr| Dangling pointers | (private)| (private)File descriptors are meant for sharing—but be careful with concurrent operations!
Problem 19: Thread Output Analysis
Section titled “Problem 19: Thread Output Analysis”What’s the output of the following code when run in a Unix system? Assume you have 2 functions:
- (1)
CreateThread(F, arg)creates a thread that starts running at the function “F” and takes the argument “arg” and returns the id of the created thread - (2)
JoinThread(int threadId)is used to wait for the termination of a thread
int g = 0;int tids[2];
void T(int arg) { if (arg == 2) { g++; printf("[MainThread] g: %d\n", g); } if (arg == 1) { g++; printf("[Thread1] g: %d\n", g); } if (arg == 0) { JoinThread(tids[1]); g++; printf("[Thread0] g: %d\n", g); }} // end-T
main() { tids[1] = CreateThread(T, 1); tids[0] = CreateThread(T, 0); JoinThread(tids[0]); T(2);} // end-mainInstructor Solution
Output:
[Thread1] g: 1[Thread0] g: 2[MainThread] g: 3Elaboration (by LLM)
Execution Flow Analysis:
The key to understanding this is tracking the order of thread creation, execution, and synchronization:
Timeline:
main(): tids[1] = CreateThread(T, 1); // Create Thread1 (arg=1) tids[0] = CreateThread(T, 0); // Create Thread0 (arg=0) JoinThread(tids[0]); // WAIT for Thread0 to finish T(2); // Execute main's functionConcurrent Execution:
Main Thread Thread1 (arg=1) Thread0 (arg=0)───────────── ─────────────── ──────────────CreateThread(T,1) ┐CreateThread(T,0) │ ┐JoinThread(tids[0]) ├─ Waiting for T0 ─────┼─ Running T(0)(blocked) │ │ │ JoinThread(tids[1]) │ (waits for Thread1) │ │ │◄─────────────────────┤ Waiting for T1 g++ (T1 arg=1) │ printf T1 │ return │ (T1 finishes) │ │ g++ (T0) └──────────────────────► printf T0 return (T0 finishes)(unblocked)T(2)g++printf mainreturnDetailed Execution Order:
- Main: Creates Thread1 with T(1)
- Main: Creates Thread0 with T(0)
- Main: Calls JoinThread(tids[0]) → Main blocks, waiting for Thread0 to finish
- Thread1: Starts executing T(1)
- arg == 1 is true
- g++ (g becomes 1)
- printf(“[Thread1] g: 1\n”)
- Thread1 finishes
- Thread0: Starts executing T(0)
- arg == 0 is true
- Calls JoinThread(tids[1]) → Thread0 blocks, waiting for Thread1
- Thread1 already finished, so this returns immediately
- g++ (g becomes 2)
- printf(“[Thread0] g: 2\n”)
- Thread0 finishes
- Main: JoinThread returns (Thread0 finished)
- Calls T(2)
- arg == 2 is true
- g++ (g becomes 3)
- printf(“[MainThread] g: 3\n”)
- Main finishes
Critical Insight: Shared Global Variable g
All three function invocations increment the same global variable g:
Initial: g = 0After Thread1: g = 0 + 1 = 1After Thread0: g = 1 + 1 = 2After Main: g = 2 + 1 = 3Note: This works predictably because:
- Thread1 finishes before Thread0 even starts checking it
- Thread0 waits for Thread1 explicitly
- Main waits for Thread0 explicitly
- No race conditions here (synchronization prevents them)
Key Learning Points:
- Thread Creation Order: Threads are created in order (Thread1 first, Thread0 second)
- Thread Start Order: May be different from creation order (depends on scheduler)
- JoinThread Synchronization: Blocks until the thread finishes
- Global Variable Access: All threads access the same
gvariable - Execution Order: Determined by JoinThread calls, not creation order
Problem 20: Parallel Summation
Section titled “Problem 20: Parallel Summation”Consider parallel summation: You are given an array
Assume you have 2 functions:
- (1)
CreateThread(F, arg)creates a thread that starts running at the function “F” and takes the argument “arg” and returns the id of the created thread - (2)
JoinThread(int threadId)is used to wait for the termination of a thread
Assume that array
Instructor Solution
#include <stdio.h>#define N ...int A[N]; // Array to be summedint localSums[4];
void doSum(int tid) { int start = tid * N / 4; int end = start + N / 4; int sum = 0;
for (int i = start; i < end; i++) sum += A[i];
localSums[tid] = sum;}
void main() { int tids[4];
for (int i = 0; i < 4; i++) tids[i] = CreateThread(doSum, i);
int sum = 0; for (int i = 0; i < 4; i++) { JoinThread(tids[i]); sum += localSums[i]; }
printf("Sum: %d\n", sum);}Elaboration (by LLM)
Problem Decomposition:
To parallelize summation with 4 threads, divide the array into 4 equal parts:
Array A[0..N-1]:[0][1]...[N/4-1] | [N/4]...[N/2-1] | [N/2]...[3N/4-1] | [3N/4]...[N-1]├─ Thread 0 ────┤ ├─ Thread 1 ────┤ ├─ Thread 2 ────┤ ├─ Thread 3 ────┤└─ Sum A[0:N/4] ─┘ └─ Sum A[N/4:N/2]└─ Sum A[N/2:3N/4]└─ Sum A[3N/4:N]─┘ → localSums[0] → localSums[1] → localSums[2] → localSums[3]
Final Sum = localSums[0] + localSums[1] + localSums[2] + localSums[3]Parallel Execution Model:
Main Thread: Create 4 threads │ ├─ Thread 0: Sum A[0..N/4-1] ├─ Thread 1: Sum A[N/4..N/2-1] ├─ Thread 2: Sum A[N/2..3N/4-1] └─ Thread 3: Sum A[3N/4..N-1] (All 4 compute in parallel) ↓ Wait for all to finish (JoinThread for each) ↓ Aggregate results: Sum all localSums[i] ↓ Print final sumKey Design Patterns:
1. Work Distribution:
Each thread gets a unique thread ID (tid = 0, 1, 2, 3):
void doSum(int tid) { int start = tid * N / 4; // Where this thread starts int end = start + N / 4; // Where this thread ends
// Thread 0: start=0, end=N/4 // Thread 1: start=N/4, end=N/2 // Thread 2: start=N/2, end=3N/4 // Thread 3: start=3N/4, end=N}2. Local Computation:
Each thread computes a LOCAL sum without synchronization:
int sum = 0;for (int i = start; i < end; i++) sum += A[i];localSums[tid] = sum; // Store result for main threadNo race conditions because:
- Each thread writes to unique localSums[tid]
- No shared write access during computation
3. Aggregation:
Main thread combines results AFTER all threads finish:
int sum = 0;for (int i = 0; i < 4; i++) { JoinThread(tids[i]); // Wait for thread i sum += localSums[i]; // Add its result}Example with Real Numbers:
Array: A = [1, 2, 3, 4, 5, 6, 7, 8] (N=8)
Thread Distribution: Thread 0: A[0..1] = [1, 2] → localSums[0] = 3 Thread 1: A[2..3] = [3, 4] → localSums[1] = 7 Thread 2: A[4..5] = [5, 6] → localSums[2] = 11 Thread 3: A[6..7] = [7, 8] → localSums[3] = 15
Aggregation: sum = 0 sum += localSums[0] = 3 sum += localSums[1] = 7 (sum = 10) sum += localSums[2] = 11 (sum = 21) sum += localSums[3] = 15 (sum = 36)
Final Output: Sum: 36Performance Analysis:
Without Parallelization:
Single thread sums N elements:Time = O(N)With 4-Thread Parallelization:
Each thread sums N/4 elements:Time = O(N/4) for computationPlus O(1) for aggregationTotal = O(N/4) (speedup of ~4x on 4 cores)Critical Sections:
This code has NO critical sections!
- Each thread writes to different localSums[tid] (no race)
- Main thread only reads after all threads finish (no race)
- All synchronization explicit (JoinThread)
Why This Design is Robust:
// Good: No races, clean synchronizationint localSums[4]; // Each thread writes unique index
void doSum(int tid) { localSums[tid] = sum; // Safe! tid unique per thread}
void main() { for (int i = 0; i < 4; i++) { JoinThread(tids[i]); // Explicit wait sum += localSums[i]; // Read after join }}
// Bad: Would cause race conditionsint globalSum = 0; // Shared!
void doSum(int tid) { globalSum += localSum; // RACE! Multiple threads writing}Scalability:
To parallelize across N cores instead of 4:
#define THREADS N
int localSums[THREADS];
void doSum(int tid) { int start = tid * size / THREADS; int end = start + size / THREADS; // ... same logic ...}
void main() { int tids[THREADS];
for (int i = 0; i < THREADS; i++) tids[i] = CreateThread(doSum, i);
int sum = 0; for (int i = 0; i < THREADS; i++) { JoinThread(tids[i]); sum += localSums[i]; }
printf("Sum: %d\n", sum);}The logic remains identical, just parameterized!
Problem 21: Parallel Minimum Finding
Section titled “Problem 21: Parallel Minimum Finding”Write a multi-threaded program to find minimum element of an array
Assume that array
#include <stdio.h>#define N ...
int A[N]; // Array for which the minimum is to be computedInstructor Solution
#include <stdio.h>#define N ...
int A[N]; // Array for which the minimum is to be computedint localMins[2];
void findMin(int tid) { int start = tid * N / 2; int end = start + N / 2; int min = INT_MAX;
for (int i = start; i < end; i++) if (A[i] < min) min = A[i];
localMins[tid] = min;}
void main() { int tids[2];
for (int i = 0; i < 2; i++) tids[i] = CreateThread(findMin, i);
for (int i = 0; i < 2; i++) JoinThread(tids[i]);
if (localMins[0] < localMins[1]) printf("Min: %d\n", localMins[0]); else printf("Min: %d\n", localMins[1]);}Elaboration (by LLM)
Array Partitioning for Min-Finding:
Split array into 2 halves for 2 threads:
Array A[0..N-1]:[0][1]...[N/2-1] | [N/2]...[N-1]├─ Thread 0 ─────┤ ├─ Thread 1 ──┤└─ Find min of ──┘ └─ Find min of─┘ first half second half localMins[0] localMins[1]
Final Min = min(localMins[0], localMins[1])Key Differences from Problem 20 (Summation):
Problem 20 (Sum): Problem 21 (Min): - Combine: ADD - Combine: COMPARE - sum += A[i] - if (A[i] < min) min = A[i] - Final: Add all parts - Final: min(part1, part2) - Operation: Associative - Operation: min()
Both parallelize the SAME WAY: 1. Partition array 2. Each thread processes partition 3. Main thread combines resultsParallel Execution Model:
Main Thread: Create 2 threads │ ├─ Thread 0: Find min in A[0..N/2-1] └─ Thread 1: Find min in A[N/2..N-1] (Both compute in parallel) ↓ Wait for both to finish (JoinThread for each) ↓ Compare results: min(localMins[0], localMins[1]) ↓ Print overall minimumAlgorithm Details:
1. Initialization:
int min = INT_MAX; // Start with largest possible valueWhy INT_MAX?
- Any array element will be less than INT_MAX
- Ensures first element becomes the min
2. Comparison:
for (int i = start; i < end; i++) if (A[i] < min) min = A[i];Each thread finds the minimum in its range.
3. Storage:
localMins[tid] = min; // Thread tid stores its minimumEach thread writes to unique index (no race).
4. Final Comparison:
if (localMins[0] < localMins[1]) printf("Min: %d\n", localMins[0]);else printf("Min: %d\n", localMins[1]);Main thread compares the two local minima.
Example Execution:
Array: A = [5, 3, 9, 1, 7, 2, 8, 4] (N=8)
Thread Distribution: Thread 0: A[0..3] = [5, 3, 9, 1] min = INT_MAX min = 5 (A[0] < INT_MAX) min = 3 (A[1] < 5) min = 3 (A[2]=9 not < 3) min = 1 (A[3] < 3) localMins[0] = 1
Thread 1: A[4..7] = [7, 2, 8, 4] min = INT_MAX min = 7 (A[4] < INT_MAX) min = 2 (A[5] < 7) min = 2 (A[6]=8 not < 2) min = 2 (A[7]=4 not < 2) localMins[1] = 2
Final Comparison: localMins[0] = 1 localMins[1] = 2 1 < 2 → Print "Min: 1"Why Use INT_MAX Instead of First Element?
// Approach 1: Using INT_MAX (Current Solution)int min = INT_MAX;for (int i = start; i < end; i++) if (A[i] < min) min = A[i];
Advantages: ✅ Works even if first element is largest ✅ Cleaner logic ✅ No special case handling
// Approach 2: Using first elementint min = A[start];for (int i = start + 1; i < end; i++) if (A[i] < min) min = A[i];
Disadvantages: ❌ More complex (skip first iteration) ❌ Requires special handlingGeneralization to K Threads:
For K threads finding minimum:
#define K 4 // Number of threads
int localMins[K];
void findMin(int tid) { int start = tid * N / K; int end = start + N / K; int min = INT_MAX;
for (int i = start; i < end; i++) if (A[i] < min) min = A[i];
localMins[tid] = min;}
void main() { int tids[K];
for (int i = 0; i < K; i++) tids[i] = CreateThread(findMin, i);
for (int i = 0; i < K; i++) JoinThread(tids[i]);
// Find minimum of all local minima int globalMin = INT_MAX; for (int i = 0; i < K; i++) { if (localMins[i] < globalMin) globalMin = localMins[i]; }
printf("Min: %d\n", globalMin);}Time Complexity Analysis:
Sequential (1 thread):
Time = O(N) - iterate through all N elementsParallel (2 threads on 2 cores):
Each thread: O(N/2)Total: O(N/2) - speedup of 2xParallel (K threads on K cores):
Each thread: O(N/K)Total: O(N/K) - speedup of K x(Ignoring aggregation time O(K), which is negligible for large N)
Critical Pattern:
This is the Fork-Join Pattern:
1. FORK: Create multiple threads2. WORK: Each thread processes independently3. JOIN: Wait for all threads4. MERGE: Combine resultsThis pattern applies to ANY associative operation:
- Addition (Problem 20)
- Minimum (Problem 21)
- Maximum (variant)
- Multiplication (variant)
- Logical operations (AND, OR)
The key requirement: Operation must be associative
Associative: (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
Examples: min(min(5, 3), 9) = min(5, min(3, 9)) ✅ sum(sum(5, 3), 9) = sum(5, sum(3, 9)) ✅ max(max(5, 3), 9) = max(5, max(3, 9)) ✅Problem 22: Parallel Array Search
Section titled “Problem 22: Parallel Array Search”You are given an unsorted array
Assume you have 2 functions:
- (1)
CreateThread(F, arg)creates a thread that starts running at the function “F” and takes the argument “arg” and returns the id of the created thread - (2)
JoinThread(int threadId)is used to wait for the termination of a thread
Assume that array
#include <stdio.h>#define N ...
int A[N]; // Array to be searchedint key; // Key to be searchedInstructor Solution
#include <stdio.h>#define N ...
int A[N]; // Array to be searchedint key; // Key to be searchedint localResult[4] = {N};
void doFind(int tid) { int start = tid * N / 4; int end = start + N / 4;
for (int i = start; i < end; i++) { if (A[i] == key) { localResult[tid] = i; return; } }}
void main() { int tids[4];
for (int i = 0; i < 4; i++) tids[i] = CreateThread(doFind, i);
int idx = N; for (int i = 0; i < 4; i++) { JoinThread(tids[i]); if (localResult[i] < idx) idx = localResult[i]; }
if (idx == N) printf("Key does not exist in the array."); else printf("Key exists at index: %d\n", idx);}Elaboration (by LLM)
Parallel Search Strategy:
Unlike summation and min-finding (which process all elements), search looks for FIRST occurrence:
Array A[0..N-1] and key=7:[3][1][7][5][2][7][9][4]
Thread Distribution: Thread 0: A[0..1] = [3, 1] → localResult[0] = N (not found) Thread 1: A[2..3] = [7, 5] → localResult[1] = 2 (found at index 2!) Thread 2: A[4..5] = [2, 7] → localResult[2] = 5 (found at index 5) Thread 3: A[6..7] = [9, 4] → localResult[3] = N (not found)
Result Aggregation: Overall minimum index = min(N, 2, 5, N) = 2 Output: "Key exists at index: 2" (first occurrence)Key Design Decisions:
1. Initialization to N (not found sentinel):
int localResult[4] = {N}; // All initialized to NWhy N?
- N is the array size
- No valid index equals N (indices are 0 to N-1)
- Represents “not found” for this thread’s range
- Easy to check:
if (idx == N) → not found
2. Early Exit on Success:
for (int i = start; i < end; i++) { if (A[i] == key) { localResult[tid] = i; return; // Early exit! Don't search rest of range }}Once found:
- Store the index
- RETURN immediately (don’t continue searching)
- This is efficient: once we find it, no need to search further
3. Finding the FIRST occurrence:
Since we want the first index:
int idx = N;for (int i = 0; i < 4; i++) { JoinThread(tids[i]); if (localResult[i] < idx) idx = localResult[i]; // Keep minimum index}Take the MINIMUM index from all threads:
- Thread with key in range 0-5 will have smaller index
- Overrides threads with larger indices
- Threads that didn’t find it have localResult = N (ignored)
Example Detailed Execution:
Array: A = [10, 20, 30, 7, 40, 7, 50, 7] (N=8)Key: 7
Initial: localResult = [8, 8, 8, 8]
Thread 0: A[0..1] = [10, 20] i=0: A[0]=10 != 7 i=1: A[1]=20 != 7 (loop ends, localResult[0] stays 8) localResult = [8, 8, 8, 8]
Thread 1: A[2..3] = [30, 7] i=2: A[2]=30 != 7 i=3: A[3]=7 == 7 ✅ localResult[1] = 3 return (early exit!) localResult = [8, 3, 8, 8]
Thread 2: A[4..5] = [40, 7] i=4: A[4]=40 != 7 i=5: A[5]=7 == 7 ✅ localResult[2] = 5 return (early exit!) localResult = [8, 3, 8, 8]
Thread 3: A[6..7] = [50, 7] i=6: A[6]=50 != 7 i=7: A[7]=7 == 7 ✅ localResult[3] = 7 return (early exit!) localResult = [8, 3, 5, 7]
Aggregation (after all joins): idx = 8 localResult[0]=8: 8 < 8? No localResult[1]=3: 3 < 8? Yes → idx = 3 localResult[2]=5: 5 < 3? No localResult[3]=7: 7 < 3? No
Final: idx = 3Output: "Key exists at index: 3"Why No Race Conditions?
// Thread 0 writes to localResult[0]// Thread 1 writes to localResult[1]// etc.
// NO overlapping writes → NO race conditions// Each thread has unique localResult indexEach thread writes to a different element of the array!
Early Exit Pattern:
This is a key optimization for search:
// Inefficient: Search entire range even after findingfor (int i = start; i < end; i++) { if (A[i] == key) { localResult[tid] = i; // Continue searching (wasted work!) }}
// Efficient: Stop immediately on first findfor (int i = start; i < end; i++) { if (A[i] == key) { localResult[tid] = i; return; // Early exit }}The return statement:
- Exits the function immediately
- Thread terminates
- Other threads may still be searching
- Main thread waits for all with JoinThread
Handling “Not Found” Case:
// All threads searched their range// No thread found the key
localResult = [N, N, N, N] (all initialized value)
Aggregation: idx = N for (i=0 to 3): localResult[i] < idx? No (all equal N)
idx stays N
Output: if (idx == N) printf("Key does not exist in the array.");Comparison with Sequential Search:
Sequential (1 thread): for (int i = 0; i < N; i++) { if (A[i] == key) return i; } Worst case: O(N) - key at end or not found
Parallel (4 threads): Best case: O(N/4) - key in first thread's range Worst case: O(N/4) - all threads search fully
Expected case: O(N/4) if key distributed randomlyGeneralization to K Threads:
#define K 4
int localResult[K] = {N};
void doFind(int tid) { int start = tid * N / K; int end = start + N / K;
for (int i = start; i < end; i++) { if (A[i] == key) { localResult[tid] = i; return; } }}
void main() { int tids[K];
for (int i = 0; i < K; i++) tids[i] = CreateThread(doFind, i);
int idx = N; for (int i = 0; i < K; i++) { JoinThread(tids[i]); if (localResult[i] < idx) idx = localResult[i]; }
if (idx == N) printf("Key does not exist in the array."); else printf("Key exists at index: %d\n", idx);}The logic scales to any number of threads!
Pattern Summary: Fork-Join-Merge
All 4 parallel problems (summation, min, search) follow the same pattern:
1. FORK: Create K threads2. WORK: Each processes partition independently3. JOIN: Wait for all to complete4. MERGE: Combine results
Applied to different operations: ├─ Sum: merge = ADD all results ├─ Min: merge = COMPARE all results └─ Search: merge = FIND MINIMUM indexThis pattern is universal for embarrassingly parallel problems!