Skip to content

Threads and Concurrency Problems

Briefly explain how a thread differs from a process.

Instructor Solution

A process is an independent container with its own address space storing code, static variables, heap to store dynamically allocated memory and other resources such as open files, sockets, semaphores etc. A thread is a unit of execution that exists inside a process. Each process starts with one thread but can have any number of threads. Each thread has its own stack to store local variables but share all other resources of the process with other threads of the process. While communication between processes requires IPC (shared memory, pipes, message queues); threads can communicate directly via shared variables.

Elaboration (by LLM)

Process as a Container:

A process is a heavyweight abstraction that the OS uses to isolate program execution. Each process has:

  • Its own memory address space (code, data, heap)
  • Its own file descriptor table
  • Its own process ID (PID)
  • Complete isolation from other processes

Thread as Execution Within a Container:

A thread is a lightweight execution unit that exists within a process. All threads in the same process:

  • Share the same memory address space
  • Can directly access each other’s data without IPC
  • Have individual execution states (program counter, registers)
  • Are scheduled independently by the OS

Resource Comparison Table:

ResourceProcessThread (in same process)
Address spaceOwnShared
File descriptorsOwnShared
HeapOwnShared
StackOwnOwn
RegistersOwnOwn
Program counterOwnOwn

Cost Implications:

  • Process creation: Expensive (allocate memory, set up page tables, initialize resources)
  • Thread creation: Cheap (just need a stack and TCB)
  • Context switch between processes: Slow (flush TLB, switch memory maps)
  • Context switch between threads: Fast (same memory, just switch registers and stack)

Communication Overhead:

  • Between processes: Requires explicit IPC (pipes, sockets, shared memory) with kernel involvement
  • Between threads: Direct memory access, just like function calls within a program

What state (information) does a thread share with other threads in the same process and what state (information) is private/specific to a thread? Be explicit in your answer.

Instructor Solution

Shared State:

All threads in the same process share the same address space and system resources:

  • Code Segment: The actual machine instructions.
  • Data Segment: Global and static variables.
  • Heap: Dynamically allocated memory (via malloc or new).
  • OS Resources: Open file descriptors, sockets, signal handlers, and semaphores.

Private State:

To execute its own flow of control, each thread maintains its own execution context:

  • Stack: Stores local variables, function parameters, and return addresses.
  • Execution Context: Program Counter, Stack Pointer, general purpose registers.
  • Thread ID (TID): A unique identifier for the thread.

Elaboration (by LLM)

Visual Memory Layout:

┌─────────────────────────────┐
│ Code Segment (shared) │ ← All threads execute same code
├─────────────────────────────┤
│ Data/Global (shared) │ ← All threads access same globals
├─────────────────────────────┤
│ Heap (shared) │ ← malloc/new allocations
├─────────────────────────────┤
│ Stack (Thread 1 - private)│ ← Local vars of thread 1
├─────────────────────────────┤
│ Stack (Thread 2 - private)│ ← Local vars of thread 2
├─────────────────────────────┤
│ Stack (Thread 3 - private)│ ← Local vars of thread 3
└─────────────────────────────┘

Why Each Thread Needs Its Own Stack:

  • Function calls need their own stack frames
  • Local variables must be independent (otherwise threads interfere)
  • Return addresses must be tracked separately per thread
  • Each thread has its own execution path through the same code

Why Shared Heap is Dangerous:

When threads share the heap:

int *ptr = malloc(sizeof(int));
*ptr = 42; // Thread 1
printf("%d", *ptr); // Thread 2 - might see 42 or a different value

Race conditions occur when multiple threads access the same memory location simultaneously. Synchronization (mutexes, locks) is required.

Why File Descriptors are Shared:

int fd = open("file.txt", O_RDONLY);
// Thread 1
read(fd, buf1, 100); // Reads from current file position
// Thread 2 (simultaneously)
read(fd, buf2, 100); // Reads from updated file position

Both threads affect the same file offset, requiring careful synchronization.

What resources does the OS allocate for a thread during thread creation? How do they differ from those allocated when a process is created?

Instructor Solution

The OS allocates only the bare essentials for a new thread:

  • A Private Stack: For local variables, function parameters, and return addresses.
  • Thread Control Block (TCB): To store register states (including the Program Counter and Stack Pointer), priority, and Thread ID.

Process creation is far more resource-intensive because the OS must establish a new container:

  • Address Space: A process gets a brand new, isolated memory space with code, data and heap; a thread simply uses the parent’s existing map.
  • Resource Duplication: Processes get their own copies of file descriptor tables; threads share these globally with all other threads in the process.

Elaboration (by LLM)

Thread Creation Overhead:

When creating a new thread, the OS:

  1. Allocates a stack (typically 1-8 MB)
  2. Creates a TCB storing:
    • Thread ID
    • Program Counter (initially points to thread function)
    • Stack Pointer (points to new stack)
    • Register values
    • Scheduling priority
    • State (ready, running, blocked)
  3. Adds thread to the process’s thread list
  4. Total time: microseconds to milliseconds

Process Creation Overhead:

When creating a new process, the OS must:

  1. Allocate new memory (separate address space)
  2. Copy or set up page tables
  3. Initialize code segment
  4. Initialize data segment
  5. Create empty heap
  6. Duplicate file descriptor table
  7. Create process control block (PCB)
  8. Initialize signal handlers
  9. Set up working directory
  10. Total time: milliseconds to seconds

Cost Comparison:

Thread creation: ~1 microsecond
Process creation: ~1-10 milliseconds
Ratio: 1000-10000x slower for process creation

Memory Overhead Comparison:

Thread:
Stack: 1-8 MB
TCB: ~1 KB
Total: ~1-8 MB per thread
Process:
Address space: varies (minimal ~10 MB)
Page tables: ~100s KB
File descriptor table: 1-10 KB
PCB: ~1 KB
Total: ~10-100+ MB per process

Why Threads Share Address Space:

Threads don’t need separate address spaces because:

  • They execute the same code
  • They benefit from shared data structures
  • Isolation isn’t required (all threads are part of same application)
  • Memory is precious on embedded systems

Why Processes Have Separate Address Spaces:

Processes need isolation because:

  • Different programs shouldn’t crash each other
  • Security: one program shouldn’t access another’s data
  • Containment: buffer overflows in one process don’t affect others
  • The cost of isolation is worth the protection gained

What are the advantages of multi-threading? Give an example when multi-threading is preferrable to a single-threaded process.

Instructor Solution

Advantages of Multi-Threading:

  • Responsiveness: Keeps an application interactive by running background tasks (like saving a file) without freezing the User Interface.
  • Resource Sharing: Threads share memory and resources by default, making communication faster and more efficient than Inter-Process Communication (IPC).
  • Economy: Thread creation and context switching are significantly “cheaper” (faster and use less RAM) than process creation.
  • Scalability: Allows a single application to utilize multiple CPU cores effectively by executing tasks in parallel.

Example: Web Browser

Multi-threading is preferable especially for I/O-bound applications with a Graphical User Interface such as a Web browser. In a single-threaded browser, if you started downloading a large file, the entire window would freeze and become unresponsive until the download finished.

In a multi-threaded browser:

  1. Thread 1 handles user input (scrolling, clicking)
  2. Thread 2 handles the network download
  3. Thread 3 renders the images and text on the screen

Elaboration (by LLM)

Responsiveness in Depth:

Consider a text editor:

Single-threaded:
User types 'H' → System searches dictionary → User waits 5 seconds
(Everything freezes until search completes)
Multi-threaded:
Thread 1: User types 'H' → Character appears immediately
Thread 2: (concurrently) Searches dictionary in background
(User never experiences freezing)

Resource Sharing Benefit:

Single-threaded process approach (if possible):

Process 1: GUI
Process 2: Download
Process 3: Rendering
Problem: Need to share downloaded data → requires IPC
Solution: Slow pipes, sockets, or shared memory

Multi-threaded approach:

Shared memory:
┌───────────────┐
│ Downloaded │
│ HTML buffer │ ← All threads access directly
└───────────────┘
Thread 1: Reads from network → writes to buffer
Thread 2: Reads from buffer → renders HTML
Thread 3: Handles user input

CPU Utilization (Single vs Multi-core):

Single-core processor:

  • Only one thread runs at a time
  • Multi-threading still helps (while one thread waits for I/O, another runs)

Multi-core processor:

  • Multiple threads run simultaneously
  • True parallelism: download + rendering at exact same time
  • Perfect for CPU-bound and I/O-bound tasks

Real-World Browser Example:

Google Chrome threads:
- Main thread: handles UI events
- IO thread: network operations
- Worker threads: JavaScript execution, image decoding
- Renderer threads: HTML/CSS rendering

Comparison: Single-threaded vs Multi-threaded Download:

Single-threaded (sequential):
Download chunk 1 (100ms)
Download chunk 2 (100ms)
Download chunk 3 (100ms)
Render HTML (200ms)
Handle input (0ms - frozen until done)
Total: 400ms frozen
Multi-threaded (concurrent):
While downloading, rendering happens
While downloading, input is processed
User sees progress bar updating and can scroll
Total: User sees instant feedback

When Single-Threading is Sufficient:

  • Batch processing with no user interaction
  • CPU-bound computation (single processor anyway)
  • Simple synchronous operations

When Multi-Threading is Essential:

  • User interfaces (event handlers)
  • Network servers (handling multiple clients)
  • Real-time systems (logging, monitoring)
  • Any application where responsiveness matters

List 1 item that is shared by all threads, and 1 item that is private to each thread.

Instructor Solution

Shared: Code, global variables, heap

Private: Stack, Program Counter, Stack Pointer

Elaboration (by LLM)

Why Code is Shared:

void printThread(int id) {
printf("Thread %d\n", id);
}
// Both Thread 1 and Thread 2 execute the SAME function code
thread1 = CreateThread(printThread, 1);
thread2 = CreateThread(printThread, 2);

Both threads execute identical machine instructions from the code segment. Sharing the code segment saves memory and makes sense since the code doesn’t change.

Why Global Variables are Shared:

int counter = 0; // Global variable
void increment() {
counter++; // All threads modify the SAME counter
}
// Thread 1
increment(); // counter = 1
// Thread 2 (simultaneously)
increment(); // counter = ?

The address of counter is the same for all threads. Without synchronization, this causes race conditions.

Why Heap is Shared:

int *ptr = malloc(sizeof(int));
*ptr = 42;
// Pass to another thread
Thread2UpdateValue(ptr); // Thread 2 modifies same memory
printf("%d", *ptr); // Thread 1 sees modified value

Heap allocations use addresses that are valid in the shared address space. All threads can access the same allocated memory.

Why Stack is Private:

void function() {
int local_var = 10; // On THIS thread's stack
// Thread 1's local_var at address 0x7FFF1000
// Thread 2's local_var at address 0x7FFF2000
}
// Thread 1
function(); // local_var at 0x7FFF1000
// Thread 2
function(); // local_var at 0x7FFF2000 (different address!)

Each thread needs independent local variables, so each gets its own stack with separate memory locations.

Why Program Counter is Private:

Thread 1: Thread 2:
line 10: x = 5 line 10: x = 5
line 11: y = x + 1 line 15: z = 100
line 12: ... line 16: ...

Each thread is executing different lines of code (or same line at different times). The program counter tracks where each thread is.

Why Stack Pointer is Private:

Thread 1: Thread 2:
SP: 0x7FFF1000 SP: 0x7FFF2000
(points to own (points to own
stack frame) stack frame)

Each thread’s stack grows independently, so each needs its own stack pointer.

Memory Layout Example:

╔═══════════════════════════╗
║ Code (SHARED) │ ← function() at 0x4000
║ Global Data (SHARED) │ ← counter at 0x6000
║ Heap (SHARED) │ ← malloc() results
║ │
║ Thread 1 Stack (PRIVATE) │ ← SP1: 0x7FFF1000
║ local_var_T1 at 0x7FFF0F0║
║ param_T1 at 0x7FFF0F8 │
║ │
║ Thread 2 Stack (PRIVATE) │ ← SP2: 0x7FFF2000
║ local_var_T2 at 0x7FFF1F0║
║ param_T2 at 0x7FFF1F8 │
╚═══════════════════════════╝

Implications for Programming:

  • Shared data needs locks: counter++ must be atomic
  • Passing stack variables is dangerous: Passing address of local variable to another thread risks use-after-free
  • Passing heap pointers is safe: Heap data persists as long as not freed
  • Global state is a liability: Avoid globals; use thread-local storage if needed

What’s the purpose of pthread_detach() function?

Instructor Solution

The purpose of pthread_detach() is to tell the operating system that a thread’s resources (like its stack and Thread Control Block) should be automatically reclaimed immediately upon its termination. Normally, a thread is “joinable,” meaning it stays in a “zombie” state after finishing until another thread calls pthread_join() to collect its exit status.

Using pthread_detach():

  • Prevents Resource Leaks: It ensures the system doesn’t run out of memory if you create many threads but don’t intend to “join” them.
  • Enables “Fire and Forget”: It allows the thread to run independently without the parent process needing to monitor its completion.

Elaboration (by LLM)

Thread Lifecycle:

Created → Runnable → Running → Blocked → Exiting → Zombie → Reaped
Still consuming resources!

Joinable Thread (Default):

pthread_t tid;
pthread_create(&tid, NULL, thread_function, NULL);
// ... thread runs and calls pthread_exit()
// Thread is now in ZOMBIE state, waiting to be joined
pthread_join(tid, NULL); // Reap the thread, free resources

Problems with Always Joining:

for (int i = 0; i < 1000; i++) {
pthread_t tid;
pthread_create(&tid, NULL, worker, NULL);
// BUG: We create 1000 threads but never join them!
// All 1000 threads become zombies, consuming memory
}

Detached Thread:

pthread_t tid;
pthread_create(&tid, NULL, thread_function, NULL);
pthread_detach(tid); // Mark as detached
// ... thread runs and calls pthread_exit()
// Resources are automatically freed - no zombie state!

When to Use Detached Threads:

  1. Worker Thread Pool:
for (int i = 0; i < 100; i++) {
pthread_t tid;
pthread_create(&tid, NULL, worker_thread, NULL);
pthread_detach(tid); // Fire and forget
}
  1. Background Tasks:
void save_file_async(const char *filename) {
pthread_t tid;
pthread_create(&tid, NULL, save_file_worker, (void*)filename);
pthread_detach(tid); // Don't wait for completion
}
  1. Server Handling Multiple Clients:
while (accept_connection(&client_socket)) {
pthread_t tid;
pthread_create(&tid, NULL, handle_client, (void*)(intptr_t)client_socket);
pthread_detach(tid); // Handle each client independently
}

When to Use Joinable Threads:

  1. When you need the return value:
int result;
pthread_join(tid, (void**)&result); // Get thread's return value
printf("Thread returned: %d\n", result);
  1. When you need to wait for completion:
pthread_create(&tid, NULL, computation_thread, NULL);
// ... do other work ...
pthread_join(tid, NULL); // Wait for computation to finish
printf("Computation done!\n");
  1. Small, bounded number of threads:
int num_threads = 4;
pthread_t tids[4];
for (int i = 0; i < num_threads; i++) {
pthread_create(&tids[i], NULL, worker, NULL);
}
for (int i = 0; i < num_threads; i++) {
pthread_join(tids[i], NULL); // Wait for all workers
}

Resource Leak Example:

// WRONG - Resource leak!
void create_worker() {
pthread_t tid;
pthread_create(&tid, NULL, worker_thread, NULL);
// thread becomes zombie when it exits
// Resources never freed - leak!
}
// CORRECT - Detached thread
void create_worker() {
pthread_t tid;
pthread_create(&tid, NULL, worker_thread, NULL);
pthread_detach(tid);
// Resources auto-freed when thread exits
}
// ALSO CORRECT - Joinable with explicit join
void create_worker() {
pthread_t tid;
pthread_create(&tid, NULL, worker_thread, NULL);
pthread_join(tid, NULL);
// Resources freed after join
}

Key Differences:

AspectJoinableDetached
Return valueCan get itLost
SynchronizationExplicit joinNone
Resource cleanupManual (join)Automatic
When to useSmall teams, need resultsWorker pools, background tasks
OverheadStays in zombie stateClean exit

Problem 7: Multi-Threaded Process Termination

Section titled “Problem 7: Multi-Threaded Process Termination”

List 2 conditions under which a multi-threaded process terminates.

Instructor Solution

A multi-threaded process terminates under any of the following conditions:

(1) A thread calls exit(): If any thread calls the exit() system call, the entire process and all its associated threads are immediately destroyed by the kernel.

(2) Last Thread Exits: The process naturally terminates when the last remaining thread calls pthread_exit().

(3) Fatal Signal: If any single thread performs an illegal operation (like a Segmentation Fault) that results in a fatal signal, the operating system terminates the entire process.

Elaboration (by LLM)

Key Insight: Process Termination is Process-Wide:

Unlike individual thread termination, when a process terminates, ALL threads die instantly, regardless of their state.

Condition 1: exit() Immediately Kills All Threads:

void *thread_func(void *arg) {
for (int i = 0; i < 1000000; i++) {
printf("Thread working...\n");
sleep(1);
}
pthread_exit(NULL); // Gentle exit
}
int main() {
pthread_t tid;
pthread_create(&tid, NULL, thread_func, NULL);
sleep(2);
exit(0); // KILLS thread immediately, even mid-loop!
// Process ends, thread gets no cleanup chance
return 0;
}

Why exit() is Dangerous:

  • Bypasses thread cleanup
  • Doesn’t call destructors (in C++)
  • Closes file descriptors immediately
  • Could corrupt files if thread was writing
  • Doesn’t allow graceful shutdown

Condition 2: Last Thread Exiting (Graceful Termination):

void *worker(void *arg) {
printf("Thread starting\n");
sleep(2);
printf("Thread ending\n");
pthread_exit(NULL); // Graceful exit
}
int main() {
pthread_t tid;
pthread_create(&tid, NULL, worker, NULL);
printf("Main waiting...\n");
pthread_join(tid, NULL); // Wait for thread
printf("Thread done, main exiting\n");
return 0; // Process terminates
}

Timeline:

0ms: Main creates thread
0ms: Main prints "Main waiting..."
0ms: Thread starts, prints "Thread starting"
2000ms: Thread exits, prints "Thread ending"
2000ms: Main resumes from join()
2000ms: Main prints "Thread done, main exiting"
2000ms: Process terminates (last thread exited)

Condition 3: Fatal Signal (Crash):

void *bad_thread(void *arg) {
int *ptr = NULL;
*ptr = 42; // Segmentation fault!
}
int main() {
pthread_t tid;
pthread_create(&tid, NULL, bad_thread, NULL);
pthread_join(tid, NULL);
return 0;
}

When the thread dereferences NULL:

  • OS sends SIGSEGV signal to process
  • Default handler terminates entire process
  • All threads die immediately
  • Process never reaches pthread_join()

Comparing Termination Methods:

// METHOD 1: Graceful (main returns normally)
int main() {
pthread_t tid;
pthread_create(&tid, NULL, worker, NULL);
pthread_join(tid, NULL); // Wait for worker
return 0; // Clean exit
}
// METHOD 2: Abrupt (main calls exit)
int main() {
pthread_t tid;
pthread_create(&tid, NULL, worker, NULL);
sleep(1);
exit(0); // KILLS process immediately
}
// METHOD 3: Crash (signal)
int main() {
pthread_t tid;
pthread_create(&tid, NULL, bad_thread, NULL);
sleep(1);
// If bad_thread crashes, entire process dies
}

Best Practices:

  1. Always join threads before exiting:
int main() {
pthread_t tids[10];
for (int i = 0; i < 10; i++) {
pthread_create(&tids[i], NULL, worker, NULL);
}
for (int i = 0; i < 10; i++) {
pthread_join(tids[i], NULL); // IMPORTANT!
}
return 0; // Clean exit
}
  1. Or use pthread_exit in main:
int main() {
pthread_t tid;
pthread_create(&tid, NULL, worker, NULL);
// Main sleeps/waits while worker runs
pthread_exit(NULL); // Waits for other threads
}
  1. Avoid calling exit() in threads:
// WRONG
void *thread_func(void *arg) {
// do work
exit(0); // KILLS entire process!
}
// RIGHT
void *thread_func(void *arg) {
// do work
pthread_exit(NULL); // Only this thread exits
}

Define concurrency and parallelism and briefly explain the difference between the two.

Instructor Solution

Concurrency is the art of managing multiple tasks at once by interleaving their execution, essentially “dealing with” many things at the same time on a single or multiple cores.

Parallelism is the act of physically executing multiple tasks simultaneously, which requires hardware with multiple processing units (like a multi-core CPU).

The core difference lies in execution: concurrency is about structural design and multitasking (making progress on Task A, then Task B), while parallelism is about simultaneous execution (Task A and Task B running at the exact same millisecond).

Elaboration (by LLM)

Concurrency: Illusion of Parallelism

Concurrency is a software concept—the arrangement of code to handle multiple tasks. It does NOT require multiple cores.

Single-Core Concurrency Example:

Time CPU Core 1
─────────────────
0ms Task A [----]
4ms Task B [----]
8ms Task A [----]
12ms Task C [----]
16ms Task B [----]
20ms Done
Result: All tasks "appear" to run at once, but only one runs at a time
The OS rapidly switches between them

Parallelism: True Simultaneous Execution

Parallelism is a hardware reality—multiple cores literally running code at the same instant. It REQUIRES multiple cores.

Dual-Core Parallelism Example:

Time CPU Core 1 CPU Core 2
───────────────────────────────
0ms Task A [----] Task B [----]
4ms Task A [----] Task B [----]
8ms Task A [----] Task B [----]
12ms Task C [----] Task B [----]
Result: Tasks A and B run simultaneously
Core 1 switches between A and C
Core 2 runs B uninterrupted

Key Differences Table:

AspectConcurrencyParallelism
DefinitionStructural arrangementPhysical execution
RequirementsSoftware onlyRequires multiple cores
CPU cores needed1+2+
Context switchingYesMay have switching too
When speedup occursWhen waiting for I/OAlways (CPU-bound work)
ExampleWeb server handling requestsMatrix multiplication on 4 cores

Visual Comparison:

Single-core Concurrency:
[A][B][C][A][B][C] ← One core time-slicing
Sequential at any instant, but concurrent across time
Multi-core Parallelism:
Core 1: [A][A][C][C]
Core 2: [B][B][B][B]
Simultaneous execution at same instant

Why Concurrency Helps Even Without Parallelism:

// Single-threaded (sequential) I/O
void fetch_websites() {
download_site_a(); // 1000ms (blocked waiting for network)
download_site_b(); // 1000ms (blocked waiting for network)
download_site_c(); // 1000ms (blocked waiting for network)
// Total: 3000ms
}
// Concurrent (multi-threaded) I/O
void fetch_websites_concurrent() {
Thread A: download_site_a() // 1000ms
Thread B: download_site_b() // 1000ms (simultaneous)
Thread C: download_site_c() // 1000ms (simultaneous)
// Total: ~1000ms (concurrent waiting!)
}

Even on single core:

  • While Thread A waits for network, CPU switches to Thread B
  • While Thread B waits, CPU switches to Thread C
  • All three downloads overlap in time

Why Parallelism Requires Multiple Cores:

You cannot achieve true parallelism on a single core because:

  • Only one instruction sequence can execute at any instant
  • Context switching creates the illusion, but still sequential
  • CPU has only one execution pipeline per core

Real-World Example: Web Server

Concurrency without parallelism (single core):

Client 1: [request] [wait] [respond]
Client 2: [request] [wait] [respond]
Client 3: [request] [respond]
All clients "feel" like server is responsive
Server is actually context-switching rapidly

Parallelism (multi-core):

Core 1: Client 1: [request] [wait] [respond]
Core 2: Client 2: [request] [wait] [respond]
Core 3: Client 3: [request] [wait] [respond]
All clients running truly in parallel
No time-sharing needed

Why Both Matter:

  • Concurrency helps with responsiveness and I/O-bound work
  • Parallelism helps with pure computation on multiple cores
  • Multi-threaded app on multi-core: Best of both worlds!

Problem 9: Concurrency Without Multiple Cores

Section titled “Problem 9: Concurrency Without Multiple Cores”

If you have a concurrent system, i.e., multi-tasking or multi-threading OS, does the underlying hardware have to have multiple CPU cores? Why or why not?

Instructor Solution

No, the underlying hardware does not need multiple CPU cores to support a concurrent system. Concurrency is a software property managed by the operating system through context switching and time-slicing, where the CPU rapidly alternates between different threads or tasks. By giving each process a few milliseconds of execution time before switching to the next, the OS creates the illusion that multiple tasks are running simultaneously, even if only one instruction is being physically processed at any given moment on a single core.

Elaboration (by LLM)

Time-Slicing: The Core of Concurrency

On a single core, the OS divides CPU time into small chunks (time slices or time quanta, typically 10-100ms).

Timeline Example: Single Core Running 3 Threads

Thread A Thread B Thread C
[10ms]
[10ms]
[10ms]
[10ms]
[10ms]
[10ms]
[10ms]
[10ms]
[10ms]

Each thread gets a turn to run for 10ms, then the OS saves its state and switches to the next thread.

Why This Works:

  1. Save State: Program counter, registers, stack pointer
  2. Switch: OS scheduler picks next thread
  3. Restore State: New thread resumes from where it left off
  4. Repeat: Rapid cycling creates illusion of parallelism

Context Switch Overhead:

100 threads × 10ms per thread = 1000ms (1 second per full cycle)
User doesn't notice threads switching (happens 100 times per second)
Appears like all 3 threads are running "simultaneously"

Why Multiple Cores Would Help (But Aren’t Required):

Single core (sequential time-slicing):
Thread A: [exec 10ms] [wait] [exec 10ms]
Thread B: [exec 10ms] [wait] [exec 10ms]
Thread C: [exec 10ms] [exec 10ms]
Result: All threads make progress, but not truly parallel
Multi-core (true parallelism):
Core 1 - Thread A: [exec continuous]
Core 2 - Thread B: [exec continuous]
Core 3 - Thread C: [exec continuous]
Result: All threads run simultaneously, faster execution

I/O-Bound Concurrency on Single Core:

Even with single core, concurrency excels with I/O:

Thread A (network request):
[execute code] → [network I/O - BLOCKS] → [process response]
Thread B (disk read):
[execute code] → [disk I/O - BLOCKS] → [process result]
Timeline:
0ms: A executes code
5ms: A calls read() for network → BLOCKS
5ms: OS switches to B
8ms: B executes code
10ms: B calls read() for disk → BLOCKS
10ms: OS switches back to A
100ms: Network response arrives, A unblocks
100ms: OS switches to A
105ms: A processes network data
110ms: A does more work...
200ms: Disk response arrives, B unblocks
200ms: OS switches to B
205ms: B processes disk data

Both I/O operations happen concurrently (overlapped in time) even though only one core is doing computation.

Historical Context:

Early computers (1970s-2000s) had single-core processors, yet supported:

  • Multi-tasking (multiple programs)
  • Concurrency (multiple threads within programs)
  • Responsiveness (background tasks while user interacts)

All achieved through context switching, not parallelism!

The Takeaway:

Concurrency = Software design property (can work with 1 core)
Parallelism = Hardware feature (requires 2+ cores)
A single core can support concurrent code
A single core cannot support parallel execution

When Single-Core Concurrency Works Well:

  • I/O-bound applications (network, disk, user input)
  • Applications where responsiveness matters more than speed
  • Systems with limited hardware resources (embedded systems)

When Single-Core Concurrency Falls Short:

  • CPU-bound computation (heavy math, processing)
  • Real-time systems needing deterministic performance
  • Applications that need true speedup

Problem 10: Parallelism and Hardware Requirements

Section titled “Problem 10: Parallelism and Hardware Requirements”

To achieve parallelism, does the underlying hardware have to have multiple CPU cores? Why or why not?

Instructor Solution

Yes, to achieve true parallelism, the hardware must have multiple CPU cores (or multiple processors). Unlike concurrency, which can be faked on a single core through rapid switching, parallelism requires the physical ability to execute two or more instructions at the exact same moment. Without at least two independent processing units, tasks cannot run simultaneously; they can only ever be interleaved.

Elaboration (by LLM)

The Fundamental Requirement: Multiple Execution Pipelines

A single CPU core has one instruction pipeline. It can only execute one instruction sequence at any given nanosecond.

Single Core (1 Pipeline):
┌─────────────────────────┐
│ Fetch → Decode → Exec │ ← Only one instruction at a time
└─────────────────────────┘
Can't execute Task A and Task B simultaneously
Can only switch back and forth (context switching)

Multi-Core (Multiple Pipelines):

Core 1 (Pipeline 1):
┌─────────────────────────┐
│ Fetch → Decode → Exec │ ← Task A running
└─────────────────────────┘
Core 2 (Pipeline 2):
┌─────────────────────────┐
│ Fetch → Decode → Exec │ ← Task B running (simultaneously!)
└─────────────────────────┘
Result: At the same nanosecond:
Core 1 executes instruction from Task A
Core 2 executes instruction from Task B
TRUE PARALLELISM

Why Concurrency Alone Doesn’t Give Parallelism:

// Two CPU-bound tasks
void task_a() {
for (int i = 0; i < 1000000; i++) {
result += i * i; // Pure computation
}
}
void task_b() {
for (int i = 0; i < 1000000; i++) {
result += i * i; // Pure computation
}
}
// Single-core execution (concurrency, NO parallelism):
Time: 0ms 10ms 20ms 30ms 40ms
[Task A][Task B][Task A][Task B][Task A][Task B]
Both tasks make progress, but never truly simultaneous
Total time: ~40ms (both tasks' time summed)
// Dual-core execution (parallelism):
Core1: [Task A____________][Task A____]
Core2: [Task B____________][Task B____]
Both tasks run truly simultaneously
Total time: ~20ms (only max time needed)
2x faster!

The Speedup Formula:

Single-core: Total time = Task A time + Task B time
Multi-core: Total time = max(Task A time, Task B time)
With 2 cores: Speedup = (A + B) / max(A, B)
If A ≈ B: Speedup ≈ 2x
If A >> B: Speedup ≈ 1x

Hardware Parallel Execution Example:

int a[1000], b[1000], c[1000];
// Core 1 Core 2
a[0] = 1 a[1] = 1
a[0] = a[0] + 2 a[1] = a[1] + 2
b[0] = a[0] b[1] = a[1]
c[0] = b[0] c[1] = b[1]
↓ All happening at SAME TIME, on same instruction

Misconception: Hyper-Threading Threads

Intel’s hyper-threading creates the ILLUSION of 2 cores per physical core:

Physical Core 1 (Hyper-threaded):
Pipeline 1: Thread A instructions
Pipeline 2: Thread B instructions
BUT: Uses same execution resources
NOT true parallelism, but better than time-slicing
Can exploit instruction-level parallelism
Result: ~1.2x-1.5x speedup (not 2x like real dual-core)

Why You Need Multiple Cores for Parallelism:

  1. Single core can’t superscale: Can’t make one core run 2 instruction sequences at once
  2. Physical limitation: Only one ALU (arithmetic logic unit) per core
  3. Memory bandwidth: Additional cores have their own caches and paths to RAM
  4. No alternative: You MUST have 2+ cores for true parallelism

Parallelism in Modern Processors:

Quad-core CPU:
┌─────────────────────┐
│ Core 1: [Pipeline] │
│ Core 2: [Pipeline] │
│ Core 3: [Pipeline] │
│ Core 4: [Pipeline] │
└─────────────────────┘
Each core can run a different thread/task simultaneously
Useful for:
- Image processing (4 threads processing 4 regions)
- Web server (4 cores handling 4 clients)
- Parallel algorithms (divide work across cores)

Key Limitation: Amdahl’s Law

Even with parallelism, speedup is limited:

If 10% of program is sequential (must run on 1 core)
And 90% is parallelizable (can run on N cores)
Speedup with 4 cores ≈ 1/(0.1 + 0.9/4) ≈ 3.1x (not 4x!)

Summary:

FeatureSingle CoreMulti-Core
ConcurrencyYes (context switching)Yes (still needed)
ParallelismNo (impossible)Yes
True speedup on CPU-bound workNoYes
Hardware costLowHigher
Synchronization complexityLowerHigher

Problem 11: Multi-Threading on Single-Core

Section titled “Problem 11: Multi-Threading on Single-Core”

Give one reason why you will want to implement a multi-threaded application in a single-core machine.

Instructor Solution

The primary reason is to improve responsiveness during I/O-bound tasks. In a single-threaded application, if the program waits for a slow resource (like a disk read, a network response, or user input), the entire process—including the UI—blocks and freezes. In a multi-threaded system, while one thread is “blocked” waiting for I/O, the OS can switch the CPU to another thread. This allows the application to remain interactive or perform background calculations while the hardware handles the data transfer in the background.

Elaboration (by LLM)

The I/O Blocking Problem:

Modern applications spend most time waiting for I/O, not computing:

// Single-threaded approach
int main() {
printf("Fetching data...\n");
data = network_read(); // BLOCKS for 2 seconds
// During these 2 seconds: process CANNOT respond to user input
// CPU sits idle, waiting for network
process_data(data); // Computation: 100ms
printf("Done!\n");
}

User experiences: 2 second freeze while waiting for network

Multi-Threaded Solution:

void *network_thread(void *arg) {
data = network_read(); // BLOCKS for 2 seconds
return NULL;
}
void *ui_thread(void *arg) {
while (true) {
handle_user_input(); // ALWAYS responsive
update_display(); // Show progress
sleep(10); // Poll for updates
}
return NULL;
}
int main() {
pthread_t net_tid, ui_tid;
pthread_create(&net_tid, NULL, network_thread, NULL);
pthread_create(&ui_tid, NULL, ui_thread, NULL);
pthread_join(net_tid, NULL);
pthread_cancel(ui_tid);
return 0;
}

Timeline Comparison:

Single-threaded (FREEZES):
0ms: Network request starts
0ms: UI frozen (cannot handle input)
2000ms: Network response arrives
2000ms: UI becomes responsive
2000ms: Process data (100ms)
2100ms: Done
User experience: 2 second freeze!
Multi-threaded (RESPONSIVE):
0ms: Network thread starts request
0ms: UI thread immediately handles input
0ms: User can interact with UI
50ms: User sees "Downloading..." message
100ms: User clicks "Cancel" button (UI immediately responds)
500ms: UI shows progress (20% downloaded)
1000ms: UI shows progress (50% downloaded)
2000ms: Network response arrives
2000ms: Process data (100ms)
2100ms: UI shows results
User experience: Smooth, responsive!

Real-World Example: Email Client

Single-threaded:

User clicks "Send Email"
→ Compose thread exits
→ Network thread starts, blocks sending...
→ User CANNOT check mail, CANNOT type new email
→ After 30 seconds: finally sent, UI responsive again

Multi-threaded:

User clicks "Send Email"
→ Send thread spawned (handles the I/O)
→ UI thread continues immediately
→ User can check mail while sending
→ User can draft new email while sending
→ All while network thread waits for server

Why Single-Core is Still Useful:

When network blocks on Core 0:

Time Thread A (UI) Thread B (Network)
────────────────────────────────────────
0ms [handles input] [network request]
5ms [renders] [blocked waiting...]
10ms [handles input] [blocked waiting...]
15ms [renders] [blocked waiting...]
...
2000ms [handles input] [response arrived!]

While Thread B waits (100% CPU idle), Thread A uses CPU for UI.

This is the KEY benefit of concurrency on single-core: overlapping I/O with computation.

I/O Wait Time Hierarchy:

Memory access: ~100 nanoseconds
Disk seek: ~10 milliseconds (100,000x slower!)
Network request: ~100 milliseconds (1,000,000x slower!)
User think time: ~3 seconds (30,000,000x slower!)

A single thread waiting 100ms for network = huge opportunity for other threads!

Common I/O-Bound Scenarios:

  1. Web Server:

    • Thread 1: Handles Client A (blocked on network read)
    • Thread 2: Handles Client B (computing response)
    • Thread 3: Handles Client C (blocked on disk read)

    Each client thinks server is responsive, even on single core!

  2. Database Client:

    • Thread 1: UI (handles mouse clicks, keyboard)
    • Thread 2: Database query (blocked on DB server response)

    User can cancel query or start new one while previous blocks

  3. File Downloader:

    • Thread 1: Download thread 1 (blocked waiting for network)
    • Thread 2: Download thread 2 (blocked waiting for network)
    • Thread 3: Download thread 3 (blocked waiting for network)

    All three downloads overlap in time (concurrent I/O)

The Golden Rule:

For I/O-bound applications:
Even 1 core + multi-threading > 1 core + single thread
Why? Because:
- N threads can have N I/O operations in flight
- When one blocks, CPU switches to another
- CPU is never idle (unless all threads are blocked)
For CPU-bound applications:
Multi-threading on 1 core helps responsiveness
But NOT computational performance

Practical Example with Numbers:

Task: Download 3 files (each 10MB, 1 second over network)
Single-threaded:
Download file 1: [1000ms network I/O]
Download file 2: [1000ms network I/O]
Download file 3: [1000ms network I/O]
Total: 3000ms
Multi-threaded (1 core):
Thread 1: Download file 1 [1000ms I/O]
Thread 2: Download file 2 [1000ms I/O, concurrent]
Thread 3: Download file 3 [1000ms I/O, concurrent]
Total: ~1000ms (network operations overlap!)

The network doesn’t care about cores—it operates in parallel by nature!

Problem 12: Many-to-One Thread Mapping and I/O

Section titled “Problem 12: Many-to-One Thread Mapping and I/O”

Consider an OS where the threads are implemented by a user-level thread library and all threads of a process are mapped to just one kernel thread (many-to-one mapping). Consider a process that has two threads, T1 and T2, and T1 is running.

True/False: When T1 blocks while executing an I/O system call, the OS can switch to T2 and start running it. Justify your answer.

Instructor Solution

Answer: False.

In a many-to-one mapping, the operating system only sees the single kernel thread, not the individual user-level threads. When T1 executes a blocking I/O system call, the entire kernel thread (and thus the entire process) is placed into a “Blocked” state by the OS. Because the kernel is unaware that T2 even exists or is ready to run, it cannot schedule it; the user-level library never gets the chance to swap T1 for T2 until the kernel thread becomes “Ready” again.

Elaboration (by LLM)

The Kernel’s Limited Visibility:

The problem hinges on what the kernel can “see”:

Many-to-One Thread Mapping:
User-Level View:
┌─────────────────────────────────────────┐
│ Single Process (User-level Library) │
│ ┌─────────┐ ┌─────────┐ │
│ │ Thread1 │ │ Thread2 │ (Managed │
│ │ (Ready) │ │ (Ready) │ by library) │
│ └─────────┘ └─────────┘ │
└─────────────────────────────────────────┘
↓ (hidden from kernel)
Kernel's View:
┌─────────────────────────────────────────┐
│ Single Schedulable Entity │
│ (One Kernel Thread = LWP) │
│ (Kernel thread is "Blocked") │
└─────────────────────────────────────────┘

The kernel only manages the single kernel thread. User threads are merely an abstraction—like library bookmarks that track which function is executing.

The Blocking Disaster:

Scenario: T1 makes a blocking read

// Thread 1 in user space
int data[100];
int bytes = read(fd, data, sizeof(data)); // Blocking I/O!

What happens:

Time Kernel View User Library View
────────────────────────────────────────────────────
0ms KLT executing read() T1 executing read()
(dispatched to CPU) T2 waiting (Ready)
5ms read() syscall issued OS takes control
OS blocks waiting for I/O (KLT blocked)
10ms KLT still blocked T2 TRAPPED
(waiting for disk/network) Cannot run!
100ms I/O complete T1 can resume
KLT becomes Ready (but not yet)
105ms Kernel reschedules KLT User library gets control
(only resource it knows) Could now run T2
But OS doesn't know about
T2's existence!

Why User Library Can’t Help:

The user-level library (like pthreads on many-to-one) cannot intercept kernel system calls:

// User library cannot see this:
void *thread_func(void *arg) {
read(fd, buf, 100); // ← Direct syscall to kernel
// ← User library has NO control here
return NULL;
}

When T1 calls read():

  1. CPU switches from user mode to kernel mode
  2. User library is paused (no execution)
  3. Only kernel code runs
  4. Kernel doesn’t know about T2
  5. Entire process blocked

Result: T2 cannot run!

Comparison Table:

┌──────────────────┬──────────────┬──────────────┐
│ Scenario │ Many-to-One │ One-to-One │
├──────────────────┼──────────────┼──────────────┤
│ T1 blocks on I/O │ T2 CANNOT │ T2 CAN │
│ │ run │ run │
├──────────────────┼──────────────┼──────────────┤
│ Kernel sees │ 1 entity │ 2 entities │
│ │ (blocked) │ (one ready) │
├──────────────────┼──────────────┼──────────────┤
│ Scheduler can... │ Switch to │ Switch to T2 │
│ │ other │ directly │
│ │ processes │ │
├──────────────────┼──────────────┼──────────────┤
│ Used in │ Green threads│ Linux, Unix │
│ │ (obsolete) │ (modern) │
└──────────────────┴──────────────┴──────────────┘

Real-World Example: Web Server

Many-to-One (FAILS):
Thread 1: Handle Client A request
→ read() from network (blocking)
→ OS blocks entire process
→ Thread 2 CANNOT handle Client B
→ Client B waits (poor responsiveness)
→ After network I/O completes: Thread 1 resumes
One-to-One (WORKS):
Thread 1: Handle Client A request
→ read() from network (blocking)
→ OS blocks only Thread 1 kernel thread
→ Scheduler immediately picks Thread 2
→ Thread 2 handles Client B (responsive!)
→ Client B doesn't wait for Client A

Why Many-to-One Fails:

The fundamental problem: The OS controls scheduling, not the user library.

Control Flow with Many-to-One:
User Library: [Init] → [decide to switch T1→T2] → ...
(but this decision is useless)
Kernel: [receives syscall] → [KLT blocked]
→ [User Library frozen, cannot execute]
→ [User Library cannot switch threads]
→ [Only kernel can resume execution]
Result: User library's thread switching logic never runs!

When Blocking Occurs:

Safe (Non-blocking) I/O Scenarios:
1. Polling:
T1: data = read_nonblocking(fd); // Returns immediately
if (data) { process(data); }
T2: can run between poll() calls
2. Asynchronous I/O:
T1: aio_read(fd, &cb); // Async, returns immediately
T2: can run while kernel handles I/O in background
But with standard blocking I/O:
T1: read(fd, buf, 100); // BLOCKS
T2: TRAPPED (user library powerless)

Historical Note: Green Threads

Java and some older systems used many-to-one:

Java 1.0 - 1.2:
❌ All user threads → 1 OS thread
❌ Blocking I/O in one thread froze entire app
❌ Limited to single core (no parallelism)
Java 1.3+:
✅ One user thread → 1 OS thread
✅ Blocking doesn't freeze app
✅ Can utilize multiple cores
Why the change? Because many-to-one is fundamentally broken!

The Key Lesson:

Blocking System Call + Many-to-One Thread Mapping
=
Entire Process Freezes
(User library has no control once kernel takes over)

This is why modern OSes (Linux, Windows, macOS) use one-to-one mapping: the kernel must be aware of each thread for true concurrency with blocking I/O.

Problem 13: One-to-One Thread Mapping and I/O

Section titled “Problem 13: One-to-One Thread Mapping and I/O”

Consider an OS where each user-level thread has a corresponding kernel-level thread (one-to-one mapping) as in Linux. Consider a process that has two threads, T1 and T2 and T1 is running.

True/False: When T1 blocks while executing an I/O system call, the OS can switch to T2 and start running it. Justify your answer.

Instructor Solution

Answer: True.

In a one-to-one mapping, every user-level thread is a distinct schedulable entity in the eyes of the kernel. Because the OS kernel manages each thread individually, it recognizes that only T1 is waiting for I/O. While T1 is in the “Blocked” state, the kernel’s scheduler can independently pick T2 from the “Ready” queue and dispatch it to a CPU core.

Elaboration (by LLM)

Kernel Awareness of Each Thread:

In one-to-one mapping, the kernel maintains separate TCBs for each thread:

One-to-One Mapping:
User Process:
┌─────────────────┐ ┌─────────────────┐
│ Thread 1 │ │ Thread 2 │
└────────┬────────┘ └────────┬────────┘
│ │
↓ ↓
Kernel Thread Manager:
┌─────────────────────────────────────────────────┐
│ TCB 1: (T1's context) │
│ State: Blocked (I/O) │
├─────────────────────────────────────────────────┤
│ TCB 2: (T2's context) │
│ State: Ready │
└─────────────────────────────────────────────────┘
Scheduler: Pick KLT2 freely!
T2 runs!

Unlike many-to-one where the kernel sees one blocked entity, here it sees two independent schedulable entities.

Key Difference from Problem 12:

When T1 blocks on I/O in one-to-one mapping:

  • Kernel knows about TCB1 AND TCB2 (both visible)
  • Kernel blocks only TCB1 (puts it in I/O wait queue)
  • TCB2 remains in Ready queue (still schedulable)
  • Scheduler freely picks TCB2 to run
  • T2 executes while T1 waits on network/disk

Real-World Impact - Web Server Example:

With 1000 clients and 1 core:

Many-to-One (Problem 12):

  • Client 1 → Thread 1 (blocked on read)
  • Process blocked
  • Clients 2-1000 wait sequentially
  • Response time: ~1000 seconds

One-to-One (Problem 13):

  • Thread 1: Client A (blocked)
  • Thread 2: Client B (executing)
  • Thread 3: Client C (executing)
  • CPU time-shares among 1000 threads
  • Network handles I/O in parallel
  • Response time: ~10 seconds

The kernel schedules based on what it can SEE—and in one-to-one, it sees all threads!

Consider a process with 4 threads in Linux. When a thread invokes fork() system call to create a new process, how many threads will the new process have? Briefly explain.

Instructor Solution

In Linux, the new process will have only one thread. When a thread calls fork(), the operating system creates a child process that is a duplicate of the caller’s address space. However, only the calling thread is duplicated into the child process; the other three threads simply cease to exist in the new process.

This behavior prevents complex issues, such as a child process inheriting threads that were in the middle of a sensitive operation or holding a mutex that will never be released (since the thread holding it wasn’t copied).

Elaboration (by LLM)

Why Only the Calling Thread?

When all threads were hypothetically copied:

pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
void *thread2_func(void *arg) {
pthread_mutex_lock(&lock); // Thread 2 is HERE (middle of critical section)
// ... doing critical work ...
pthread_mutex_unlock(&lock);
}
int main() {
pthread_t tid2, tid3, tid4;
pthread_create(&tid2, NULL, thread2_func, NULL);
pthread_create(&tid3, NULL, other_work, NULL);
pthread_create(&tid4, NULL, other_work, NULL);
fork(); // DISASTER if all threads copied!
}

If all threads were duplicated:

Child Process After fork():

  • Thread 2 (copied): Has lock, in middle of critical section
    • Cannot unlock (state undefined)
    • Mutex is corrupted in child!
  • Thread 3 (copied): Tied to parent’s I/O operations
  • Thread 4 (copied): Managing memory
  • Result: Complete chaos, deadlocks, data corruption!

The Safe Solution: Copy Only Calling Thread

int main() {
pthread_t tid2, tid3, tid4;
pthread_create(&tid2, NULL, worker, NULL);
pthread_create(&tid3, NULL, worker, NULL);
pthread_create(&tid4, NULL, worker, NULL);
pid_t child_pid = fork(); // Safe!
if (child_pid == 0) {
// Child process
printf("Child has 1 thread\n"); // Only main thread
// Safe to do child-specific work
close(shared_fd); // Don't interfere with parent
_exit(0);
} else {
// Parent process
printf("Parent still has 4 threads\n");
pthread_join(tid2, NULL);
pthread_join(tid3, NULL);
pthread_join(tid4, NULL);
}
}

Thread Count After fork():

Before fork(): 4 threads (1 main, 3 workers)
After fork(): Parent has 4 threads (unchanged)
Child has 1 thread (calling thread only)

Common Problem: Lost Worker Threads

void *worker_thread(void *arg) {
while (true) {
job_t *job = get_next_job();
process_job(job);
}
}
int main() {
// Create thread pool (8 worker threads)
for (int i = 0; i < 8; i++) {
pthread_create(&workers[i], NULL, worker_thread, NULL);
}
fork(); // Oops!
}

In child process:

  • All 8 workers are GONE
  • Only main thread remains
  • If child tries to add jobs: queue fills up
  • No workers to process them: DEADLOCK

Best Practice: Call fork() before creating threads!

Consider a process with 4 threads in Linux. When a thread invokes exec() system call to load a new executable on top of the current process, how many threads will the new process have? Briefly explain.

Instructor Solution

The new process will have only one thread. When a thread calls exec(), the entire address space of the process is replaced with the new executable. Because the old code, data, and stacks are wiped out to make room for the new program, all existing threads—including the one that called exec()—are terminated. The new program then begins execution from its main() function with a single, fresh thread of control.

Elaboration (by LLM)

Complete Address Space Replacement:

The exec() syscall doesn’t “update” the process—it completely replaces it:

// Old program - 4 threads
int main() {
pthread_t tid2, tid3;
pthread_create(&tid2, NULL, work, NULL);
pthread_create(&tid3, NULL, work, NULL);
sleep(10); // Simulate work
execv("/usr/bin/new_program", args); // REPLACED!
// Nothing after this executes in old program
}

What exec() does:

  1. Flush all state: Old code, old data, old stacks
  2. Release all threads: Threads 2, 3, 4 are killed
  3. Load new executable: Read new binary from disk
  4. Initialize new process: Fresh data, fresh heap
  5. Start new main(): Single thread control

Thread Termination Timeline:

Time Thread 1 Thread 2 Thread 3
──────────────────────────────────────────────────────────
0ms main() Starting Starting
pthread_create()
5ms sleep(10) Running Running
10ms execv() called Running Running
11ms BANG! TERMINATED TERMINATED
(old code destroyed) (killed instantly) (killed instantly)
11ms ┌──────────────────┐
│ NEW EXECUTABLE │
│ CODE LOADED │
│ main() running │
│ 1 thread now │
└──────────────────┘

Key Difference from fork():

fork():
├─ Parent: Unchanged, all threads continue
└─ Child: 1 thread (copy of caller)
Result: Two processes
exec():
├─ Replace current process entirely
├─ All threads (except caller) killed
├─ New code loaded
└─ New main() starts with 1 thread
Result: Same process, different program

Standard Pattern: fork() + exec()

This is the standard way to run new programs:

pid_t pid = fork();
if (pid == 0) {
// Child process
// Child has 1 thread (safe to manipulate)
close(STDIN_FILENO);
open("/input.txt", O_RDONLY);
execv("/bin/cat", args); // Never returns if successful
perror("execv failed");
exit(1);
} else {
// Parent process
// Still has all threads
waitpid(pid, NULL, 0); // Wait for child's program
}

Resource Leaks After exec():

int main() {
pthread_mutex_t m = PTHREAD_MUTEX_INITIALIZER;
int fd = open("data.txt", O_RDONLY);
pthread_create(&tid, NULL, work, NULL);
execv("/bin/true", args);
// exec() doesn't return, but if it did:
// - Thread tid: GONE
// - Mutex m: LOST (not unlocked)
// - File fd: LEFT OPEN (might need cleanup)
}

Solution: Use fork() + exec() carefully to isolate the new program!

A process has two threads, T1 and T2. Assume that T1 allocates some memory on the heap as follows:

int *ptr = new int[100];

T1 then passes this pointer “ptr” to T2. Can T2 legally access the array pointed to by ptr? Justify your answer.

Instructor Solution

Yes. The array pointed to by “ptr” is located in the heap, and the threads that belong to the same process share the same heap. If T2 has the pointer, it can read/write that memory easily. This is the standard way threads share large amounts of data: by storing them either in static global variables or in the heap.

Elaboration (by LLM)

Why Heap Works (Unlike Stack):

Stack (Problem 17 - dangerous):

  • Thread 1 Stack: 0x5000 - 0x6000 (private)
  • Thread 2 Stack: 0x3000 - 0x4000 (private)
  • Each thread has separate stack
  • When thread exits, stack frame destroyed
  • Addresses become invalid (dangling pointers)

Heap (Safe):

  • Single shared heap: 0x8000 - 0x20000
  • All threads allocate from same memory region
  • When thread exits, heap memory persists
  • Other threads can still access it

Pointer Transfer Mechanism:

void *thread1_func(void *arg) {
int *ptr = (int *)malloc(100 * sizeof(int)); // Heap alloc
// T1's local variable 'ptr' contains address, e.g., 0x8000
// This address points to SHARED heap
// Pass to T2 via shared variable
shared_ptr = ptr; // Copy the ADDRESS (just the value!)
return NULL;
}
void *thread2_func(void *arg) {
// T2 receives the same address from shared_ptr
int *ptr = shared_ptr; // Same address as T1!
// Now T2 can access T1's heap allocation
ptr[0] = 100; // OK - writing to shared heap
printf("%d\n", ptr[50]); // OK - reading from shared heap
return NULL;
}

Memory Lifetime Guarantee:

void *thread1_func(void *arg) {
int *ptr = malloc(100 * sizeof(int)); // Heap allocation
*(int **)arg = ptr; // Share with T2 (store address)
return NULL; // T1 exits
} // T1's STACK frame destroyed
// BUT heap at ptr still valid!
void *thread2_func(void *arg) {
int *data = *(int **)arg; // Get T1's allocation
printf("%d\n", data[0]); // ✅ Still accessible!
return NULL;
}

The heap allocation outlives the thread that created it!

Common Heap Sharing Patterns:

Pattern 1: Global Pointer

int *global_ptr = NULL;
void *thread1_func(void *arg) {
global_ptr = malloc(1000); // T1 allocates
global_ptr[0] = 42;
}
void *thread2_func(void *arg) {
printf("%d\n", global_ptr[0]); // T2 uses T1's allocation ✅
}

Pattern 2: Function Argument

void *thread_func(void *arg) {
int *buffer = (int *)arg; // Buffer passed from main
buffer[0] = 42; // ✅ Access shared heap
}
int main() {
int *buffer = malloc(100 * sizeof(int));
pthread_t tid;
pthread_create(&tid, NULL, thread_func, (void *)buffer);
buffer[0] = 99; // Main also uses it ✅
pthread_join(tid, NULL);
free(buffer);
}

Pattern 3: Thread-Safe Queue

struct Node {
int data;
struct Node *next;
};
void enqueue(int value) {
struct Node *new = malloc(sizeof(struct Node)); // Heap!
new->data = value;
new->next = NULL;
pthread_mutex_lock(&lock);
// Link into queue
pthread_mutex_unlock(&lock);
}
void *worker_thread(void *arg) {
struct Node *node = dequeue(); // T2 accesses T1's heap alloc
printf("%d\n", node->data); // ✅ Works
free(node);
}

Why This is Safe:

Heap management is centralized in the kernel:

  • Kernel tracks all allocations
  • When any thread frees: heap manager updates
  • All threads see consistent heap state
  • Addresses remain valid until explicit free()

A process has two threads, T1 and T2. Assume that T1 declares a local variable x as follows:

int x = 10;

T1 then passes the address of this variable “x” to T2. Can T2 legally access “x”? Justify your answer.

Instructor Solution

Recall that local variables are on the stack, and every thread within the same process has their own private stack. While T2 can technically access T1’s stack if it has the memory address (since they share the same address space), it is extremely dangerous. If T1 returns from its current function, its stack frame is destroyed, and T2 will be pointing to “garbage” memory, likely causing a crash.

Elaboration (by LLM)

Stack Layout: Thread-Private Regions

Each thread gets its own stack:

Process Address Space:
Text (Shared) All threads execute same code
Data + BSS (Shared) All threads share globals
Heap (Shared) All threads allocate from same heap
STACKS (THREAD-PRIVATE!)
├─ Stack for Thread 1 0x7000: [x = 10] ← T1's local var
└─ Stack for Thread 2 0x4000: [locals] ← T2's locals

Stacks grow in opposite directions, thread-private regions isolated.

The Danger: Stack Frame Destruction

void function_with_local(void *shared_ptr_arg) {
int x = 10; // Local variable on T1's stack at 0x7000
// Pass address to T2
int **ptr_holder = (int **)shared_ptr_arg;
*ptr_holder = &x; // T2 now holds 0x7000
return; // CRITICAL MOMENT!
// Stack frame is DESTROYED here
// Address 0x7000 is no longer valid
}

Timeline:

Time Thread 1 Thread 2
──────────────────────────────────────────────────────
0ms function_with_local()
x = 10
ptr = &x (passes 0x7000) (waiting)
5ms (doing work) Receives 0x7000
printf(*ptr); // 10 ✅
10ms return; Still holding 0x7000
[Stack destroyed]
[0x7000 now invalid]
[0x7000 becomes garbage]
15ms Another_function() printf(*ptr); // CRASH!
[Reuses stack at 0x7000] (Undefined behavior)
int y = 999; [0x7000=999] Might print garbage

Race Condition - Timing Problem:

int *shared_ptr = NULL;
void *thread1_func(void *arg) {
int x = 10;
shared_ptr = &x; // T2 now has the address
return NULL; // Stack destroyed!
}
void *thread2_func(void *arg) {
sleep(1); // Wait 1 second
// By now, T1 has exited
if (shared_ptr) {
printf("%d\n", *shared_ptr); // CRASH! Dangling pointer!
}
return NULL;
}

Comparison: Why Heap Works vs. Stack Fails

HEAP ALLOCATION (Safe):
void *thread1_func() {
int *ptr = malloc(100); // Allocated on heap
return ptr; // Can return safely
} // Stack destroyed, heap survives!
void *thread2_func() {
int *data = result;
printf("%d", *data); // ✅ Works! (heap still allocated)
}
STACK ALLOCATION (Dangerous):
void *thread1_func() {
int x = 10;
return &x; // RETURNS STACK ADDRESS!
} // Stack destroyed, &x invalid!
void *thread2_func() {
int *data = result;
printf("%d", *data); // ❌ CRASH! (stack freed)
}

Best Practices:

❌ DON'T: Pass stack addresses between threads
❌ DON'T: Return &local_variable from thread function
❌ DON'T: Store stack pointers in shared variables
✅ DO: Use heap allocations (malloc)
✅ DO: Use global variables (for constants)
✅ DO: Copy values instead of addresses (when safe)

Correct Pattern:

void *thread1_func(void *arg) {
// Allocate on HEAP instead
int *data = malloc(sizeof(int));
*data = 10;
// Share the address
*(int **)arg = data; // Safe!
return NULL; // Stack destroyed, but heap survives
}
void *thread2_func(void *arg) {
int *data = *(int **)arg;
sleep(10); // T1 has exited
printf("%d\n", *data); // ✅ Works! (heap still allocated)
free(data); // Proper cleanup
return NULL;
}

A process has two threads, T1 and T2. Assume that T1 opens a file as follows:

int fd = open("file.txt", O_RDONLY);

Assume that “fd” is a global variable and T2 has access to it. Can T2 call read(fd, …) to read data from “file.txt”? Justify your answer.

Instructor Solution

Yes. File descriptors are part of the Process Resources. The File Descriptor Table is shared by all threads within the same process. Once any thread opens a file, the resulting integer ID is valid for every other thread in that process. This means that all other threads in the same process who has access to the file descriptor can read from the file.

Elaboration (by LLM)

File Descriptor Table: Process-Wide Resource

File descriptors are stored in a kernel structure that’s shared across all threads:

Process Resources (Shared):
File Descriptor Table
FD 0: stdin
FD 1: stdout
FD 2: stderr
FD 3: file.txt ← (opened by T1)
FD 4: socket
FD 5: /dev/random
Threads:
T1: Can use FD 3 ✅
T2: Can use FD 3 ✅ (T1 opened it)
Both threads reference same kernel file entry!

Unlike stacks (thread-private) and heaps (process-shared but allocated separately), file descriptors are explicitly process-wide resources.

File Descriptor Operations Across Threads:

int fd = -1; // Global variable
void *thread1_func(void *arg) {
// T1 opens the file
fd = open("file.txt", O_RDONLY);
printf("[T1] Opened file with fd=%d\n", fd);
// Read first 100 bytes
char buf1[100];
ssize_t n = read(fd, buf1, 100);
printf("[T1] Read %ld bytes\n", n);
// File position is now at byte 100
// This position is SHARED with T2!
sleep(1);
return NULL;
}
void *thread2_func(void *arg) {
sleep(0.5); // Let T1 read first
// T2 can use the FD that T1 opened
// Important: File position is SHARED!
char buf2[50];
ssize_t n = read(fd, buf2, 50); // ✅ Works!
printf("[T2] Read %ld bytes from fd=%d\n", n, fd);
printf("[T2] Got data from byte position 100 onwards\n");
return NULL;
}
int main() {
pthread_t tid1, tid2;
pthread_create(&tid1, NULL, thread1_func, NULL);
pthread_create(&tid2, NULL, thread2_func, NULL);
pthread_join(tid1, NULL);
pthread_join(tid2, NULL);
if (fd >= 0) close(fd);
return 0;
}

Output:

[T1] Opened file with fd=3
[T1] Read 100 bytes
[T2] Read 50 bytes from fd=3
[T2] Got data from byte position 100 onwards

Critical Issue: Shared File Position

File descriptors don’t just refer to files—they track the current read/write position:

File: file.txt (1000 bytes)
Initially:
[0][1][2]...[100]...[200]...[1000]
Position 0
After T1 reads 100 bytes:
[0][1][2]...[100]...[200]...[1000]
Position 100 (shared!)
T2 calls read():
[0][1][2]...[100]...[150]...[1000]
Position 150 (after reading 50)

Race Condition Danger:

int fd = open("data.txt", O_RDONLY);
void *thread1_func(void *arg) {
char buf[10];
read(fd, buf, 10); // Reads bytes 0-9
// Scheduling might switch to T2 here!
printf("T1: %s\n", buf);
}
void *thread2_func(void *arg) {
char buf[10];
read(fd, buf, 10); // Reads which bytes?
printf("T2: %s\n", buf);
}

File position is shared, so multiple threads reading concurrently can cause data loss or confusion!

Proper Synchronization:

pthread_mutex_t fd_lock = PTHREAD_MUTEX_INITIALIZER;
int fd = -1;
void *thread1_func(void *arg) {
pthread_mutex_lock(&fd_lock);
// Only T1 is reading, file position safe
char buf[10];
ssize_t n = read(fd, buf, 10);
printf("T1 read: %.*s\n", (int)n, buf);
pthread_mutex_unlock(&fd_lock);
}
void *thread2_func(void *arg) {
pthread_mutex_lock(&fd_lock); // Wait if T1 reading
char buf[10];
ssize_t n = read(fd, buf, 10);
printf("T2 read: %.*s\n", (int)n, buf);
pthread_mutex_unlock(&fd_lock);
}

With mutex: Each read() is atomic, file position updates predictably.

Common FD Sharing Patterns:

Pattern 1: Read-Only Safe

// Config file read by multiple threads
int config_fd = open("config.txt", O_RDONLY);
void *worker_thread(void *arg) {
// Safe if threads seek to known positions first
lseek(config_fd, THREAD_OFFSET, SEEK_SET);
read(config_fd, local_buf, SIZE);
}

Pattern 2: Append-Only (Requires Mutex)

int log_fd = open("log.txt", O_WRONLY | O_APPEND);
pthread_mutex_t log_lock = PTHREAD_MUTEX_INITIALIZER;
void *thread_func(void *arg) {
pthread_mutex_lock(&log_lock);
write(log_fd, "message\n", 8); // Atomic
pthread_mutex_unlock(&log_lock);
}

Comparison: FDs vs. Heap vs. Stack

Resource | Shared? | Access | Issues
─────────────┼─────────┼─────────┼─────────────
File Descr. | Yes | Via FD | Shared position
| (global)| (global)| (races!)
─────────────┼─────────┼─────────┼─────────────
Heap Memory | Yes | Via ptr | Race conditions
| (shared)| (shared)| on data
─────────────┼─────────┼─────────┼─────────────
Stack Memory | No | Via addr| Dangling pointers
| (private)| (private)

File descriptors are meant for sharing—but be careful with concurrent operations!

What’s the output of the following code when run in a Unix system? Assume you have 2 functions:

  • (1) CreateThread(F, arg) creates a thread that starts running at the function “F” and takes the argument “arg” and returns the id of the created thread
  • (2) JoinThread(int threadId) is used to wait for the termination of a thread
int g = 0;
int tids[2];
void T(int arg) {
if (arg == 2) {
g++;
printf("[MainThread] g: %d\n", g);
}
if (arg == 1) {
g++;
printf("[Thread1] g: %d\n", g);
}
if (arg == 0) {
JoinThread(tids[1]);
g++;
printf("[Thread0] g: %d\n", g);
}
} // end-T
main() {
tids[1] = CreateThread(T, 1);
tids[0] = CreateThread(T, 0);
JoinThread(tids[0]);
T(2);
} // end-main
Instructor Solution

Output:

[Thread1] g: 1
[Thread0] g: 2
[MainThread] g: 3

Elaboration (by LLM)

Execution Flow Analysis:

The key to understanding this is tracking the order of thread creation, execution, and synchronization:

Timeline:
main():
tids[1] = CreateThread(T, 1); // Create Thread1 (arg=1)
tids[0] = CreateThread(T, 0); // Create Thread0 (arg=0)
JoinThread(tids[0]); // WAIT for Thread0 to finish
T(2); // Execute main's function

Concurrent Execution:

Main Thread Thread1 (arg=1) Thread0 (arg=0)
───────────── ─────────────── ──────────────
CreateThread(T,1) ┐
CreateThread(T,0) │ ┐
JoinThread(tids[0]) ├─ Waiting for T0 ─────┼─ Running T(0)
(blocked) │ │
│ JoinThread(tids[1])
│ (waits for Thread1)
│ │
│◄─────────────────────┤ Waiting for T1
g++ (T1 arg=1) │
printf T1 │
return │
(T1 finishes) │
│ g++ (T0)
└──────────────────────► printf T0
return
(T0 finishes)
(unblocked)
T(2)
g++
printf main
return

Detailed Execution Order:

  1. Main: Creates Thread1 with T(1)
  2. Main: Creates Thread0 with T(0)
  3. Main: Calls JoinThread(tids[0]) → Main blocks, waiting for Thread0 to finish
  4. Thread1: Starts executing T(1)
    • arg == 1 is true
    • g++ (g becomes 1)
    • printf(“[Thread1] g: 1\n”)
    • Thread1 finishes
  5. Thread0: Starts executing T(0)
    • arg == 0 is true
    • Calls JoinThread(tids[1]) → Thread0 blocks, waiting for Thread1
    • Thread1 already finished, so this returns immediately
    • g++ (g becomes 2)
    • printf(“[Thread0] g: 2\n”)
    • Thread0 finishes
  6. Main: JoinThread returns (Thread0 finished)
    • Calls T(2)
    • arg == 2 is true
    • g++ (g becomes 3)
    • printf(“[MainThread] g: 3\n”)
    • Main finishes

Critical Insight: Shared Global Variable g

All three function invocations increment the same global variable g:

Initial: g = 0
After Thread1: g = 0 + 1 = 1
After Thread0: g = 1 + 1 = 2
After Main: g = 2 + 1 = 3

Note: This works predictably because:

  • Thread1 finishes before Thread0 even starts checking it
  • Thread0 waits for Thread1 explicitly
  • Main waits for Thread0 explicitly
  • No race conditions here (synchronization prevents them)

Key Learning Points:

  1. Thread Creation Order: Threads are created in order (Thread1 first, Thread0 second)
  2. Thread Start Order: May be different from creation order (depends on scheduler)
  3. JoinThread Synchronization: Blocks until the thread finishes
  4. Global Variable Access: All threads access the same g variable
  5. Execution Order: Determined by JoinThread calls, not creation order

Consider parallel summation: You are given an array , and you are asked to compute the sum of the elements of this array in parallel as follows: The main thread will create 4 threads, each thread will compute the sum of elements of the array and terminate. The main thread will then wait for the termination of the worker threads, compute the overall sum, print it on the screen and terminate.

Assume you have 2 functions:

  • (1) CreateThread(F, arg) creates a thread that starts running at the function “F” and takes the argument “arg” and returns the id of the created thread
  • (2) JoinThread(int threadId) is used to wait for the termination of a thread

Assume that array is global and is divisible by 4.

Instructor Solution
#include <stdio.h>
#define N ...
int A[N]; // Array to be summed
int localSums[4];
void doSum(int tid) {
int start = tid * N / 4;
int end = start + N / 4;
int sum = 0;
for (int i = start; i < end; i++)
sum += A[i];
localSums[tid] = sum;
}
void main() {
int tids[4];
for (int i = 0; i < 4; i++)
tids[i] = CreateThread(doSum, i);
int sum = 0;
for (int i = 0; i < 4; i++) {
JoinThread(tids[i]);
sum += localSums[i];
}
printf("Sum: %d\n", sum);
}

Elaboration (by LLM)

Problem Decomposition:

To parallelize summation with 4 threads, divide the array into 4 equal parts:

Array A[0..N-1]:
[0][1]...[N/4-1] | [N/4]...[N/2-1] | [N/2]...[3N/4-1] | [3N/4]...[N-1]
├─ Thread 0 ────┤ ├─ Thread 1 ────┤ ├─ Thread 2 ────┤ ├─ Thread 3 ────┤
└─ Sum A[0:N/4] ─┘ └─ Sum A[N/4:N/2]└─ Sum A[N/2:3N/4]└─ Sum A[3N/4:N]─┘
→ localSums[0] → localSums[1] → localSums[2] → localSums[3]
Final Sum = localSums[0] + localSums[1] + localSums[2] + localSums[3]

Parallel Execution Model:

Main Thread:
Create 4 threads
├─ Thread 0: Sum A[0..N/4-1]
├─ Thread 1: Sum A[N/4..N/2-1]
├─ Thread 2: Sum A[N/2..3N/4-1]
└─ Thread 3: Sum A[3N/4..N-1]
(All 4 compute in parallel)
Wait for all to finish
(JoinThread for each)
Aggregate results: Sum all localSums[i]
Print final sum

Key Design Patterns:

1. Work Distribution:

Each thread gets a unique thread ID (tid = 0, 1, 2, 3):

void doSum(int tid) {
int start = tid * N / 4; // Where this thread starts
int end = start + N / 4; // Where this thread ends
// Thread 0: start=0, end=N/4
// Thread 1: start=N/4, end=N/2
// Thread 2: start=N/2, end=3N/4
// Thread 3: start=3N/4, end=N
}

2. Local Computation:

Each thread computes a LOCAL sum without synchronization:

int sum = 0;
for (int i = start; i < end; i++)
sum += A[i];
localSums[tid] = sum; // Store result for main thread

No race conditions because:

  • Each thread writes to unique localSums[tid]
  • No shared write access during computation

3. Aggregation:

Main thread combines results AFTER all threads finish:

int sum = 0;
for (int i = 0; i < 4; i++) {
JoinThread(tids[i]); // Wait for thread i
sum += localSums[i]; // Add its result
}

Example with Real Numbers:

Array: A = [1, 2, 3, 4, 5, 6, 7, 8] (N=8)
Thread Distribution:
Thread 0: A[0..1] = [1, 2] → localSums[0] = 3
Thread 1: A[2..3] = [3, 4] → localSums[1] = 7
Thread 2: A[4..5] = [5, 6] → localSums[2] = 11
Thread 3: A[6..7] = [7, 8] → localSums[3] = 15
Aggregation:
sum = 0
sum += localSums[0] = 3
sum += localSums[1] = 7 (sum = 10)
sum += localSums[2] = 11 (sum = 21)
sum += localSums[3] = 15 (sum = 36)
Final Output: Sum: 36

Performance Analysis:

Without Parallelization:

Single thread sums N elements:
Time = O(N)

With 4-Thread Parallelization:

Each thread sums N/4 elements:
Time = O(N/4) for computation
Plus O(1) for aggregation
Total = O(N/4) (speedup of ~4x on 4 cores)

Critical Sections:

This code has NO critical sections!

  • Each thread writes to different localSums[tid] (no race)
  • Main thread only reads after all threads finish (no race)
  • All synchronization explicit (JoinThread)

Why This Design is Robust:

// Good: No races, clean synchronization
int localSums[4]; // Each thread writes unique index
void doSum(int tid) {
localSums[tid] = sum; // Safe! tid unique per thread
}
void main() {
for (int i = 0; i < 4; i++) {
JoinThread(tids[i]); // Explicit wait
sum += localSums[i]; // Read after join
}
}
// Bad: Would cause race conditions
int globalSum = 0; // Shared!
void doSum(int tid) {
globalSum += localSum; // RACE! Multiple threads writing
}

Scalability:

To parallelize across N cores instead of 4:

#define THREADS N
int localSums[THREADS];
void doSum(int tid) {
int start = tid * size / THREADS;
int end = start + size / THREADS;
// ... same logic ...
}
void main() {
int tids[THREADS];
for (int i = 0; i < THREADS; i++)
tids[i] = CreateThread(doSum, i);
int sum = 0;
for (int i = 0; i < THREADS; i++) {
JoinThread(tids[i]);
sum += localSums[i];
}
printf("Sum: %d\n", sum);
}

The logic remains identical, just parameterized!

Write a multi-threaded program to find minimum element of an array with two threads. The first thread should find the minimum of and the second thread should find the minimum of . Each thread should start immediately upon being created and each must execute the same function to find the minimum. After threads have been terminated, the minimum element of the array should be printed by the main thread.

Assume that array is global and is divisible by 2.

#include <stdio.h>
#define N ...
int A[N]; // Array for which the minimum is to be computed
Instructor Solution
#include <stdio.h>
#define N ...
int A[N]; // Array for which the minimum is to be computed
int localMins[2];
void findMin(int tid) {
int start = tid * N / 2;
int end = start + N / 2;
int min = INT_MAX;
for (int i = start; i < end; i++)
if (A[i] < min)
min = A[i];
localMins[tid] = min;
}
void main() {
int tids[2];
for (int i = 0; i < 2; i++)
tids[i] = CreateThread(findMin, i);
for (int i = 0; i < 2; i++)
JoinThread(tids[i]);
if (localMins[0] < localMins[1])
printf("Min: %d\n", localMins[0]);
else
printf("Min: %d\n", localMins[1]);
}

Elaboration (by LLM)

Array Partitioning for Min-Finding:

Split array into 2 halves for 2 threads:

Array A[0..N-1]:
[0][1]...[N/2-1] | [N/2]...[N-1]
├─ Thread 0 ─────┤ ├─ Thread 1 ──┤
└─ Find min of ──┘ └─ Find min of─┘
first half second half
localMins[0] localMins[1]
Final Min = min(localMins[0], localMins[1])

Key Differences from Problem 20 (Summation):

Problem 20 (Sum): Problem 21 (Min):
- Combine: ADD - Combine: COMPARE
- sum += A[i] - if (A[i] < min) min = A[i]
- Final: Add all parts - Final: min(part1, part2)
- Operation: Associative - Operation: min()
Both parallelize the SAME WAY:
1. Partition array
2. Each thread processes partition
3. Main thread combines results

Parallel Execution Model:

Main Thread:
Create 2 threads
├─ Thread 0: Find min in A[0..N/2-1]
└─ Thread 1: Find min in A[N/2..N-1]
(Both compute in parallel)
Wait for both to finish
(JoinThread for each)
Compare results: min(localMins[0], localMins[1])
Print overall minimum

Algorithm Details:

1. Initialization:

int min = INT_MAX; // Start with largest possible value

Why INT_MAX?

  • Any array element will be less than INT_MAX
  • Ensures first element becomes the min

2. Comparison:

for (int i = start; i < end; i++)
if (A[i] < min)
min = A[i];

Each thread finds the minimum in its range.

3. Storage:

localMins[tid] = min; // Thread tid stores its minimum

Each thread writes to unique index (no race).

4. Final Comparison:

if (localMins[0] < localMins[1])
printf("Min: %d\n", localMins[0]);
else
printf("Min: %d\n", localMins[1]);

Main thread compares the two local minima.

Example Execution:

Array: A = [5, 3, 9, 1, 7, 2, 8, 4] (N=8)
Thread Distribution:
Thread 0: A[0..3] = [5, 3, 9, 1]
min = INT_MAX
min = 5 (A[0] < INT_MAX)
min = 3 (A[1] < 5)
min = 3 (A[2]=9 not < 3)
min = 1 (A[3] < 3)
localMins[0] = 1
Thread 1: A[4..7] = [7, 2, 8, 4]
min = INT_MAX
min = 7 (A[4] < INT_MAX)
min = 2 (A[5] < 7)
min = 2 (A[6]=8 not < 2)
min = 2 (A[7]=4 not < 2)
localMins[1] = 2
Final Comparison:
localMins[0] = 1
localMins[1] = 2
1 < 2 → Print "Min: 1"

Why Use INT_MAX Instead of First Element?

// Approach 1: Using INT_MAX (Current Solution)
int min = INT_MAX;
for (int i = start; i < end; i++)
if (A[i] < min)
min = A[i];
Advantages:
✅ Works even if first element is largest
✅ Cleaner logic
✅ No special case handling
// Approach 2: Using first element
int min = A[start];
for (int i = start + 1; i < end; i++)
if (A[i] < min)
min = A[i];
Disadvantages:
❌ More complex (skip first iteration)
❌ Requires special handling

Generalization to K Threads:

For K threads finding minimum:

#define K 4 // Number of threads
int localMins[K];
void findMin(int tid) {
int start = tid * N / K;
int end = start + N / K;
int min = INT_MAX;
for (int i = start; i < end; i++)
if (A[i] < min)
min = A[i];
localMins[tid] = min;
}
void main() {
int tids[K];
for (int i = 0; i < K; i++)
tids[i] = CreateThread(findMin, i);
for (int i = 0; i < K; i++)
JoinThread(tids[i]);
// Find minimum of all local minima
int globalMin = INT_MAX;
for (int i = 0; i < K; i++) {
if (localMins[i] < globalMin)
globalMin = localMins[i];
}
printf("Min: %d\n", globalMin);
}

Time Complexity Analysis:

Sequential (1 thread):

Time = O(N) - iterate through all N elements

Parallel (2 threads on 2 cores):

Each thread: O(N/2)
Total: O(N/2) - speedup of 2x

Parallel (K threads on K cores):

Each thread: O(N/K)
Total: O(N/K) - speedup of K x

(Ignoring aggregation time O(K), which is negligible for large N)

Critical Pattern:

This is the Fork-Join Pattern:

1. FORK: Create multiple threads
2. WORK: Each thread processes independently
3. JOIN: Wait for all threads
4. MERGE: Combine results

This pattern applies to ANY associative operation:

  • Addition (Problem 20)
  • Minimum (Problem 21)
  • Maximum (variant)
  • Multiplication (variant)
  • Logical operations (AND, OR)

The key requirement: Operation must be associative

Associative: (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
Examples:
min(min(5, 3), 9) = min(5, min(3, 9)) ✅
sum(sum(5, 3), 9) = sum(5, sum(3, 9)) ✅
max(max(5, 3), 9) = max(5, max(3, 9)) ✅

You are given an unsorted array and a “key”, and you are asked to find the first index where key is stored in parallel as follows: The main thread will create 4 threads, each thread will search of the elements and terminate. The main thread will then wait for the termination of the worker threads, compute the overall search result, print it on the screen and terminate. If the key does not exist in the array, then print “Key does not exist in the array”.

Assume you have 2 functions:

  • (1) CreateThread(F, arg) creates a thread that starts running at the function “F” and takes the argument “arg” and returns the id of the created thread
  • (2) JoinThread(int threadId) is used to wait for the termination of a thread

Assume that array and “key” are both global and is divisible by 4.

#include <stdio.h>
#define N ...
int A[N]; // Array to be searched
int key; // Key to be searched
Instructor Solution
#include <stdio.h>
#define N ...
int A[N]; // Array to be searched
int key; // Key to be searched
int localResult[4] = {N};
void doFind(int tid) {
int start = tid * N / 4;
int end = start + N / 4;
for (int i = start; i < end; i++) {
if (A[i] == key) {
localResult[tid] = i;
return;
}
}
}
void main() {
int tids[4];
for (int i = 0; i < 4; i++)
tids[i] = CreateThread(doFind, i);
int idx = N;
for (int i = 0; i < 4; i++) {
JoinThread(tids[i]);
if (localResult[i] < idx)
idx = localResult[i];
}
if (idx == N)
printf("Key does not exist in the array.");
else
printf("Key exists at index: %d\n", idx);
}

Elaboration (by LLM)

Parallel Search Strategy:

Unlike summation and min-finding (which process all elements), search looks for FIRST occurrence:

Array A[0..N-1] and key=7:
[3][1][7][5][2][7][9][4]
Thread Distribution:
Thread 0: A[0..1] = [3, 1] → localResult[0] = N (not found)
Thread 1: A[2..3] = [7, 5] → localResult[1] = 2 (found at index 2!)
Thread 2: A[4..5] = [2, 7] → localResult[2] = 5 (found at index 5)
Thread 3: A[6..7] = [9, 4] → localResult[3] = N (not found)
Result Aggregation:
Overall minimum index = min(N, 2, 5, N) = 2
Output: "Key exists at index: 2" (first occurrence)

Key Design Decisions:

1. Initialization to N (not found sentinel):

int localResult[4] = {N}; // All initialized to N

Why N?

  • N is the array size
  • No valid index equals N (indices are 0 to N-1)
  • Represents “not found” for this thread’s range
  • Easy to check: if (idx == N) → not found

2. Early Exit on Success:

for (int i = start; i < end; i++) {
if (A[i] == key) {
localResult[tid] = i;
return; // Early exit! Don't search rest of range
}
}

Once found:

  • Store the index
  • RETURN immediately (don’t continue searching)
  • This is efficient: once we find it, no need to search further

3. Finding the FIRST occurrence:

Since we want the first index:

int idx = N;
for (int i = 0; i < 4; i++) {
JoinThread(tids[i]);
if (localResult[i] < idx)
idx = localResult[i]; // Keep minimum index
}

Take the MINIMUM index from all threads:

  • Thread with key in range 0-5 will have smaller index
  • Overrides threads with larger indices
  • Threads that didn’t find it have localResult = N (ignored)

Example Detailed Execution:

Array: A = [10, 20, 30, 7, 40, 7, 50, 7] (N=8)
Key: 7
Initial: localResult = [8, 8, 8, 8]
Thread 0: A[0..1] = [10, 20]
i=0: A[0]=10 != 7
i=1: A[1]=20 != 7
(loop ends, localResult[0] stays 8)
localResult = [8, 8, 8, 8]
Thread 1: A[2..3] = [30, 7]
i=2: A[2]=30 != 7
i=3: A[3]=7 == 7 ✅
localResult[1] = 3
return (early exit!)
localResult = [8, 3, 8, 8]
Thread 2: A[4..5] = [40, 7]
i=4: A[4]=40 != 7
i=5: A[5]=7 == 7 ✅
localResult[2] = 5
return (early exit!)
localResult = [8, 3, 8, 8]
Thread 3: A[6..7] = [50, 7]
i=6: A[6]=50 != 7
i=7: A[7]=7 == 7 ✅
localResult[3] = 7
return (early exit!)
localResult = [8, 3, 5, 7]
Aggregation (after all joins):
idx = 8
localResult[0]=8: 8 < 8? No
localResult[1]=3: 3 < 8? Yes → idx = 3
localResult[2]=5: 5 < 3? No
localResult[3]=7: 7 < 3? No
Final: idx = 3
Output: "Key exists at index: 3"

Why No Race Conditions?

// Thread 0 writes to localResult[0]
// Thread 1 writes to localResult[1]
// etc.
// NO overlapping writes → NO race conditions
// Each thread has unique localResult index

Each thread writes to a different element of the array!

Early Exit Pattern:

This is a key optimization for search:

// Inefficient: Search entire range even after finding
for (int i = start; i < end; i++) {
if (A[i] == key) {
localResult[tid] = i;
// Continue searching (wasted work!)
}
}
// Efficient: Stop immediately on first find
for (int i = start; i < end; i++) {
if (A[i] == key) {
localResult[tid] = i;
return; // Early exit
}
}

The return statement:

  • Exits the function immediately
  • Thread terminates
  • Other threads may still be searching
  • Main thread waits for all with JoinThread

Handling “Not Found” Case:

// All threads searched their range
// No thread found the key
localResult = [N, N, N, N] (all initialized value)
Aggregation:
idx = N
for (i=0 to 3):
localResult[i] < idx? No (all equal N)
idx stays N
Output:
if (idx == N)
printf("Key does not exist in the array.");

Comparison with Sequential Search:

Sequential (1 thread):
for (int i = 0; i < N; i++) {
if (A[i] == key) return i;
}
Worst case: O(N) - key at end or not found
Parallel (4 threads):
Best case: O(N/4) - key in first thread's range
Worst case: O(N/4) - all threads search fully
Expected case: O(N/4) if key distributed randomly

Generalization to K Threads:

#define K 4
int localResult[K] = {N};
void doFind(int tid) {
int start = tid * N / K;
int end = start + N / K;
for (int i = start; i < end; i++) {
if (A[i] == key) {
localResult[tid] = i;
return;
}
}
}
void main() {
int tids[K];
for (int i = 0; i < K; i++)
tids[i] = CreateThread(doFind, i);
int idx = N;
for (int i = 0; i < K; i++) {
JoinThread(tids[i]);
if (localResult[i] < idx)
idx = localResult[i];
}
if (idx == N)
printf("Key does not exist in the array.");
else
printf("Key exists at index: %d\n", idx);
}

The logic scales to any number of threads!

Pattern Summary: Fork-Join-Merge

All 4 parallel problems (summation, min, search) follow the same pattern:

1. FORK: Create K threads
2. WORK: Each processes partition independently
3. JOIN: Wait for all to complete
4. MERGE: Combine results
Applied to different operations:
├─ Sum: merge = ADD all results
├─ Min: merge = COMPARE all results
└─ Search: merge = FIND MINIMUM index

This pattern is universal for embarrassingly parallel problems!