04 - Threads

Problem 1: Thread vs. Process

Briefly explain how a thread differs from a process.

Instructor Solution

A process is an independent container with its own address space storing code, static variables, heap to store dynamically allocated memory and other resources such as open files, sockets, semaphores etc. A thread is a unit of execution that exists inside a process. Each process starts with one thread but can have any number of threads. Each thread has its own stack to store local variables but share all other resources of the process with other threads of the process. While communication between processes requires IPC (shared memory, pipes, message queues); threads can communicate directly via shared variables.

Elaboration (by LLM)

Process as a Container:

A process is a heavyweight abstraction that the OS uses to isolate program execution. Each process has:

Its own memory address space (code, data, heap)
Its own file descriptor table
Its own process ID (PID)
Complete isolation from other processes

Thread as Execution Within a Container:

A thread is a lightweight execution unit that exists within a process. All threads in the same process:

Share the same memory address space
Can directly access each other’s data without IPC
Have individual execution states (program counter, registers)
Are scheduled independently by the OS

Resource Comparison Table:

Resource	Process	Thread (in same process)
Address space	Own	Shared
File descriptors	Own	Shared
Heap	Own	Shared
Stack	Own	Own
Registers	Own	Own
Program counter	Own	Own

Cost Implications:

Process creation: Expensive (allocate memory, set up page tables, initialize resources)
Thread creation: Cheap (just need a stack and TCB)
Context switch between processes: Slow (flush TLB, switch memory maps)
Context switch between threads: Fast (same memory, just switch registers and stack)

Communication Overhead:

Between processes: Requires explicit IPC (pipes, sockets, shared memory) with kernel involvement
Between threads: Direct memory access, just like function calls within a program

What state (information) does a thread share with other threads in the same process and what state (information) is private/specific to a thread? Be explicit in your answer.

Instructor Solution

Shared State:

All threads in the same process share the same address space and system resources:

Code Segment: The actual machine instructions.
Data Segment: Global and static variables.
Heap: Dynamically allocated memory (via malloc or new).
OS Resources: Open file descriptors, sockets, signal handlers, and semaphores.

Private State:

To execute its own flow of control, each thread maintains its own execution context:

Stack: Stores local variables, function parameters, and return addresses.
Execution Context: Program Counter, Stack Pointer, general purpose registers.
Thread ID (TID): A unique identifier for the thread.

Elaboration (by LLM)

Visual Memory Layout:

┌─────────────────────────────┐
│   Code Segment (shared)     │ ← All threads execute same code
├─────────────────────────────┤
│   Data/Global (shared)      │ ← All threads access same globals
├─────────────────────────────┤
│   Heap (shared)             │ ← malloc/new allocations
├─────────────────────────────┤
│   Stack (Thread 1 - private)│ ← Local vars of thread 1
├─────────────────────────────┤
│   Stack (Thread 2 - private)│ ← Local vars of thread 2
├─────────────────────────────┤
│   Stack (Thread 3 - private)│ ← Local vars of thread 3
└─────────────────────────────┘

Why Each Thread Needs Its Own Stack:

Function calls need their own stack frames
Local variables must be independent (otherwise threads interfere)
Return addresses must be tracked separately per thread
Each thread has its own execution path through the same code

Why Shared Heap is Dangerous:

When threads share the heap:

1
int *ptr = malloc(sizeof(int));
2
*ptr = 42;           // Thread 1
3
printf("%d", *ptr);  // Thread 2 - might see 42 or a different value

Race conditions occur when multiple threads access the same memory location simultaneously. Synchronization (mutexes, locks) is required.

Why File Descriptors are Shared:

1
int fd = open("file.txt", O_RDONLY);
2
// Thread 1
3
read(fd, buf1, 100);  // Reads from current file position
4
// Thread 2 (simultaneously)
5
read(fd, buf2, 100);  // Reads from updated file position

Both threads affect the same file offset, requiring careful synchronization.

Problem 3: Thread Creation Resources

What resources does the OS allocate for a thread during thread creation? How do they differ from those allocated when a process is created?

Instructor Solution

The OS allocates only the bare essentials for a new thread:

A Private Stack: For local variables, function parameters, and return addresses.
Thread Control Block (TCB): To store register states (including the Program Counter and Stack Pointer), priority, and Thread ID.

Process creation is far more resource-intensive because the OS must establish a new container:

Address Space: A process gets a brand new, isolated memory space with code, data and heap; a thread simply uses the parent’s existing map.
Resource Duplication: Processes get their own copies of file descriptor tables; threads share these globally with all other threads in the process.

Elaboration (by LLM)

Thread Creation Overhead:

When creating a new thread, the OS:

Allocates a stack (typically 1-8 MB)
Creates a TCB storing:
- Thread ID
- Program Counter (initially points to thread function)
- Stack Pointer (points to new stack)
- Register values
- Scheduling priority
- State (ready, running, blocked)
Adds thread to the process’s thread list
Total time: microseconds to milliseconds

Process Creation Overhead:

When creating a new process, the OS must:

Allocate new memory (separate address space)
Copy or set up page tables
Initialize code segment
Initialize data segment
Create empty heap
Duplicate file descriptor table
Create process control block (PCB)
Initialize signal handlers
Set up working directory
Total time: milliseconds to seconds

Cost Comparison:

Thread creation:  ~1 microsecond
Process creation: ~1-10 milliseconds

Ratio: 1000-10000x slower for process creation

Memory Overhead Comparison:

Thread:
  Stack: 1-8 MB
  TCB:   ~1 KB
  Total: ~1-8 MB per thread

Process:
  Address space: varies (minimal ~10 MB)
  Page tables: ~100s KB
  File descriptor table: 1-10 KB
  PCB: ~1 KB
  Total: ~10-100+ MB per process

Why Threads Share Address Space:

Threads don’t need separate address spaces because:

They execute the same code
They benefit from shared data structures
Isolation isn’t required (all threads are part of same application)
Memory is precious on embedded systems

Why Processes Have Separate Address Spaces:

Processes need isolation because:

Different programs shouldn’t crash each other
Security: one program shouldn’t access another’s data
Containment: buffer overflows in one process don’t affect others
The cost of isolation is worth the protection gained

Problem 4: Multi-Threading Advantages

What are the advantages of multi-threading? Give an example when multi-threading is preferrable to a single-threaded process.

Instructor Solution

Advantages of Multi-Threading:

Responsiveness: Keeps an application interactive by running background tasks (like saving a file) without freezing the User Interface.
Resource Sharing: Threads share memory and resources by default, making communication faster and more efficient than Inter-Process Communication (IPC).
Economy: Thread creation and context switching are significantly “cheaper” (faster and use less RAM) than process creation.
Scalability: Allows a single application to utilize multiple CPU cores effectively by executing tasks in parallel.

Example: Web Browser

Multi-threading is preferable especially for I/O-bound applications with a Graphical User Interface such as a Web browser. In a single-threaded browser, if you started downloading a large file, the entire window would freeze and become unresponsive until the download finished.

In a multi-threaded browser:

Thread 1 handles user input (scrolling, clicking)
Thread 2 handles the network download
Thread 3 renders the images and text on the screen

Elaboration (by LLM)

Responsiveness in Depth:

Consider a text editor:

Single-threaded:
  User types 'H' → System searches dictionary → User waits 5 seconds
  (Everything freezes until search completes)

Multi-threaded:
  Thread 1: User types 'H' → Character appears immediately
  Thread 2: (concurrently) Searches dictionary in background
  (User never experiences freezing)

Resource Sharing Benefit:

Single-threaded process approach (if possible):

Process 1: GUI
Process 2: Download
Process 3: Rendering

Problem: Need to share downloaded data → requires IPC
Solution: Slow pipes, sockets, or shared memory

Multi-threaded approach:

Shared memory:
  ┌───────────────┐
  │  Downloaded   │
  │  HTML buffer  │ ← All threads access directly
  └───────────────┘

Thread 1: Reads from network → writes to buffer
Thread 2: Reads from buffer → renders HTML
Thread 3: Handles user input

CPU Utilization (Single vs Multi-core):

Single-core processor:

Only one thread runs at a time
Multi-threading still helps (while one thread waits for I/O, another runs)

Multi-core processor:

Multiple threads run simultaneously
True parallelism: download + rendering at exact same time
Perfect for CPU-bound and I/O-bound tasks

Real-World Browser Example:

Google Chrome threads:
  - Main thread: handles UI events
  - IO thread: network operations
  - Worker threads: JavaScript execution, image decoding
  - Renderer threads: HTML/CSS rendering

Comparison: Single-threaded vs Multi-threaded Download:

Single-threaded (sequential):
  Download chunk 1 (100ms)
  Download chunk 2 (100ms)
  Download chunk 3 (100ms)
  Render HTML (200ms)
  Handle input (0ms - frozen until done)
  Total: 400ms frozen

Multi-threaded (concurrent):
  While downloading, rendering happens
  While downloading, input is processed
  User sees progress bar updating and can scroll
  Total: User sees instant feedback

When Single-Threading is Sufficient:

Batch processing with no user interaction
CPU-bound computation (single processor anyway)
Simple synchronous operations

When Multi-Threading is Essential:

User interfaces (event handlers)
Network servers (handling multiple clients)
Real-time systems (logging, monitoring)
Any application where responsiveness matters

Problem 5: Shared vs. Private Items

List 1 item that is shared by all threads, and 1 item that is private to each thread.

Instructor Solution

Shared: Code, global variables, heap

Private: Stack, Program Counter, Stack Pointer

Elaboration (by LLM)

Why Code is Shared:

1
void printThread(int id) {
2
  printf("Thread %d\n", id);
3
}
4

5
// Both Thread 1 and Thread 2 execute the SAME function code
6
thread1 = CreateThread(printThread, 1);
7
thread2 = CreateThread(printThread, 2);

Both threads execute identical machine instructions from the code segment. Sharing the code segment saves memory and makes sense since the code doesn’t change.

Why Global Variables are Shared:

1
int counter = 0;  // Global variable
2

3
void increment() {
4
  counter++;  // All threads modify the SAME counter
5
}
6

7
// Thread 1
8
increment();  // counter = 1
9
// Thread 2 (simultaneously)
10
increment();  // counter = ?

The address of counter is the same for all threads. Without synchronization, this causes race conditions.

Why Heap is Shared:

1
int *ptr = malloc(sizeof(int));
2
*ptr = 42;
3

4
// Pass to another thread
5
Thread2UpdateValue(ptr);  // Thread 2 modifies same memory
6
printf("%d", *ptr);        // Thread 1 sees modified value

Heap allocations use addresses that are valid in the shared address space. All threads can access the same allocated memory.

Why Stack is Private:

1
void function() {
2
  int local_var = 10;  // On THIS thread's stack
3
  // Thread 1's local_var at address 0x7FFF1000
4
  // Thread 2's local_var at address 0x7FFF2000
5
}
6

7
// Thread 1
8
function();  // local_var at 0x7FFF1000
9
// Thread 2
10
function();  // local_var at 0x7FFF2000 (different address!)

Each thread needs independent local variables, so each gets its own stack with separate memory locations.

Why Program Counter is Private:

Thread 1:              Thread 2:
  line 10: x = 5         line 10: x = 5
  line 11: y = x + 1     line 15: z = 100
  line 12: ...           line 16: ...

Each thread is executing different lines of code (or same line at different times). The program counter tracks where each thread is.

Why Stack Pointer is Private:

Thread 1:           Thread 2:
  SP: 0x7FFF1000      SP: 0x7FFF2000
  (points to own      (points to own
   stack frame)        stack frame)

Each thread’s stack grows independently, so each needs its own stack pointer.

Memory Layout Example:

╔═══════════════════════════╗
║   Code (SHARED)           │ ← function() at 0x4000
║   Global Data (SHARED)    │ ← counter at 0x6000
║   Heap (SHARED)           │ ← malloc() results
║                           │
║  Thread 1 Stack (PRIVATE) │ ← SP1: 0x7FFF1000
║  local_var_T1 at 0x7FFF0F0║
║  param_T1 at 0x7FFF0F8    │
║                           │
║  Thread 2 Stack (PRIVATE) │ ← SP2: 0x7FFF2000
║  local_var_T2 at 0x7FFF1F0║
║  param_T2 at 0x7FFF1F8    │
╚═══════════════════════════╝

Implications for Programming:

Shared data needs locks: counter++ must be atomic
Passing stack variables is dangerous: Passing address of local variable to another thread risks use-after-free
Passing heap pointers is safe: Heap data persists as long as not freed
Global state is a liability: Avoid globals; use thread-local storage if needed

Problem 6: Purpose of pthread_detach()

What’s the purpose of pthread_detach() function?

Instructor Solution

The purpose of pthread_detach() is to tell the operating system that a thread’s resources (like its stack and Thread Control Block) should be automatically reclaimed immediately upon its termination. Normally, a thread is “joinable,” meaning it stays in a “zombie” state after finishing until another thread calls pthread_join() to collect its exit status.

Using pthread_detach():

Prevents Resource Leaks: It ensures the system doesn’t run out of memory if you create many threads but don’t intend to “join” them.
Enables “Fire and Forget”: It allows the thread to run independently without the parent process needing to monitor its completion.

Elaboration (by LLM)

Thread Lifecycle:

Created → Runnable → Running → Blocked → Exiting → Zombie → Reaped
                                                       ↑
                                    Still consuming resources!

Joinable Thread (Default):

1
pthread_t tid;
2
pthread_create(&tid, NULL, thread_function, NULL);
3

4
// ... thread runs and calls pthread_exit()
5
// Thread is now in ZOMBIE state, waiting to be joined
6

7
pthread_join(tid, NULL);  // Reap the thread, free resources

Problems with Always Joining:

1
for (int i = 0; i < 1000; i++) {
2
  pthread_t tid;
3
  pthread_create(&tid, NULL, worker, NULL);
4
  // BUG: We create 1000 threads but never join them!
5
  // All 1000 threads become zombies, consuming memory
6
}

Detached Thread:

1
pthread_t tid;
2
pthread_create(&tid, NULL, thread_function, NULL);
3
pthread_detach(tid);  // Mark as detached
4

5
// ... thread runs and calls pthread_exit()
6
// Resources are automatically freed - no zombie state!

When to Use Detached Threads:

Worker Thread Pool:

1
for (int i = 0; i < 100; i++) {
2
  pthread_t tid;
3
  pthread_create(&tid, NULL, worker_thread, NULL);
4
  pthread_detach(tid);  // Fire and forget
5
}

Background Tasks:

1
void save_file_async(const char *filename) {
2
  pthread_t tid;
3
  pthread_create(&tid, NULL, save_file_worker, (void*)filename);
4
  pthread_detach(tid);  // Don't wait for completion
5
}

Server Handling Multiple Clients:

1
while (accept_connection(&client_socket)) {
2
  pthread_t tid;
3
  pthread_create(&tid, NULL, handle_client, (void*)(intptr_t)client_socket);
4
  pthread_detach(tid);  // Handle each client independently
5
}

When to Use Joinable Threads:

When you need the return value:

1
int result;
2
pthread_join(tid, (void**)&result);  // Get thread's return value
3
printf("Thread returned: %d\n", result);

When you need to wait for completion:

1
pthread_create(&tid, NULL, computation_thread, NULL);
2
// ... do other work ...
3
pthread_join(tid, NULL);  // Wait for computation to finish
4
printf("Computation done!\n");

Small, bounded number of threads:

1
int num_threads = 4;
2
pthread_t tids[4];
3

4
for (int i = 0; i < num_threads; i++) {
5
  pthread_create(&tids[i], NULL, worker, NULL);
6
}
7

8
for (int i = 0; i < num_threads; i++) {
9
  pthread_join(tids[i], NULL);  // Wait for all workers
10
}

Resource Leak Example:

1
// WRONG - Resource leak!
2
void create_worker() {
3
  pthread_t tid;
4
  pthread_create(&tid, NULL, worker_thread, NULL);
5
  // thread becomes zombie when it exits
6
  // Resources never freed - leak!
7
}
8

9
// CORRECT - Detached thread
10
void create_worker() {
11
  pthread_t tid;
12
  pthread_create(&tid, NULL, worker_thread, NULL);
13
  pthread_detach(tid);
14
  // Resources auto-freed when thread exits
15
}
16

17
// ALSO CORRECT - Joinable with explicit join
18
void create_worker() {
19
  pthread_t tid;
20
  pthread_create(&tid, NULL, worker_thread, NULL);
21
  pthread_join(tid, NULL);
22
  // Resources freed after join
23
}

Key Differences:

Aspect	Joinable	Detached
Return value	Can get it	Lost
Synchronization	Explicit join	None
Resource cleanup	Manual (join)	Automatic
When to use	Small teams, need results	Worker pools, background tasks
Overhead	Stays in zombie state	Clean exit

Problem 7: Multi-Threaded Process Termination

List 2 conditions under which a multi-threaded process terminates.

Instructor Solution

A multi-threaded process terminates under any of the following conditions:

(1) A thread calls exit(): If any thread calls the exit() system call, the entire process and all its associated threads are immediately destroyed by the kernel.

(2) Last Thread Exits: The process naturally terminates when the last remaining thread calls pthread_exit().

(3) Fatal Signal: If any single thread performs an illegal operation (like a Segmentation Fault) that results in a fatal signal, the operating system terminates the entire process.

Elaboration (by LLM)

Key Insight: Process Termination is Process-Wide:

Unlike individual thread termination, when a process terminates, ALL threads die instantly, regardless of their state.

Condition 1: exit() Immediately Kills All Threads:

1
void *thread_func(void *arg) {
2
  for (int i = 0; i < 1000000; i++) {
3
    printf("Thread working...\n");
4
    sleep(1);
5
  }
6
  pthread_exit(NULL);  // Gentle exit
7
}
8

9
int main() {
10
  pthread_t tid;
11
  pthread_create(&tid, NULL, thread_func, NULL);
12
  sleep(2);
13
  exit(0);  // KILLS thread immediately, even mid-loop!
14
  // Process ends, thread gets no cleanup chance
15
  return 0;
16
}

Why exit() is Dangerous:

Bypasses thread cleanup
Doesn’t call destructors (in C++)
Closes file descriptors immediately
Could corrupt files if thread was writing
Doesn’t allow graceful shutdown

Condition 2: Last Thread Exiting (Graceful Termination):

1
void *worker(void *arg) {
2
  printf("Thread starting\n");
3
  sleep(2);
4
  printf("Thread ending\n");
5
  pthread_exit(NULL);  // Graceful exit
6
}
7

8
int main() {
9
  pthread_t tid;
10
  pthread_create(&tid, NULL, worker, NULL);
11
  printf("Main waiting...\n");
12
  pthread_join(tid, NULL);  // Wait for thread
13
  printf("Thread done, main exiting\n");
14
  return 0;  // Process terminates
15
}

Timeline:

0ms:   Main creates thread
0ms:   Main prints "Main waiting..."
0ms:   Thread starts, prints "Thread starting"
2000ms: Thread exits, prints "Thread ending"
2000ms: Main resumes from join()
2000ms: Main prints "Thread done, main exiting"
2000ms: Process terminates (last thread exited)

Condition 3: Fatal Signal (Crash):

1
void *bad_thread(void *arg) {
2
  int *ptr = NULL;
3
  *ptr = 42;  // Segmentation fault!
4
}
5

6
int main() {
7
  pthread_t tid;
8
  pthread_create(&tid, NULL, bad_thread, NULL);
9
  pthread_join(tid, NULL);
10
  return 0;
11
}

When the thread dereferences NULL:

OS sends SIGSEGV signal to process
Default handler terminates entire process
All threads die immediately
Process never reaches pthread_join()

Comparing Termination Methods:

1
// METHOD 1: Graceful (main returns normally)
2
int main() {
3
  pthread_t tid;
4
  pthread_create(&tid, NULL, worker, NULL);
5
  pthread_join(tid, NULL);  // Wait for worker
6
  return 0;  // Clean exit
7
}
8

9
// METHOD 2: Abrupt (main calls exit)
10
int main() {
11
  pthread_t tid;
12
  pthread_create(&tid, NULL, worker, NULL);
13
  sleep(1);
14
  exit(0);  // KILLS process immediately
15
}
16

17
// METHOD 3: Crash (signal)
18
int main() {
19
  pthread_t tid;
20
  pthread_create(&tid, NULL, bad_thread, NULL);
21
  sleep(1);
22
  // If bad_thread crashes, entire process dies
23
}

Best Practices:

Always join threads before exiting:

1
int main() {
2
  pthread_t tids[10];
3
  for (int i = 0; i < 10; i++) {
4
    pthread_create(&tids[i], NULL, worker, NULL);
5
  }
6
  for (int i = 0; i < 10; i++) {
7
    pthread_join(tids[i], NULL);  // IMPORTANT!
8
  }
9
  return 0;  // Clean exit
10
}

Or use pthread_exit in main:

1
int main() {
2
  pthread_t tid;
3
  pthread_create(&tid, NULL, worker, NULL);
4
  // Main sleeps/waits while worker runs
5
  pthread_exit(NULL);  // Waits for other threads
6
}

Avoid calling exit() in threads:

1
// WRONG
2
void *thread_func(void *arg) {
3
  // do work
4
  exit(0);  // KILLS entire process!
5
}
6

7
// RIGHT
8
void *thread_func(void *arg) {
9
  // do work
10
  pthread_exit(NULL);  // Only this thread exits
11
}

Problem 8: Concurrency vs. Parallelism

Define concurrency and parallelism and briefly explain the difference between the two.

Instructor Solution

Concurrency is the art of managing multiple tasks at once by interleaving their execution, essentially “dealing with” many things at the same time on a single or multiple cores.

Parallelism is the act of physically executing multiple tasks simultaneously, which requires hardware with multiple processing units (like a multi-core CPU).

The core difference lies in execution: concurrency is about structural design and multitasking (making progress on Task A, then Task B), while parallelism is about simultaneous execution (Task A and Task B running at the exact same millisecond).

Elaboration (by LLM)

Concurrency: Illusion of Parallelism

Concurrency is a software concept—the arrangement of code to handle multiple tasks. It does NOT require multiple cores.

Single-Core Concurrency Example:

Time   CPU Core 1
─────────────────
0ms    Task A [----]
4ms              Task B [----]
8ms              Task A [----]
12ms             Task C [----]
16ms             Task B [----]
20ms             Done

Result: All tasks "appear" to run at once, but only one runs at a time
The OS rapidly switches between them

Parallelism: True Simultaneous Execution

Parallelism is a hardware reality—multiple cores literally running code at the same instant. It REQUIRES multiple cores.

Dual-Core Parallelism Example:

Time   CPU Core 1    CPU Core 2
───────────────────────────────
0ms    Task A [----]  Task B [----]
4ms    Task A [----]  Task B [----]
8ms    Task A [----]  Task B [----]
12ms   Task C [----]  Task B [----]

Result: Tasks A and B run simultaneously
Core 1 switches between A and C
Core 2 runs B uninterrupted

Key Differences Table:

Aspect	Concurrency	Parallelism
Definition	Structural arrangement	Physical execution
Requirements	Software only	Requires multiple cores
CPU cores needed	1+	2+
Context switching	Yes	May have switching too
When speedup occurs	When waiting for I/O	Always (CPU-bound work)
Example	Web server handling requests	Matrix multiplication on 4 cores

Visual Comparison:

Single-core Concurrency:
  [A][B][C][A][B][C]  ← One core time-slicing
  ↑
  Sequential at any instant, but concurrent across time

Multi-core Parallelism:
  Core 1: [A][A][C][C]
  Core 2: [B][B][B][B]
  ↑
  Simultaneous execution at same instant

Why Concurrency Helps Even Without Parallelism:

1
// Single-threaded (sequential) I/O
2
void fetch_websites() {
3
  download_site_a();  // 1000ms (blocked waiting for network)
4
  download_site_b();  // 1000ms (blocked waiting for network)
5
  download_site_c();  // 1000ms (blocked waiting for network)
6
  // Total: 3000ms
7
}
8

9
// Concurrent (multi-threaded) I/O
10
void fetch_websites_concurrent() {
11
  Thread A: download_site_a()  // 1000ms
12
  Thread B: download_site_b()  // 1000ms (simultaneous)
13
  Thread C: download_site_c()  // 1000ms (simultaneous)
14
  // Total: ~1000ms (concurrent waiting!)
15
}

Even on single core:

While Thread A waits for network, CPU switches to Thread B
While Thread B waits, CPU switches to Thread C
All three downloads overlap in time

Why Parallelism Requires Multiple Cores:

You cannot achieve true parallelism on a single core because:

Only one instruction sequence can execute at any instant
Context switching creates the illusion, but still sequential
CPU has only one execution pipeline per core

Real-World Example: Web Server

Concurrency without parallelism (single core):

Client 1: [request] [wait] [respond]
Client 2:           [request] [wait] [respond]
Client 3:                     [request] [respond]

All clients "feel" like server is responsive
Server is actually context-switching rapidly

Parallelism (multi-core):

Core 1: Client 1: [request] [wait] [respond]
Core 2: Client 2: [request] [wait] [respond]
Core 3: Client 3: [request] [wait] [respond]

All clients running truly in parallel
No time-sharing needed

Why Both Matter:

Concurrency helps with responsiveness and I/O-bound work
Parallelism helps with pure computation on multiple cores
Multi-threaded app on multi-core: Best of both worlds!

Problem 9: Concurrency Without Multiple Cores

If you have a concurrent system, i.e., multi-tasking or multi-threading OS, does the underlying hardware have to have multiple CPU cores? Why or why not?

Instructor Solution

No, the underlying hardware does not need multiple CPU cores to support a concurrent system. Concurrency is a software property managed by the operating system through context switching and time-slicing, where the CPU rapidly alternates between different threads or tasks. By giving each process a few milliseconds of execution time before switching to the next, the OS creates the illusion that multiple tasks are running simultaneously, even if only one instruction is being physically processed at any given moment on a single core.

Elaboration (by LLM)

Time-Slicing: The Core of Concurrency

On a single core, the OS divides CPU time into small chunks (time slices or time quanta, typically 10-100ms).

Timeline Example: Single Core Running 3 Threads

Thread A    Thread B    Thread C
[10ms]
            [10ms]
                        [10ms]
[10ms]
            [10ms]
                        [10ms]
[10ms]
            [10ms]
                        [10ms]

Each thread gets a turn to run for 10ms, then the OS saves its state and switches to the next thread.

Why This Works:

Save State: Program counter, registers, stack pointer
Switch: OS scheduler picks next thread
Restore State: New thread resumes from where it left off
Repeat: Rapid cycling creates illusion of parallelism

Context Switch Overhead:

100 threads × 10ms per thread = 1000ms (1 second per full cycle)
User doesn't notice threads switching (happens 100 times per second)
Appears like all 3 threads are running "simultaneously"

Why Multiple Cores Would Help (But Aren’t Required):

Single core (sequential time-slicing):
Thread A: [exec 10ms] [wait] [exec 10ms]
Thread B:           [exec 10ms] [wait] [exec 10ms]
Thread C:                   [exec 10ms] [exec 10ms]
Result: All threads make progress, but not truly parallel

Multi-core (true parallelism):
Core 1 - Thread A: [exec continuous]
Core 2 - Thread B: [exec continuous]
Core 3 - Thread C: [exec continuous]
Result: All threads run simultaneously, faster execution

I/O-Bound Concurrency on Single Core:

Even with single core, concurrency excels with I/O:

Thread A (network request):
  [execute code] → [network I/O - BLOCKS] → [process response]

Thread B (disk read):
  [execute code] → [disk I/O - BLOCKS] → [process result]

Timeline:
0ms:   A executes code
5ms:   A calls read() for network → BLOCKS
5ms:   OS switches to B
8ms:   B executes code
10ms:  B calls read() for disk → BLOCKS
10ms:  OS switches back to A
100ms: Network response arrives, A unblocks
100ms: OS switches to A
105ms: A processes network data
110ms: A does more work...
200ms: Disk response arrives, B unblocks
200ms: OS switches to B
205ms: B processes disk data

Both I/O operations happen concurrently (overlapped in time) even though only one core is doing computation.

Historical Context:

Early computers (1970s-2000s) had single-core processors, yet supported:

Multi-tasking (multiple programs)
Concurrency (multiple threads within programs)
Responsiveness (background tasks while user interacts)

All achieved through context switching, not parallelism!

The Takeaway:

Concurrency = Software design property (can work with 1 core)
Parallelism = Hardware feature (requires 2+ cores)

A single core can support concurrent code
A single core cannot support parallel execution

When Single-Core Concurrency Works Well:

I/O-bound applications (network, disk, user input)
Applications where responsiveness matters more than speed
Systems with limited hardware resources (embedded systems)

When Single-Core Concurrency Falls Short:

CPU-bound computation (heavy math, processing)
Real-time systems needing deterministic performance
Applications that need true speedup

Problem 10: Parallelism and Hardware Requirements

To achieve parallelism, does the underlying hardware have to have multiple CPU cores? Why or why not?

Instructor Solution

Yes, to achieve true parallelism, the hardware must have multiple CPU cores (or multiple processors). Unlike concurrency, which can be faked on a single core through rapid switching, parallelism requires the physical ability to execute two or more instructions at the exact same moment. Without at least two independent processing units, tasks cannot run simultaneously; they can only ever be interleaved.

Elaboration (by LLM)

The Fundamental Requirement: Multiple Execution Pipelines

A single CPU core has one instruction pipeline. It can only execute one instruction sequence at any given nanosecond.

Single Core (1 Pipeline):
  ┌─────────────────────────┐
  │  Fetch → Decode → Exec  │ ← Only one instruction at a time
  └─────────────────────────┘

  Can't execute Task A and Task B simultaneously
  Can only switch back and forth (context switching)

Multi-Core (Multiple Pipelines):

Core 1 (Pipeline 1):
  ┌─────────────────────────┐
  │  Fetch → Decode → Exec  │ ← Task A running
  └─────────────────────────┘

Core 2 (Pipeline 2):
  ┌─────────────────────────┐
  │  Fetch → Decode → Exec  │ ← Task B running (simultaneously!)
  └─────────────────────────┘

Result: At the same nanosecond:
  Core 1 executes instruction from Task A
  Core 2 executes instruction from Task B
  TRUE PARALLELISM

Why Concurrency Alone Doesn’t Give Parallelism:

1
// Two CPU-bound tasks
2
void task_a() {
3
  for (int i = 0; i < 1000000; i++) {
4
    result += i * i;  // Pure computation
5
  }
6
}
7

8
void task_b() {
9
  for (int i = 0; i < 1000000; i++) {
10
    result += i * i;  // Pure computation
11
  }
12
}
13

14
// Single-core execution (concurrency, NO parallelism):
15
Time:  0ms        10ms       20ms       30ms       40ms
16
       [Task A][Task B][Task A][Task B][Task A][Task B]
17

18
Both tasks make progress, but never truly simultaneous
19
Total time: ~40ms (both tasks' time summed)
20

21
// Dual-core execution (parallelism):
22
Core1: [Task A____________][Task A____]
23
Core2: [Task B____________][Task B____]
24

25
Both tasks run truly simultaneously
26
Total time: ~20ms (only max time needed)
27
2x faster!

The Speedup Formula:

Single-core:   Total time = Task A time + Task B time
Multi-core:    Total time = max(Task A time, Task B time)

With 2 cores:   Speedup = (A + B) / max(A, B)
If A ≈ B:       Speedup ≈ 2x
If A >> B:      Speedup ≈ 1x

Hardware Parallel Execution Example:

1
int a[1000], b[1000], c[1000];
2

3
// Core 1              Core 2
4
a[0] = 1            a[1] = 1
5
a[0] = a[0] + 2     a[1] = a[1] + 2
6
b[0] = a[0]         b[1] = a[1]
7
c[0] = b[0]         c[1] = b[1]
8

9
↓ All happening at SAME TIME, on same instruction

Misconception: Hyper-Threading Threads

Intel’s hyper-threading creates the ILLUSION of 2 cores per physical core:

Physical Core 1 (Hyper-threaded):
  Pipeline 1: Thread A instructions
  Pipeline 2: Thread B instructions

BUT: Uses same execution resources
NOT true parallelism, but better than time-slicing
Can exploit instruction-level parallelism

Result: ~1.2x-1.5x speedup (not 2x like real dual-core)

Why You Need Multiple Cores for Parallelism:

Single core can’t superscale: Can’t make one core run 2 instruction sequences at once
Physical limitation: Only one ALU (arithmetic logic unit) per core
Memory bandwidth: Additional cores have their own caches and paths to RAM
No alternative: You MUST have 2+ cores for true parallelism

Parallelism in Modern Processors:

Quad-core CPU:
  ┌─────────────────────┐
  │ Core 1: [Pipeline]  │
  │ Core 2: [Pipeline]  │
  │ Core 3: [Pipeline]  │
  │ Core 4: [Pipeline]  │
  └─────────────────────┘

Each core can run a different thread/task simultaneously
Useful for:
  - Image processing (4 threads processing 4 regions)
  - Web server (4 cores handling 4 clients)
  - Parallel algorithms (divide work across cores)

Key Limitation: Amdahl’s Law

Even with parallelism, speedup is limited:

If 10% of program is sequential (must run on 1 core)
And 90% is parallelizable (can run on N cores)

Speedup with 4 cores ≈ 1/(0.1 + 0.9/4) ≈ 3.1x (not 4x!)

Summary:

Feature	Single Core	Multi-Core
Concurrency	Yes (context switching)	Yes (still needed)
Parallelism	No (impossible)	Yes
True speedup on CPU-bound work	No	Yes
Hardware cost	Low	Higher
Synchronization complexity	Lower	Higher

Problem 11: Multi-Threading on Single-Core

Give one reason why you will want to implement a multi-threaded application in a single-core machine.

Instructor Solution

The primary reason is to improve responsiveness during I/O-bound tasks. In a single-threaded application, if the program waits for a slow resource (like a disk read, a network response, or user input), the entire process—including the UI—blocks and freezes. In a multi-threaded system, while one thread is “blocked” waiting for I/O, the OS can switch the CPU to another thread. This allows the application to remain interactive or perform background calculations while the hardware handles the data transfer in the background.

Elaboration (by LLM)

The I/O Blocking Problem:

Modern applications spend most time waiting for I/O, not computing:

1
// Single-threaded approach
2
int main() {
3
  printf("Fetching data...\n");
4
  data = network_read();       // BLOCKS for 2 seconds
5
  // During these 2 seconds: process CANNOT respond to user input
6
  // CPU sits idle, waiting for network
7

8
  process_data(data);          // Computation: 100ms
9
  printf("Done!\n");
10
}

User experiences: 2 second freeze while waiting for network

Multi-Threaded Solution:

1
void *network_thread(void *arg) {
2
  data = network_read();       // BLOCKS for 2 seconds
3
  return NULL;
4
}
5

6
void *ui_thread(void *arg) {
7
  while (true) {
8
    handle_user_input();       // ALWAYS responsive
9
    update_display();          // Show progress
10
    sleep(10);                 // Poll for updates
11
  }
12
  return NULL;
13
}
14

15
int main() {
16
  pthread_t net_tid, ui_tid;
17

18
  pthread_create(&net_tid, NULL, network_thread, NULL);
19
  pthread_create(&ui_tid, NULL, ui_thread, NULL);
20

21
  pthread_join(net_tid, NULL);
22
  pthread_cancel(ui_tid);
23

24
  return 0;
25
}

Timeline Comparison:

Single-threaded (FREEZES):
0ms:   Network request starts
0ms:   UI frozen (cannot handle input)
2000ms: Network response arrives
2000ms: UI becomes responsive
2000ms: Process data (100ms)
2100ms: Done
User experience: 2 second freeze!

Multi-threaded (RESPONSIVE):
0ms:   Network thread starts request
0ms:   UI thread immediately handles input
0ms:   User can interact with UI
50ms:  User sees "Downloading..." message
100ms: User clicks "Cancel" button (UI immediately responds)
500ms: UI shows progress (20% downloaded)
1000ms: UI shows progress (50% downloaded)
2000ms: Network response arrives
2000ms: Process data (100ms)
2100ms: UI shows results
User experience: Smooth, responsive!

Real-World Example: Email Client

Single-threaded:

User clicks "Send Email"
  → Compose thread exits
  → Network thread starts, blocks sending...
  → User CANNOT check mail, CANNOT type new email
  → After 30 seconds: finally sent, UI responsive again

Multi-threaded:

User clicks "Send Email"
  → Send thread spawned (handles the I/O)
  → UI thread continues immediately
  → User can check mail while sending
  → User can draft new email while sending
  → All while network thread waits for server

Why Single-Core is Still Useful:

When network blocks on Core 0:

Time  Thread A (UI)    Thread B (Network)
────────────────────────────────────────
0ms   [handles input]  [network request]
5ms   [renders]        [blocked waiting...]
10ms  [handles input]  [blocked waiting...]
15ms  [renders]        [blocked waiting...]
...
2000ms [handles input] [response arrived!]

While Thread B waits (100% CPU idle), Thread A uses CPU for UI.

This is the KEY benefit of concurrency on single-core: overlapping I/O with computation.

I/O Wait Time Hierarchy:

Memory access:     ~100 nanoseconds
Disk seek:         ~10 milliseconds (100,000x slower!)
Network request:   ~100 milliseconds (1,000,000x slower!)
User think time:   ~3 seconds (30,000,000x slower!)

A single thread waiting 100ms for network = huge opportunity for other threads!

Common I/O-Bound Scenarios:

Web Server:
- Thread 1: Handles Client A (blocked on network read)
- Thread 2: Handles Client B (computing response)
- Thread 3: Handles Client C (blocked on disk read)
Each client thinks server is responsive, even on single core!
Database Client:
- Thread 1: UI (handles mouse clicks, keyboard)
- Thread 2: Database query (blocked on DB server response)
User can cancel query or start new one while previous blocks
File Downloader:
- Thread 1: Download thread 1 (blocked waiting for network)
- Thread 2: Download thread 2 (blocked waiting for network)
- Thread 3: Download thread 3 (blocked waiting for network)
All three downloads overlap in time (concurrent I/O)

The Golden Rule:

For I/O-bound applications:
  Even 1 core + multi-threading > 1 core + single thread

Why? Because:
  - N threads can have N I/O operations in flight
  - When one blocks, CPU switches to another
  - CPU is never idle (unless all threads are blocked)

For CPU-bound applications:
  Multi-threading on 1 core helps responsiveness
  But NOT computational performance

Practical Example with Numbers:

Task: Download 3 files (each 10MB, 1 second over network)

Single-threaded:
  Download file 1: [1000ms network I/O]
  Download file 2: [1000ms network I/O]
  Download file 3: [1000ms network I/O]
  Total: 3000ms

Multi-threaded (1 core):
  Thread 1: Download file 1 [1000ms I/O]
  Thread 2: Download file 2 [1000ms I/O, concurrent]
  Thread 3: Download file 3 [1000ms I/O, concurrent]
  Total: ~1000ms (network operations overlap!)

The network doesn’t care about cores—it operates in parallel by nature!

Problem 12: Many-to-One Thread Mapping and I/O

Consider an OS where the threads are implemented by a user-level thread library and all threads of a process are mapped to just one kernel thread (many-to-one mapping). Consider a process that has two threads, T1 and T2, and T1 is running.

True/False: When T1 blocks while executing an I/O system call, the OS can switch to T2 and start running it. Justify your answer.

Instructor Solution

Answer: False.

In a many-to-one mapping, the operating system only sees the single kernel thread, not the individual user-level threads. When T1 executes a blocking I/O system call, the entire kernel thread (and thus the entire process) is placed into a “Blocked” state by the OS. Because the kernel is unaware that T2 even exists or is ready to run, it cannot schedule it; the user-level library never gets the chance to swap T1 for T2 until the kernel thread becomes “Ready” again.

Elaboration (by LLM)

The Kernel’s Limited Visibility:

The problem hinges on what the kernel can “see”:

Many-to-One Thread Mapping:

User-Level View:
┌─────────────────────────────────────────┐
│ Single Process (User-level Library)     │
│  ┌─────────┐  ┌─────────┐               │
│  │ Thread1 │  │ Thread2 │   (Managed    │
│  │ (Ready) │  │ (Ready) │   by library) │
│  └─────────┘  └─────────┘               │
└─────────────────────────────────────────┘
         ↓ (hidden from kernel)

Kernel's View:
┌─────────────────────────────────────────┐
│ Single Schedulable Entity               │
│ (One Kernel Thread = LWP)               │
│ (Kernel thread is "Blocked")            │
└─────────────────────────────────────────┘

The kernel only manages the single kernel thread. User threads are merely an abstraction—like library bookmarks that track which function is executing.

The Blocking Disaster:

Scenario: T1 makes a blocking read

1
// Thread 1 in user space
2
int data[100];
3
int bytes = read(fd, data, sizeof(data));  // Blocking I/O!

What happens:

Time   Kernel View                  User Library View
────────────────────────────────────────────────────
0ms    KLT executing read()         T1 executing read()
       (dispatched to CPU)           T2 waiting (Ready)

5ms    read() syscall issued        OS takes control
       OS blocks waiting for I/O     (KLT blocked)

10ms   KLT still blocked            T2 TRAPPED
       (waiting for disk/network)    Cannot run!

100ms  I/O complete                 T1 can resume
       KLT becomes Ready             (but not yet)

105ms  Kernel reschedules KLT       User library gets control
       (only resource it knows)      Could now run T2

       But OS doesn't know about
       T2's existence!

Why User Library Can’t Help:

The user-level library (like pthreads on many-to-one) cannot intercept kernel system calls:

1
// User library cannot see this:
2
void *thread_func(void *arg) {
3
  read(fd, buf, 100);  // ← Direct syscall to kernel
4
                       // ← User library has NO control here
5
  return NULL;
6
}

When T1 calls read():

CPU switches from user mode to kernel mode
User library is paused (no execution)
Only kernel code runs
Kernel doesn’t know about T2
Entire process blocked

Result: T2 cannot run!

Comparison Table:

┌──────────────────┬──────────────┬──────────────┐
│ Scenario         │ Many-to-One  │ One-to-One   │
├──────────────────┼──────────────┼──────────────┤
│ T1 blocks on I/O │ T2 CANNOT    │ T2 CAN       │
│                  │ run          │ run          │
├──────────────────┼──────────────┼──────────────┤
│ Kernel sees      │ 1 entity     │ 2 entities   │
│                  │ (blocked)    │ (one ready)  │
├──────────────────┼──────────────┼──────────────┤
│ Scheduler can... │ Switch to    │ Switch to T2 │
│                  │ other        │ directly     │
│                  │ processes    │              │
├──────────────────┼──────────────┼──────────────┤
│ Used in          │ Green threads│ Linux, Unix  │
│                  │ (obsolete)   │ (modern)     │
└──────────────────┴──────────────┴──────────────┘

Real-World Example: Web Server

Many-to-One (FAILS):

Thread 1: Handle Client A request
  → read() from network (blocking)
  → OS blocks entire process
  → Thread 2 CANNOT handle Client B
  → Client B waits (poor responsiveness)
  → After network I/O completes: Thread 1 resumes

One-to-One (WORKS):

Thread 1: Handle Client A request
  → read() from network (blocking)
  → OS blocks only Thread 1 kernel thread
  → Scheduler immediately picks Thread 2
  → Thread 2 handles Client B (responsive!)
  → Client B doesn't wait for Client A

Why Many-to-One Fails:

The fundamental problem: The OS controls scheduling, not the user library.

Control Flow with Many-to-One:

User Library:  [Init] → [decide to switch T1→T2] → ...
               (but this decision is useless)

Kernel:        [receives syscall] → [KLT blocked]
               → [User Library frozen, cannot execute]
               → [User Library cannot switch threads]
               → [Only kernel can resume execution]

Result: User library's thread switching logic never runs!

When Blocking Occurs:

Safe (Non-blocking) I/O Scenarios:

1. Polling:
   T1: data = read_nonblocking(fd);  // Returns immediately
       if (data) { process(data); }
   T2: can run between poll() calls

2. Asynchronous I/O:
   T1: aio_read(fd, &cb);  // Async, returns immediately
   T2: can run while kernel handles I/O in background

But with standard blocking I/O:
   T1: read(fd, buf, 100);  // BLOCKS
   T2: TRAPPED (user library powerless)

Historical Note: Green Threads

Java and some older systems used many-to-one:

Java 1.0 - 1.2:
  ❌ All user threads → 1 OS thread
  ❌ Blocking I/O in one thread froze entire app
  ❌ Limited to single core (no parallelism)

Java 1.3+:
  ✅ One user thread → 1 OS thread
  ✅ Blocking doesn't freeze app
  ✅ Can utilize multiple cores

Why the change? Because many-to-one is fundamentally broken!

The Key Lesson:

Blocking System Call + Many-to-One Thread Mapping
    =
Entire Process Freezes
(User library has no control once kernel takes over)

This is why modern OSes (Linux, Windows, macOS) use one-to-one mapping: the kernel must be aware of each thread for true concurrency with blocking I/O.

Problem 13: One-to-One Thread Mapping and I/O

Consider an OS where each user-level thread has a corresponding kernel-level thread (one-to-one mapping) as in Linux. Consider a process that has two threads, T1 and T2 and T1 is running.

True/False: When T1 blocks while executing an I/O system call, the OS can switch to T2 and start running it. Justify your answer.

Instructor Solution

Answer: True.

In a one-to-one mapping, every user-level thread is a distinct schedulable entity in the eyes of the kernel. Because the OS kernel manages each thread individually, it recognizes that only T1 is waiting for I/O. While T1 is in the “Blocked” state, the kernel’s scheduler can independently pick T2 from the “Ready” queue and dispatch it to a CPU core.

Elaboration (by LLM)

Kernel Awareness of Each Thread:

In one-to-one mapping, the kernel maintains separate TCBs for each thread:

One-to-One Mapping:

User Process:
┌─────────────────┐  ┌─────────────────┐
│   Thread 1      │  │   Thread 2      │
└────────┬────────┘  └────────┬────────┘
         │                    │
         ↓                    ↓

Kernel Thread Manager:
┌─────────────────────────────────────────────────┐
│ TCB 1: (T1's context)                           │
│ State: Blocked (I/O)                            │
├─────────────────────────────────────────────────┤
│ TCB 2: (T2's context)                           │
│ State: Ready                                    │
└─────────────────────────────────────────────────┘
         ↓
    Scheduler: Pick KLT2 freely!
              T2 runs!

Unlike many-to-one where the kernel sees one blocked entity, here it sees two independent schedulable entities.

Key Difference from Problem 12:

When T1 blocks on I/O in one-to-one mapping:

Kernel knows about TCB1 AND TCB2 (both visible)
Kernel blocks only TCB1 (puts it in I/O wait queue)
TCB2 remains in Ready queue (still schedulable)
Scheduler freely picks TCB2 to run
T2 executes while T1 waits on network/disk

Real-World Impact - Web Server Example:

With 1000 clients and 1 core:

Many-to-One (Problem 12):

Client 1 → Thread 1 (blocked on read)
Process blocked
Clients 2-1000 wait sequentially
Response time: ~1000 seconds

One-to-One (Problem 13):

Thread 1: Client A (blocked)
Thread 2: Client B (executing)
Thread 3: Client C (executing)
CPU time-shares among 1000 threads
Network handles I/O in parallel
Response time: ~10 seconds

The kernel schedules based on what it can SEE—and in one-to-one, it sees all threads!

Problem 14: fork() with Multiple Threads

Consider a process with 4 threads in Linux. When a thread invokes fork() system call to create a new process, how many threads will the new process have? Briefly explain.

Instructor Solution

In Linux, the new process will have only one thread. When a thread calls fork(), the operating system creates a child process that is a duplicate of the caller’s address space. However, only the calling thread is duplicated into the child process; the other three threads simply cease to exist in the new process.

This behavior prevents complex issues, such as a child process inheriting threads that were in the middle of a sensitive operation or holding a mutex that will never be released (since the thread holding it wasn’t copied).

Elaboration (by LLM)

Why Only the Calling Thread?

When all threads were hypothetically copied:

1
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
2

3
void *thread2_func(void *arg) {
4
  pthread_mutex_lock(&lock);  // Thread 2 is HERE (middle of critical section)
5
  // ... doing critical work ...
6
  pthread_mutex_unlock(&lock);
7
}
8

9
int main() {
10
  pthread_t tid2, tid3, tid4;
11
  pthread_create(&tid2, NULL, thread2_func, NULL);
12
  pthread_create(&tid3, NULL, other_work, NULL);
13
  pthread_create(&tid4, NULL, other_work, NULL);
14

15
  fork();  // DISASTER if all threads copied!
16
}

If all threads were duplicated:

Child Process After fork():

Thread 2 (copied): Has lock, in middle of critical section
- Cannot unlock (state undefined)
- Mutex is corrupted in child!
Thread 3 (copied): Tied to parent’s I/O operations
Thread 4 (copied): Managing memory
Result: Complete chaos, deadlocks, data corruption!

The Safe Solution: Copy Only Calling Thread

1
int main() {
2
  pthread_t tid2, tid3, tid4;
3

4
  pthread_create(&tid2, NULL, worker, NULL);
5
  pthread_create(&tid3, NULL, worker, NULL);
6
  pthread_create(&tid4, NULL, worker, NULL);
7

8
  pid_t child_pid = fork();  // Safe!
9

10
  if (child_pid == 0) {
11
    // Child process
12
    printf("Child has 1 thread\n");  // Only main thread
13

14
    // Safe to do child-specific work
15
    close(shared_fd);  // Don't interfere with parent
16
    _exit(0);
17
  } else {
18
    // Parent process
19
    printf("Parent still has 4 threads\n");
20

21
    pthread_join(tid2, NULL);
22
    pthread_join(tid3, NULL);
23
    pthread_join(tid4, NULL);
24
  }
25
}

Thread Count After fork():

Before fork():  4 threads (1 main, 3 workers)
                ↓
After fork(): Parent has 4 threads (unchanged)
              Child has 1 thread (calling thread only)

Common Problem: Lost Worker Threads

1
void *worker_thread(void *arg) {
2
  while (true) {
3
    job_t *job = get_next_job();
4
    process_job(job);
5
  }
6
}
7

8
int main() {
9
  // Create thread pool (8 worker threads)
10
  for (int i = 0; i < 8; i++) {
11
    pthread_create(&workers[i], NULL, worker_thread, NULL);
12
  }
13

14
  fork();  // Oops!
15
}

In child process:

All 8 workers are GONE
Only main thread remains
If child tries to add jobs: queue fills up
No workers to process them: DEADLOCK

Best Practice: Call fork() before creating threads!

Problem 15: exec() with Multiple Threads

Consider a process with 4 threads in Linux. When a thread invokes exec() system call to load a new executable on top of the current process, how many threads will the new process have? Briefly explain.

Instructor Solution

The new process will have only one thread. When a thread calls exec(), the entire address space of the process is replaced with the new executable. Because the old code, data, and stacks are wiped out to make room for the new program, all existing threads—including the one that called exec()—are terminated. The new program then begins execution from its main() function with a single, fresh thread of control.

Elaboration (by LLM)

Complete Address Space Replacement:

The exec() syscall doesn’t “update” the process—it completely replaces it:

1
// Old program - 4 threads
2
int main() {
3
  pthread_t tid2, tid3;
4

5
  pthread_create(&tid2, NULL, work, NULL);
6
  pthread_create(&tid3, NULL, work, NULL);
7

8
  sleep(10);  // Simulate work
9

10
  execv("/usr/bin/new_program", args);  // REPLACED!
11
  // Nothing after this executes in old program
12
}

What exec() does:

Flush all state: Old code, old data, old stacks
Release all threads: Threads 2, 3, 4 are killed
Load new executable: Read new binary from disk
Initialize new process: Fresh data, fresh heap
Start new main(): Single thread control

Thread Termination Timeline:

Time   Thread 1              Thread 2           Thread 3
──────────────────────────────────────────────────────────
0ms    main()                Starting           Starting
       pthread_create()

5ms    sleep(10)             Running            Running

10ms   execv() called        Running            Running

11ms   BANG!                 TERMINATED        TERMINATED
       (old code destroyed)  (killed instantly) (killed instantly)

11ms   ┌──────────────────┐
       │ NEW EXECUTABLE   │
       │ CODE LOADED      │
       │ main() running   │
       │ 1 thread now     │
       └──────────────────┘

Key Difference from fork():

fork():
  ├─ Parent: Unchanged, all threads continue
  └─ Child: 1 thread (copy of caller)
  Result: Two processes

exec():
  ├─ Replace current process entirely
  ├─ All threads (except caller) killed
  ├─ New code loaded
  └─ New main() starts with 1 thread
  Result: Same process, different program

Standard Pattern: fork() + exec()

This is the standard way to run new programs:

1
pid_t pid = fork();
2

3
if (pid == 0) {
4
  // Child process
5
  // Child has 1 thread (safe to manipulate)
6

7
  close(STDIN_FILENO);
8
  open("/input.txt", O_RDONLY);
9

10
  execv("/bin/cat", args);  // Never returns if successful
11

12
  perror("execv failed");
13
  exit(1);
14
} else {
15
  // Parent process
16
  // Still has all threads
17
  waitpid(pid, NULL, 0);  // Wait for child's program
18
}

Resource Leaks After exec():

1
int main() {
2
  pthread_mutex_t m = PTHREAD_MUTEX_INITIALIZER;
3
  int fd = open("data.txt", O_RDONLY);
4

5
  pthread_create(&tid, NULL, work, NULL);
6

7
  execv("/bin/true", args);
8

9
  // exec() doesn't return, but if it did:
10
  // - Thread tid: GONE
11
  // - Mutex m: LOST (not unlocked)
12
  // - File fd: LEFT OPEN (might need cleanup)
13
}

Solution: Use fork() + exec() carefully to isolate the new program!

A process has two threads, T1 and T2. Assume that T1 allocates some memory on the heap as follows:

1
int *ptr = new int[100];

T1 then passes this pointer “ptr” to T2. Can T2 legally access the array pointed to by ptr? Justify your answer.

Instructor Solution

Yes. The array pointed to by “ptr” is located in the heap, and the threads that belong to the same process share the same heap. If T2 has the pointer, it can read/write that memory easily. This is the standard way threads share large amounts of data: by storing them either in static global variables or in the heap.

Elaboration (by LLM)

Why Heap Works (Unlike Stack):

Stack (Problem 17 - dangerous):

Thread 1 Stack: 0x5000 - 0x6000 (private)
Thread 2 Stack: 0x3000 - 0x4000 (private)
Each thread has separate stack
When thread exits, stack frame destroyed
Addresses become invalid (dangling pointers)

Heap (Safe):

Single shared heap: 0x8000 - 0x20000
All threads allocate from same memory region
When thread exits, heap memory persists
Other threads can still access it

Pointer Transfer Mechanism:

1
void *thread1_func(void *arg) {
2
  int *ptr = (int *)malloc(100 * sizeof(int));  // Heap alloc
3

4
  // T1's local variable 'ptr' contains address, e.g., 0x8000
5
  // This address points to SHARED heap
6

7
  // Pass to T2 via shared variable
8
  shared_ptr = ptr;  // Copy the ADDRESS (just the value!)
9

10
  return NULL;
11
}
12

13
void *thread2_func(void *arg) {
14
  // T2 receives the same address from shared_ptr
15
  int *ptr = shared_ptr;  // Same address as T1!
16

17
  // Now T2 can access T1's heap allocation
18
  ptr[0] = 100;  // OK - writing to shared heap
19
  printf("%d\n", ptr[50]);  // OK - reading from shared heap
20

21
  return NULL;
22
}

Memory Lifetime Guarantee:

1
void *thread1_func(void *arg) {
2
  int *ptr = malloc(100 * sizeof(int));  // Heap allocation
3
  *(int **)arg = ptr;  // Share with T2 (store address)
4
  return NULL;  // T1 exits
5
}  // T1's STACK frame destroyed
6
   // BUT heap at ptr still valid!
7

8
void *thread2_func(void *arg) {
9
  int *data = *(int **)arg;  // Get T1's allocation
10
  printf("%d\n", data[0]);   // ✅ Still accessible!
11
  return NULL;
12
}

The heap allocation outlives the thread that created it!

Common Heap Sharing Patterns:

Pattern 1: Global Pointer

1
int *global_ptr = NULL;
2

3
void *thread1_func(void *arg) {
4
  global_ptr = malloc(1000);  // T1 allocates
5
  global_ptr[0] = 42;
6
}
7

8
void *thread2_func(void *arg) {
9
  printf("%d\n", global_ptr[0]);  // T2 uses T1's allocation ✅
10
}

Pattern 2: Function Argument

1
void *thread_func(void *arg) {
2
  int *buffer = (int *)arg;  // Buffer passed from main
3
  buffer[0] = 42;  // ✅ Access shared heap
4
}
5

6
int main() {
7
  int *buffer = malloc(100 * sizeof(int));
8

9
  pthread_t tid;
10
  pthread_create(&tid, NULL, thread_func, (void *)buffer);
11

12
  buffer[0] = 99;  // Main also uses it ✅
13

14
  pthread_join(tid, NULL);
15
  free(buffer);
16
}

Pattern 3: Thread-Safe Queue

1
struct Node {
2
  int data;
3
  struct Node *next;
4
};
5

6
void enqueue(int value) {
7
  struct Node *new = malloc(sizeof(struct Node));  // Heap!
8
  new->data = value;
9
  new->next = NULL;
10

11
  pthread_mutex_lock(&lock);
12
  // Link into queue
13
  pthread_mutex_unlock(&lock);
14
}
15

16
void *worker_thread(void *arg) {
17
  struct Node *node = dequeue();  // T2 accesses T1's heap alloc
18
  printf("%d\n", node->data);     // ✅ Works
19
  free(node);
20
}

Why This is Safe:

Heap management is centralized in the kernel:

Kernel tracks all allocations
When any thread frees: heap manager updates
All threads see consistent heap state
Addresses remain valid until explicit free()

Problem 17: Stack Access Between Threads

A process has two threads, T1 and T2. Assume that T1 declares a local variable x as follows:

1
int x = 10;

T1 then passes the address of this variable “x” to T2. Can T2 legally access “x”? Justify your answer.

Instructor Solution

Recall that local variables are on the stack, and every thread within the same process has their own private stack. While T2 can technically access T1’s stack if it has the memory address (since they share the same address space), it is extremely dangerous. If T1 returns from its current function, its stack frame is destroyed, and T2 will be pointing to “garbage” memory, likely causing a crash.

Elaboration (by LLM)

Stack Layout: Thread-Private Regions

Each thread gets its own stack:

Process Address Space:

Text (Shared)              All threads execute same code
Data + BSS (Shared)        All threads share globals
Heap (Shared)              All threads allocate from same heap
STACKS (THREAD-PRIVATE!)
  ├─ Stack for Thread 1   0x7000: [x = 10]  ← T1's local var
  └─ Stack for Thread 2   0x4000: [locals]  ← T2's locals

Stacks grow in opposite directions, thread-private regions isolated.

The Danger: Stack Frame Destruction

1
void function_with_local(void *shared_ptr_arg) {
2
  int x = 10;  // Local variable on T1's stack at 0x7000
3

4
  // Pass address to T2
5
  int **ptr_holder = (int **)shared_ptr_arg;
6
  *ptr_holder = &x;  // T2 now holds 0x7000
7

8
  return;  // CRITICAL MOMENT!
9
  // Stack frame is DESTROYED here
10
  // Address 0x7000 is no longer valid
11
}

Timeline:

Time  Thread 1                     Thread 2
──────────────────────────────────────────────────────
0ms   function_with_local()
      x = 10
      ptr = &x (passes 0x7000)      (waiting)

5ms   (doing work)                 Receives 0x7000
                                    printf(*ptr);  // 10 ✅

10ms  return;                      Still holding 0x7000
      [Stack destroyed]
      [0x7000 now invalid]
      [0x7000 becomes garbage]

15ms  Another_function()           printf(*ptr);  // CRASH!
      [Reuses stack at 0x7000]     (Undefined behavior)
      int y = 999;  [0x7000=999]   Might print garbage

Race Condition - Timing Problem:

1
int *shared_ptr = NULL;
2

3
void *thread1_func(void *arg) {
4
  int x = 10;
5
  shared_ptr = &x;  // T2 now has the address
6

7
  return NULL;  // Stack destroyed!
8
}
9

10
void *thread2_func(void *arg) {
11
  sleep(1);  // Wait 1 second
12

13
  // By now, T1 has exited
14
  if (shared_ptr) {
15
    printf("%d\n", *shared_ptr);  // CRASH! Dangling pointer!
16
  }
17
  return NULL;
18
}

Comparison: Why Heap Works vs. Stack Fails

HEAP ALLOCATION (Safe):
  void *thread1_func() {
    int *ptr = malloc(100);  // Allocated on heap
    return ptr;              // Can return safely
  }  // Stack destroyed, heap survives!

  void *thread2_func() {
    int *data = result;
    printf("%d", *data);  // ✅ Works! (heap still allocated)
  }

STACK ALLOCATION (Dangerous):
  void *thread1_func() {
    int x = 10;
    return &x;  // RETURNS STACK ADDRESS!
  }  // Stack destroyed, &x invalid!

  void *thread2_func() {
    int *data = result;
    printf("%d", *data);  // ❌ CRASH! (stack freed)
  }

Best Practices:

❌ DON'T: Pass stack addresses between threads
❌ DON'T: Return &local_variable from thread function
❌ DON'T: Store stack pointers in shared variables

✅ DO: Use heap allocations (malloc)
✅ DO: Use global variables (for constants)
✅ DO: Copy values instead of addresses (when safe)

Correct Pattern:

1
void *thread1_func(void *arg) {
2
  // Allocate on HEAP instead
3
  int *data = malloc(sizeof(int));
4
  *data = 10;
5

6
  // Share the address
7
  *(int **)arg = data;  // Safe!
8

9
  return NULL;  // Stack destroyed, but heap survives
10
}
11

12
void *thread2_func(void *arg) {
13
  int *data = *(int **)arg;
14

15
  sleep(10);  // T1 has exited
16

17
  printf("%d\n", *data);  // ✅ Works! (heap still allocated)
18

19
  free(data);  // Proper cleanup
20

21
  return NULL;
22
}

A process has two threads, T1 and T2. Assume that T1 opens a file as follows:

1
int fd = open("file.txt", O_RDONLY);

Assume that “fd” is a global variable and T2 has access to it. Can T2 call read(fd, …) to read data from “file.txt”? Justify your answer.

Instructor Solution

Yes. File descriptors are part of the Process Resources. The File Descriptor Table is shared by all threads within the same process. Once any thread opens a file, the resulting integer ID is valid for every other thread in that process. This means that all other threads in the same process who has access to the file descriptor can read from the file.

Elaboration (by LLM)

File Descriptor Table: Process-Wide Resource

File descriptors are stored in a kernel structure that’s shared across all threads:

Process Resources (Shared):
  File Descriptor Table
    FD 0: stdin
    FD 1: stdout
    FD 2: stderr
    FD 3: file.txt ← (opened by T1)
    FD 4: socket
    FD 5: /dev/random

Threads:
  T1: Can use FD 3 ✅
  T2: Can use FD 3 ✅ (T1 opened it)

Both threads reference same kernel file entry!

Unlike stacks (thread-private) and heaps (process-shared but allocated separately), file descriptors are explicitly process-wide resources.

File Descriptor Operations Across Threads:

1
int fd = -1;  // Global variable
2

3
void *thread1_func(void *arg) {
4
  // T1 opens the file
5
  fd = open("file.txt", O_RDONLY);
6

7
  printf("[T1] Opened file with fd=%d\n", fd);
8

9
  // Read first 100 bytes
10
  char buf1[100];
11
  ssize_t n = read(fd, buf1, 100);
12
  printf("[T1] Read %ld bytes\n", n);
13

14
  // File position is now at byte 100
15
  // This position is SHARED with T2!
16

17
  sleep(1);
18

19
  return NULL;
20
}
21

22
void *thread2_func(void *arg) {
23
  sleep(0.5);  // Let T1 read first
24

25
  // T2 can use the FD that T1 opened
26
  // Important: File position is SHARED!
27

28
  char buf2[50];
29
  ssize_t n = read(fd, buf2, 50);  // ✅ Works!
30

31
  printf("[T2] Read %ld bytes from fd=%d\n", n, fd);
32
  printf("[T2] Got data from byte position 100 onwards\n");
33

34
  return NULL;
35
}
36

37
int main() {
38
  pthread_t tid1, tid2;
39

40
  pthread_create(&tid1, NULL, thread1_func, NULL);
41
  pthread_create(&tid2, NULL, thread2_func, NULL);
42

43
  pthread_join(tid1, NULL);
44
  pthread_join(tid2, NULL);
45

46
  if (fd >= 0) close(fd);
47

48
  return 0;
49
}

Output:

[T1] Opened file with fd=3
[T1] Read 100 bytes
[T2] Read 50 bytes from fd=3
[T2] Got data from byte position 100 onwards

Critical Issue: Shared File Position

File descriptors don’t just refer to files—they track the current read/write position:

File: file.txt (1000 bytes)

Initially:
[0][1][2]...[100]...[200]...[1000]
 ↑
Position 0

After T1 reads 100 bytes:
[0][1][2]...[100]...[200]...[1000]
            ↑
            Position 100 (shared!)

T2 calls read():
[0][1][2]...[100]...[150]...[1000]
                    ↑
                    Position 150 (after reading 50)

Race Condition Danger:

1
int fd = open("data.txt", O_RDONLY);
2

3
void *thread1_func(void *arg) {
4
  char buf[10];
5
  read(fd, buf, 10);  // Reads bytes 0-9
6
  // Scheduling might switch to T2 here!
7

8
  printf("T1: %s\n", buf);
9
}
10

11
void *thread2_func(void *arg) {
12
  char buf[10];
13
  read(fd, buf, 10);  // Reads which bytes?
14

15
  printf("T2: %s\n", buf);
16
}

File position is shared, so multiple threads reading concurrently can cause data loss or confusion!

Proper Synchronization:

1
pthread_mutex_t fd_lock = PTHREAD_MUTEX_INITIALIZER;
2
int fd = -1;
3

4
void *thread1_func(void *arg) {
5
  pthread_mutex_lock(&fd_lock);
6

7
  // Only T1 is reading, file position safe
8
  char buf[10];
9
  ssize_t n = read(fd, buf, 10);
10
  printf("T1 read: %.*s\n", (int)n, buf);
11

12
  pthread_mutex_unlock(&fd_lock);
13
}
14

15
void *thread2_func(void *arg) {
16
  pthread_mutex_lock(&fd_lock);  // Wait if T1 reading
17

18
  char buf[10];
19
  ssize_t n = read(fd, buf, 10);
20
  printf("T2 read: %.*s\n", (int)n, buf);
21

22
  pthread_mutex_unlock(&fd_lock);
23
}

With mutex: Each read() is atomic, file position updates predictably.

Common FD Sharing Patterns:

Pattern 1: Read-Only Safe

1
// Config file read by multiple threads
2
int config_fd = open("config.txt", O_RDONLY);
3

4
void *worker_thread(void *arg) {
5
  // Safe if threads seek to known positions first
6
  lseek(config_fd, THREAD_OFFSET, SEEK_SET);
7
  read(config_fd, local_buf, SIZE);
8
}

Pattern 2: Append-Only (Requires Mutex)

1
int log_fd = open("log.txt", O_WRONLY | O_APPEND);
2
pthread_mutex_t log_lock = PTHREAD_MUTEX_INITIALIZER;
3

4
void *thread_func(void *arg) {
5
  pthread_mutex_lock(&log_lock);
6
  write(log_fd, "message\n", 8);  // Atomic
7
  pthread_mutex_unlock(&log_lock);
8
}

Comparison: FDs vs. Heap vs. Stack

Resource     | Shared? | Access | Issues
─────────────┼─────────┼─────────┼─────────────
File Descr.  | Yes     | Via FD  | Shared position
             | (global)| (global)| (races!)
─────────────┼─────────┼─────────┼─────────────
Heap Memory  | Yes     | Via ptr | Race conditions
             | (shared)| (shared)| on data
─────────────┼─────────┼─────────┼─────────────
Stack Memory | No      | Via addr| Dangling pointers
             | (private)| (private)

File descriptors are meant for sharing—but be careful with concurrent operations!

Problem 19: Thread Output Analysis

What’s the output of the following code when run in a Unix system? Assume you have 2 functions:

(1) CreateThread(F, arg) creates a thread that starts running at the function “F” and takes the argument “arg” and returns the id of the created thread
(2) JoinThread(int threadId) is used to wait for the termination of a thread

1
int g = 0;
2
int tids[2];
3

4
void T(int arg) {
5
    if (arg == 2) {
6
        g++;
7
        printf("[MainThread] g: %d\n", g);
8
    }
9
    if (arg == 1) {
10
        g++;
11
        printf("[Thread1] g: %d\n", g);
12
    }
13
    if (arg == 0) {
14
        JoinThread(tids[1]);
15
        g++;
16
        printf("[Thread0] g: %d\n", g);
17
    }
18
} // end-T
19

20
main() {
21
    tids[1] = CreateThread(T, 1);
22
    tids[0] = CreateThread(T, 0);
23
    JoinThread(tids[0]);
24
    T(2);
25
} // end-main

Instructor Solution

Output:

[Thread1] g: 1
[Thread0] g: 2
[MainThread] g: 3

Elaboration (by LLM)

Execution Flow Analysis:

The key to understanding this is tracking the order of thread creation, execution, and synchronization:

Timeline:

main():
  tids[1] = CreateThread(T, 1);     // Create Thread1 (arg=1)
  tids[0] = CreateThread(T, 0);     // Create Thread0 (arg=0)
  JoinThread(tids[0]);              // WAIT for Thread0 to finish
  T(2);                             // Execute main's function

Concurrent Execution:

Main Thread               Thread1 (arg=1)        Thread0 (arg=0)
─────────────             ───────────────        ──────────────
CreateThread(T,1)         ┐
CreateThread(T,0)         │                      ┐
JoinThread(tids[0])       ├─ Waiting for T0 ─────┼─ Running T(0)
(blocked)                 │                      │
                          │                      JoinThread(tids[1])
                          │                      (waits for Thread1)
                          │                      │
                          │◄─────────────────────┤ Waiting for T1
                          g++ (T1 arg=1)         │
                          printf T1              │
                          return                 │
                          (T1 finishes)          │
                          │                      g++ (T0)
                          └──────────────────────► printf T0
                                                  return
                                                  (T0 finishes)
(unblocked)
T(2)
g++
printf main
return

Detailed Execution Order:

Main: Creates Thread1 with T(1)
Main: Creates Thread0 with T(0)
Main: Calls JoinThread(tids[0]) → Main blocks, waiting for Thread0 to finish
Thread1: Starts executing T(1)
- arg == 1 is true
- g++ (g becomes 1)
- printf(“[Thread1] g: 1\n”)
- Thread1 finishes
Thread0: Starts executing T(0)
- arg == 0 is true
- Calls JoinThread(tids[1]) → Thread0 blocks, waiting for Thread1
- Thread1 already finished, so this returns immediately
- g++ (g becomes 2)
- printf(“[Thread0] g: 2\n”)
- Thread0 finishes
Main: JoinThread returns (Thread0 finished)
- Calls T(2)
- arg == 2 is true
- g++ (g becomes 3)
- printf(“[MainThread] g: 3\n”)
- Main finishes

Critical Insight: Shared Global Variable g

All three function invocations increment the same global variable g:

Initial:  g = 0
After Thread1: g = 0 + 1 = 1
After Thread0: g = 1 + 1 = 2
After Main:    g = 2 + 1 = 3

Note: This works predictably because:

Thread1 finishes before Thread0 even starts checking it
Thread0 waits for Thread1 explicitly
Main waits for Thread0 explicitly
No race conditions here (synchronization prevents them)

Key Learning Points:

Thread Creation Order: Threads are created in order (Thread1 first, Thread0 second)
Thread Start Order: May be different from creation order (depends on scheduler)
JoinThread Synchronization: Blocks until the thread finishes
Global Variable Access: All threads access the same g variable
Execution Order: Determined by JoinThread calls, not creation order

Problem 20: Parallel Summation

Consider parallel summation: You are given an array $A[0..N-1]$ , and you are asked to compute the sum of the elements of this array in parallel as follows: The main thread will create 4 threads, each thread will compute the sum of $N/4$ elements of the array and terminate. The main thread will then wait for the termination of the worker threads, compute the overall sum, print it on the screen and terminate.

Assume you have 2 functions:

(1) CreateThread(F, arg) creates a thread that starts running at the function “F” and takes the argument “arg” and returns the id of the created thread
(2) JoinThread(int threadId) is used to wait for the termination of a thread

Assume that array $A$ is global and $N$ is divisible by 4.

Instructor Solution

1
#include <stdio.h>
2
#define N ...
3
int A[N]; // Array to be summed
4
int localSums[4];
5

6
void doSum(int tid) {
7
    int start = tid * N / 4;
8
    int end = start + N / 4;
9
    int sum = 0;
10

11
    for (int i = start; i < end; i++)
12
        sum += A[i];
13

14
    localSums[tid] = sum;
15
}
16

17
void main() {
18
    int tids[4];
19

20
    for (int i = 0; i < 4; i++)
21
        tids[i] = CreateThread(doSum, i);
22

23
    int sum = 0;
24
    for (int i = 0; i < 4; i++) {
25
        JoinThread(tids[i]);
26
        sum += localSums[i];
27
    }
28

29
    printf("Sum: %d\n", sum);
30
}

Elaboration (by LLM)

Problem Decomposition:

To parallelize summation with 4 threads, divide the array into 4 equal parts:

Array A[0..N-1]:
[0][1]...[N/4-1] | [N/4]...[N/2-1] | [N/2]...[3N/4-1] | [3N/4]...[N-1]
├─ Thread 0 ────┤ ├─ Thread 1 ────┤ ├─ Thread 2 ────┤ ├─ Thread 3 ────┤
└─ Sum A[0:N/4] ─┘ └─ Sum A[N/4:N/2]└─ Sum A[N/2:3N/4]└─ Sum A[3N/4:N]─┘
     → localSums[0]   → localSums[1]   → localSums[2]   → localSums[3]

Final Sum = localSums[0] + localSums[1] + localSums[2] + localSums[3]

Parallel Execution Model:

Main Thread:
  Create 4 threads
  │
  ├─ Thread 0: Sum A[0..N/4-1]
  ├─ Thread 1: Sum A[N/4..N/2-1]
  ├─ Thread 2: Sum A[N/2..3N/4-1]
  └─ Thread 3: Sum A[3N/4..N-1]
       (All 4 compute in parallel)
       ↓
  Wait for all to finish
  (JoinThread for each)
       ↓
  Aggregate results: Sum all localSums[i]
       ↓
  Print final sum

Key Design Patterns:

1. Work Distribution:

Each thread gets a unique thread ID (tid = 0, 1, 2, 3):

1
void doSum(int tid) {
2
    int start = tid * N / 4;      // Where this thread starts
3
    int end = start + N / 4;       // Where this thread ends
4

5
    // Thread 0: start=0, end=N/4
6
    // Thread 1: start=N/4, end=N/2
7
    // Thread 2: start=N/2, end=3N/4
8
    // Thread 3: start=3N/4, end=N
9
}

2. Local Computation:

Each thread computes a LOCAL sum without synchronization:

1
int sum = 0;
2
for (int i = start; i < end; i++)
3
    sum += A[i];
4
localSums[tid] = sum;  // Store result for main thread

No race conditions because:

Each thread writes to unique localSums[tid]
No shared write access during computation

3. Aggregation:

Main thread combines results AFTER all threads finish:

1
int sum = 0;
2
for (int i = 0; i < 4; i++) {
3
    JoinThread(tids[i]);    // Wait for thread i
4
    sum += localSums[i];    // Add its result
5
}

Example with Real Numbers:

Array: A = [1, 2, 3, 4, 5, 6, 7, 8]  (N=8)

Thread Distribution:
  Thread 0: A[0..1] = [1, 2]    → localSums[0] = 3
  Thread 1: A[2..3] = [3, 4]    → localSums[1] = 7
  Thread 2: A[4..5] = [5, 6]    → localSums[2] = 11
  Thread 3: A[6..7] = [7, 8]    → localSums[3] = 15

Aggregation:
  sum = 0
  sum += localSums[0] = 3
  sum += localSums[1] = 7    (sum = 10)
  sum += localSums[2] = 11   (sum = 21)
  sum += localSums[3] = 15   (sum = 36)

Final Output: Sum: 36

Performance Analysis:

Without Parallelization:

Single thread sums N elements:
Time = O(N)

With 4-Thread Parallelization:

Each thread sums N/4 elements:
Time = O(N/4) for computation
Plus O(1) for aggregation
Total = O(N/4) (speedup of ~4x on 4 cores)

Critical Sections:

This code has NO critical sections!

Each thread writes to different localSums[tid] (no race)
Main thread only reads after all threads finish (no race)
All synchronization explicit (JoinThread)

Why This Design is Robust:

1
// Good: No races, clean synchronization
2
int localSums[4];  // Each thread writes unique index
3

4
void doSum(int tid) {
5
    localSums[tid] = sum;  // Safe! tid unique per thread
6
}
7

8
void main() {
9
    for (int i = 0; i < 4; i++) {
10
        JoinThread(tids[i]);      // Explicit wait
11
        sum += localSums[i];      // Read after join
12
    }
13
}
14

15
// Bad: Would cause race conditions
16
int globalSum = 0;  // Shared!
17

18
void doSum(int tid) {
19
    globalSum += localSum;  // RACE! Multiple threads writing
20
}

Scalability:

To parallelize across N cores instead of 4:

1
#define THREADS N
2

3
int localSums[THREADS];
4

5
void doSum(int tid) {
6
    int start = tid * size / THREADS;
7
    int end = start + size / THREADS;
8
    // ... same logic ...
9
}
10

11
void main() {
12
    int tids[THREADS];
13

14
    for (int i = 0; i < THREADS; i++)
15
        tids[i] = CreateThread(doSum, i);
16

17
    int sum = 0;
18
    for (int i = 0; i < THREADS; i++) {
19
        JoinThread(tids[i]);
20
        sum += localSums[i];
21
    }
22

23
    printf("Sum: %d\n", sum);
24
}

The logic remains identical, just parameterized!

Problem 21: Parallel Minimum Finding

Write a multi-threaded program to find minimum element of an array $A[0..N-1]$ with two threads. The first thread should find the minimum of $A[0..N/2]$ and the second thread should find the minimum of $A[N/2+1..N-1]$ . Each thread should start immediately upon being created and each must execute the same function to find the minimum. After threads have been terminated, the minimum element of the array should be printed by the main thread.

Assume that array $A$ is global and $N$ is divisible by 2.

1
#include <stdio.h>
2
#define N ...
3

4
int A[N]; // Array for which the minimum is to be computed

Instructor Solution

1
#include <stdio.h>
2
#define N ...
3

4
int A[N]; // Array for which the minimum is to be computed
5
int localMins[2];
6

7
void findMin(int tid) {
8
    int start = tid * N / 2;
9
    int end = start + N / 2;
10
    int min = INT_MAX;
11

12
    for (int i = start; i < end; i++)
13
        if (A[i] < min)
14
            min = A[i];
15

16
    localMins[tid] = min;
17
}
18

19
void main() {
20
    int tids[2];
21

22
    for (int i = 0; i < 2; i++)
23
        tids[i] = CreateThread(findMin, i);
24

25
    for (int i = 0; i < 2; i++)
26
        JoinThread(tids[i]);
27

28
    if (localMins[0] < localMins[1])
29
        printf("Min: %d\n", localMins[0]);
30
    else
31
        printf("Min: %d\n", localMins[1]);
32
}

Elaboration (by LLM)

Array Partitioning for Min-Finding:

Split array into 2 halves for 2 threads:

Array A[0..N-1]:
[0][1]...[N/2-1] | [N/2]...[N-1]
├─ Thread 0 ─────┤ ├─ Thread 1 ──┤
└─ Find min of ──┘ └─ Find min of─┘
   first half        second half
   localMins[0]      localMins[1]

Final Min = min(localMins[0], localMins[1])

Key Differences from Problem 20 (Summation):

Problem 20 (Sum):              Problem 21 (Min):
  - Combine: ADD                - Combine: COMPARE
  - sum += A[i]                 - if (A[i] < min) min = A[i]
  - Final: Add all parts        - Final: min(part1, part2)
  - Operation: Associative      - Operation: min()

Both parallelize the SAME WAY:
  1. Partition array
  2. Each thread processes partition
  3. Main thread combines results

Parallel Execution Model:

Main Thread:
  Create 2 threads
  │
  ├─ Thread 0: Find min in A[0..N/2-1]
  └─ Thread 1: Find min in A[N/2..N-1]
       (Both compute in parallel)
       ↓
  Wait for both to finish
  (JoinThread for each)
       ↓
  Compare results: min(localMins[0], localMins[1])
       ↓
  Print overall minimum

Algorithm Details:

1. Initialization:

1
int min = INT_MAX;  // Start with largest possible value

Why INT_MAX?

Any array element will be less than INT_MAX
Ensures first element becomes the min

2. Comparison:

1
for (int i = start; i < end; i++)
2
    if (A[i] < min)
3
        min = A[i];

Each thread finds the minimum in its range.

3. Storage:

1
localMins[tid] = min;  // Thread tid stores its minimum

Each thread writes to unique index (no race).

4. Final Comparison:

1
if (localMins[0] < localMins[1])
2
    printf("Min: %d\n", localMins[0]);
3
else
4
    printf("Min: %d\n", localMins[1]);

Main thread compares the two local minima.

Example Execution:

Array: A = [5, 3, 9, 1, 7, 2, 8, 4]  (N=8)

Thread Distribution:
  Thread 0: A[0..3] = [5, 3, 9, 1]
    min = INT_MAX
    min = 5 (A[0] < INT_MAX)
    min = 3 (A[1] < 5)
    min = 3 (A[2]=9 not < 3)
    min = 1 (A[3] < 3)
    localMins[0] = 1

  Thread 1: A[4..7] = [7, 2, 8, 4]
    min = INT_MAX
    min = 7 (A[4] < INT_MAX)
    min = 2 (A[5] < 7)
    min = 2 (A[6]=8 not < 2)
    min = 2 (A[7]=4 not < 2)
    localMins[1] = 2

Final Comparison:
  localMins[0] = 1
  localMins[1] = 2
  1 < 2 → Print "Min: 1"

Why Use INT_MAX Instead of First Element?

1
// Approach 1: Using INT_MAX (Current Solution)
2
int min = INT_MAX;
3
for (int i = start; i < end; i++)
4
    if (A[i] < min)
5
        min = A[i];
6

7
Advantages:
8
  ✅ Works even if first element is largest
9
  ✅ Cleaner logic
10
  ✅ No special case handling
11

12
// Approach 2: Using first element
13
int min = A[start];
14
for (int i = start + 1; i < end; i++)
15
    if (A[i] < min)
16
        min = A[i];
17

18
Disadvantages:
19
  ❌ More complex (skip first iteration)
20
  ❌ Requires special handling

Generalization to K Threads:

For K threads finding minimum:

1
#define K 4  // Number of threads
2

3
int localMins[K];
4

5
void findMin(int tid) {
6
    int start = tid * N / K;
7
    int end = start + N / K;
8
    int min = INT_MAX;
9

10
    for (int i = start; i < end; i++)
11
        if (A[i] < min)
12
            min = A[i];
13

14
    localMins[tid] = min;
15
}
16

17
void main() {
18
    int tids[K];
19

20
    for (int i = 0; i < K; i++)
21
        tids[i] = CreateThread(findMin, i);
22

23
    for (int i = 0; i < K; i++)
24
        JoinThread(tids[i]);
25

26
    // Find minimum of all local minima
27
    int globalMin = INT_MAX;
28
    for (int i = 0; i < K; i++) {
29
        if (localMins[i] < globalMin)
30
            globalMin = localMins[i];
31
    }
32

33
    printf("Min: %d\n", globalMin);
34
}

Time Complexity Analysis:

Sequential (1 thread):

Time = O(N)  - iterate through all N elements

Parallel (2 threads on 2 cores):

Each thread: O(N/2)
Total: O(N/2) - speedup of 2x

Parallel (K threads on K cores):

Each thread: O(N/K)
Total: O(N/K) - speedup of K x

(Ignoring aggregation time O(K), which is negligible for large N)

Critical Pattern:

This is the Fork-Join Pattern:

1. FORK:   Create multiple threads
2. WORK:   Each thread processes independently
3. JOIN:   Wait for all threads
4. MERGE:  Combine results

This pattern applies to ANY associative operation:

Addition (Problem 20)
Minimum (Problem 21)
Maximum (variant)
Multiplication (variant)
Logical operations (AND, OR)

The key requirement: Operation must be associative

Associative: (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)

Examples:
  min(min(5, 3), 9) = min(5, min(3, 9)) ✅
  sum(sum(5, 3), 9) = sum(5, sum(3, 9)) ✅
  max(max(5, 3), 9) = max(5, max(3, 9)) ✅

Problem 22: Parallel Array Search

You are given an unsorted array $A[0..N-1]$ and a “key”, and you are asked to find the first index where key is stored in parallel as follows: The main thread will create 4 threads, each thread will search $N/4$ of the elements and terminate. The main thread will then wait for the termination of the worker threads, compute the overall search result, print it on the screen and terminate. If the key does not exist in the array, then print “Key does not exist in the array”.

Assume you have 2 functions:

(1) CreateThread(F, arg) creates a thread that starts running at the function “F” and takes the argument “arg” and returns the id of the created thread
(2) JoinThread(int threadId) is used to wait for the termination of a thread

Assume that array $A$ and “key” are both global and $N$ is divisible by 4.

1
#include <stdio.h>
2
#define N ...
3

4
int A[N]; // Array to be searched
5
int key; // Key to be searched

Instructor Solution

1
#include <stdio.h>
2
#define N ...
3

4
int A[N]; // Array to be searched
5
int key; // Key to be searched
6
int localResult[4] = {N};
7

8
void doFind(int tid) {
9
    int start = tid * N / 4;
10
    int end = start + N / 4;
11

12
    for (int i = start; i < end; i++) {
13
        if (A[i] == key) {
14
            localResult[tid] = i;
15
            return;
16
        }
17
    }
18
}
19

20
void main() {
21
    int tids[4];
22

23
    for (int i = 0; i < 4; i++)
24
        tids[i] = CreateThread(doFind, i);
25

26
    int idx = N;
27
    for (int i = 0; i < 4; i++) {
28
        JoinThread(tids[i]);
29
        if (localResult[i] < idx)
30
            idx = localResult[i];
31
    }
32

33
    if (idx == N)
34
        printf("Key does not exist in the array.");
35
    else
36
        printf("Key exists at index: %d\n", idx);
37
}

Elaboration (by LLM)

Parallel Search Strategy:

Unlike summation and min-finding (which process all elements), search looks for FIRST occurrence:

Array A[0..N-1] and key=7:
[3][1][7][5][2][7][9][4]

Thread Distribution:
  Thread 0: A[0..1] = [3, 1]       → localResult[0] = N (not found)
  Thread 1: A[2..3] = [7, 5]       → localResult[1] = 2 (found at index 2!)
  Thread 2: A[4..5] = [2, 7]       → localResult[2] = 5 (found at index 5)
  Thread 3: A[6..7] = [9, 4]       → localResult[3] = N (not found)

Result Aggregation:
  Overall minimum index = min(N, 2, 5, N) = 2
  Output: "Key exists at index: 2"  (first occurrence)

Key Design Decisions:

1. Initialization to N (not found sentinel):

1
int localResult[4] = {N};  // All initialized to N

Why N?

N is the array size
No valid index equals N (indices are 0 to N-1)
Represents “not found” for this thread’s range
Easy to check: if (idx == N) → not found

2. Early Exit on Success:

1
for (int i = start; i < end; i++) {
2
    if (A[i] == key) {
3
        localResult[tid] = i;
4
        return;  // Early exit! Don't search rest of range
5
    }
6
}

Once found:

Store the index
RETURN immediately (don’t continue searching)
This is efficient: once we find it, no need to search further

3. Finding the FIRST occurrence:

Since we want the first index:

1
int idx = N;
2
for (int i = 0; i < 4; i++) {
3
    JoinThread(tids[i]);
4
    if (localResult[i] < idx)
5
        idx = localResult[i];  // Keep minimum index
6
}

Take the MINIMUM index from all threads:

Thread with key in range 0-5 will have smaller index
Overrides threads with larger indices
Threads that didn’t find it have localResult = N (ignored)

Example Detailed Execution:

Array:    A = [10, 20, 30, 7, 40, 7, 50, 7]  (N=8)
Key:      7

Initial:  localResult = [8, 8, 8, 8]

Thread 0: A[0..1] = [10, 20]
  i=0: A[0]=10 != 7
  i=1: A[1]=20 != 7
  (loop ends, localResult[0] stays 8)
  localResult = [8, 8, 8, 8]

Thread 1: A[2..3] = [30, 7]
  i=2: A[2]=30 != 7
  i=3: A[3]=7 == 7 ✅
    localResult[1] = 3
    return (early exit!)
  localResult = [8, 3, 8, 8]

Thread 2: A[4..5] = [40, 7]
  i=4: A[4]=40 != 7
  i=5: A[5]=7 == 7 ✅
    localResult[2] = 5
    return (early exit!)
  localResult = [8, 3, 8, 8]

Thread 3: A[6..7] = [50, 7]
  i=6: A[6]=50 != 7
  i=7: A[7]=7 == 7 ✅
    localResult[3] = 7
    return (early exit!)
  localResult = [8, 3, 5, 7]

Aggregation (after all joins):
  idx = 8
  localResult[0]=8: 8 < 8? No
  localResult[1]=3: 3 < 8? Yes → idx = 3
  localResult[2]=5: 5 < 3? No
  localResult[3]=7: 7 < 3? No

Final: idx = 3
Output: "Key exists at index: 3"

Why No Race Conditions?

1
// Thread 0 writes to localResult[0]
2
// Thread 1 writes to localResult[1]
3
// etc.
4

5
// NO overlapping writes → NO race conditions
6
// Each thread has unique localResult index

Each thread writes to a different element of the array!

Early Exit Pattern:

This is a key optimization for search:

1
// Inefficient: Search entire range even after finding
2
for (int i = start; i < end; i++) {
3
    if (A[i] == key) {
4
        localResult[tid] = i;
5
        // Continue searching (wasted work!)
6
    }
7
}
8

9
// Efficient: Stop immediately on first find
10
for (int i = start; i < end; i++) {
11
    if (A[i] == key) {
12
        localResult[tid] = i;
13
        return;  // Early exit
14
    }
15
}

The return statement:

Exits the function immediately
Thread terminates
Other threads may still be searching
Main thread waits for all with JoinThread

Handling “Not Found” Case:

1
// All threads searched their range
2
// No thread found the key
3

4
localResult = [N, N, N, N]  (all initialized value)
5

6
Aggregation:
7
  idx = N
8
  for (i=0 to 3):
9
    localResult[i] < idx? No (all equal N)
10

11
  idx stays N
12

13
Output:
14
  if (idx == N)
15
    printf("Key does not exist in the array.");

Comparison with Sequential Search:

Sequential (1 thread):
  for (int i = 0; i < N; i++) {
    if (A[i] == key) return i;
  }
  Worst case: O(N)  - key at end or not found

Parallel (4 threads):
  Best case: O(N/4)   - key in first thread's range
  Worst case: O(N/4)  - all threads search fully

Expected case: O(N/4) if key distributed randomly

Generalization to K Threads:

1
#define K 4
2

3
int localResult[K] = {N};
4

5
void doFind(int tid) {
6
    int start = tid * N / K;
7
    int end = start + N / K;
8

9
    for (int i = start; i < end; i++) {
10
        if (A[i] == key) {
11
            localResult[tid] = i;
12
            return;
13
        }
14
    }
15
}
16

17
void main() {
18
    int tids[K];
19

20
    for (int i = 0; i < K; i++)
21
        tids[i] = CreateThread(doFind, i);
22

23
    int idx = N;
24
    for (int i = 0; i < K; i++) {
25
        JoinThread(tids[i]);
26
        if (localResult[i] < idx)
27
            idx = localResult[i];
28
    }
29

30
    if (idx == N)
31
        printf("Key does not exist in the array.");
32
    else
33
        printf("Key exists at index: %d\n", idx);
34
}

The logic scales to any number of threads!

Pattern Summary: Fork-Join-Merge

All 4 parallel problems (summation, min, search) follow the same pattern:

1. FORK:   Create K threads
2. WORK:   Each processes partition independently
3. JOIN:   Wait for all to complete
4. MERGE:  Combine results

Applied to different operations:
  ├─ Sum: merge = ADD all results
  ├─ Min: merge = COMPARE all results
  └─ Search: merge = FIND MINIMUM index

This pattern is universal for embarrassingly parallel problems!

04 - Threads

Problem 1: Thread vs. Process

Elaboration (by LLM)

Problem 2: Thread State Sharing

Elaboration (by LLM)

Problem 3: Thread Creation Resources

Elaboration (by LLM)

Problem 4: Multi-Threading Advantages

Elaboration (by LLM)

Problem 5: Shared vs. Private Items

Elaboration (by LLM)

Problem 6: Purpose of pthread_detach()

Elaboration (by LLM)

Problem 7: Multi-Threaded Process Termination

Elaboration (by LLM)

Problem 8: Concurrency vs. Parallelism

Elaboration (by LLM)

Problem 9: Concurrency Without Multiple Cores

Elaboration (by LLM)

Problem 10: Parallelism and Hardware Requirements

Elaboration (by LLM)

Problem 11: Multi-Threading on Single-Core

Elaboration (by LLM)

Problem 12: Many-to-One Thread Mapping and I/O

Elaboration (by LLM)

Problem 13: One-to-One Thread Mapping and I/O

Elaboration (by LLM)

Problem 14: fork() with Multiple Threads

Elaboration (by LLM)

Problem 15: exec() with Multiple Threads

Elaboration (by LLM)

Problem 16: Heap Sharing Between Threads

Elaboration (by LLM)

Problem 17: Stack Access Between Threads

Elaboration (by LLM)

Problem 18: File Descriptor Sharing

Elaboration (by LLM)

Problem 19: Thread Output Analysis

Elaboration (by LLM)

Problem 20: Parallel Summation

Elaboration (by LLM)

Problem 21: Parallel Minimum Finding

Elaboration (by LLM)

Problem 22: Parallel Array Search

Elaboration (by LLM)