Parallel

##

  • Review the typical performance models covered in this lecture
  • Pay attention to in-class examples as well as your homework. There is one question in Section 2 is from this lecture.
  • Karp-Flatt Metric(*), speedup formula, isoefficiency formulas and examples ##
  • Memory hierarchies
  • Spatial and temporal locality
  • False sharing/true sharing: causes and how to fix them ##
  • Concept and usage of fork():
  • who is the child process? How is fork() used?
  • What are the zombie processes? What situation causes zombie processes? What are the solutions to zombie processes? E.g., waitpid(), wait(), or terminate the parent process. Pay attention the exercises on this in the lecture slides.
  • Signals communication between processes: signals are not queued
  • What are the differences between process and threads? ##
  • Threads vs Processes: Mechanisms and Sharing
  • Pthreads basics
  • Variable space (sharing? Private? Etc.)
  • Unintended sharing
  • Thread detached mode
  • Understanding race condition and data race, and what may cause them to happen; what are the potential solutions (e.g., data race example in slides).
  • Enforcing mutual exclusion solutions: semaphores, mutex locks, condition variables, atomic operations (e.g., spinlocks) ##
  • Semaphores, mutex, condition variables and atomic operations: coding, mechanisms and when to use what.
  • Questions on how use semaphores: P (sem_post) and W (sem_wait) problems, binary semaphores, consumer and producer questions using semaphores, deadlocks caused by semaphores.
  • Questions on deadlocks: what causes deadlocks? Examples that cause deadlocks. Four necessary conditions for triggering deadlocks.
  • Questions on condition variables: always o go h with mutex; condition variables usage example and the mechanism (how to wake up and then how they grab the lock).
  • Questions on atomic operations: what are they? Why are they fast? When to use atomic operations mutex/semaphores/condition variables?
  • Understanding barrier
  • Understanding spinlocks and what is the condition to use spinlocks (similar to atomic operations); how to use CAS (compare and swamp) to enable spinlocks.
  • the cnt program
  • Quiz2 review and other synchronization questions ##
  • Basic OpenMP directives, runtime APIs and environmental variables
  • OpenMP example with false sharing ##
  • MPI blocked send and receive

  • Strong scaling concerns the speedup for a fixed problem size with respect to the number of processors, and is governed by Amdahl’s law.
  • Weak scaling concerns the speedup for a scaled problem size with respect to the number of processors, and is governed by Gustafson’s law.

scaled speedup: p+(1-p)s

  1. Process startup time
  2. Process synchronization time
  3. Imbalanced workload
  4. Architectural overhead

e = (1/w - 1/p)/(1 - 1/p)

Since e is constant, large serial fraction is the primary reason. Since e is steadily increasing, parallel overhead is the primary reason.

C programming

Big issues for long-running data center applications, scientific simulations, Daemons (background process) and servers.

Locality

Principle of Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently

  • Temporal locality: Recently referenced items are likely to be referenced again in the near future
  • Spatial locality: Items with nearby addresses tend to be referenced close together in time

  • Memory Hierarchies
    • The “memory wall” is the growing disparity of speed between CPU and memory outside the CPU chip.
    • CPU speed improved at an annual rate of 55% while memory speed only improved at 10%.
    • CPU speed improvements slowed significantly partly due to major physical barriers and partly because current CPU designs have already hit the memory wall in some sense.
  • True sharing: true sharing, would require programmatic synchronization constructs to ensure ordered data access.
  • False sharing: The frequent coordination required between processors when cache lines are marked ‘Invalid’ requires cache lines to be written to memory and subsequently loaded. False sharing increases this coordination and can significantly degrade application performance.
  • Week 4 - 16 floating point registers for MxM

  • Processes and Threads
    • process = memory state + machine state
  • How do we communicate to processes?
    • Receiving aSignal
    • A destination process receives a signal when it is forced by the kernel to react in some way.
    • The process can either ignore the signal, terminate, or catch the signal by executing user‐level functions called signal handlers.
  • Process = context + code, data, and stack
  • 20K cycles to create and reap a process
  • 10K cycles (or less) to create and reap a thread

  • OpenMP does not parallelize dependencies
    • Often does not detect dependencies
    • Nasty race conditions still exist!
  • Global variables (address spaces) are not shared
  • File tables (file descriptors) are shared
  • Program context:
    • Data registers
    • Condition codes
    • Stack pointer (SP)
    • Program counter (PC)
  • Kernel context:
    • VM structures
    • Descriptor table
  • Semaphore: non-negative global integer synchronization variable.
    • Unlike mutex (a locking mechanism), it is a signaling mechanism. It guarantees both ordering and atomicity.
    • Unlike a mutex, any thread can release a semaphore, not only the one that has acquired it in the first place.
  • Deadlock is when two or more tasks never make progress because each is waiting for some resource held by another process/thread.

  • A condition variable allows a thread to block itself until a specified condition becomes true.

  • atomic
Written on December 1, 2020