Geeks With Blogs
Josh Reuben

HPC Job Types

HPC has 3 types of jobs

· Task Flow – vanilla sequence


· Parametric Sweep – concurrently run multiple instances of the same program, each with a different work unit input


· MPI – message passing between master & slave tasks


But when you try go outside the box – job tasks that spawn jobs, blocking the parent task – you run the risk of resource starvation, deadlocks, and recursive, non-converging or exponential blow-up.

The solution to this is to write some performance monitoring and job scheduling code. You can do this in 2 ways:

  1. manually control scheduling - allocate/ de-allocate resources, change job priorities, pause & resume tasks , restrict long running tasks to specific compute clusters
  2. Semi-automatically - set threshold params for scheduling.

How – Control Job Scheduling

In order to manage the tasks and resources that are associated with a job, you will need to access the ISchedulerJob interface -

This really allows you to control how a job is run – you can access & tweak the following features:

  • max / min resource values
  • clip_image007whether job resources can grow / shrink, and whether jobs can be pre-empted, whether the job is exclusive per node
  • clip_image007[1]the creator process id & the job pool
  • timestamp of job creation & completion
  • clip_image007[2]job priority, hold time & run time limit
  • Re-queue count
  • Job progress
  • Max/ min Number of cores, nodes, sockets, RAM
  • Dynamic task list – can add / cancel jobs on the fly
  • Job counters

When – poll perf counters

Tweaking the job scheduler should be done on the basis of resource utilization according to PerfMon counters – HPC exposes 2 Perf objects: Compute Clusters, Compute Nodes

You can monitor running jobs according to dynamic thresholds – use your own discretion:

  • Percentage processor time
  • Number of running jobs
  • Number of running tasks
  • Total number of processors
  • Number of processors in use
  • Number of processors idle
  • Number of serial tasks
  • Number of parallel tasks

Design Your algorithms correctly

Finally , don’t assume you have unlimited compute resources in your cluster – design your algorithms with the following factors in mind:

· Branching factor - - dynamically optimize the number of children per node


· cutoffs to prevent explosions - - not all functions converge after n attempts. You also need a threshold of good enough, diminishing returns

· heuristic shortcuts - - sometimes an exhaustive search is impractical and short cuts are suitable

· Pruning – remove / de-prioritize unnecessary tree branches


· avoid local minima / maxima - - sometimes an algorithm cant converge because it gets stuck in a local saddle – try simulated annealing, hill climbing or genetic algorithms to get out of these ruts



watch out for rounding errors - multiple iterations can in parallel can quickly amplify & blow up your algo ! Use an epsilon, avoid floating point errors,  truncations, approximations

Happy Coding !

Posted on Wednesday, October 10, 2012 2:34 PM Parallelism | Back to top

Comments on this post: HPC Server Dynamic Job Scheduling: when jobs spawn jobs

# re: HPC Server Dynamic Job Scheduling: when jobs spawn jobs
Requesting Gravatar...
Your site provided us with valuable information to work with.
By | Berita Terkini
Left by iyus on Feb 21, 2013 6:38 AM

Your comment:
 (will show your gravatar)

Copyright © JoshReuben | Powered by: | Join free