Document 179531

IBM eServer pSeries
How to control Resource allocation on pSeries
multi MCM system
Pascal Vezolle
Deep Computing EMEA
ATS-P.S.S.C/ Montpellier FRANCE
SCICOMP9 - CINECA 2004
© 2004 IBM Corporation
IBM eServer pSeries
Agenda
ƒ AIX Resource Management Tools
– WorkLoad Manager (WLM)
– Affinity Services
– Bind command and Resource Set
– Memory Affinity
ƒ Number of MCMs impact on Parallel job performance
ƒ Versatile System Resource Allocation and Control
• PSSC tool integrating AIX resource capabilities in HPC
environment
SCICOMP9 - CINECA 2004
© 2004 IBM Corporation
1
IBM eServer pSeries
Customer Resource control requirements versus AIX
capabilities
ƒ Dedicated physical resources for applications or user groups
– Solution 1: Partitioning
– Solution 2: WorkLoad Manager (WLM) based on Resource Set
ƒ Control Resources allocation on standalone system
– Control resources consumption (CPU, memory, IO)
• Solution: WLM classes (Consumable features in LoadLeveler)
– Control interactive and batch jobs ratio
• Solution: WLM & LoadLeveler
– Control job priority: Preemption (i.e. free resources for running high priority jobs)
• Solution: WLM class tiers&shares or LoadLeveler Gang scheduler (2005 with backfill
scheduler)
ƒ Optimize performances for HPC applications
– Memory affinity
– Binding and Resource Set attachment to guarantee MCM Affinity
SCICOMP9 - CINECA 2004
© 2004 IBM Corporation
IBM eServer pSeries
Dedicated physical resources
ƒ What Tools are Available?
– Partitioning
•
•
•
•
Subdivide a larger machine into several smaller servers
Allows multiple OS instances to peacefully coexist
Static - AIX 5.1, Dynamic - AIX 5.2
Resources includes: CPU, Memory, I/O adapter (Granularity: 1CPU, 256MB
memory, 1PCI slot)
• drawbacks in HPC: no Memory Affinity (except for Affinity LPAR), no
shared IO and network adapters, limited SMP capabilities for multi-threads
applications
– AIX Workload Manager
•
•
•
•
Controls access to the resources of a single AIX instance
Allows multiple workloads to peacefully coexist on one AIX image
Limits the CPU, Memory and disk I/O bandwidth consumption
Granularity: Percentage of CPU time, physical memory, and disk I/O bandwidth
SCICOMP9 - CINECA 2004
© 2004 IBM Corporation
2
IBM eServer pSeries
Elements of WLM
Preemption
Priority
Tiers
Classification
Resource
Classes
Limits
Users
Groups
Applications
Types
Tags
Rules
Shares
RSET
Resource Control (consumption and location)
SCICOMP9 - CINECA 2004
© 2004 IBM Corporation
IBM eServer pSeries
WLM shares and limits
ƒ Very useful to control interactive usage while insuring Batch resources
(shares) and job priority
– Example on p690 32 processors
• Requirements (slovakian, tunisian, hungary met):
– guaranty 8 CPU for interactive and 24 CPU for batch
– 1 ultra priority batch class for urgent jobs
One WLM solution: 3 WLM classes (default, batchUrgent and batch) with shares and limits
Class Name
Shares
batchUrgent
batch
default
1000
75
25
100%
Hard CPU limit = 75% for batchUrgent class (let 8 CPU free for interactive)
No Urgent job: shares=100, 75 (24 CPUs) for batch and 25 ( 8 CPUs) for interactive
Batch Urgent jobs running: shares=1100, 1000 parts for priority job limited to 24 CPUs
SCICOMP9 - CINECA 2004
© 2004 IBM Corporation
3
IBM eServer pSeries
AIX Affinity Services
Processor Affinity
ƒ
–
2 programming models for CPU binding
1)
2)
Attach a process to a specific CPU ID (with bindprocessor command or API ; root
and no root user)
Attach a process to Resource Set (with RSET APIs and commands)
Memory Affinity
ƒ
–
AIX tries to allocation the memory on the same MCM containing the
CPU
1) bindprocessor process [ ProcessorNum ] | -q | -u Process
•
-q Displays the processors which are available.
•
-u Unbinds the threads of the specified process.
To bind process pid=999 to processor 1:
bindprocessor 999 1
To display the processor number where the process is bound :
ps –o bnd –p 999
SCICOMP9 - CINECA 2004
© 2004 IBM Corporation
IBM eServer pSeries
Resource set
ƒ A resource set structure is a set of physical resources:
– CPU (cpu are identified by a CPU ID created at boot time)
– Memory pool (current AIX supports only one pool)
ƒ Attach a process to a rset limits the process to only use the physical resources
containing in the rset (no thread featureb in AIX5.2B)
ƒ Available with partitioning
– System rset ‘sys/sys’ contains the available CPU and memory pool
– How to display current rset configuration: lsrset –a -v
ƒ AIX 5.2: Dynamic management capabilities for root and no root users with APIs
and AIX commands
ƒ 2 types of rset: partition rset and effective rset
• partition rset: restricted to user root + only one rset per process
• effective rset: can be used by no root user with CAP_NUMA_ATTACH capabilty:
> chuser capabilities=CAP_NUMA_ATTACH,CAP_PROPAGATE username
> or add in /etc/security/user file
SCICOMP9 - CINECA 2004
© 2004 IBM Corporation
4
IBM eServer pSeries
AIX 5.2 Resource Set commands
ƒ Create a rset:
http://publib16.boulder.ibm.com/pseries
//publib16.boulder.ibm.com/cgi-bin
• mkrset –c CPUlist [-m MEMlist] rsetname
(create a rset for MCM0: mkrset –c 0-7 test/mcm0)
ƒ Remove a rset:
• rmrset rsetname
ƒ Display information about rset:
• lsrset [ -f] [ -v| -o] [ -r rsetname | -n namespace | -a]
ƒ Attach (detach) a process to a rset:
• attachrset [ -P ] [ -F] rsetname pid or [ -P] [ -F] [ -c CPUlist ] [ -m MEMlist ] pid
• detachrset [ -P ] pid
ƒ Execute a command in a rset:
• execrset [ -P ] [ -F] -c CPUlist [ -mMEMlist ] –e command [ parameters ]
• or execrset [ -P ] [ -F] rsetname [ -e] command [ parameters ]
SCICOMP9 - CINECA 2004
© 2004 IBM Corporation
IBM eServer pSeries
Memory Affinity
MCM
memory
ƒ On pSeries MCM system the memory is
attached to the MCMs
ƒ Local memory access is faster
ƒ The target is to improve performance of HPC applications by
backing the data in memory that is attached to MCM containing the
CPU
ƒ If memory affinity is enabled, AIX managed memory as a set of
pools (one per MCM or SCM)
– Pools can be monitored by the following commands
– kdb
> mempool *
> vmpool
> free
SCICOMP9 - CINECA 2004
© 2004 IBM Corporation
5
IBM eServer pSeries
How to set Memory Affinity
ƒ In AIX 5.1:
• vmtume –y 1 or 0 (default), (+ bosboot –a, reboot)
• Global availability for all processes: on or off
ƒ In AIX 5.2:
• vmo –memory_affinity=1 or 0 (default), (+ bosboot –a, reboot)
• + variable environment MEMORY_AFFINTY provided memory affinity
for a selected processed (also available with AIX 5.1G)
– Two valid settings for MEMORY_AFFINITY
> MEMORY_AFFINTY=MCM
memory allocation is local per MCM and paging page is global
> [email protected]=EARLY
both memory allocation is and paging space are local per MCM
SCICOMP9 - CINECA 2004
© 2004 IBM Corporation
IBM eServer pSeries
No Memory Affinity 0: vmo -o memory_affinity=0 (AIX 5.2)
Memory allocation is random
by 4 kbytes page
(or 16 Mbytes Large Page)
→ Job 1
→ Job 2
→ Job 3
memory
MCM
MCM
MCM
MCM
SCICOMP9 - CINECA 2004
© 2004 IBM Corporation
6
IBM eServer pSeries
Memory Affinity 1: vmo -o memory_affinity=1 (AIX 5.2)
MEMORY_AFFINITY not set
⇓
4 kbytes pages is allocated
in round-robin fashion across MCMs
memory
MCM
4 kbytes page
(or 16 Mbytes Large Page)
MCM
MCM
MCM
SCICOMP9 - CINECA 2004
© 2004 IBM Corporation
IBM eServer pSeries
Memory Affinity 2: vmo -o memory_affinity=1 (AIX 5.2)
MEMORY_AFFINITY=MCM
0) process pages are assigned locally on
MCM pool containing the CPU
1) if the process is rescheduled
- memory is not moved
memory
MCM
MCM
2) new process pages are assigned locally
MCM
MCM
Default AIX 5.1: vmtune –y 1
New in 5.1G: vmtune –y 2
SCICOMP9 - CINECA 2004
© 2004 IBM Corporation
7
IBM eServer pSeries
Memory Affinity 3: vmo -o memory_affinity=1 (AIX 5.2)
Not enough memory on the local pool
MEMORY_AFFINITY=MCM
memory is taken from any pool
memory
MCM
MCM
MCM
MCM
Memory used
SCICOMP9 - CINECA 2004
© 2004 IBM Corporation
IBM eServer pSeries
Memory Affinity 4: vmo -o memory_affinity=1 (AIX 5.2, AIX 5.1G)
Not enough memory on the local pool
[email protected]=EARLY
AIX 5.2B or AIX 5.1G
replace page from local pool
t
in/ou
Page
as soon as the local pool has reached
its low threshold (minfree)
disk
memory
MCM
MCM
(set maxperm to a low value)
MCM
MCM
Memory used
SCICOMP9 - CINECA 2004
© 2004 IBM Corporation
8
IBM eServer pSeries
Impact of Memory affinity versus the number of MCM
Elapsed time fluctuations for parallel jobs versus the number of involved MCM
On an idle system, the AIX scheduler spreads processes or threads across MCMs
P690 32 way 1.3 GHz
Differences between 1, 2 and 4 MCMs runs
difference ratio
Elapsed time (sec)
700
200%
155.0%
4 MCM
600
150%
2 MCM
1MCM
500
% 4/2MCM
100%
% 4/1MCM
400
44.8%
50%
62.0%
300
MPI
codes
30.0%
200
38.0%
28.5%
6.6%
6.8%
28.5%
0%
10.6%
13.0%
0.0%
8.3%
-50%
100
0
Pthreads
codes
-100%
DYNAMO 8 DYNAMO NTMIX 16 NSMD 16 CPMD 16
16
NSI 8
SDDNS 8
U1 8
U1 16
SCICOMP9 - CINECA 2004
© 2004 IBM Corporation
IBM eServer pSeries
VSRAC: a tool integrating AIX resource capabilities in HPC
environment
ƒ AIX does not provide automatic tool to guarantee MCM Affinity
ƒ In production due to process rescheduling the memory bandwidth can be limited by inter MCM
bandwidth (lost of Memory Affinitty)
ƒ AIX Affinity services (WLM, binding, rset, Memory affinity) are partially use by HPC environment
(Loadleveler, Parallel Environment)
ƒ …
ƒ PSSC Solution: VSRAC interface
– VSRAC allows multi MCM resource allocation controls in an unique interface including
•
•
•
•
•
AIX Resources capabilities: WLM, binding, rset, Memory affinity
standard resource allocation policies (process placement)
Internal workload management
Interactive commands
Interfaces with LoadLeveler and Parallel environment
– With environment variables, users or administrator can apply a allocation resource policy
– Available at http://tonga.mop.ibm.com/vsrac
SCICOMP9 - CINECA 2004
© 2004 IBM Corporation
9
IBM eServer pSeries
Batch Software: LoadLeveler, PBS
vsrac commands
Parallel Environment: POE, Mpitch, …
System resource affinity control (VSRAC)
Operating System: UNIX
System resource allocation capabilities provided by OS:
Work Load Manager, Resource Rset, CPU Binding, …
Manual allocation resources
WLM classes, rset, binding
No global allocation
Memory affinity
Hardware level
batch jobs
Interactive jobs
OS Level
Cluster level
User jobs
Resource control interface (VSRAC)
Default UNIX dispatcher
balances processes across MCMs
MCM interconnect
CPU
CPU
CPU MCM
CPU
CPU
CPU
CPU MCM
CPU
CPU
CPU
CPU MCM
CPU
Memory
Memory
Memory
...
SCICOMP9 - CINECA 2004
Hardware
architecture
MCM system
© 2004 IBM Corporation
IBM eServer pSeries
VSRAC resource allocation policies (POLICY_RAC)
ƒ 3 types of policies
– RSET policies: jobs are attached to rset defined on MCMs (MCM Affinity)
– rset_mcm processes are placed sequentially, minimizing the number
of MCMs used => reduce execution time fluctations
– rset_mcm_r processes are placed round robin allocation scheme
MPI jobs: task 1 on MCM 0, task 2 MCM 2, …
– WLM policies: jobs are assign in WLM policies
(not compatible with LoadLeveler ConsumableCpus)
– ll_wlm assigns LoadLeveler classes to WLM classes
– ll_wlm_rset or ll_wlm_rset_r ll_wlm + tasks placement on MCM
– Bind policies: processes are bound on CPU
– bind_pr sequentially versus CPU number
– bind_pr_r round robin versus MCM number
SCICOMP9 - CINECA 2004
© 2004 IBM Corporation
10
IBM eServer pSeries
VSRAC Environment variables
ƒ MCM_AFFINITY [on|off]: activates VSRAC tools
ƒ POLICY_RAC: set resource allocation policies
ƒ JOBTYPE_RAC: specifies job type (serail, mpi, OpenMP, …)
ƒ THREADS_TASK_RAC: number of threads per process for multi
threading job
ƒ WORKLOAD_RAC [on|off]: activates VSRAC internal workload
management
ƒ TARGET_RAC: specifies a list of MCMs or CPUs to limit the job to
use only these physical resources
ƒ MP_PMDSUFFIX=vsrac: ppe variable to set vsrac interface for MPI
jobs.
SCICOMP9 - CINECA 2004
© 2004 IBM Corporation
IBM eServer pSeries
VSRAC usage
ƒ Interactive jobs can be managed by VSRAC under the following
conditions:
– Serial, OpenMP or multithreaded jobs must be started with the vsrac
driver command
> vsrac program_name argument1 argument2 …
– MPI jobs must be started with the mpp command. mpp is a wrapper
script around the poe command that accepts all poe options.
ƒ LoadLeveler jobs
– Loadleveler configuration
• the system administrator must add an user job prolog and epilog in the
LoadL_config file.
– JOB_USER_PROLOG = /opt/vsrac/bin/prolog_vsrac.sh
– JOB_USER_EPILOG = /opt/vsrac/bin/epilog_vsrac.sh
SCICOMP9 - CINECA 2004
© 2004 IBM Corporation
11
IBM eServer pSeries
MCM AFFINITY benefits
each application is launched several times with a policy of 1 process/CPU
With VSRAC Control
rset_mcm policy
Without VSRAC control
Differences between WLM controlled througput and WLM
controlled runs and ratio
difference %
Differences between free througput and free runs and ratio
Elapsed tim e
(sec)
800
121
700
600
500
400
150
Elapsed tim e
(sec)
800
150
130
700
130
600
110
500
90
400
70
difference %
80
91
110
90
67
37
32
24
300
70
44
29
200
100
0
DYNAM O DYNAM O NTM IX 16
8
16
NSM D 16
CPM D 16
Throughput average (free)
NSI 8
SDDNS 8
runs free
U1 8
U1 16
ratio
50
30
300
10
200
-10
100
-30
0
50
12
12
11
9
8
NSM D 16
CPM D 16
16
3
13
30
0
10
-10
DYNAM O DYNAM O NTM IX 16
8
16
Throughput average (w ith WLM)
NSI 8
SDDNS 8
U1 8
runs controlled w ith WLM
U1 16
ratio
P690 32 way 1.3 GHz
SCICOMP9 - CINECA 2004
© 2004 IBM Corporation
12