Comet

From InterSciWiki
Jump to: navigation, search

Comet Access

Comet User Guide

Technical Summary

Comet Comet is a dedicated XSEDE cluster designed by Dell and SDSC delivering ~2.0 petaflops. It features Intel next-gen processors with AVX2, Mellanox FDR InfiniBand interconnects, and Aeon storage.

The standard compute nodes consist of Intel Xeon E5-2680v3 (formerly codenamed Haswell) processors, 128 GB DDR4 DRAM (64 GB per socket), and 320 GB of SSD local scratch memory. The GPU nodes contain four NVIDIA GPUs each. The large memory nodes contain 1.5 TB of DRAM and four Haswell processors each. The network topology is 56 Gbps FDR InfiniBand with rack-level full bisection bandwidth and 4:1 oversubscription cross-rack bandwidth. Comet has 7 petabytes of 200 GB/second performance storage and 6 petabytes of 100 GB/second durable storage. It also has dedicated gateway/portal hosting nodes and a Virtual Image repository. External connectivity to Internet2 and ESNet is 100Gbps.

COMING SOON: 36 additional GPU nodes will be added to Comet, each of which features 4 NVIDIA P100 GPUs, for a total of 144 additional GPUs. The nodes are expected to be in production effective July 1, but users can request them starting with the current XSEDE allocations cycle that is open through April 15. Please see GPU Nodes section for additional information on making GPU allocation requests.

Serving the Long Tail

Comet was designed and is operated on the principle that the majority of computational research is performed at modest scale. Comet also supports science gateways, which are web-based applications that simplify access to HPC resources on behalf of a diverse range of research communities and domains, typically with hundreds to thousands of users. Comet is an NSF-funded system operated by the San Diego Supercomputer Center at UC San Diego. Comet is available through the Extreme Science and Discovery Environment (XSEDE) program.

Comet System Configuration System Component Configuration Intel Haswell Standard Compute Nodes Node count 1,944 Clock speed 2.5 GHz Cores/node 24 DRAM/node 128 GB SSD memory/node 320 GB NVIDIA Maxwell K80 GPU Nodes Node count 36 CPU cores:GPUs/node 24:4 CPU:GPU DRAM/node 128 GB:40 GB Large-memory Haswell Nodes Node count 4 Clock speed 2.2 GHz Cores/node 64 DRAM/node 1.5 TB SSD memory/node 400 GB Storage Systems File systems Lustre, NFS Performance Storage 7 PB Home file system 280 TB Resource allocation policies are designed to serve more users than traditional HPC systems

The maximum allocation for a Principle Investigator is 10M core-hours. Limiting the allocation size means that Comet can support more projects, even if the size of an individual project is smaller. Access via Science Gateways, which generally serve hundreds to thousands of users, can request more that the 10M SU cap. Comet provides Rapid Access Trial Accounts that give users 1000 SUs within 24 hours of requesting them. Job scheduling policies are designed for user productivity

The maximum allowable job size on Comet is 1,728 cores – a limit that helps shorten wait times since there are fewer nodes in idle state waiting for large number of nodes to become free. In practice, the average job size, weighted by core-hours, is about 400 cores (20 when not weighted, which reflects the large fraction of single node jobs). Comet supports long-running jobs - up to as much as one week by special request. Comet supports shared-node jobs (more than one job on a single node). Many applications are serial or can only scale to a few cores. Allowing shared nodes improves job throughput, provides higher overall system utilization, and allows more users to run on Comet. Comet’s system architecture is designed for user productivity

Each rack of Comet standard compute nodes provides 1,728 cores in a fully non-blocking fat tree FDR networking. Jobs can be run within a single rack to minimize latency or allowed to span racks to minimize wait time. This ensures maximum interconnect performance when it is critical to application performance without penalizing throughput when that is the most important factor. Each Comet compute node features 128 GB of DDR4 memory. This is important for both shared node jobs, and for those users with serial and threaded applications. Each compute node features up to 400 GB of SSD memory, which can be used to accelerate I/O performance for some applications. Comet features 36 GPU nodes, and supports many community developed applications. These versions typically run much faster on GPUs than CPUs. Comet’s 4 large memory nodes are well suited applications such as those in genomics Comet’s storage system, Data Oasis, provides high performance and high capacity, with added levels of protection via ZFS and a Durable Storage partition for periodic replication of critical project data. Trial Accounts

Trial Accounts give potential users rapid access to Comet for the purpose of evaluating Comet for their research. This can be a useful step in accessing the usefulness of the system by allowing them to compile, run, and do initial benchmarking of their application prior to submitting a larger Startup or Research allocation. Trial Accounts are for 1000 core-hours, and requests are fulfilled within 1 working day.

REQUEST TRIAL ACCOUNT

Technical Details

System Component Configuration 1944 Standard Compute Nodes Processor Type Intel Xeon E5-2680v3 Sockets 2 Cores/socket 12 Clock speed 2.5 GHz Flop speed 960 GFlop/s Memory capacity 128 GB DDR4 DRAM

Flash memory 320 GB SSD

Memory bandwidth 120 GB/s STREAM Triad bandwidth 104 GB/s 36 GPU Nodes GPUs 4 NVIDIA Cores/socket 12 Sockets 2 Clock speed 2.5 GHz Memory capacity 128 GB DDR4 DRAM Flash memory 400 GB SSD

Memory bandwidth 120 GB/s STREAM Triad bandwidth 104 GB/s 4 Large Memory Nodes Sockets 4 Cores/socket 16 Clock speed 2.2 GHz Memory capacity 1.5 TB Flash memory 400 GB STREAM Triad bandwidth 142 GB/sec Full System Total compute nodes 1984 Total compute cores 47,776 Peak performance ~2.0 PFlop/s Total memory 247 TB Total memory bandwidth 228 TB/s Total flash memory 634 TB FDR InfiniBand Interconnect Topology Hybrid Fat-Tree Link bandwidth 56 Gb/s (bidirectional) Peak bisection bandwidth TBD Gb/s MPI latency 1.03-1.97 µs DISK I/O Subsystem File Systems NFS, Lustre Storage capacity (durable) 6 PB Storage capacity (performance) 7 PB I/O bandwidth (performance disk) 200 GB/s Comet supports the XSEDE core software stack, which includes remote login, remote computation, data movement, science workflow support, and science gateway support toolkits.

Systems Software Environment

Software Function Description Cluster Management Rocks Operating System CentOS File Systems NFS, Lustre Scheduler and Resource Manager SLURM XSEDE Software CTSS User Environment Modules Compilers Intel and PGI Fortran, C, C++ Message Passing Intel MPI, MVAPICH, Open MPI Debugger DDT Performance IPM, mpiP, PAPI, TAU Supported Application Software

by Domain of Science

Domain Software Biochemistry

APBS

Bioinformatics

BamTools, BEAGLE, BEAST, BEAST 2, bedtools, Bismark, BLAST, BLAT, Bowtie, Bowtie 2, BWA, Cufflinks, DPPDiv, Edena, FastQC, FastTree, FASTX-Toolkit, FSA, GARLI, GATK, GMAP-GSNAP, IDBA-UD, MAFFT, MrBayes, PhyloBayes, Picard, PLINK, QIIME, RAxML, SAMtools, SOAPdenovo2, SOAPsnp, SPAdes, TopHat, Trimmomatic, Trinity, Velvet

Compilers

GNU, Intel, Mono, PGI

File format libraries

HDF4, HDF5, NetCDF

Interpreted languages

MATLAB, Octave, R

Large-scale data-analysis frameworks

Hadoop 1, Hadoop 2 (with YARN), Spark, RDMA-Spark

Molecular dynamics

Amber, Gromacs, LAMMPS, NAMD

MPI libraries

MPICH2, MVAPICH2, Open MPI

Numerical libraries

ATLAS, FFTW, GSL, LAPACK, MKL, ParMETIS, PETSc, ScaLAPACK, SPRNG, Sundials, SuperLU, Trilinios

Predictive analytics

KNIME, Mahout, Weka

Profiling and debugging

DDT, IDB, IPM, mpiP, PAPI, TAU, Valgrind

Quantum chemistry

CPMD, CP2K, GAMESS, Gaussian, MOPAC, NWChem, Q-Chem, VASP

Structural mechanics

Abaqus

Visualization

IDL, VisIt

System Access

As an XSEDE computing resource, Comet is accessible to XSEDE users who are given time on the system. To obtain an account, users may submit a proposal through the XSEDE Allocation Request System or request a Trial Account. ==Comet

Nancy: Your Comet and Gordon allocation doesn’t expire until 3/13/17. You’ll want to get a renewal request in by Jan 15 so you can keep your time going continuously. If you log in at http://portal.xsede.org you should be able to see everything you need about you allocation. You have about 4700 CPU hours remaining on Comet and 5000 on Gordon.

https://www.sdsc.edu/services/hpc/hpc_systems.html

Comet HPC

Gordon HPC - Sinkovits