Calculate pi program

6/20/2023

Client processors are not generally used for HPC, they are used for desktop applications - like gaming. Obviously, Intel had their reasons to do this. This is not to say it's impossible to optimize for heterogeneous computing, but it is not a direction that I would like to move y-cruncher towards. Any attempts to steer certain threads to certain core types (say to prevent a thread running code that's optimized for a specific core type from moving to the wrong core) will inevitably lead to complications with load balancing. And at the highest level, efficient parallelization is almost hopeless as you can no longer assume that tasks of similar size will finish in a similar amount of time.

Thus there is no way to adapt should the thread be moved to a different core type - and that's assuming there is a way for the thread to even know it was moved. In most cases, cache blocking parameters for a task cannot be changed while it is running. At the mid level, the different core types have different cache layouts, thus requiring different cache blocking mechanisms depending on what core you're running on.At the lowest level, different instruction sequences run differently on different cores types, thus no single sequence is optimal on both.The split of P and E cores is quite frankly a nightmare to optimize for at all levels: The lack of AVX512 is likely why Tiger Lake and Rocket Lake outperform Alder Lake in single-threaded benchmarks where memory bandwidth and core count are not a factor. From a developer perspective, this very discouraging since most of the algorithms I've been working on since 2016 have been heavily influenced by (if not outright designed for) AVX512. It also removes all the other (non-width) functionality exclusive to AVX512 such as masking, all-to-all permutes, and increased register count. Removing AVX512 is a huge step back in more ways than just the instruction width. Memory bandwidth remains the biggest bottleneck.Heterogeneous computing (splitting of P and E-cores) is difficult to optimize for.Removal of AVX512 starting from Alder Lake.And because Raptor Lake lacks AVX512, it can only run a binary going all the way back to Skylake client (circa 2015). The latest Intel processor which y-cruncher has optimizations for is Tiger Lake which is 2 generations behind the latest (Raptor Lake). I've been asked a number of times about why I haven't done any optimizations for recent Intel processors.

Intel Optimizations (or lack of): (April 17, 2023). It's probably safe to say that since StorageReview is able to match the world record in a fraction of the time, they are more than capable of beating it. I'm not sure if Google's record used SSDs or hard drives, but if the latter, this would be the first large computation done entirely on SSDs. Jordan Ranous from StorageReview has just flexed a system that matched Google's 100 trillion digit world record in just 59 days.

5 trillion digits - August 2010 (Shigeru Kondo).
10 trillion digits - October 2011 (Shigeru Kondo).
12.1 trillion digits - December 2013 (Shigeru Kondo).
13.3 trillion digits - October 2014 (Sandon Van Ness "houkouonchi").
22.4 trillion digits - November 2016 (Peter Trueb).
31.4 trillion digits - January 2019 (Emma Haruka Iwao).
50 trillion digits - January 2020 (Timothy Mullican).
62.8 trillion digits - August 2021 (UAS Grisons).
100 trillion digits - June 2022 (Emma Haruka Iwao).Y-cruncher has been used to set several world records for the most digits of Pi ever computed. Ever since its launch in 2009, it has become a common benchmarking and stress-testing application for overclockers and hardware enthusiasts. It is the first of its kind that is multi-threaded and scalable to multi-core systems. Y-cruncher is a program that can compute Pi and other constants to trillions of digits. The first scalable multi-threaded Pi-benchmark for multi-core systems.

0 Comments

Calculate pi program

Leave a Reply.

Author

Archives

Categories