There was an interesting optimization task.
The bottleneck is memory system. So you try to fit accurately in L1, missing L1 hurts much performance. A bottleneck task for me was to sort small arrays of 14 bits (100-200 elements on average). For my core i7 6700 I found optimized 14-bit radix sort the fastest. It also scales well
with multithreading. Also I tried parallel odd-even merge sort by sorting 16 shorts with avx2 at a time, but it didn't give win over radix sort. Overall it was fun, some lessons I learned: make automated tests as early as you can (when you are trying squeezing every bit - high possibility you will break something), simd versions of algorithms do not scale well if done wrong, also it's that competition feeling - when you squeezed everything from hardware - you see from leaderboard that another competitor submitted solution that outperformed yours by 2x-10x times
The bottleneck is memory system. So you try to fit accurately in L1, missing L1 hurts much performance. A bottleneck task for me was to sort small arrays of 14 bits (100-200 elements on average). For my core i7 6700 I found optimized 14-bit radix sort the fastest. It also scales well
with multithreading. Also I tried parallel odd-even merge sort by sorting 16 shorts with avx2 at a time, but it didn't give win over radix sort. Overall it was fun, some lessons I learned: make automated tests as early as you can (when you are trying squeezing every bit - high possibility you will break something), simd versions of algorithms do not scale well if done wrong, also it's that competition feeling - when you squeezed everything from hardware - you see from leaderboard that another competitor submitted solution that outperformed yours by 2x-10x times
No comments:
Post a Comment