Performance Principles
C++ is known for its performance capabilities. Understanding how to write efficient code requires knowledge of memory management, CPU architecture, and compiler optimizations.
Memory Optimization
Cache-Friendly Data Structures
// Bad: Array of Structures (AoS) - poor cache locality
struct Particle {
float x, y, z;
float vx, vy, vz;
float mass;
};
std::vector<Particle> particles;
// Good: Structure of Arrays (SoA) - better cache locality
struct ParticleSystem {
std::vector<float> x, y, z;
std::vector<float> vx, vy, vz;
std::vector<float> mass;
};
Avoiding Unnecessary Allocations
// Reserve capacity to avoid reallocations
std::vector<int> v;
v.reserve(1000); // Pre-allocate memory
// Use string_view to avoid copies
void process(std::string_view sv) {
// No allocation, just a view
}
// Small Buffer Optimization (SBO)
// std::string uses it automatically for small strings
std::string small = "Hello"; // No heap allocation
Move Semantics
Avoid unnecessary copies with move semantics:
class Buffer {
std::unique_ptr<char[]> data_;
size_t size_;
public:
// Move constructor - transfer ownership
Buffer(Buffer&& other) noexcept
: data_(std::move(other.data_))
, size_(std::exchange(other.size_, 0))
{}
// Move assignment
Buffer& operator=(Buffer&& other) noexcept {
if (this != &other) {
data_ = std::move(other.data_);
size_ = std::exchange(other.size_, 0);
}
return *this;
}
};
// Use std::move when transferring
std::vector<Buffer> buffers;
buffers.push_back(std::move(myBuffer));
Compiler Optimizations
- -O2/-O3: Enable compiler optimizations
- -march=native: Use CPU-specific instructions
- inline: Reduce function call overhead
- constexpr: Compute at compile time
// constexpr - compile-time computation
constexpr int factorial(int n) {
return n <= 1 ? 1 : n * factorial(n - 1);
}
constexpr int result = factorial(10); // Computed at compile time
// Force inlining for performance-critical code
[[gnu::always_inline]] inline void hot_function() {
// Critical code
}
SIMD and Parallelism
#include <execution>
#include <algorithm>
std::vector<int> v(1000000);
// Parallel algorithms (C++17)
std::sort(std::execution::par, v.begin(), v.end());
std::for_each(std::execution::par_unseq, v.begin(), v.end(),
[](int& x) { x *= 2; });
// Manual SIMD (with intrinsics)
#include <immintrin.h>
void add_vectors(float* a, float* b, float* c, size_t n) {
for (size_t i = 0; i < n; i += 8) {
__m256 va = _mm256_load_ps(a + i);
__m256 vb = _mm256_load_ps(b + i);
__m256 vc = _mm256_add_ps(va, vb);
_mm256_store_ps(c + i, vc);
}
}
Profiling Tools
- perf: Linux performance profiler
- Valgrind/Callgrind: Cache and branch prediction analysis
- Intel VTune: Comprehensive performance analysis
- Google Benchmark: Micro-benchmarking library