Published on

My notes on performance engineering

AOS Layout (Array of Structs)

In the Array of Structs (AOS) layout, each element in the array is a fully-formed struct. This means all the fields (a, b, and c) are grouped together for each instance.

// Array of Structs (AOS) layout
struct AOS {
    int a;          // Single integer field a
    int b;          // Single integer field b
    int c;          // Single integer field c
    // Other fields can be added similarly
};

AOS s[N];  // Array of N instances of AOS

In this layout, the struct AOS contains individual integer fields (a, b, c). The array s[N] holds N instances of AOS, and each instance of AOS contains all its fields close together in memory. When iterating over s[i], you can access all three fields (a, b, and c) for a given index i.

SOA Layout (Struct of Arrays)

In the Struct of Arrays (SOA) layout, each field from the struct is treated as a separate array. For example, instead of storing a, b, and c together in each struct, we store them in separate arrays.

// Struct of Arrays (SOA) layout
struct SOA {
    int a[N];       // Array of integers for field a
    int b[N];       // Array of integers for field b
    int c[N];       // Array of integers for field c
    // Other arrays can be added similarly
};

SOA s;  // A single instance of SOA

In this layout, the struct SOA contains three separate arrays: a[N], b[N], and c[N]. These arrays are each contiguous blocks of memory. So, for example, a will contain all the a values for all instances, b will contain all the b values, and so on.

Access Patterns and Comparison

  • In AOS (Array of Structs):

    • Each instance of AOS has all its fields (a, b, c) together in memory. This layout is useful when accessing multiple fields of a struct at once.
    • For example, if you have a loop iterating over s[i] (where s is an array of AOS), you can access all the fields of s[i] together, which might be beneficial if you need to use all fields at once.
    for (int i = 0; i < N; ++i) {
        // Accessing all fields for the ith instance of AOS
        int result = s[i].a + s[i].b + s[i].c;
    }
    
  • In SOA (Struct of Arrays):

    • Each field is an independent array, which makes it easier to access all elements of a single field in a contiguous block of memory.
    • For example, if you need to iterate over the array of as, bs, or cs separately, this layout provides better memory locality for that purpose.
    for (int i = 0; i < N; ++i) {
        // Accessing the ith element of each field in SOA
        int result = s.a[i] + s.b[i] + s.c[i];
    }
    

Performance Considerations

  • AOS is better when you need to access multiple fields of the same struct together. This is because all the fields of the struct are stored contiguously in memory, which can lead to fewer cache misses when accessing multiple fields at once.
  • SOA is better when you access a single field across all structs in the array. Since each field is stored in its own array, accessing a single field across all instances is highly cache-friendly due to better memory locality. Additionally, it enables better vectorization when operating on a single field across multiple data points.

Use Case Example for AOS and SOA:

  1. AOS Use Case:

    • If you have an operation that needs to access multiple fields of each struct simultaneously (e.g., in physics simulations or matrix operations where each struct represents a vector of components), the AOS layout can provide better cache locality for those operations.
  2. SOA Use Case:

    • If you’re processing a large dataset and you mostly perform operations on a single field across all structs (e.g., summing all a values or multiplying all b values), SOA will likely offer better performance due to the continuous memory access patterns and more efficient use of cache.

Conclusion

  • AOS is generally better when you need to access all fields of a struct at once, since all the fields are stored together.
  • SOA is ideal when your operations access a single field across all structs, as each field is stored in a separate array, leading to better cache locality for those types of access patterns.

Cygwin64 version of Clang is using a medium memory model by default. This is quite wasteful because it will use 64-bit absolute addresses rather than 32-bit relative addresses for static variables and constants. You can improve performance by specifying -mcmodel=small. The medium memory model is needed only if you are making a direct link to a variable inside an external DLL (which is bad programming practice anyway). Another disadvantage of the Cygwin version is that you have to include the Cygwin DLL when distributing the executable

  1. Using references instead of pointers:

    • References are generally safer because once a reference is bound to an object, it cannot be changed to refer to another object (unlike pointers). This prevents issues where a pointer could become "wild" or point to an invalid location.
  2. Initializing pointers to zero (NULL or nullptr):

    • Uninitialized pointers are a common source of bugs, especially if they end up pointing to random memory locations. Initializing pointers to nullptr (in C++11 and later) or NULL (in older C++) ensures that you know when a pointer is intentionally not pointing to anything. This makes it easier to check for nullity and avoids dereferencing invalid addresses.
  3. Setting pointers to zero when objects become invalid:

    • After deleting or deallocating memory, setting the pointer to nullptr ensures that it doesn’t point to a deallocated memory block. This prevents accidental dereferencing of invalid memory, which can lead to crashes or undefined behavior.
  4. Avoiding pointer arithmetic and pointer type casting:

    • Pointer arithmetic can easily lead to accessing memory out of bounds, causing segmentation faults or memory corruption. Similarly, type casting pointers (e.g., from one type to another) can confuse the compiler and lead to invalid memory access if not done carefully. By avoiding these, you reduce the risk of pointer-related errors.

Variables and objects defined within a function are stored on the stack, It is used to store function return addresses (indicating where the function was called from), function parameters, local variables, and registers that need to be restored before the function exits,the stack is an efficient memory region because it reuses the same memory addresses repeatedly. In the absence of large arrays, this part of memory is typically cached in the level-1 data cache.

Global or Static Storage in Memory

Global Variables: A global variable is a variable that is declared outside of any function, making it accessible to all functions within a program. These variables exist for the duration of the program’s execution, meaning they are stored in a special area of memory called static memory.

Static Memory: The static memory is a reserved section of memory that holds variables and data that exist for the lifetime of the program. It is used to store:

  • Global variables (those declared outside functions)
  • Static variables (those declared with the static keyword within functions or classes)
  • Floating-point constants
  • String constants
  • Array initializer lists
  • Switch statement jump tables (used for efficient case switching)
  • Virtual function tables (for polymorphic behavior in C++)

The static memory is generally divided into three main parts:

  1. For constants: Data that is never modified during the program’s execution (e.g., string literals).
  2. For initialized variables: Variables that are initialized with a value and can be modified during execution.
  3. For uninitialized variables: Variables that are not initialized when declared but may later be modified.

Advantages and Disadvantages of Static Memory:

  • Advantages:
    • Initialization before execution: Data in static memory is initialized before the program starts running, so it’s readily available when needed.
    • Persistence: Global variables or static variables retain their values throughout the program’s execution, making them accessible at any time.
  • Disadvantages:
    • Memory inefficiency: Static memory is allocated for the entire program runtime, even if only a small part of the program uses the global or static variables. This can result in unused memory taking up space, which may affect the efficiency of the program’s memory usage.
    • Reduced caching efficiency: Because the memory space is reserved and static, the operating system or hardware’s cache may not be as efficient as it could be. The memory can’t be reused for other purposes.

Global Variables – Should They Be Used?

  • While global variables are useful in certain scenarios (e.g., when communication between threads is needed), they should be avoided when possible.
  • Using global variables can make the code harder to maintain and debug, especially when multiple functions modify the same variable.
  • In most cases, it’s better to pass data between functions as parameters or encapsulate shared data within a class.
  • Better alternatives: You could make the functions that share a common variable members of the same class and store the shared variable inside the class, reducing reliance on globals.

Example of Lookup Table with Static Storage:

int GetValue(int x) {
    static const int values[] = {10, 25, -7, 3, 18};
    return values[x];
}

  • Explanation:
    • The list array is declared static, meaning that it’s initialized only once when the function is first called and is reused in subsequent calls.
    • The const keyword ensures the array elements are not changed, helping the compiler optimize memory usage and performance by preventing unnecessary checks for modification.

This approach minimizes overhead because the array doesn’t need to be reinitialized each time the function is called, and the const declaration lets the compiler know the data won’t change, enabling further optimizations.

String and Floating-Point Constants in Static Memory:

  • String constants and floating-point constants (like 6.9) are typically stored in static memory because they don’t change during the program’s execution.
  • If multiple instances of the same constant appear in the program, the compiler will often optimize this by storing just one instance of the constant, thereby reducing memory usage.

Integer Constants:

  • Integer constants, on the other hand, are often part of the instruction code itself, and are not stored in static memory. Therefore, there is typically no issue with caching for integer constants.

Register Storage for Variables:

  • Registers are small, high-speed memory locations within the CPU used for temporary data storage. Accessing data from registers is significantly faster than accessing main memory.
  • Optimizing compilers automatically store frequently used variables in registers for faster access. Registers can be reused for different variables as long as their usage periods (live ranges) do not overlap.
  • Local variables are ideal for register storage because they are often used within a limited scope, making them easier to manage.

Register Availability:

  • In 32-bit x86 systems, there are about six general-purpose integer registers.
  • In 64-bit systems, there are approximately fourteen general-purpose integer registers.
  • Floating-point registers are separate, with eight available in 32-bit systems, sixteen in 64-bit systems, and up to thirty-two with AVX512 enabled.

Note: Some compilers may struggle with using floating-point registers in 32-bit mode unless certain instruction sets like SSE are enabled.

Key Takeaway:

  • Registers offer the fastest access for variables, but their number is limited. Optimizing compilers manage their use efficiently, prioritizing local variables and optimizing performance.

Volatile

The volatile Keyword Explained:

for further https://stackoverflow.com/questions/246127/why-is-volatile-needed-in-c/ https://stackoverflow.com/questions/72552/why-does-volatile-exist https://stackoverflow.com/questions/4557979/when-to-use-volatile-with-multi-threading/ https://stackoverflow.com/questions/16699247/what-is-a-cache-friendly-code/

  1. Purpose of volatile:

    • The volatile keyword tells the compiler that a variable's value can change unexpectedly, typically due to external factors (e.g., another thread or hardware).
    • It prevents the compiler from making optimizations that assume the variable will always hold its previously assigned value. Without volatile, the compiler may assume the value remains unchanged, leading to incorrect behavior.
  2. Example Scenario:

    volatile int seconds; // Incremented by another thread
    void DelayFiveSeconds() {
        seconds = 0;
        while (seconds < 5) {
            // Do nothing while seconds count to 5
        }
    }
    
    • In this case, seconds is being updated by another thread. Without volatile, the compiler might optimize the loop as while (0 < 5), which would cause an infinite loop, assuming seconds never changes.
  3. How volatile Works:

    • Prevents optimizations: Forces the compiler to always fetch the variable from memory (not from registers) to ensure the value is up-to-date.
    • Ensures that the variable is not optimized out or cached in a register, which is crucial when dealing with variables that are altered outside the current code scope (like in multi-threading).
  4. Not Atomic:

    • Important: volatile does not make the variable atomic. It doesn't prevent multiple threads from modifying it at the same time.
    • Thread safety: In the example, if one thread sets seconds to 0 while another increments it, a race condition could occur, causing inconsistent behavior.
  5. Better Approach for Thread Safety:

    • To safely manage the variable between threads, consider using proper synchronization (e.g., mutexes or atomic operations) to avoid race conditions.
  6. Real-World Usage:

    • volatile is often used in hardware programming (e.g., interacting with hardware registers or flags) and in multithreading scenarios where variables are shared between threads.

Key Takeaways:

  • volatile ensures the variable is always fetched from memory, preventing incorrect compiler optimizations.

  • It’s not atomic—use synchronization mechanisms for safe concurrent access.

  • Essential for situations like multi-threaded programming or interacting with hardware that may modify variables outside the program's control.

  • just got interesting to read unrelated https://github.com/Rust-for-Linux/linux/issues/2

Integer Size and Performance Recommendations:

  1. Default Integer Size:

    • Use the default integer size (usually int) when the size doesn’t matter and there’s no risk of overflow. Examples include simple variables, loop counters, etc. This is typically efficient and simple.
  2. Optimizing for Large Arrays:

    • For large arrays, consider using the smallest integer size that is sufficient for your needs. This can improve data cache efficiency by reducing memory usage.
  3. Bit-fields:

    • Avoid using bit-fields with sizes other than 8, 16, 32, or 64 bits. Non-standard sizes can lead to inefficiency due to alignment issues and extra overhead in memory storage.
  4. 64-bit Integers:

    • In 64-bit systems, you can use 64-bit integers if the extra bits are useful for your application. This can help when working with large data sets, though it’s not always necessary for typical calculations.
  5. size_t Type:

    • size_t is unsigned and designed for array sizes and indices. It’s 32 bits in 32-bit systems and 64 bits in 64-bit systems. This type ensures that overflow won’t occur even with large arrays (over 2 GB), which makes it ideal for managing large data structures.
  6. Avoiding Integer Overflow:

    • When selecting an integer size, consider potential intermediate overflow in expressions. For example, in a = (b * c) / d, the product b * c can overflow even if b, c, and d are within safe limits individually.
    • Important: There is no automatic check for integer overflow. Always ensure the selected type can handle the expected range of values throughout the calculation.

Key Takeaways:

  • Default size is fine for small, simple data.
  • Use the smallest appropriate integer for large data structures to optimize cache usage.
  • Be careful with bit-fields and avoid sizes other than 8, 16, 32, or 64 bits.
  • Consider using size_t for array sizes to avoid overflow.
  • Watch out for overflow in intermediate calculations, as there's no automatic overflow detection.

1. Division by a constant: Unsigned integers are faster than signed integers

When performing a division operation by a constant, unsigned integers (integers that can't hold negative values) are often processed more efficiently than signed integers (integers that can hold both positive and negative values).

This performance difference stems from the way the CPU handles signed and unsigned division. For unsigned integers, the hardware can use simpler, faster operations because it doesn't need to account for negative numbers. For signed integers, extra logic is required to handle negative numbers, which can slow things down.

Example: Let’s assume you're dividing a number by a constant. If a is an unsigned integer, and you’re dividing it by 5, the processor can optimize this operation more efficiently than if a is signed, which requires handling both positive and negative numbers.

unsigned int a = 20;
unsigned int result = a / 5;  // Faster than if 'a' were signed

The modulo operator (%) behaves similarly. For unsigned integers, the modulo operation is generally faster compared to signed integers.


2. Conversion to floating-point is faster with signed integers

When converting integers to floating-point numbers (e.g., from int to float), signed integers tend to be faster to convert than unsigned integers on most instruction sets.

This is because floating-point operations are usually optimized for signed integers (which include both positive and negative numbers). In contrast, unsigned integers don't have a sign bit, and converting them to floating-point numbers can involve more complex processing to handle the lack of a sign bit.

Example:

int signedInt = 10;
float floatValue = (float)signedInt;  // Conversion from signed to float is fast

unsigned int unsignedInt = 10;
float floatValue2 = (float)unsignedInt;  // Conversion might be slower due to handling of the unsigned type

3. Overflow behavior

Overflow is when a variable exceeds the maximum or minimum value it can hold.

  • For unsigned integers, overflow causes the value to "wrap around" and continue from the minimum (typically 0). In other words, if you try to store a number that’s too large for an unsigned integer, the result will be a "low positive" value. For example, if an 8-bit unsigned integer (with a max value of 255) overflows, it will wrap around to 0.

Example (Unsigned overflow):

unsigned char u = 255;
u = u + 1;  // Overflow! The result is 0 (wrap-around)
  • For signed integers, overflow is officially undefined in C and many other languages. Typically, when a signed integer overflows, it "wraps around" as well, but it results in a negative value. For example, a signed integer could overflow from its maximum positive value to a negative number. However, the behavior isn't guaranteed and can vary by compiler and platform. Some compilers may optimize the code in such a way that the overflow scenario is not even considered, leading to undefined results if the overflow occurs.

Example (Signed overflow):

int s = 2147483647;  // Assuming a 32-bit signed integer
s = s + 1;  // Overflow! This could result in undefined behavior, but in many cases, it wraps around to a negative number

4. Mixing signed and unsigned integers in comparisons

When you compare a signed integer with an unsigned integer (e.g., using <, >, ==), the result can be unpredictable or ambiguous. This happens because the signed integer may be negative, but the unsigned integer cannot be. So, comparing these two types directly could produce unexpected results.

In many compilers, the signed integer gets converted to an unsigned integer during the comparison, which could lead to incorrect results. For example, a negative signed integer might be interpreted as a large positive number when compared to an unsigned integer.

Example:

int signedNum = -1;
unsigned int unsignedNum = 10;

if (signedNum < unsignedNum) {
    // This may not behave as expected, because the signedNum will be interpreted as a large positive number
}

Key Takeaways:

  • Unsigned integers are generally faster for division and modulo operations.
  • Signed integers are typically faster to convert to floating-point numbers.
  • Overflow behavior is different: unsigned integers wrap around positively, but signed integers may behave unpredictably, especially with compilers optimizing away overflow checks.
  • Comparing signed with unsigned integers is risky and should be avoided, as the result can be ambiguous and lead to incorrect behavior.

Pre-Increment (++i) vs. Post-Increment (i++) Efficiency

The pre-increment (++i) and post-increment (i++) operators are both extremely efficient and behave similarly to addition in terms of performance. When you're simply incrementing an integer variable without using the result of the operation, there’s essentially no performance difference between the two. This means that in a simple loop, for example, using either form of the increment operator will produce the same result in terms of speed.

Example:

// Both are equally efficient
for (i = 0; i < n; i++)   // Post-increment
for (i = 0; i < n; ++i)   // Pre-increment

In the above example, both loops will behave identically. The loop will execute n times, incrementing i by 1 in each iteration, and there is no noticeable performance difference.


When the Result of the Expression Matters

While pre-increment and post-increment are equivalent in simple use cases like a loop counter, the situation changes when the result of the increment expression is used, such as in an assignment. In these cases, the efficiency can differ.

  • With post-increment (i++), the original value of i is used in the expression before it is incremented, meaning that the program has to first use i in the expression and then update it. This may introduce a small delay because the updated value of i isn't available until after the increment operation.
  • With pre-increment (++i), the value of i is incremented first, so the updated value is used directly, without needing to wait.

Example:

x = fruits[i++];   // Post-increment: uses the current value of i first, then increments it

In this case, the address of the array element is calculated using the original value of i, and then the increment occurs after accessing the array. This means the computation of the address might be delayed, making the operation slightly slower.

x = fruits[++i];   // Pre-increment: increments i first, then uses the new value of i

Here, the address calculation uses the updated value of i, so it avoids the extra step of waiting for the increment. This may make the operation more efficient, especially on certain architectures where memory access is delayed by the extra cycle.


Efficiency in Other Cases

There are cases where pre-increment can actually be more efficient than post-increment. For example, when you use pre-increment in an assignment:

a = ++b;   // Pre-increment

In this case, the value of b is incremented, and then assigned to a. Since both a and b are assigned the same value, the compiler can optimize this by using the same register for both, making it faster.

On the other hand, with post-increment:

a = b++;   // Post-increment

Here, the value of b is assigned to a first, then b is incremented. Because a and b end up with different values, the compiler can't use the same register for both, which could lead to a small performance hit.


General Rule:

  • When incrementing or decrementing integers and using the result, pre-increment (++i) is generally more efficient because it doesn’t require saving the old value before the increment occurs.
  • In simple loops where the result of the increment is not used directly, both pre-increment and post-increment have similar performance.

Decrement Operators (--i and i--)

The behavior and efficiency rules discussed for the increment operators also apply to the decrement operators (--i and i--). Whether you're using them in loops or expressions, the same principles hold true:

  • Pre-decrement (--i) might be slightly more efficient than post-decrement (i--) when used in an expression, because the updated value is used directly without needing to save the old value.

Example:

for (int i = n; i > 0; --i)  // Pre-decrement
for (int i = n; i > 0; i--)  // Post-decrement

Both loops are functionally identical in terms of how many times they run, but if you're using the result of the decrement in an expression, the pre-decrement could be a little more efficient.


Conclusion

In summary:

  • For simple increments or decrements (e.g., in loops), pre-increment (++i) and post-increment (i++) are equally fast.
  • When the result of the increment/decrement expression is used, pre-increment and pre-decrement are generally more efficient than post-increment and post-decrement.
  • The performance difference arises because post-increment requires the original value to be used first, which can introduce a small delay in some situations.

Sure! Let's break down the explanation in a detailed and more understandable way.


Floating Point Registers in Modern x86 Microprocessors

Modern x86 processors have two types of floating point registers and, as a result, two different methods of handling floating point calculations: x87 registers (older method) and vector registers (newer method). Each method has its own pros and cons.


x87 Registers

The x87 floating point registers are part of the older method of handling floating point operations. These registers are organized as a stack and each register holds long double precision (80-bit) values.

Advantages of x87 Registers:

  1. Long double precision: All calculations are done with 80-bit precision, which is higher than the standard double precision (64-bit), providing more accuracy.
  2. No extra time for conversions between different precisions: If you mix precision types (e.g., float, double), x87 registers handle conversions between them without extra overhead.
  3. Intrinsic support for complex math functions: There are built-in hardware instructions for mathematical functions like logarithms and trigonometric functions.
  4. Compact code: The code is relatively smaller and more efficient, which helps with caching and can improve overall performance for some workloads.

Disadvantages of x87 Registers:

  1. Difficult for compiler optimization: Due to the way the x87 register stack is organized, it's difficult for compilers to treat floating point variables as regular registers, leading to inefficient use of resources.
  2. Slow floating point comparisons: Operations that compare floating point numbers are relatively slow with x87 registers.
  3. Inefficient integer-to-floating point conversions: Conversions between integers and floating point numbers are not as fast as with vector registers.
  4. Slower math functions with long double precision: Calculations such as division, square root, and other mathematical functions are slower when using 80-bit long double precision.

Vector Registers (XMM, YMM, ZMM)

In more modern processors, a newer method for floating point operations involves vector registers. These include registers like XMM (128 bits), YMM (256 bits), and ZMM (512 bits) which are part of the SSE, AVX, and AVX512 instruction sets, respectively. Vector registers can hold single precision (32-bit) or double precision (64-bit) floating point values, and they allow for parallel processing.

Advantages of Vector Registers:

  1. Easier for compiler optimization: Vector registers make it easier for compilers to handle floating point variables, allowing more efficient use of CPU resources.
  2. Parallel calculations: Vector operations allow for parallel processing of multiple floating point values at once, which can greatly speed up computations when working with large arrays or matrices.
  3. Same precision for all operands: Floating point operations maintain the same precision as the operands, which can be more efficient in many cases because no extra precision conversion is required during intermediate steps.

Disadvantages of Vector Registers:

  1. No long double precision: Unlike x87 registers, vector registers don't support 80-bit long double precision. Only single (32-bit) and double (64-bit) precision are supported.
  2. Mixed precision calculations are slower: If you have calculations that involve operands with different precisions (e.g., mixing single and double precision), converting between these precisions can take time and reduce performance.
  3. Dependence on function libraries for complex math: Unlike x87 registers that have built-in support for certain math functions, vector registers require function libraries for complex math (logarithms, trigonometry, etc.), which can sometimes be slower than the hardware-supported functions in x87.

Using Floating Point in Modern Compilers

  • Compilers tend to use vector registers for floating point calculations when they are available. This is usually in 64-bit mode or when the SSE (Streaming SIMD Extensions) or higher instruction sets are enabled.
  • Vector operations are often more efficient when performing calculations on large datasets due to their ability to process multiple values simultaneously (parallelism).
  • Double precision calculations are often as fast as single precision calculations, as long as they don't involve vector operations. When vector operations are used, single precision (32-bit) tends to be faster for mathematical functions like division and square roots.

Precision Considerations:

  • Double precision (64-bit) and single precision (32-bit) have different performance characteristics. If you're working with big arrays, using single precision can help improve data cache usage since more data fits into memory.
  • Double precision may be necessary if higher accuracy is required, but it won't usually be significantly slower than single precision unless vector operations are involved.
  • Half precision (16-bit) is supported on processors with the AVX512-FP16 instruction set extension, but it’s not as widely used.

Floating Point Operations and Performance

  • Floating point addition typically takes 2 - 6 clock cycles.
  • Multiplication takes 3 - 8 clock cycles.
  • Division can be much slower, taking anywhere from 14 - 45 clock cycles.
  • Floating point comparisons and conversions between floating point and integer types are inefficient when using x87 registers.

Best Practices for Floating Point Operations

  • Avoid mixing single and double precision in the same expression because it requires conversions that can slow down the calculation.

  • Minimize conversions between integer and floating point types when possible, as these conversions are inefficient.

  • Use flush-to-zero mode for handling floating point underflow, which helps avoid the performance cost of dealing with subnormal numbers. Subnormal numbers can introduce inefficiencies and potential inaccuracies.

    Example of setting flush-to-zero mode:

    #include <xmmintrin.h>
    _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON); // Enable flush-to-zero mode
    
  • If vector registers are available, it’s often beneficial to set both flush-to-zero and denormals-are-zero modes to prevent performance issues with subnormal numbers.

    Example for both flush-to-zero and denormals-are-zero modes:

    #include <xmmintrin.h>
    _mm_setcsr(_mm_getcsr() | 0x8040); // Enable both flush-to-zero and denormals-are-zero modes
    

Conclusion

In modern processors:

  • The x87 registers offer more precision (long double) but are harder to optimize and slower for certain operations.
  • The vector registers (XMM, YMM, ZMM) are more efficient for parallel operations and easier for compilers to optimize, but they lack long double precision and can incur penalties for mixed precision calculations.
  • Using single precision is often beneficial for large arrays or when leveraging vector operations, while double precision should be used when higher accuracy is required without vector operations.

Understanding Enums in C/C++

An enum (short for enumeration) is essentially a set of named integer constants. In most programming languages, including C and C++, enums are treated as integers under the hood. The key thing to understand is that enum values are just integers with labels attached to them, making them as efficient as using regular integers.

Efficiency of Enums

Enums are as efficient as integers in terms of performance. This means that operations involving enums (such as assignments or comparisons) are just as fast as operations involving regular integers. The compiler often handles enums internally as simple integers, so there's no significant performance penalty for using them.

Potential Name Clashes

One thing to be aware of when using enums is that the names of enum values (enumerators) could clash with other identifiers in your code, like variables or function names. This can lead to ambiguity or errors because the compiler won't know whether you're referring to the enum value or another variable/function with the same name.

To prevent these name clashes, you should:

  1. Use unique names for enum values: This will help avoid conflicts with other identifiers.
  2. Use namespaces: By placing the enum inside a namespace, you can ensure that the enumerators are scoped within that namespace, preventing name clashes with variables or functions outside the namespace.

Example of Name Clashes

Here's an example where a name clash might occur:

enum Color {
    Red,
    Green,
    Blue
};

int Red = 5;  // This causes a clash with the enum value Red

// Now we can't use the enum value Red without confusion

Solution using Namespaces

To avoid this, you can wrap the enum in a namespace:

namespace Colors {
    enum Color {
        Red,
        Green,
        Blue
    };
}

int Red = 5;  // No clash with Colors::Red

By putting the enum inside a namespace (Colors), you ensure that the enumerators (like Red) are scoped specifically to the Colors namespace, avoiding conflicts with other Red variables or functions outside that namespace.


Summary:

  • Enums are just integers in disguise: They don’t affect performance and are just labeled integers for better readability and maintainability.
  • Avoid name clashes: If you define enums in header files, make sure their names are unique or put them in namespaces to prevent conflicts with variables or functions that have the same name.

Optimizing Operand Order in Logical Expressions

When writing logical expressions, especially with operators like && (AND) or || (OR), the order of operands can influence performance. By understanding how each operand behaves, you can optimize your code for better efficiency. Here are two general guidelines:

  1. Most Predictable Operand First: If one operand is more predictable (i.e., its result is easier to determine based on the previous computations), place that operand first in the expression. This can help the compiler and processor make decisions faster, especially for short-circuiting.

  2. Faster Operand First: If one operand is faster to compute than the other, place the faster operand first. This way, the more costly operation is performed only when necessary, potentially saving time if the first operand resolves the entire condition.


Careful Consideration for Boolean Expressions

However, there are cases where swapping operands in logical expressions is not safe. Specifically, when the evaluation of the operands has side effects or when the first operand influences the validity of the second, the order must remain fixed. Here’s why:

  • Side Effects: If evaluating the first operand changes something that affects the second operand (like modifying a variable or triggering a function), changing the order could lead to unintended behavior.
  • Operand Validity: In logical expressions using && (AND), the first operand can determine whether the second operand should even be evaluated. For example, in an && operation, if the first operand is false, the second operand will not be evaluated due to short-circuiting. Swapping the order of operands could change the logic of the expression.

Example 1: Array Indexing with Bounds Check

Let’s say we are checking if an index i is within the bounds of an array and then performing a check on the array value. Here's the original code:

unsigned int index;
const int MAX_SIZE = 100;
float data[MAX_SIZE];

if (index < MAX_SIZE && data[index] > 1.0) {
    // Process the array element
}

Here, you cannot swap the operands in the condition because if index >= MAX_SIZE, the expression data[index] > 1.0 would try to access an invalid memory location, causing undefined behavior. The check index < MAX_SIZE must be evaluated first to ensure the array access is valid.


Example 2: Validating a Handle Before Calling a Function

Consider the following code that checks if a file handle is valid before calling a function to write data:

if (fileHandle != INVALID_HANDLE && WriteToFile(fileHandle, ...)) {
    // Write to the file
}

Here, the operands cannot be swapped because if fileHandle is invalid, you should not call WriteToFile at all. Swapping the order could lead to attempting to write to an invalid file handle, which would cause errors.


Key Takeaways

Boolean operations can become much more efficient if we are certain that the operands only contain the values 0 and 1. The reason compilers typically avoid making this assumption is because variables might have other values if they are uninitialized or come from unreliable sources. However, if the variables a and b are initialized to valid Boolean values or derived from operators that produce Boolean results, we can optimize the code.

Here’s an optimized version of the code:

// layman
bool a, b, c, d;
c = a && b;
d = a || b;
// Optimized Example
char a = 0, b = 0, c, d;
c = a & b;
d = a | b;

In this case, I've used char (or int) instead of bool, so I can use the bitwise operators (& and |) rather than logical operators (&& and ||). Bitwise operations are executed much faster since they typically take only one clock cycle. The OR operator (|) works even if a and b contain values other than 0 or 1. However, the AND operator (&) and XOR operator (^) can give inconsistent results if their operands are not strictly Boolean (i.e., values other than 0 or 1).

Some important notes:

  1. Bitwise NOT (~) vs. Boolean NOT (!): You cannot use ~ for negation in a Boolean context. Instead, if you know that a variable is 0 or 1, you can achieve the Boolean NOT by XOR’ing it with 1:

    bool a, b;
    b = !a; // Logical NOT
    
    // Optimized version:
    char a = 0, b;
    b = a ^ 1; // XOR with 1 as a substitute for NOT
    
  2. Using & and | for short-circuiting: You cannot replace logical && with & or || with | if the second operand (b) is an expression that should only be evaluated based on the first operand’s value. For instance:

    • a && b ensures b is evaluated only if a is true.
    • a || b ensures b is evaluated only if a is false. Using & or | could lead to unwanted evaluations of b even when a would make it unnecessary.

Where is the trick most useful?

The use of bitwise operators is more beneficial when you're working with variables rather than comparisons. For example:

bool a;
float x, y, z;
a = (x > y) && (z != 0);

In most cases, the logical AND (&&) here is optimal. Avoid changing it to & unless you anticipate many branch mispredictions that could slow down execution due to the evaluation of z != 0.

Boolean Vector Operations

In addition, integers can be used as Boolean vectors. For instance, if a and b are 32-bit integers, the expression y = a & b; will perform 32 AND operations in just a single clock cycle, which is highly efficient for manipulating multiple Boolean values packed into an integer.

In these cases, the bitwise operators (&, |, ^, and ~) are extremely useful for performing Boolean vector operations on large sets of binary data.


A pointer is essentially an integer that holds the memory address of another variable or object in memory. Since a pointer is just a type of integer, performing arithmetic on pointers is as fast as doing arithmetic on regular integers.

When you add an integer to a pointer, the result is not simply the address increased by that integer's value. Instead, the pointer is adjusted by the size of the data type it points to. In other words, when you add an integer to a pointer, it multiplies the integer by the size of the type the pointer is pointing to (for example, the size of an int, float, etc.).

Example:

Consider the following example where we have an array of integers:

int arr[5] = {10, 20, 30, 40, 50};
int *ptr = arr;  // Pointer points to the first element of the array

If we then add 2 to the pointer:

ptr = ptr + 2;

The pointer ptr now points to the third element of the array (arr[2]), which has the value 30.

Here’s why:

  • Initially, ptr points to the first element of the array (arr[0]).
  • When we add 2 to ptr, the pointer moves by 2 * sizeof(int) bytes in memory.
  • Since an int typically takes 4 bytes (on most systems), ptr moves forward by 2 * 4 = 8 bytes, effectively pointing to the third element (arr[2]).

Key Takeaways:

  • Pointer arithmetic takes into account the size of the data type the pointer is pointing to.
  • When you add an integer to a pointer, the pointer moves by the number of bytes equal to the integer multiplied by the size of the type it’s pointing to.
  • This makes pointer arithmetic a fast and efficient way to navigate through memory, especially when working with arrays or data structures.

Code Example:

#include <stdio.h>

int main() {
    int arr[5] = {10, 20, 30, 40, 50};
    int *ptr = arr;

    printf("First element: %d\n", *ptr);  // Outputs 10
    ptr = ptr + 2;  // Move pointer forward by 2 elements
    printf("Third element: %d\n", *ptr);  // Outputs 30

    return 0;
}

In this example, adding 2 to ptr advances the pointer by two int-sized steps (each step being 4 bytes), making it point to the third element of the array, arr[2]. This kind of pointer arithmetic is crucial for efficiently working with arrays and data structures in low-level programming.

he object pointed to can be accessed approximately two clock cycles after the value of the pointer has been calculated. Therefore, it is recommended to calculate the value of a pointer well before the pointer is used. For example, x = _(p++) is more efficient than x = _(++p) because in the latter case the reading of x must wait until a few clock cycles after the pointer p has been incremented, while in the former case x can be read before p is incremented


Calling a function through a function pointer can introduce extra overhead compared to calling the function directly. This is because the CPU needs to predict the target address. If the function pointer value is the same as the last time it was used, the CPU can predict the target address accurately, making the call faster. However, if the function pointer changes, the prediction may fail, causing a delay due to misprediction and requiring the CPU to correct itself.

Example:

#include <stdio.h>

void foo() {
    printf("Function foo called\n");
}

void bar() {
    printf("Function bar called\n");
}

int main() {
    void (*func_ptr)();

    // Function pointer initially points to foo
    func_ptr = foo;
    func_ptr();  // Calls foo

    // Function pointer changes to bar
    func_ptr = bar;
    func_ptr();  // Calls bar

    return 0;
}

Here:

  • The function pointer func_ptr initially points to foo, so the call is predictable.
  • When func_ptr changes to bar, the CPU may mispredict the address, causing a delay.

Summary:

Smart pointers in C++ are a feature that automatically manage the memory of dynamically allocated objects. When a smart pointer is deleted, it ensures that the object it points to is also deleted, preventing memory leaks. They are particularly useful for managing objects created with new and ensuring that memory is properly freed when the object is no longer needed.

Types of Smart Pointers:

  • std::unique_ptr: Ensures that only one pointer owns a dynamically allocated object at a time. Ownership can be transferred, but not shared.
  • std::shared_ptr: Allows multiple pointers to share ownership of the same object. The object is deleted when all shared_ptr instances pointing to it are destroyed.

Performance Considerations:

  • Accessing an object through a smart pointer (using *p or p->member) has no extra performance cost compared to a regular pointer.
  • However, creating, deleting, copying, or transferring ownership of a smart pointer adds some overhead, especially for shared_ptr compared to unique_ptr.
  • In simple cases, modern compilers (like Clang or GCC) can optimize the overhead of unique_ptr to the point where it behaves similarly to a regular pointer.

Use Cases:

  • Smart pointers are helpful when an object is dynamically created in one function and must be deleted in another unrelated function. They handle memory management automatically in such cases.
  • If a function or class is responsible for both creating and deleting an object, smart pointers are unnecessary.

Trade-offs:

  • If many small objects are dynamically allocated with individual smart pointers, this could introduce significant overhead. In such cases, it may be more efficient to group these objects into a container with contiguous memory (like a vector) to minimize memory allocation costs.

An array can be initialized to zero by using memset:

float list[69];
memset(list, 0, sizeof(list));

The idea behind organizing a multidimensional array is to ensure that the elements are accessed in the most efficient way, particularly with respect to cache locality.


In C++, multidimensional arrays are stored in row-major order. This means that elements of each row are stored in contiguous memory locations. Specifically, in the example:

const int rows = 20, columns = 50;
float matrix[rows][columns];
int i, j;
float x;
for (i = 0; i < rows; i++)
    for (j = 0; j < columns; j++)
        matrix[i][j] += x;
  • The array matrix is stored such that the elements of row 0 are placed in contiguous memory locations, followed by the elements of row 1, then row 2, and so on.
  • When you access the array as matrix[i][j], the last index (j) changes fastest, meaning that you move across the elements of a row sequentially in memory.

Cache Efficiency:

  • Sequential access is beneficial because accessing consecutive elements of the array (like matrix[i][j], then matrix[i][j+1]) maximizes the likelihood that the data will be in the cache, as cache lines generally store contiguous memory locations.
  • In this example, the outer loop (on i, representing rows) runs slower than the inner loop (on j, representing columns), so each row of the matrix is accessed in order, and the elements within a row are accessed in contiguous memory locations.

Opposite Order:

If you were to reverse the order of the loops, like this:

for (j = 0; j < columns; j++)
    for (i = 0; i < rows; i++)
        matrix[i][j] += x;
  • This would cause the first index (i) to change fastest, which means you're accessing elements column by column. Since the array is stored row-by-row in memory, this leads to non-sequential access. In this case, each access might be a cache miss, which significantly reduces the efficiency of data caching.

when working with multidimensional arrays or arrays of objects, the efficiency of address calculation and data access can be improved by making certain dimensions a power of 2, especially when accessing the array in a non-sequential manner (like skipping rows or columns).

Example Breakdown:

In the following example:

// https://www.agner.org/optimize/optimizing_cpp.pdf
int FuncRow(int); int FuncCol(int);
const int rows = 20, columns = 32;
float matrix[rows][columns];
int i; float x;
for (i = 0; i < 100; i++)
    matrix[FuncRow(i)][FuncCol(i)] += x;
  1. Address Calculation:

    • To access matrix[FuncRow(i)][FuncCol(i)], the address of the element needs to be calculated. Normally, for a 2D array, the address calculation would be like:
      Address = base_address + (FuncRow(i) * columns + FuncCol(i)) * sizeof(float)
      
    • The expression FuncRow(i) * columns + FuncCol(i) calculates the index in a linearized memory model (since arrays are stored in row-major order).
  2. Power of 2 Optimization:

    • The important part is that multiplying by a power of 2 (like 32, which is 2^5) can be more efficient than multiplying by a non-power of 2. This is because multiplying by powers of 2 can be implemented with bit-shifting, which is much faster than regular multiplication.
    • Specifically, multiplying by columns (which is 32) can be optimized as a left bit-shift operation. For example, instead of multiplying by 32, you can shift the value by 5 bits (because 2^5 = 32). This is more efficient in terms of CPU cycles.
  3. Compiler Optimization:

    • In the case where the rows are accessed sequentially (e.g., accessing each row one after the other), an optimizing compiler can often "see" this and calculate the address more efficiently by simply adding the length of each row to the address of the preceding row, without needing to explicitly calculate the full address.

Considerations:

  • Arrays of Structures or Objects: The advice of making the second dimension (e.g., columns in the matrix) a power of 2 also applies to arrays of structures or classes. If the structure size (in bytes) is a power of 2, the address calculation becomes more efficient when accessing elements non-sequentially.

  • Non-Sequential Access and Cache Efficiency:

    • When arrays are accessed non-sequentially, such as in a random order, ensuring that the dimension size is a power of 2 can help with address calculation efficiency.
    • However, this strategy may backfire in cases where the array is large enough to be beyond the L1 data cache, especially if there are frequent cache misses. If the array is large and accessed non-sequentially, forcing the dimension size to be a power of 2 could result in cache contention and slow down the program due to inefficient cache utilization.

Conclusion:

  1. Optimizing Address Calculation:

    • For better address calculation efficiency in cases of non-sequential access, make the second dimension (or the dimension that is multiplied) a power of 2. This makes multiplication faster by leveraging bit-shifting.
  2. Consider Cache Implications:

    • While using a power of 2 can improve address calculation, it may lead to cache inefficiencies for large arrays accessed non-sequentially. In these cases, optimizing for cache locality (by grouping elements into contiguous memory blocks) might be a better approach than strictly following the "power of 2" rule. -- converting a integer to smaller size done by ignoring higher bits no check for overflow so no extra overhead it usually takes one or two clock cycles to convert to longer size example 7.22

When converting a floating-point number to an integer, the process can be relatively slow

Why It's Slow:

  • Truncation: When you convert a floating point number (like 3.14) to an integer (e.g., int), the fractional part is discarded, and only the whole part is retained. This requires the processor to change the rounding mode to truncation, perform the conversion, and then revert to the original rounding mode. This process can take around 50-100 clock cycles.
  • SSE2 Instruction Set: In modern processors (those supporting SSE2 or later instruction sets), the conversion process is much faster and can be optimized. If SSE2 or similar instructions are not enabled, the conversion might be slower.

Solutions to Optimize Floating-Point to Integer Conversion:

1. Avoid the Conversions by Using Different Types:

  • If possible, avoid converting floating-point numbers to integers in the critical parts of your code.
  • You can use integers throughout the code if fractional precision is not needed.
  • Alternatively, use fixed-point arithmetic (where you store the fractional part as an integer with a defined scaling factor).

2. Move the Conversions Out of the Innermost Loop:

  • If conversions cannot be avoided, try to move them outside of performance-critical loops.
  • For example, instead of converting a floating point value to an integer in every iteration of the loop, store intermediate results as floating point values and only convert them when necessary.

Example:

float arr[100];
int result[100];

// Instead of converting each value in a loop:
// for (int i = 0; i < 100; i++) {
//     result[i] = (int)arr[i]; // slow conversion
// }

// Convert the entire array outside of the innermost loop:
for (int i = 0; i < 100; i++) {
    arr[i] = arr[i] * 2.5f; // float operations
}

for (int i = 0; i < 100; i++) {
    result[i] = (int)arr[i]; // convert only once
}

3. Use 64-bit Mode or Enable SSE2:

  • 64-bit mode: Some processors and compilers can optimize certain operations, including floating-point to integer conversion, in 64-bit mode. This may be an option to explore depending on your platform.
  • SSE2: SSE2 (Streaming SIMD Extensions 2) is a set of SIMD (Single Instruction, Multiple Data) instructions available in modern processors that can perform floating-point to integer conversion much faster. Enabling this instruction set may significantly speed up the conversion process.

You can enable SSE2 in your compiler by specifying appropriate flags (e.g., -msse2 for GCC or Clang) and ensuring your processor supports it.

4. Use Rounding Instead of Truncation (Custom Round Function):

  • Rounding involves rounding the floating-point value to the nearest integer rather than just truncating it (removing the decimal part). This can sometimes be more efficient, especially when using custom assembly language functions optimized for specific architectures.
  • A custom rounding function written in assembly can optimize this operation, especially if the processor has specific instructions to perform rounding quickly (e.g., roundss in x86 processors with SSE).

Example using rounding (if available):

float f = 3.14f;
int i = (int)(f + 0.5f);  // Rounds to the nearest integer

Using Assembly for Round Function: If rounding is required instead of truncation, you can write a custom round function using assembly language to take advantage of hardware optimizations for rounding operations. This requires knowledge of the target architecture and assembly instructions but can significantly reduce conversion time in critical sections.


Summary of Recommendations:

  1. Avoid conversions in performance-critical areas by using appropriate data types (e.g., use integers if you don't need the fractional part).
  2. Move conversions out of loops when possible to avoid performing them repeatedly.
  3. Use SSE2 or 64-bit mode to speed up conversions by enabling the right hardware instructions.
  4. Use rounding instead of truncation, and if necessary, write custom assembly code to optimize the rounding process on the target platform.

The expression *(int*)&x is a type punning operation in C or C++ that involves casting and dereferencing a pointer, and it can be broken down into the following parts:

  1. (int*)&x:

    • &x: This takes the address of the variable x. It gives a pointer to x.
    • (int*)&x: This is a cast operation that converts the pointer to the type int*. In other words, you're treating the address of x as if it points to an int instead of its actual type. This is an example of casting a pointer.
  2. *(int*)&x:

    • *: This dereferences the pointer, meaning it accesses the value stored at the address the pointer points to.
    • Since the pointer was cast to an int* type ((int*)&x), dereferencing it means accessing the data at that memory location as an integer (even if x is of another type).

What it does:

  • The expression *(int*)&x will treat the bytes of x as if they were an integer. It casts the address of x to an int* and then dereferences that pointer to obtain the value as an int.
  • This can be used to reinterpret the raw memory of a variable (which may be of a different type, such as float, double, etc.) as an integer.

Example:

Let’s assume x is a float:

float x = 3.14f;
int value = *(int*)&x;
  • Step 1: &x gives the address of x.
  • Step 2: (int*)&x casts the address of x to a pointer to int, which tells the program to treat the memory where x is stored as if it were an integer.
  • Step 3: *(int*)&x dereferences that pointer to retrieve the value at the memory location, but interpreted as an int rather than a float.

Why use this?

  • This is an example of type punning, which allows you to reinterpret the binary data of a variable as a different type. It is sometimes used for operations like bit-level manipulation or for cases where you need to understand how a variable is stored in memory (e.g., inspecting the raw binary representation of a float as an int).
  • Example use case: You might use this to examine the raw bit pattern of a floating-point number for tasks like serialization, network protocols, or debugging.

The example *(int*)&x |= 0x80000000; demonstrates a technique where a type-cast is applied to the address of a variable in order to treat it as if it were a different type—in this case, treating a float as an int to manipulate its raw bit pattern. Let's break this down in detail.

1. Type-casting the address of x:

  • &x: The address-of operator (&) is used to get the memory address of the variable x. This returns a pointer to the float x.
  • (int*)&x: The address of x (which is a pointer to float) is then type-cast to a pointer to int. This means that the compiler will treat the memory location of x as if it points to an int, even though x is a float.

2. Dereferencing the pointer:

  • *(int*)&x: The * operator dereferences the pointer (int*)&x, meaning that the value at that memory address is accessed and treated as an int. Instead of interpreting the value as a float, the program now treats the same memory as an integer. This step gives you access to the raw bit pattern of the float value.

3. Bitwise manipulation:

  • |= 0x80000000: The bitwise OR assignment (|=) is then used to set the sign bit of the float. The hexadecimal value 0x80000000 corresponds to a 32-bit number where the most significant bit (the sign bit in the IEEE 754 representation of a float) is set to 1. In the IEEE 754 standard:
    • Sign bit: 1 represents negative, 0 represents positive.
    • By using |= 0x80000000, you're setting the sign bit to 1, effectively making the float negative, regardless of its original value.

Explanation of the Effect:

  • This technique is often used for manipulating the bits of a float directly. It can be faster than using higher-level operations like x = -abs(x); because it modifies the bits directly in memory without needing to calculate absolute values or perform additional arithmetic.

Example:

Let's say the value of x is a float with the value 3.14f. The code *(int*)&x |= 0x80000000; will directly change the bit pattern of the number to represent -3.14f without going through a typical arithmetic operation. This is done by setting the sign bit to 1, which corresponds to a negative value.

Potential Dangers and Issues:

  1. Violation of Strict Aliasing Rule:

    • Strict aliasing rule in C/C++ specifies that objects of different types must not share the same memory address. The compiler assumes that pointers to different types (except for char pointers) cannot point to the same object.
    • In this case, the code casts the address of a float to an int*, which is type punning (treating the same memory as different types). This breaks the strict aliasing rule, and as a result, the compiler might optimize the code in a way that leads to undefined behavior.
    • Some compilers might store the float and int in different registers, leading to potential issues during execution. This is something you need to be aware of, as it can cause bugs that are hard to trace.

    Safe Alternative: You can use a union to safely access the same memory as different types. In a union, all members share the same memory location, making this type of access well-defined:

    union FloatIntUnion {
        float f;
        int i;
    };
    
    FloatIntUnion u;
    u.f = 3.14f;
    u.i |= 0x80000000; // Set sign bit
    
  2. Incompatibility with Larger Types:

    • If the size of the types doesn't match (for example, int uses more bits than float), the type-casting might fail or lead to incorrect results. The above code assumes both int and float are 32 bits in size (which is typical on x86 systems).
    • If you're working with platforms where int and float differ in size (e.g., int is 64 bits on some systems), the casting would not work as expected.
  3. Endianness Issues:

    • Endianness refers to the byte order in which data is stored in memory. If you attempt to access part of a variable (such as 32 bits of a 64-bit double), the code may behave differently on systems with different endianness (big-endian vs. little-endian).
    • On a big-endian system, the byte order of the data in memory is reversed, which could cause issues when accessing specific parts of the variable's bits. The code might not be portable across different hardware platforms.
  4. CPU Optimization and Performance:

    • When performing bitwise operations on parts of a variable, the CPU might not handle this as efficiently as accessing the whole variable at once.
    • For example, accessing or modifying a 64-bit double 32 bits at a time could cause delays due to store forwarding (the delay in updating memory after a write). This can lead to performance problems on certain CPUs.

Summary:

The code *(int*)&x |= 0x80000000; is a low-level, performance-oriented technique used to manipulate the bits of a floating-point number directly by treating the memory of a float as if it were an int. It sets the sign bit of x to 1, making the number negative.

However, there are several risks to this approach:


A switch statement allows multiple branches, with optimal efficiency when case labels are sequential (each label incrementing by one), as it can be implemented using a jump table. However, if case labels are widely spaced, the compiler must convert it to a branch tree, which is less efficient, number of branches should be kept small, number of branches should be kept small, number of branches should be kept small, number of branches should be kept small, number of branches should be kept small, number of branches should be kept small, number of branches should be kept small, number of branches should be kept small, number of branches should be kept small, number of branches should be kept small, number of branches should be kept small.

  • use memset and memcpy

  • __fastcall changes the function calling method in 32-bit mode so that the first two integer parameters are transferred in registers rather than on the stack. This can improve the speed of functions with integer parameters(https://www.agner.org/optimize/optimizing_cpp.pdf). -The return type of a function should preferably be a simple type, a pointer, a reference, or void. Returning objects of a composite type is more complex and often inefficient

  • https://stackoverflow.com/questions/252552/why-do-we-need-c-unions unions can also be used to access same data in different ways may be just like