Lua - Performance & Optimization

Overview

Estimated time: 35–40 minutes

Writing efficient Lua code requires understanding how Lua works internally and applying optimization techniques. This tutorial covers performance analysis, optimization strategies, profiling tools, and the LuaJIT compiler for maximum performance.

Learning Objectives

Understand Lua's performance characteristics and bottlenecks
Apply optimization techniques for tables, functions, and loops
Use profiling tools to identify performance issues
Leverage LuaJIT for high-performance applications
Implement best practices for memory and CPU efficiency

Prerequisites

Strong understanding of Lua tables, functions, and control flow
Knowledge of Lua's garbage collection concepts
Basic understanding of algorithm complexity

Performance Fundamentals

Understanding Lua's execution model is key to optimization:

-- Local vs Global Variable Access
local start_time = os.clock()

-- Slow: Global variable access
function slow_global_access()
    for i = 1, 1000000 do
        math.sin(i)  -- Global lookup for 'math' every time
    end
end

-- Fast: Local variable access
function fast_local_access()
    local sin = math.sin  -- Cache the function locally
    for i = 1, 1000000 do
        sin(i)  -- Direct local access
    end
end

-- Timing test
local function time_function(func, name)
    local start = os.clock()
    func()
    local duration = os.clock() - start
    print(string.format("%s took %.4f seconds", name, duration))
end

time_function(slow_global_access, "Global access")
time_function(fast_local_access, "Local access")

Expected Output:

Global access took 0.1234 seconds
Local access took 0.0678 seconds

Table Optimization

Tables are fundamental to Lua performance:

Array vs Hash Performance

-- Array part vs hash part performance
local function test_array_vs_hash()
    local iterations = 1000000
    
    -- Array part (sequential integer keys starting from 1)
    local array = {}
    local start_time = os.clock()
    for i = 1, iterations do
        array[i] = i * 2
    end
    local array_time = os.clock() - start_time
    
    -- Hash part (non-sequential or non-integer keys)
    local hash = {}
    start_time = os.clock()
    for i = 1, iterations do
        hash["key" .. i] = i * 2
    end
    local hash_time = os.clock() - start_time
    
    print(string.format("Array insertion: %.4f seconds", array_time))
    print(string.format("Hash insertion: %.4f seconds", hash_time))
    print(string.format("Array is %.2fx faster", hash_time / array_time))
end

test_array_vs_hash()

Table Preallocation

-- Efficient table preallocation
local function compare_table_growth()
    local size = 100000
    
    -- Growing table (inefficient)
    local start_time = os.clock()
    local growing_table = {}
    for i = 1, size do
        growing_table[i] = i
    end
    local grow_time = os.clock() - start_time
    
    -- Preallocated table (efficient)
    start_time = os.clock()
    local preallocated = {table.unpack({}, 1, size)}  -- Preallocate
    for i = 1, size do
        preallocated[i] = i
    end
    local prealloc_time = os.clock() - start_time
    
    print(string.format("Growing table: %.4f seconds", grow_time))
    print(string.format("Preallocated: %.4f seconds", prealloc_time))
    
    -- Better preallocation method
    start_time = os.clock()
    local better_table = {}
    -- Set array size hint
    for i = 1, size do
        better_table[i] = nil
    end
    for i = 1, size do
        better_table[i] = i
    end
    local better_time = os.clock() - start_time
    
    print(string.format("Better prealloc: %.4f seconds", better_time))
end

compare_table_growth()

Function Call Optimization

Minimize function call overhead:

-- Function call overhead
local function compare_function_calls()
    local iterations = 1000000
    
    -- Regular function calls
    local function add(a, b)
        return a + b
    end
    
    local start_time = os.clock()
    local result = 0
    for i = 1, iterations do
        result = add(result, i)
    end
    local func_time = os.clock() - start_time
    
    -- Inlined operations
    start_time = os.clock()
    result = 0
    for i = 1, iterations do
        result = result + i  -- Inlined
    end
    local inline_time = os.clock() - start_time
    
    print(string.format("Function calls: %.4f seconds", func_time))
    print(string.format("Inlined code: %.4f seconds", inline_time))
    print(string.format("Inline is %.2fx faster", func_time / inline_time))
end

compare_function_calls()

-- Avoiding closures in loops
local function closure_performance()
    local iterations = 100000
    local functions = {}
    
    -- Inefficient: Creating closures in loop
    local start_time = os.clock()
    for i = 1, iterations do
        functions[i] = function() return i * 2 end
    end
    local closure_time = os.clock() - start_time
    
    -- Efficient: Reuse function with parameter
    local function multiplier(x)
        return x * 2
    end
    
    start_time = os.clock()
    local values = {}
    for i = 1, iterations do
        values[i] = multiplier(i)
    end
    local reuse_time = os.clock() - start_time
    
    print(string.format("Closures: %.4f seconds", closure_time))
    print(string.format("Reused function: %.4f seconds", reuse_time))
end

closure_performance()

Loop Optimization

Optimize common loop patterns:

-- Loop optimization techniques
local function loop_optimizations()
    local data = {}
    for i = 1, 10000 do
        data[i] = i
    end
    
    -- Inefficient: Length calculation in loop
    local start_time = os.clock()
    local sum = 0
    for i = 1, #data do  -- #data calculated each iteration
        sum = sum + data[i]
    end
    local slow_time = os.clock() - start_time
    
    -- Efficient: Cache length
    start_time = os.clock()
    sum = 0
    local n = #data  -- Calculate once
    for i = 1, n do
        sum = sum + data[i]
    end
    local fast_time = os.clock() - start_time
    
    -- Most efficient: Use ipairs for arrays
    start_time = os.clock()
    sum = 0
    for i, value in ipairs(data) do
        sum = sum + value
    end
    local ipairs_time = os.clock() - start_time
    
    print(string.format("Length in loop: %.6f seconds", slow_time))
    print(string.format("Cached length: %.6f seconds", fast_time))
    print(string.format("Using ipairs: %.6f seconds", ipairs_time))
end

loop_optimizations()

-- String concatenation optimization
local function string_concat_optimization()
    local pieces = {}
    for i = 1, 1000 do
        pieces[i] = "part" .. i
    end
    
    -- Inefficient: String concatenation in loop
    local start_time = os.clock()
    local result = ""
    for i = 1, #pieces do
        result = result .. pieces[i]  -- Creates new string each time
    end
    local concat_time = os.clock() - start_time
    
    -- Efficient: table.concat
    start_time = os.clock()
    result = table.concat(pieces)  -- Single operation
    local table_concat_time = os.clock() - start_time
    
    print(string.format("String concat: %.6f seconds", concat_time))
    print(string.format("table.concat: %.6f seconds", table_concat_time))
    print(string.format("table.concat is %.0fx faster", 
          concat_time / table_concat_time))
end

string_concat_optimization()

Memory Optimization

Manage memory efficiently:

-- Memory usage optimization
local function memory_optimization_demo()
    -- Inefficient: Storing unnecessary data
    local inefficient_data = {}
    for i = 1, 1000 do
        inefficient_data[i] = {
            id = i,
            name = "Item " .. i,
            description = "This is item number " .. i,
            timestamp = os.time(),
            metadata = {
                created_by = "system",
                version = 1.0,
                tags = {"tag1", "tag2", "tag3"}
            }
        }
    end
    
    -- Efficient: Store only necessary data
    local efficient_data = {}
    for i = 1, 1000 do
        efficient_data[i] = {
            i,  -- id (position 1)
            "Item " .. i,  -- name (position 2)
            os.time()  -- timestamp (position 3)
            -- Store metadata separately if needed
        }
    end
    
    -- Even more efficient: Use string interning for repeated values
    local cached_strings = {}
    local function intern_string(str)
        if not cached_strings[str] then
            cached_strings[str] = str
        end
        return cached_strings[str]
    end
    
    local interned_data = {}
    for i = 1, 1000 do
        interned_data[i] = {
            i,
            intern_string("Item " .. (i % 10)),  -- Reuse similar strings
            os.time()
        }
    end
    
    print("Memory optimization examples created")
    print("Inefficient data uses more memory per record")
    print("Interned strings reduce memory for repeated values")
end

memory_optimization_demo()

-- Garbage collection hints
local function gc_optimization()
    print("Before optimization:", collectgarbage("count"), "KB")
    
    -- Create some temporary data
    local temp_data = {}
    for i = 1, 100000 do
        temp_data[i] = "temporary data " .. i
    end
    
    print("After creating data:", collectgarbage("count"), "KB")
    
    -- Clear references
    temp_data = nil
    
    -- Suggest garbage collection (don't force it frequently)
    collectgarbage("collect")
    
    print("After cleanup:", collectgarbage("count"), "KB")
end

gc_optimization()

Profiling and Measurement

Tools and techniques for performance analysis:

-- Simple profiler
local Profiler = {}
Profiler.__index = Profiler

function Profiler:new()
    local obj = {
        times = {},
        counts = {},
        start_times = {}
    }
    setmetatable(obj, self)
    return obj
end

function Profiler:start(name)
    self.start_times[name] = os.clock()
end

function Profiler:stop(name)
    if not self.start_times[name] then
        error("Profiler: No start time for " .. name)
    end
    
    local duration = os.clock() - self.start_times[name]
    self.times[name] = (self.times[name] or 0) + duration
    self.counts[name] = (self.counts[name] or 0) + 1
    self.start_times[name] = nil
end

function Profiler:report()
    print("\n=== Profiler Report ===")
    local sorted_names = {}
    for name in pairs(self.times) do
        table.insert(sorted_names, name)
    end
    
    table.sort(sorted_names, function(a, b)
        return self.times[a] > self.times[b]
    end)
    
    for _, name in ipairs(sorted_names) do
        local total_time = self.times[name]
        local count = self.counts[name]
        local avg_time = total_time / count
        
        print(string.format("%-20s: %8.4fs total, %6d calls, %8.6fs avg",
              name, total_time, count, avg_time))
    end
    print("========================\n")
end

-- Usage example
local profiler = Profiler:new()

local function expensive_operation(n)
    profiler:start("expensive_operation")
    local result = 0
    for i = 1, n do
        result = result + math.sin(i) * math.cos(i)
    end
    profiler:stop("expensive_operation")
    return result
end

local function fast_operation(n)
    profiler:start("fast_operation")
    local result = n * (n + 1) / 2  -- Simple arithmetic
    profiler:stop("fast_operation")
    return result
end

-- Profile different operations
for i = 1, 5 do
    expensive_operation(10000)
    fast_operation(10000)
end

profiler:report()

LuaJIT Optimization

Leverage LuaJIT's just-in-time compilation:

-- LuaJIT-specific optimizations
local ffi = require("ffi")  -- Only available in LuaJIT

-- Define C structure for better performance
ffi.cdef[[
typedef struct {
    double x, y, z;
} point3d_t;
]]

local function luajit_optimization_demo()
    -- Regular Lua table approach
    local function create_points_lua(n)
        local points = {}
        for i = 1, n do
            points[i] = {x = i, y = i * 2, z = i * 3}
        end
        return points
    end
    
    -- LuaJIT FFI approach (much faster)
    local function create_points_ffi(n)
        local points = ffi.new("point3d_t[?]", n)
        for i = 0, n - 1 do
            points[i].x = i + 1
            points[i].y = (i + 1) * 2
            points[i].z = (i + 1) * 3
        end
        return points
    end
    
    -- Vector operations - LuaJIT optimizable
    local function vector_operations(points, n)
        local sum = 0
        for i = 1, n do
            local p = points[i]
            sum = sum + p.x * p.x + p.y * p.y + p.z * p.z
        end
        return sum
    end
    
    local function vector_operations_ffi(points, n)
        local sum = 0
        for i = 0, n - 1 do
            local p = points[i]
            sum = sum + p.x * p.x + p.y * p.y + p.z * p.z
        end
        return sum
    end
    
    local n = 100000
    
    -- Benchmark Lua tables
    local start_time = os.clock()
    local lua_points = create_points_lua(n)
    local lua_result = vector_operations(lua_points, n)
    local lua_time = os.clock() - start_time
    
    -- Benchmark FFI structures (LuaJIT only)
    if ffi then
        start_time = os.clock()
        local ffi_points = create_points_ffi(n)
        local ffi_result = vector_operations_ffi(ffi_points, n)
        local ffi_time = os.clock() - start_time
        
        print(string.format("Lua tables: %.4f seconds", lua_time))
        print(string.format("FFI structs: %.4f seconds", ffi_time))
        print(string.format("FFI is %.2fx faster", lua_time / ffi_time))
    end
end

-- Only run LuaJIT demo if FFI is available
if pcall(require, "ffi") then
    luajit_optimization_demo()
else
    print("FFI not available (not running LuaJIT)")
end

Best Practices Summary

Do's

Cache frequently accessed globals: Store math.sin, table.insert in locals
Use local variables: Much faster than global access
Preallocate tables: When you know the approximate size
Use table.concat: For string building instead of .. operator
Profile your code: Measure before optimizing
Use ipairs for arrays: Faster than numeric for loops for sequential data
Minimize function call overhead: Inline simple operations when performance critical

Don'ts

Don't premature optimize: Profile first, optimize bottlenecks
Don't call collectgarbage() frequently: Let Lua manage memory
Don't create functions in tight loops: Reuse functions when possible
Don't use string concatenation in loops: Use table.concat instead
Don't ignore table array/hash distinction: Sequential integer keys are faster

Performance Testing Framework

-- Micro-benchmark framework
local function benchmark(name, func, iterations)
    iterations = iterations or 1000000
    
    -- Warm up
    for i = 1, 100 do
        func()
    end
    
    -- Measure
    local start_time = os.clock()
    for i = 1, iterations do
        func()
    end
    local duration = os.clock() - start_time
    
    print(string.format("%-25s: %.6f seconds (%d iterations)", 
          name, duration, iterations))
    return duration
end

-- Example usage
local data = {}
for i = 1, 1000 do
    data[i] = i
end

benchmark("table access", function()
    local sum = 0
    for i = 1, #data do
        sum = sum + data[i]
    end
end)

benchmark("ipairs iteration", function()
    local sum = 0
    for i, v in ipairs(data) do
        sum = sum + v
    end
end)

Common Pitfalls

Optimizing before profiling - measure first, optimize the bottlenecks
Over-optimizing non-critical code - focus on hot paths
Ignoring memory allocation patterns - preallocate when possible
Using wrong table access patterns - arrays vs hash tables
Not considering LuaJIT-specific optimizations if using LuaJIT
Micro-optimizations that hurt readability without significant gain

Checks for Understanding

Why are local variables faster than global variables in Lua?
What's the difference between array part and hash part of a Lua table?
When should you use table.concat instead of string concatenation?
What are the benefits of preallocating tables?
How does LuaJIT improve performance compared to standard Lua?
What's the first step before optimizing any code?

Show answers

Local variables are stored in registers/stack, while globals require hash table lookup in the global environment.
Array part uses direct indexing (faster) for sequential integer keys starting from 1. Hash part uses hash table lookup for other keys.
When building strings in loops or concatenating many strings - table.concat is O(n) while repeated .. is O(n²).
Avoids table resizing overhead during growth, reduces memory fragmentation, and improves cache locality.
LuaJIT uses just-in-time compilation to machine code, FFI for C struct access, and specialized optimizations for number-heavy code.
Profile the code to identify actual bottlenecks - don't optimize based on assumptions.

Next Steps

Performance optimization is an iterative process. Start with profiling to identify bottlenecks, apply appropriate optimizations, and measure results. Remember that readable, maintainable code is often more valuable than micro-optimizations. Next, learn debugging techniques to troubleshoot performance and correctness issues.

« Previous: Lua - Libraries & LuaRocks | Next: Lua - Debugging »