概述
第34章概述了GPU执行模型、地址空间和调度规划;现在我们构建一个端到端工作负载,从Zig源码开始,到经验证的二进制转储结束,准备提交给Vulkan或WebGPU队列系列。Target.zig该项目将三个部分拼接在一起:用纯Zig编写的SPIR-V内核、具有CPU回退的主机端编排CLI,以及用于将捕获的GPU缓冲区与预期结果进行比较的差异实用工具。build.zig
学习目标
- 使用自托管后端将Zig计算内核转换为SPIR-V,并了解它期望的资源布局。
- 从可以带或不带GPU访问运行的主机应用程序协调缓冲区、调度几何和验证路径。
- 构建轻量级诊断,通过确定性CPU参考评估GPU输出。
构建计算流水线
我们的工作负载对向量的元素求平方。主机创建提交元数据和数据缓冲区,内核对每条通道求平方,差异工具验证设备捕获。静态通道容量镜像GPU存储缓冲区布局,而主机强制执行逻辑边界,以便内核在调度额外线程时能够退出。builtin.zig
拓扑和数据流
调度有意适中(64线程块中的1000个元素),因此你可以专注于正确性而不是占用调优。主机注入随机浮点值,记录校验和以便观察,并发出下游工具或真实GPU驱动程序可以重用的二进制blob。因为存储缓冲区在原始字节上运行,我们将每个指针参数与extern struct正面配对,以保证与描述符表的布局奇偶性。
编写SPIR-V内核
内核接收三个存储缓冲区:描述逻辑长度的提交头、输入向量和输出向量。每个调用读取一条通道,对它求平方,并在边界内时写回结果。防御性检查防止散乱的工作组触及逻辑长度之后的内存,这是主机填充调度以匹配wavefront粒度时的常见危险。
//! SPIR-V Kernel: element-wise vector squaring
//!
//! This kernel expects three storage buffers: an input vector (`in_values`), an
//! output vector (`out_values`), and a descriptor struct `Submission` that
//! communicates the logical element count. Each invocation squares one element
//! and writes the result back to `out_values`.
const builtin = @import("builtin");
/// Maximum number of elements the kernel will touch.
pub const lane_capacity: u32 = 1024;
/// Submission header shared between the host and the kernel.
///
/// The `extern` layout ensures the struct matches bindings created by Vulkan or
/// WebGPU descriptor tables.
const Submission = extern struct {
/// Logical element count requested by the host.
len: u32,
_padding: u32 = 0,
};
/// Storage buffer layout expected by the kernel.
const VectorPayload = extern struct {
values: [lane_capacity]f32,
};
/// Squares each element of `in_values` and writes the result to `out_values`.
///
/// The kernel is written defensively: it checks both the logical length passed
/// by the host and the static `lane_capacity` to avoid out-of-bounds writes when
/// dispatched with more threads than necessary.
pub export fn squareVector(
submission: *addrspace(.storage_buffer) const Submission,
in_values: *addrspace(.storage_buffer) const VectorPayload,
out_values: *addrspace(.storage_buffer) VectorPayload,
) callconv(.kernel) void {
const group_index = @workGroupId(0);
const group_width = @workGroupSize(0);
const local_index = @workItemId(0);
const linear = group_index * group_width + local_index;
const logical_len = submission.len;
if (linear >= logical_len or linear >= lane_capacity) return;
const value = in_values.*.values[linear];
out_values.*.values[linear] = value * value;
}
// Guard compilation so this file is only compiled when targeting SPIR-V.
comptime {
switch (builtin.target.cpu.arch) {
.spirv32, .spirv64 => {},
else => @compileError("squareVector must be compiled with a SPIR-V target, e.g. -target spirv32-vulkan-none"),
}
}
$ zig build-obj -fno-llvm -O ReleaseSmall -target spirv32-vulkan-none \
-femit-bin=kernels/vector_square.spv \
01_vector_square_kernel.zigno output (binary module generated)实验结束时删除kernels/vector_square.spv,以便重复运行始终从源码重建着色器。fs.zig
主机编排和CPU回退
主机CLI规划调度,种下确定性输入,运行CPU回退,并应请求将参考转储写入out/reference.bin。它还验证SPIR-V头(0x07230203),以便损坏的构建立即浮出水面,而不是在图形API深处失败。可选钩子允许你放入捕获的GPU缓冲区(out/gpu_result.bin)进行比较。
// Project host pipeline for the vector-square kernel.
//
// This program demonstrates the CPU orchestration that pairs with the
// `squareVector` SPIR-V kernel. It prepares input data, plans a dispatch,
// validates the compiled shader module, and runs a CPU fallback that mirrors the
// GPU algorithm. When requested via `--emit-binary`, it also writes the CPU
// output to `out/reference.bin` so external GPU runs can be compared bit-for-bit.
const std = @import("std");
/// Must match `lane_capacity` in 01_vector_square_kernel.zig.
const lane_capacity: u32 = 1024;
const default_problem_len: u32 = 1000;
const workgroup_size: u32 = 64;
const spirv_path = "kernels/vector_square.spv";
const gpu_dump_path = "out/gpu_result.bin";
const cpu_dump_path = "out/reference.bin";
/// Encapsulates the GPU workgroup dispatch geometry, accounting for padding
/// when the total workload doesn't evenly divide into workgroup boundaries.
const DispatchPlan = struct {
workgroup_size: u32,
group_count: u32,
/// Total invocations including padding to fill complete workgroups
padded_invocations: u32,
/// Number of unused lanes in the final workgroup
tail: u32,
};
/// Tracks a validated SPIR-V module alongside its filesystem path for diagnostics.
const ModuleInfo = struct {
path: []const u8,
bytes: []u8,
};
pub fn main() !void {
// Initialize allocator with leak detection for development builds
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer switch (gpa.deinit()) {
.ok => {},
.leak => std.log.err("general-purpose allocator detected a leak", .{}),
};
const allocator = gpa.allocator();
// Parse command-line arguments for optional flags
var args = try std.process.argsWithAllocator(allocator);
defer args.deinit();
_ = args.next(); // skip program name
var emit_binary = false;
var logical_len: u32 = default_problem_len;
while (args.next()) |arg| {
if (std.mem.eql(u8, arg, "--emit-binary")) {
emit_binary = true;
} else if (std.mem.eql(u8, arg, "--length")) {
const value = args.next() orelse return error.MissingLengthValue;
logical_len = try std.fmt.parseInt(u32, value, 10);
} else {
return error.UnknownFlag;
}
}
// Clamp user-provided length to prevent buffer overruns in the kernel
if (logical_len == 0 or logical_len > lane_capacity) {
std.log.warn("clamping problem length to lane capacity ({d})", .{lane_capacity});
logical_len = @min(lane_capacity, logical_len);
if (logical_len == 0) logical_len = @min(lane_capacity, default_problem_len);
}
// Calculate how many workgroups we need to process this many elements
const plan = computeDispatch(logical_len, workgroup_size);
std.debug.print(
"launch plan: {d} groups × {d} lanes => {d} invocations (tail {d})\n",
.{ plan.group_count, plan.workgroup_size, plan.padded_invocations, plan.tail },
);
// Use deterministic PRNG for reproducible test runs across environments
var prng = std.Random.DefaultPrng.init(0xBEEFFACE);
const random = prng.random();
// Generate input data with a predictable pattern plus random noise
var input = try allocator.alloc(f32, logical_len);
defer allocator.free(input);
for (input, 0..input.len) |*slot, idx| {
const base: f32 = @floatFromInt(idx);
slot.* = base * 0.5 + random.float(f32);
}
// Execute CPU reference implementation to produce expected results
var cpu_output = try allocator.alloc(f32, logical_len);
defer allocator.free(cpu_output);
runCpuFallback(input, cpu_output);
// Compute simple checksum for quick sanity verification
const checksum = checksumSlice(cpu_output);
std.debug.print("cpu fallback checksum: {d:.6}\n", .{checksum});
// Attempt to load and validate the compiled SPIR-V shader module
const module = try loadSpirvIfPresent(allocator, spirv_path);
defer if (module) |info| allocator.free(info.bytes);
if (module) |info| {
std.debug.print(
"gpu module: {s} ({d} bytes, header ok)\n",
.{ info.path, info.bytes.len },
);
} else {
std.debug.print(
"gpu module: missing ({s}); run kernel build command to generate it\n",
.{spirv_path},
);
}
// Check if a GPU execution captured output for comparison
const maybe_gpu_dump = try loadBinaryIfPresent(allocator, gpu_dump_path);
defer if (maybe_gpu_dump) |blob| allocator.free(blob);
if (maybe_gpu_dump) |blob| {
// Compare GPU results against CPU reference lane-by-lane
const mismatches = compareF32Slices(cpu_output, blob);
std.debug.print(
"gpu capture diff: {d} mismatched lanes\n",
.{mismatches},
);
} else {
std.debug.print(
"gpu capture diff: skipped (no {s} file found)\n",
.{gpu_dump_path},
);
}
// Display first few lanes for manual inspection
const sample_count = @min(input.len, 6);
for (input[0..sample_count], cpu_output[0..sample_count], 0..) |original, squared, idx| {
std.debug.print(
"lane {d:>3}: in={d:.5} out={d:.5}\n",
.{ idx, original, squared },
);
}
// Write reference dump if requested for external GPU validation tools
if (emit_binary) {
try emitCpuDump(cpu_output);
std.debug.print("cpu reference written to {s}\n", .{cpu_dump_path});
}
}
/// Computes dispatch geometry by rounding up to complete workgroups.
/// Returns the number of groups, total padded invocations, and unused tail lanes.
fn computeDispatch(total_items: u32, group_size: u32) DispatchPlan {
std.debug.assert(group_size > 0);
// Divide ceiling to ensure all items are covered
const groups = std.math.divCeil(u32, total_items, group_size) catch unreachable;
const padded = groups * group_size;
return .{
.workgroup_size = group_size,
.group_count = groups,
.padded_invocations = padded,
.tail = padded - total_items,
};
}
/// Executes the squaring operation on the CPU, mirroring the GPU kernel logic.
/// Each output element is the square of its corresponding input.
fn runCpuFallback(input: []const f32, output: []f32) void {
std.debug.assert(input.len == output.len);
for (input, output) |value, *slot| {
slot.* = value * value;
}
}
/// Calculates a simple sum of all f32 values in double precision for observability.
fn checksumSlice(values: []const f32) f64 {
var total: f64 = 0.0;
for (values) |value| {
total += @as(f64, @floatCast(value));
}
return total;
}
/// Attempts to read and validate a SPIR-V binary module from disk.
/// Returns null if the file doesn't exist; validates the magic number (0x07230203).
fn loadSpirvIfPresent(allocator: std.mem.Allocator, path: []const u8) !?ModuleInfo {
var file = std.fs.cwd().openFile(path, .{}) catch |err| switch (err) {
error.FileNotFound => return null,
else => return err,
};
defer file.close();
const bytes = try file.readToEndAlloc(allocator, 1 << 20);
errdefer allocator.free(bytes);
// Validate minimum size for SPIR-V header
if (bytes.len < 4) return error.SpirvTooSmall;
// Check little-endian magic number
const magic = std.mem.readInt(u32, bytes[0..4], .little);
if (magic != 0x0723_0203) return error.InvalidSpirvMagic;
return ModuleInfo{ .path = path, .bytes = bytes };
}
/// Loads raw binary data if the file exists; returns null for missing files.
fn loadBinaryIfPresent(allocator: std.mem.Allocator, path: []const u8) !?[]u8 {
var file = std.fs.cwd().openFile(path, .{}) catch |err| switch (err) {
error.FileNotFound => return null,
else => return err,
};
defer file.close();
const bytes = try file.readToEndAlloc(allocator, 1 << 24);
return bytes;
}
/// Compares two f32 slices for approximate equality within 1e-6 tolerance.
/// Returns the count of mismatched lanes; returns expected.len if sizes differ.
fn compareF32Slices(expected: []const f32, blob_bytes: []const u8) usize {
// Ensure blob size aligns with f32 boundaries
if (blob_bytes.len % @sizeOf(f32) != 0) return expected.len;
const actual = std.mem.bytesAsSlice(f32, blob_bytes);
if (actual.len != expected.len) return expected.len;
var mismatches: usize = 0;
for (expected, actual) |lhs, rhs| {
// Use floating-point tolerance to account for minor GPU precision differences
if (!std.math.approxEqAbs(f32, lhs, rhs, 1e-6)) {
mismatches += 1;
}
}
return mismatches;
}
/// Writes CPU-computed f32 array to disk as raw bytes for external comparison tools.
fn emitCpuDump(values: []const f32) !void {
// Ensure output directory exists before writing
try std.fs.cwd().makePath("out");
var file = try std.fs.cwd().createFile(cpu_dump_path, .{ .truncate = true });
defer file.close();
// Convert f32 slice to raw bytes for binary serialization
const bytes = std.mem.sliceAsBytes(values);
try file.writeAll(bytes);
}
$ zig build-obj -fno-llvm -O ReleaseSmall -target spirv32-vulkan-none \
-femit-bin=kernels/vector_square.spv \
01_vector_square_kernel.zig
$ zig run 02_host_pipeline.zig -- --emit-binarylaunch plan: 16 groups × 64 lanes => 1024 invocations (tail 24)
cpu fallback checksum: 83467485.758038
gpu module: kernels/vector_square.spv (5368 bytes, header ok)
gpu capture diff: skipped (no out/gpu_result.bin file found)
lane 0: in=0.10821 out=0.01171
lane 1: in=1.07972 out=1.16579
lane 2: in=1.03577 out=1.07281
lane 3: in=2.33225 out=5.43938
lane 4: in=2.92146 out=8.53491
lane 5: in=2.89332 out=8.37133
cpu reference written to out/reference.bin如果你计划捕获GPU缓冲区,请保留生成的out/reference.bin;否则,删除它以保持工作区清洁。
验证设备转储
差异工具消费两个二进制转储(预期与捕获)并报告不匹配的通道,预览前几个差异以帮助你快速发现数据相关的错误。它假设小端f32值,与大多数主机API公开原始映射缓冲区的方式匹配。mem.zig
// Utility to compare two float32 binary dumps.
//
// The files are expected to be raw little-endian 32-bit float arrays. The
// program prints the number of mismatched lanes (based on absolute tolerance)
// and highlights the first few differences for quick diagnostics.
const std = @import("std");
/// Maximum number of mismatched differences to display in diagnostic output
const max_preview = 5;
pub fn main() !void {
// Initialize allocator with leak detection for development builds
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer switch (gpa.deinit()) {
.ok => {},
.leak => std.log.warn("compare_dump leaked memory", .{}),
};
const allocator = gpa.allocator();
// Parse command-line arguments expecting exactly two file paths
var args = try std.process.argsWithAllocator(allocator);
defer args.deinit();
_ = args.next(); // Skip program name
const expected_path = args.next() orelse return usageError();
const actual_path = args.next() orelse return usageError();
if (args.next()) |_| return usageError(); // Reject extra arguments
// Load both binary dumps into memory for comparison
const expected_bytes = try readAll(allocator, expected_path);
defer allocator.free(expected_bytes);
const actual_bytes = try readAll(allocator, actual_path);
defer allocator.free(actual_bytes);
// Reinterpret raw bytes as f32 slices for element-wise comparison
const expected = std.mem.bytesAsSlice(f32, expected_bytes);
const actual = std.mem.bytesAsSlice(f32, actual_bytes);
// Early exit if array lengths differ
if (expected.len != actual.len) {
std.debug.print(
"length mismatch: expected {d} elements, actual {d} elements\n",
.{ expected.len, actual.len },
);
return;
}
// Track total mismatches and collect first few for detailed reporting
var mismatches: usize = 0;
var first_few: [max_preview]?Diff = .{null} ** max_preview;
// Compare each lane using floating-point tolerance to account for minor precision differences
for (expected, actual, 0..) |lhs, rhs, idx| {
if (!std.math.approxEqAbs(f32, lhs, rhs, 1e-6)) {
// Store first N differences for diagnostic display
if (mismatches < max_preview) {
first_few[mismatches] = Diff{ .index = idx, .expected = lhs, .actual = rhs };
}
mismatches += 1;
}
}
// Print summary of comparison results
std.debug.print("mismatched lanes: {d}\n", .{mismatches});
// Display detailed information for first few mismatches to aid debugging
for (first_few) |maybe_diff| {
if (maybe_diff) |diff| {
std.debug.print(
" lane {d}: expected={d:.6} actual={d:.6}\n",
.{ diff.index, diff.expected, diff.actual },
);
}
}
}
/// Prints usage information and returns an error when invocation is invalid
fn usageError() !void {
std.debug.print("usage: compare_dump <expected.bin> <actual.bin>\n", .{});
return error.InvalidInvocation;
}
/// Reads entire file contents into allocated memory with a 64 MiB size limit
fn readAll(allocator: std.mem.Allocator, path: []const u8) ![]u8 {
var file = try std.fs.cwd().openFile(path, .{});
defer file.close();
return try file.readToEndAlloc(allocator, 1 << 26);
}
/// Captures a single floating-point mismatch with its location and values
const Diff = struct {
index: usize,
expected: f32,
actual: f32,
};
$ zig run 03_compare_dump.zig -- out/reference.bin out/reference.binmismatched lanes: 0要验证真实的GPU运行,请将设备缓冲区保存为out/gpu_result.bin,并针对该文件重新运行03_compare_dump.zig以显示任何差异。Io.zig
注意事项与限制
- 存储缓冲区需要显式对齐;保持你的
extern struct定义与主机描述符绑定同步,以避免静默填充错误。 - 自托管SPIR-V后端在CPU目标上拒绝不支持的地址空间,因此将内核源文件与主机构建隔离(不要从CPU二进制文件
@import)。 - 确定性PRNG种下使CPU和GPU执行在运行和CI环境中可比较。
练习
- 通过绑定第二个输入缓冲区将内核扩展为融合乘法和加法(
a * a + b);相应地更新主机和差异工具。 - 教主机CLI发出描述调度计划的JSON元数据,以便外部分析器可以获取运行配置。json.zig
- 捕获真实GPU输出(通过Vulkan、WebGPU或wgpu-native)并将二进制文件输入
03_compare_dump.zig,记下硬件所需的任何公差调整。
替代方案和边缘案例
- 供应商对存储缓冲区的映射不同;在假设
f32数组密集打包之前,检查所需的最小区齐(例如某些驱动程序上的16字节)。 - 对于非常大的缓冲区,流式比较而不是将整个转储加载到内存中,以避免在低端机器上对分配器施加压力。
- 当针对CUDA(
nvptx64)时,将调用约定交换为.kernel并调整地址空间(.global/.shared)以满足PTX期望。