实现 Metal 4 机器学习与图形应用程序的完美融合

实现 Metal 4 机器学习与图形应用程序的完美融合

了解如何使用 Metal 4 将机器学习无缝融入你的图形应用程序中。我们将介绍用于在 GPU 时间线上连同渲染和计算工作一同运行模型的张量资源和 ML 编码器。了解如何使用着色器 ML 将神经网络直接嵌入着色器中，以实现高级效果和性能提升。我们还将通过示例 App 来展示适用于 Metal 4 ML 工作负载的新调试工具的实际应用。

章节
- 0:00 - Introduction
- 2:52 - Meet tensors
- 6:21 - Encode ML networks
- 12:51 - Embed ML in your shader
- 20:26 - Debug your ML workloads
资源
相关视频

WWDC25
WWDC24
- 利用 Metal 加快机器学习
- 基于 Apple GPU 训练机器学习和 AI 模型
Hi! My name is Preston Provins and I am an engineer on the Metal Framework team at Apple, and I'll be joined later by my colleague Scott. I'll share the additions coming to Metal that combine machine learning and games, and Scott will introduce the GPU tools additions designed to enhance your debugging experience for machine learning in Metal 4. I'm excited to share how to combine Metal 4 Machine learning and graphics in this session. If you are interested in what all Metal 4 has to offer, check out the Metal 4 foundations talk to learn about what else is new to Metal 4. Machine learning is transforming games and graphics with techniques like upscaling, asset compression, animation blending, and neural shading. These techniques push the frontier of creativity and immersion. They simulate complex phenomena, enhance visual fidelity, and streamline the exploration of new styles and effects. CoreML is fantastic for a wide range of machine learning tasks, such as segmentation, classification, generative AI and more. It makes it easy to author machine learning models. If your application of machine learning requires tight integration with the GPU timeline, Metal 4 has you covered. In a typical frame, a game may perform vertex skinning in a compute pass, rasterize the scene in a render pass, and apply antialiasing in another compute pass. Antialiasing is typically done using image-processing techniques such as temporal anti-aliasing. Cutting edge techniques replace these traditional methods with a machine learning network.
This network upscales the image, allowing the rest of the rendering to happen at lower resolution, improving performance.
It’s also becoming more common to execute tiny neural networks inside a shader. For example, a traditional fragment shader would sample material textures, but groundbreaking techniques use small neural networks to decompress textures on the fly and achieve higher compression ratios.
This neural rendering technique compresses material sets to 50% of the block compressed footprint. In this session, we’ll meet MTLTensors, Metal 4’s new resource for machine learning workflows. We’ll dive into the new MTL4MachineLearningCommandEncoder, which runs entire networks on the GPU timeline alongside your other draws and dispatches. We’ll introduce Shader ML, which lets you embed machine learning operations inside your own shaders. Finally, we’ll show how you can seamlessly integrate ML into your application with the Metal Debugger. You’re already familiar with MTLBuffers and MTLTextures. This year Metal 4 introduces MTLTensor, a new resource that lets you apply machine learning to data with unprecedented ease. The MTLTensor is a fundamental machine learning data type that can be used for compute, graphics and machine learning contexts. Machine learning workloads will extensively utilize tensors. MTL4MachineLearningCommandEncoder uses MTLTensors to represent inputs and outputs, and Shader ML uses MTLTensors to represent weights as well as inputs and outputs.
MTLTensors are multi-dimensional containers for data, and are described by a rank and number of dimensions for each rank. MTLTensors can extend well beyond two dimensions, providing the flexibility to describe any data layout you need for practical machine learning usage. MTLTextures, for example, are limited to four channels at most and have strict limits for the extents that depend on the texture format. For machine learning, it's common to use data that has more than two dimensions, such as convolution operations. Using a flat representation of data like MTLBuffer would require complicating indexing schemes for data with multiple dimensions. Indexing multidimensional data in a MTLTensor is way simpler because the stride and dimension of each rank are baked into the MTLTensor object and automatically used in the indexing calculations. Let's work through the process of creating a MTLTensor. The MTLTensor’s rank describes how many axes it has. This MTLTensor has a rank of two. It contains rows of columns of data. The extents of the dimension describes how many data points are along that axis. The dataType property defines what format of data the MTLTensor is wrapping.
Usage properties indicate how the MTLTensor will be utilized. MTLTensorUsageMachineLearning for MTL4MachineLearningCommandEncoder MTLTensorUsageCompute or MTLTensorUsageRender for usage inside of your shader programs. It's also possible to combine usages like the usage property for textures. Those are the important MTLTensor properties that should be populated on a MTLTensorDescriptor object, now lets make a MTLTensor in code. With the descriptor properties filled, create a new MTLTensor by calling newTensorWithDescriptor:offset:error: on a MTLDevice object. MTLTensors are created from either a MTLDevice or MTLBuffer object; however, MTLTensor created from device offer the best performance. Similar to how MTLTextures can be swizzled, creating a MTLTensor from MTLDevice object results in an opaque layout that is optimized for reading and writing. Now, let's focus on creating MTLTensors from a pre-existing MTLBuffer. Unlike MTLTensors created from a MTLDevice MTLTensors from MTLBuffer aren't assumed to be tightly packed, so you need to specify its strides. The innermost stride, should always be one.
The second stride indicates how many elements are jumped over when the row index is incremented.
It's possible the source MTLBuffer contains padding, such as unused columns at the end of the row. Padding needs to be accounted so the MTLTensor wraps the appropriate elements.
To create a MTLTensor from underlying buffer, set the dataType and usage properties like you would for a device-allocated tensor. Then fill out the strides property of the MTLTensorDescriptor so the resulting MTLTensor will appropriately wrap the contents of the MTLBuffer. Finally, use the newTensorWithDescriptor:offset:error: on the source MTLBuffer. Now that we know how to allocate and create MTLTensors, let's dive into the new machine learning encoder to add ML work to the GPU Timeline. Metal 4 enables compute and render commands to be easily added to the GPU timeline. with the MTL4ComputeCommandEncoder and MTL4RenderCommandEncoder, respectively. This year we're taking unification even further by adding machine learning work to the GPU Timeline.
The MTL4MachineLearningCommandEncoder enables full models to run alongside, and synchronize, with other Metal commands on the GPU and ensures seamless integration with other MTLCommands. It's a new encoder for encoding machine learning commands, it has an interface similar to the compute and render encoders. The Metal 4 synchronization primitives also operate with machine learning commands, just like with compute and render. Synchronization enables control over work orchestration and facilitate parallelization to maintain high performance.
The MTL4MachineLearningCommandEncoder creation workflow can be separated into two parts: offline and runtime. The offline portion of the workflow takes place prior to application launch and the runtime portion will happen during the application life time, such as in the middle of a frame. Let's start with the offline portion of the workflow, creating a MTLPackage.
A MTLPackage is a container for one or more functions that each represent an ML network, that you can use in Metal to execute machine learning work. This format is optimized for loading and execution with Metal.
To create a MTLPackage, you first need to have a CoreML package. Here we use CoreML converter to convert from the ML framework the network was authored in, such as PyTorch or Tensorflow, into a CoreML Package.
Here is an example of exporting a PyTorch model using the CoreML Tools library in Python. Simply import the tools and run convert on the model to generate an export. Finally, save that export as an ML package. There is one thing I would like to highlight here: not every CoreML package is an ML program and only ML programs are supported. If the CoreML package was exported an older OS checkout this article to learn more on exporting those CoreML model files as an ML package. With the CoreML package created, it is as simple as running the metal-package-builder command line on the saved model to produce a MTLPackage. This converts the CoreML package into a format that can be efficiently loaded at runtime. So, that's it for creating a MTLPackage. The offline portion of the workflow is complete and the rest is carried out at run-time.
To compile the network, first open the MTLPackage as an MTLLibrary. Create a function descriptor using the name of the function that represent the network in the package. In this case, the main function.
To compile the network, create a MTL4MachineLearningPipelineState. This is done using a MTL4MachineLearningPipelineStateDescriptor with the function descriptor. If the network has dynamic inputs, specify the size of each input on the MTL4MachineLearningPipelineStateDescriptor.
Compile the network for the specific device by creating the MTL4MachineLearningPipelineState with the MTL4MachineLearningPipelineStateDescriptor.
That is how an MTL4MachineLearningPipelineState object is created, Now the next step is creating the MTL4MachineLearningCommandEncoder and encoding work.
Let's take a deeper look at using the MTL4MachineLearningCommandEncoder object to dispatching work on the GPU timeline.
Simply create the MTL4MachineLearningCommandEncoder object, just like creating and encoder for compute or render. Set the created the MTL4MachineLearningPipelineState object, and bind inputs and outputs being used. Finally, dispatch the work using the dispatchNetworkWithIntermediatesHeap method.
The machine learning encoder uses the heap to store intermediate data between operations, instead of creating and releasing buffers it enables the reuse of resources for different dispatches.
To create this MTLHeap, create a MTLHeapDescriptor and set the type property to MTLHeapTypePlacement You can get the minimum heap size for the network from querying the intermediateHeapSize of the pipeline, set the size property of the heap to be greater than or equal to that.
After encoding your network dispatches, end encoding and submit your commands to run them on the GPU timeline.
As previously mentioned, Metal4 synchronization primitives also operate with machine learning commands, just like with compute and render.
Work that doesn't depend on machine learning output can happen in tandem if synchronized correctly.
Only the work consuming the network output needs to wait for the schedule machine learning work to conclude.
To synchronize MTL4MachineLearningCommandEncoder dispatches, you can use standard Metal 4 synchronization primitives such as MTLBarriers and MTLFences. The new MTLStageMachineLearning is used to identify ML workloads in barriers. For example, to make your rendering work wait on outputs produced by a network, you could use a barrier between the appropriate render stage and the machine learning stage. Let's look at MTL4MachineLearningCommandEncoder in action - in this example, MTL4MachineLearningCommandEncoder is used to dispatch a fully convolutional network to predict occlusion values per pixel. Evaluating this requires careful synchronization. The depth buffer and view space normals are populated prior to launching the ML workload. While the network is processing the data, the renderer dispatches other render related tasks in parallel and waits for the neural results before compositing the final frame. MTL4MachineLearningCommandEncoder isn't limited to just processing full frame information for games, you can use it for any network that fits into a real time budget and leverage Metal 4 synchronization primitives to however best suites your integration needs. That's how Metal 4’s MTL4MachineLearningCommandEncoder makes it easy to run large machine learning workloads on the GPU timeline. To summarize: Machine learning is joining compute and render in Metal 4 through the MTL4MachineLearningCommandEncoder. MTL4MachineLearningCommandEncoder enables entire networks to run on the GPU timeline. Resources are shareable with other GPU commands and the robust set of Metal 4 synchronization primitives enable high performance machine learning capabilities. Metal 4 also introduces Shader ML for embedding smaller machine learning operations inside your existing kernels and shaders. Cutting-edge games are adopting machine learning to replace traditional rendering algorithms. ML based techniques can offer solutions for global illumination, material shading, geometry compression, material compression and more. These techniques can often improve performance or decrease memory footprint. As a motivating example, let’s consider neural material compression - a technique that enables up to 50% compression when compared to block compressed formats.
With traditional materials, you sample material textures, such as albedo and normal maps. Then you use the sampled values to perform shading.
With neural material compression, you’ll sample latent texture data, perform inference using the sampled values, and use the network’s output to perform shading.
Splitting each step into it's own pipeline is inefficient since each step needs to sync tensors to device memory, operate and sync outputs back for later operations.
To get the best performance, your app should combine these steps into a single shader dispatch. With Shader ML, Metal enables you to run your ML network directly within your fragment shader, without having to go through device memory between steps. You can initialize input tensors, run your network, and shade only the necessary pixels each frame. That improves your runtime memory footprint and your game’s disk space.
Lets take a look at neural material evaluation in greater detail.
Initializing input MTLTensors can be split into two parts, loading the networks weights and building the input feature MTLTensor. The input feature MTLTensor is made by sampling bound textures with a UV coordinate for the fragment.
Inference is where the input feature MTLTensor is transformed by learned weight matrices to extract features, compute activations, and propagate information through layers. This evaluation is repeated for multiple layers and the result is a decompressed material. Finally, the decompressed materials are used for the shading calculations of the fragment.
Let's see how to initialize our input MTLTensors with Shader ML. First, let’s declare a fragment shader that will utilize Shader ML and pass in network weights. Start by including the new metal_tensor header. We’ll use the MTLTensor type to access the network weights. MTLTensors are bound to the shader using buffer binding slots. It's also possible to pass in MTLTensors using argument buffers as well. The MTLTensor type is templated. The first template argument is the MTLTensor’s dataType. These MTLTensors were created in device memory so we use the device address space qualifier. The second argument represents the MTLTensor's dimensions and the type to be used for the indexing into the MTLTensor. Here, we’re using dextents to define a rank two tensor with dynamic extents. With that, our fragment shader is set up. Let’s implement the neural material compression algorithm.
With the weights of the network passed in, we can create the input MTLTensor by sampling four latent textures. MTLTensor are not just a resource that can be bound: you also can create inline MTLTensor directly within your shaders! Create a MTLTensor wrapping the sampled values, and use it for the evaluation of the network. Inline MTLTensors are assumed to be tightly packed, so there is no need to pass strides at creation.
With that, initializing the input MTLTensors is complete and we are all setup to infer values from the neural network. Evaluation transforms inputs using learned parameters which are then activated. The activations are passed to subsequent layers, the final layer's activations form the decompressed material.
This year, Metal introduces Metal Performance Primitives to make MTLTensor operations accessible in the shading language. This library is a set of high performance APIs that enable performance portable solutions on MTLTensors. It provides matrix multiplication and convolution operations.
Matrix multiplication is at the heart of neural network evaluation. We'll use the matmul2d implementation provided by Metal Performance Primitives to implement a performance portable network evaluation routine. To get started include the new MetalPerformancePrimitives header inside your Metal shader The parameters of your matrix multiplication are configured using the matmul2d_descriptor object. The first set of template parameters specify the problem size of the matrix multiplication. The next set of template parameters control whether the inputs to the matrix multiplication need to be transposed when performing the operation. And the last template parameter controls your precision requirements.
In addition to the descriptor, the matmul2d operation must be specialized with the number of threads that will be participating in the operation. Here, since we are within a fragment shader, we’ll use execution_thread to indicate that the full matrix multiplication will be performed by this thread. Then, run the matrix multiplication with that configuration.
Finally, activate each element of the result of our matrix multiplication using the ReLU activation function. This process is repeated for the second layer to fully evaluate our network right in our shader. After evaluation is complete, the decompressed materials are available to be used for shading.
The output MTLTensor holds channel information which can then be used like any other value sampled from a texture. Here’s a realtime demo of neural material compression compared to traditional materials. There’s no perceived quality loss from using neural materials, especially when shaded. Here’s the base color in isolation. It’s still very difficult to notice any differences between neural materials and traditional ones, and yet neural materials use half the memory and take up half the disk space.
MTLTensor operations aren’t exclusive to just fragment shaders. They can be used inside of all functions and all shader stages. If an entire simdgroup or threadgroup will be doing the same operations on the same data, you can leverage the hardware to your advantage by choosing a larger execution group. But if your MTLTensor operations are divergent with respect to data or exhibit non-uniform control flow at the MTLTensor operation call site, you must use a single thread execution group. Other execution schemes assume there is no divergence and uniform control flow for the execution group.
To summarize, you can now perform ML operations like matrix multiplication and convolution in your own shaders Shader ML makes it easy to perform multiple ML operations in a single shader. This is cache-friendly, requires less dispatches, and uses less memory bandwidth, especially when using smaller networks. And Shader ML gives you the fine-grained control you need to create custom operations. It’s never been easier to implement cutting-edge ML techniques in your Metal apps. And thats how you can use Shader ML to embed neural networks into your shader program. Now, I'll turn things over to my colleague Scott to show how Metal 4's new debug tools make debugging machine learning workloads a breeze. Hi everyone, I'm Scott Moyers and I'm a software engineer on the GPU Tools team.
Earlier, Preston showed you an application that uses machine learning to calculate ambient occlusion. The app encodes a machine learning network directly into its Metal rendering pipeline.
While helping develop this app, I hit an issue where the output had some severe artifacts. Let me enable just the ambient occlusion pass to highlight the problem I had.
There should be shadows in the corners of objects, but instead there's lots of noise and the structure of the scene is barely visible.
I’ll show you how I used the new tools to find and fix this issue. First let’s capture a GPU trace of the app in Xcode. To do that I’ll click on the Metal icon at the bottom of the screen, then the capture button.
Once the capture completes I can find my captured frame available in the summary.
The debug navigator on the left provides the list of commands that the application used to construct the frame.
For instance, the offscreen command buffer contains many encoders including the G-buffer pass. And the next command buffer contains my MTL4MachineLearningCommandEncoder. Using Metal 4 allowed me to have fine grained control over synchronization, and whilst I was careful about setting up barriers and events between dependent passes, I wondered if a synchronization problem could be causing these issues. To check this, I turned to the Dependency Viewer, which is a useful tool to get an overview of the structure of your Metal application. I’ll click on the Dependencies icon at the top left.
With this interface, I can see all of the application’s commands along with any synchronization primitives such as barriers and events. Zooming into a command encoder reveals even more detail. There’s the completion of my first command buffer.
The command below it copies the normals into a MTLTensor. Then there’s a barrier followed by a MTL4MachineLearningCommandEncoder. I’ll zoom back out so I can review the overall structure. My new ambient occlusion pass is in the command buffer on the right. Before I added this pass the application was working fine, so I can assume the dependencies within the top and bottom command buffer are correct.
I’ll inspect the new command buffer containing the MTL4MachineLearningCommandEncoder.
Before the command buffer can start there’s a wait for a shared event signal.
Then at the end of the command buffer there’s a signal to unblock the next one. So there can’t be any other commands running in parallel with this command buffer. And within the command buffer there’s barriers between each encoder ensuring that each command executes one after the other. I was fairly confident at this stage that there weren’t any synchronization issues, at least within this frame. With that ruled out, I decided to check the MTL4MachineLearningCommandEncoder directly. Clicking on the dispatch call for the ambient occlusion network takes me to its bound resources.
On the right the assistant editor is displaying the output MTLTensor. I can see it has the same artifacts as the running application, so clearly it’s not correct. I’ll double click the input MTLTensor to display it next to the output. The input has what I would expect for view space normals; the objects facing a different direction do have different component intensities. So the problem must be inside my machine learning network. Let’s go back to our bound resources view and this time I’ll double click Network to open it in the new ML Network Debugger. This tool is essential for understanding what's happening inside the model.
This graph represents the structure of my ambient occlusion network. I wrote it in PyTorch, and in my target's build phases I do what Preston suggested earlier, I export it as a CoreML package, then convert to an MTLPackage. The boxes are the operations and the connections show the data flow through the model from left to right. I wanted to find out which operation was responsible for introducing the artifacts. I knew the final output was bad and that the input was good, so I decided to bisect the graph to narrow it down. Let’s pick an operation roughly in the middle.
Selecting an operation shows its description on the right, along with its attributes, inputs, and outputs. What’s more is that I am able to inspect the intermediate MTLTensor data that any operation outputs. I’ll click on the preview to open it in the MTLTensor viewer.
I can see the artifacts are already present here, so I’ll check an earlier operation.
This operation also has artifacts in the output. Let’s inspect its input.
This MTLTensor however appears to be highlighting edges in the scene, which is expected, the input to our network is the edges extracted from our depth buffer. So something must be going wrong within this region of the network.
This stitched region can be expanded by clicking on the arrows in the top left of the operation.
From the order and types of these operations, I recognize this as my SignedSmoothstep function. It first takes the absolute value of the input. Then clamps the value between 0 and 1. But then it’s raising the result to the power of itself, which doesn’t seem right to me, I don’t remember there being a power operation in the SignedSmoothstep function. Let’s jump into the Python code to find out what's going on. I’ll stop the debug session and go back to the source code.
The model I'm running is in this class called LightUNet. I’ll navigate to its forward propagation function to check it's doing what I expect.
The first custom operation it's performing is SignedSmoothstep, which is the stitched region I saw in the ML network debugger. I’ll jump to its forward propagation function.
This should be a straight forward smoothstep operation where I maintain the sign of the input. But, on this line I can see the bug, I typed too many asterisks making my multiply a power operator. Let's delete that extra one and try running it again.
And there you have it, a working implementation of neural ambient occlusion using Metal 4’s built in MTL4MachineLearningCommandEncoder.
In this demo I showed you how I used the Metal Debugger to debug a Metal 4 machine learning application. First the dependency viewer helped me validate synchronization. After that I inspected the inputs and outputs of the network using the MTLTensor viewer, this verified the problem was inside the network.
Finally I used the ML network debugger to step through the operations in the network and pinpoint the issue.
These tools are part of a larger family of tools available for debugging and optimizing Metal apps. Now let’s recap what we covered today. Metal 4 introduces MTLTensor, a new multi-dimensional resource designed specifically for machine learning data. MTLTensors provide flexibility for complex data layouts beyond two dimensions, and with baked-in stride and dimension information, greatly simplifies indexing. New features in Metal 4 make it possible to combine machine learning workloads into your metal pipelines. The MTL4MachineLearningCommandEncoder enables entire machine learning networks to run directly on the GPU timeline. This allows seamless integration and synchronization with your compute and render work. For smaller networks, Shader ML and the Metal Performance Primitives library allows you to embed machine learning operations directly into your shaders. Lastly, the Metal Debugger gives you incredible visibility into what’s happening in your Metal 4 application. The new ML network debugger makes it easy to understand your network and how it executes on device. This kind of insight is essential for ensuring correctness and optimizing performance. For some next steps, try out Metal 4’s MTL4MachineLearningCommandEncoder and Shader ML for yourself, by installing the latest OS and Xcode. To find out more about how the Metal developer tools can help you, head over to the Apple Developer website. And to get the most out of your Metal 4 application, make sure you check out other Metal 4 talks. We're truly excited to see what you'll build with these new capabilities, thank you.

8:13 - Exporting a Core ML package with PyTorch

import coremltools as ct

# define model in PyTorch
# export model to an mlpackage

model_from_export = ct.convert(
    custom_traced_model,
    inputs=[...],
    outputs=[...],
    convert_to='mlprogram',
    minimum_deployment_target=ct.target.macOS16,
)

model_from_export.save('model.mlpackage')

9:10 - Identifying a network in a Metal package

library = [device newLibraryWithURL:@"myNetwork.mtlpackage"];

functionDescriptor = [MTL4LibraryFunctionDescriptor new]
functionDescriptor.name = @"main";
functionDescriptor.library = library;

9:21 - Creating a pipeline state

descriptor = [MTL4MachineLearningPipelineDescriptor new];
descriptor.machineLearningFunctionDescriptor = functionDescriptor;

[descriptor setInputDimensions:dimensions
                 atBufferIndex:1];

pipeline = [compiler newMachineLearningPipelineStateWithDescriptor:descriptor
                                                             error:&error];

9:58 - Dispatching a network

commands = [device newCommandBuffer];
[commands beginCommandBufferWithAllocator:cmdAllocator];
[commands useResidencySet:residencySet];

/* Create intermediate heap */
/* Configure argument table */

encoder = [commands machineLearningCommandEncoder];
[encoder setPipelineState:pipeline];
[encoder setArgumentTable:argTable];
[encoder dispatchNetworkWithIntermediatesHeap:heap];

10:30 - Creating a heap for intermediate storage

heapDescriptor = [MTLHeapDescriptor new];
heapDescriptor.type = MTLHeapTypePlacement;
heapDescriptor.size = pipeline.intermediatesHeapSize;
        
heap = [device newHeapWithDescriptor:heapDescriptor];

10:46 - Submitting commands to the GPU timeline

commands = [device newCommandBuffer];
[commands beginCommandBufferWithAllocator:cmdAllocator];
[commands useResidencySet:residencySet];

/* Create intermediate heap */
/* Configure argument table */

encoder = [commands machineLearningCommandEncoder];
[encoder setPipelineState:pipeline];
[encoder setArgumentTable:argTable];
[encoder dispatchNetworkWithIntermediatesHeap:heap];

[commands endCommandBuffer];
[queue commit:&commands count:1];

11:18 - Synchronization

[encoder barrierAfterStages:MTLStageMachineLearning
          beforeQueueStages:MTLStageVertex
          visibilityOptions:MTL4VisibilityOptionDevice];

15:17 - Declaring a fragment shader with tensor inputs

// Metal Shading Language 4

#include <metal_tensor>

using namespace metal;
 
[[fragment]]
float4 shade_frag(tensor<device half, dextents<int, 2>> layer0Weights [[ buffer(0) ]],
                  tensor<device half, dextents<int, 2>> layer1Weights [[ buffer(1) ]],
                  /* other bindings */)
{
    // Creating input tensor
    half inputs[INPUT_WIDTH] = { /* four latent texture samples + UV data */ };

    auto inputTensor = tensor(inputs, extents<int, INPUT_WIDTH, 1>());
    ...
}

17:12 - Operating on tensors in shaders

// Metal Shading Language 4

#include <MetalPerformancePrimitives/MetalPerformancePrimitives.h>

using namespace mpp;

constexpr tensor_ops::matmul2d_descriptor desc(
              /* M, N, K */ 1, HIDDEN_WIDTH, INPUT_WIDTH,
       /* left transpose */ false,
      /* right transpose */ true,
    /* reduced precision */ true);

tensor_ops::matmul2d<desc, execution_thread> op;
op.run(inputTensor, layerN, intermediateN);

for (auto intermediateIndex = 0; intermediateIndex < intermediateN(0); ++intermediateIndex)
{
    intermediateN[intermediateIndex, 0] = max(0.0f, intermediateN[intermediateIndex, 0]);
}

18:38 - Render using network evaluation

half3 baseColor          = half3(outputTensor[0,0], outputTensor[1,0], outputTensor[2,0]);
half3 tangentSpaceNormal = half3(outputTensor[3,0], outputTensor[4,0], outputTensor[5,0]);

half3 worldSpaceNormal = worldSpaceTBN * tangentSpaceNormal;

return baseColor * saturate(dot(worldSpaceNormal, worldSpaceLightDir));

- 0:00 - Introduction
- Introducing Metal 4, which enhances machine learning (ML) integration in games and graphics. Metal 4 enables techniques like upscaling, asset compression, and animation blending using ML networks, improving performance and visual fidelity. Key features include 'MTLTensors' for ML workflows, 'MTL4MachineLearningCommandEncoder' for running networks on the GPU timeline, Shader ML for embedding ML operations in shaders, and improved debugging tools. CoreML is optimal for authoring ML models, and you can achieve seamless ML integration into applications with help from the Metal Debugger.
- 2:52 - Meet tensors
- Metal 4 introduces 'MTLTensor', a new resource specifically designed for machine learning workloads. 'MTLTensors' are multi-dimensional data containers, enabling efficient representation of the complex data layouts commonly used in machine learning, such as those required for convolution operations. They simplify indexing multidimensional data compared to flat representations like 'MTLBuffers'. An 'MTLTensor' is defined by its rank (number of axes), extents (number of data points along each axis), data type, and usage properties. These properties are specified in an 'MTLTensorDescriptor' object. You can create 'MTLTensors' from either an 'MTLDevice' object, which offers optimized performance with an opaque layout, or from an existing 'MTLBuffer' object, where you need to specify the strides to account for potential padding.
- 6:21 - Encode ML networks
- The latest Metal also introduces the 'MTL4MachineLearningCommandEncoder', allowing machine learning work to be integrated seamlessly into the GPU timeline alongside compute and render commands. This new encoder enables full ML models to run on the GPU, synchronizing with other Metal commands using standard synchronization primitives like barriers and fences. The workflow involves two main parts: offline and runtime. Offline, the system converts a CoreML package into an optimized 'MTLPackage' using the 'metal-package-builder' command-line tool. At runtime, the system compiles 'MTLPackage' into an 'MTL4MachineLearningPipelineState', and creates the 'MTL4MachineLearningCommandEncoder' — set up with the pipeline state, inputs, and outputs — and then dispatches encoded commands to the GPU. The encoder utilizes an 'MTLHeap' to store intermediate data, optimizing resource usage. This allows for parallel execution of nondependent tasks, enhancing performance. Metal 4's synchronization capabilities ensure that work consuming ML outputs waits for the network to complete, making it suitable for various real-time applications, not just games.
- 12:51 - Embed ML in your shader
- Metal 4 introduces Shader ML, enabling developers to embed machine learning operations directly into shaders. This enhances performance and reduces memory footprint by eliminating the need to sync tensors between device memory and shaders. Neural material compression, a specific ML technique, is an example. It compresses material textures by up to 50% compared to traditional block-compressed formats. With Shader ML, the entire neural material evaluation process — from initializing input tensors to performing inference and shading — can be combined into a single shader dispatch. Metal Performance Primitives is integrated into Shader ML, providing high-performance APIs like matrix multiplication and convolution. This allows you to implement neural network evaluation routines efficiently within fragment shaders, resulting in real-time applications with no perceived quality loss but significantly reduced memory usage and disk space.
- 20:26 - Debug your ML workloads
- In the example provide, using the new GPU Tools in Xcode, you can debug a machine learning workload that is causing severe artifacts in an application's ambient occlusion calculation. You can capture the GPU trace and utilize the Dependency Viewer to inspect the synchronization of command buffers, ruling out any issues. You then examine the input and output tensors of the 'MTL4MachineLearningCommandEncoder', confirming that the problem is within the machine learning network itself. Next, you can open the network in the new ML Network Debugger, a visual tool that represents the structure of the model (written in PyTorch and converted to CoreML and MTLPackage), enabling you to pinpoint the specific operation responsible for introducing the artifacts. Upon inspecting the graph, the artifacts are present in both the output and an earlier operation's input, indicating an issue within the network. The 'SignedSmoothstep' function is identified as the problem area. Upon reviewing the Python code, a bug is discovered — an extra asterisk causing the system to interpret a multiplication operation as a power operation. Correcting this error resolves the issue, and the neural ambient occlusion implementation using Metal 4's 'MTL4MachineLearningCommandEncoder' is successfully debugged.

章节

资源

相关视频

WWDC25

WWDC24