Skip to content

burn

Model Serving & MLOps

Burn is a next‑generation tensor library and deep learning framework that offers high flexibility, efficiency, and portability across many hardware backends.

Rust Latest v0.21.0 · 27d ago Security brief →

Features

  • Supports multiple GPU (CUDA, ROCm, Metal, Vulkan, WebGPU, LibTorch) and CPU backends with generic trait‑based design.
  • Provides backend decorators for autodifferentiation (Autodiff) and kernel fusion (Fusion).
  • Enables remote execution via a Beta “Remote” decorator for distributed computations.

Recent releases

View all 7 releases →
v0.21.0 Breaking risk
⚠ Upgrade required
  • Update any code that directly accessed `~/.cache` for Burn datasets to call the new platform‑aware cache directory API.
  • Migrate existing binary model records (`BinFileRecorder`, `BinBytesRecorder`) to the new shape representation before loading with v0.21.0.
  • Adjust backend creation functions (e.g., `bool_empty`, `bool_into_int`) to accept an explicit output dtype argument.
Breaking changes
  • Dataset cache directory switched from hardcoded `~/.cache` to platform‑specific locations (`$XDG_CACHE_HOME`, `~/Library/Caches`, `{FOLDERPATH_LOCAL_APPDATA}`); code must use the new API.
  • `TensorData::shape` now stores a private `Shape` type instead of `Vec<usize>`; existing binary records using `BinFileRecorder` or `BinBytesRecorder` are not forward‑compatible and require conversion before upgrading.
  • Removed support for `powf` on integer tensors; operations must first cast to float (e.g., `tensor_int.float().powf(tensor_float)`).
Notable features
  • Added lightweight eager CPU backend **burn‑flex** for WebAssembly and embedded targets, replacing `burn-ndarray`.
  • Implemented early off‑policy reinforcement learning support in `burn‑rl` and related training utilities.
  • Introduced new kernel work: GEMV performance improvements, top‑k operations (`argtopk`), FFT implementations (rFFT/irFFT) with fusion support.
Full changelog

Summary

Burn 0.21.0 brings 4 months of improvements that make the framework significantly faster and more reliable across the board. The gains span distributed workflows for training large models all the way down to small-model inference, where the reduced framework overhead becomes especially noticeable.

We rethought our distributed computing stack around differentiable collective operations. Kernel selection is now more reliable thanks to better autotuning and a new validation layer, and a project-level burn.toml file lets you tweak those internals (and many others) without recompiling. A reworked device handle reduces framework overhead, and a new burn-dispatch crate simplifies backend selection while paving the way for faster compile times. The release also ships burn-flex, a lightweight eager CPU backend for WebAssembly and embedded targets that replaces burn-ndarray. Finally, we added early off-policy reinforcement learning support and a fresh round of kernel work on GEMV, top-k, and FFT.

For more details, check out the release post on our website.

Changelog

Breaking

We've introduced a couple of breaking changes with this release. The affected areas are detailed in the sections below.

burn-dataset cache directory

To respect platform conventions, we switched from using a hardcoded ~/.cache directory root for downloaded artifacts.

| Platform | Path |
|----------|------|
| Linux | $XDG_CACHE_HOME or ~/.cache |
| macOS | ~/Library/Caches |
| Windows | {FOLDERPATH_LOCAL_APPDATA} |

For Linux users without $XDG_CACHE_HOME configured, this change has no effect. The cache directory is still ~/.cache.

Interface Changes

TensorData::shape now stores a Shape instead of a Vec<usize>. Existing binary records using BinFileRecorder or BinBytesRecorder are no not forward-compatible and must be converted before upgrading.

static STATE_ENCODED: &[u8] = include_bytes!("model.bin");

let model: Model<B> = Model::new(&Default::default());

// Old format can still be loaded before upgrade, but must be re-saved in a forward-compatible format.
let record = BinBytesRecorder::<FullPrecisionSettings, &'static [u8]>::default()
    .load(STATE_ENCODED, &Default::default())
    .expect("Failed to decode state");
let model = model.load_record(record);

model.save_file("model.mpk", &NamedMpkFileRecorder::<FullPrecisionSettings>::new()).unwrap();

The module derive macro has been improved, and the Ignored<T> wrapper is now deprecated. For fields that should not considered modules, use #[module(skip)] instead.

pub struct Conv1d<B: Backend> {
-    pub padding: Ignored<PaddingConfig1d>,
+    #[module(skip)]
+    pub padding: PaddingConfig1d,
}

We added support for explicit asymmetric padding. If you were using explicit padding, you must now specify the same value for all pairs. Note that PaddingConfig3d does not support asymmetric padding yet.

// Symmetric (left, right)
- PaddingConfig1d::Explicit(1)
+ PaddingConfig1d::Explicit(1, 1)
// Symmetric (top, left, bottom, right)
- PaddingConfig2d::Explicit(1, 1)
+ PaddingConfig2d::Explicit(1, 1, 1, 1)

The Gelu activation module can now be configured with tanh approximation. This only affects code that instantiated Gelu directly.

- let activation = Gelu;
+ let activation = Gelu::new(); // or Gelu::default()

The position-wise feed-forward module now has a configurable activation function. To keep it backwards compatible with previously saved records, the field is marked as #[module(skip)].

#[derive(Module, Debug)]
pub struct PositionWiseFeedForward<B: Backend> {
    // ...
-   /// GELU activation function.
-   pub gelu: Gelu,
+   /// Activation function.
+   #[module(skip)]
+   pub activation: Activation<B>,
}

The Shape fields are now private and some methods have been renamed. ShapeError has been renamed to MetadataError.

- let b = tensor.shape().dims[0];
+ let b = tensor.shape()[0]

- if let Err(ShapeError::RankMismatch{...}) = lhs.broadcast(&rhs) {
+ if let Err(MetadataError::RankMismatch{...}) = lhs.broadcast(&rhs) {

- let shape = shape.swap(1, 2).unwrap();
+ let shape = shape.swapped(1, 2).unwrap();

- let shape = shape.permute(&[0, 2, 1, 3]).unwrap();
+ let shape = shape.permuted(&[0, 2, 1, 3]).unwrap();

The boolean data type was expanded to include its storage type.

match bool_tensor.dtype() {
-   DType::Bool => todo!(),
+   DType::Bool(BoolStore::Native) => todo!(),
+   DType::Bool(BoolStore::U8) => todo!(),
+   DType::Bool(BoolStore::U32) => todo!(),
    _ => unreachable!(),
}

powf is no longer supported for Int tensors, as it previously relied on incorrect implicit truncation. These operations are now only available for Float tensors.

- let tensor_i = tensor_int.powf(tensor_float);
+ let tensor_f = tensor_int.float().powf(tensor_float);

- let tensor_i = tensor_int.powf_scalar(scalar_float);
+ let tensor_f = tensor_int.float().powf_scalar(scalar_float);

Backend tensor creation and conversion ops now take an explicit output dtype. This removes backend-specific dtype inference and ensures consistent behavior across backends. (Backend implementors only.)

impl BoolTensorOps<Self> for MyBackend {
-    fn bool_empty(shape: Shape, device: &Device<Self>) -> BoolTensor<Self> {
+    fn bool_empty(shape: Shape, device: &Device<Self>, dtype: BoolDType) -> BoolTensor<Self> {
        // use `dtype` instead of inferring internally
    }
-    fn bool_into_int(tensor: BoolTensor<Self>) -> IntTensor<Self> {
+    fn bool_into_int(tensor: BoolTensor<Self>, out_dtype: IntDType) -> IntTensor<Self> {
        // use `dtype` instead of inferring internally
    }
}

Associated types were moved from Backend to BackendTypes. Prefer the type aliases (Device<B>, FloatTensor<B>, etc.) to avoid type resolution issues.

impl BoolTensorOps<Self> for MyBackend {
-    fn bool_empty(shape: Shape, device: &<Self as Backend>::Device, dtype: BoolDType)) -> <Self as Backend>::BoolTensorPrimitive {
+    fn bool_empty(shape: Shape, device: &Device<Self>, dtype: BoolDType) -> BoolTensor<Self> {
    }
}

Module & Tensor

  • Feat/device policy (#4373) @laggui
  • Implement basic RNN module (#4460) @aditya0by0
  • Add deg2rad and rad2deg (#4462) @softmaximalist
  • Implement median tensor operation (#4454) @softmaximalist
  • Add Selu activation function (#4439) @antimora
  • Add CELU activation function (#4441) @antimora
  • Add Elu activation function (#4438) @antimora
  • Add BiGru (bidirectional GRU) module (#4442) @antimora
  • Add ThresholdedRelu activation function (#4440) @antimora
  • Add Softsign activation function (#4437) @antimora
  • [Breaking] Add configurable activation and layer_norm_eps to transformer layers (#4410) @antimora
  • [Breaking] Add asymmetric padding support for conv and pool operations (#4263) @antimora
  • Implement HardShrink, SoftShrink and Shrink Activations (#4556) @aditya0by0
  • feat: add align_corners support to InterpolateOptions (#4518) @antimora
  • feat: support padding on arbitrary dimensions (#4507) @antimora
  • feat: enhance attention() with scale, attn_bias, softcap, and is_causal (#4476) @antimora
  • feat: Introduce Lanczos3 interpolation method (#4601) @ovr
  • Add HannWindow operator to burn-tensor (#4631) @walkinggo
  • [Breaking] Remove int powf and make powi numeric op (#4646) @laggui
  • [Breaking] Add bool store dtype + remove bool elem from fusion (#4649) @laggui
  • [Breaking] Use device settings to provide output dtype (#4653) @laggui
  • feat: add categorical sampling for tensors (#4655) @majiayu000
  • Add HammingWindow operator to burn-tensor (#4698) @RunjiaChen
  • Fix: make module cloning efficient for CPU devices (#4703) @antimora
  • feat: support cross-kind tensor casting via .cast() (#4713) @antimora
  • Add FloatInfo for dtype-aware precision info (#4721) @antimora
  • Fix unsqueeze_dims panic (#4755) @softmaximalist
  • Fix unsqueeze_dims panic on duplicate sorted axes (#4764) @antimora
  • feat(burn-nn): add native LocalResponseNorm module (#4765) @jcwal1516
  • Add det (determinant) tensor operation (#4813) @softmaximalist
  • Add Blackman window function to signal module (#4842) @softmaximalist
  • Add STFT/ISTFT and thread n through FFT backend trait (#4835) @antimora
  • Add linear op to ModuleOps for fused matmul+bias (#4747) @antimora
  • Add native impementations for scatter_nd / gather_nd; provide autodiff for assign & add (#4709) @cu9hue
  • Fix conv x-backward padding_out bug (#4806) @antimora
  • Extract float math ops in a new trait (#4891) @skewballfox
  • linalg::lu: Improve numerical handling and small perf cleanup (#4902) @softmaximalist
  • Adding complex to complex FFT implementation (#4903) @RunjiaChen
  • add autodiff for scatter_nd min/max/mul (#4909) @cu9hue
  • fix: conv_transpose x-backward output size (#4916) @SAY-5
  • Change pwff activation to #[module(skip)] for backward compat (stateless) (#4929)

Datasets & Training

  • Implement SSIM vision metric (#4396) @softmaximalist
  • add KLDivLoss and batch_mean in reduction (#4399) @donjuanplatinum
  • Fix cubek matmul stage size (#4435) @laggui
  • Implement the PSNR vision metric (#4379) @softmaximalist
  • Implement Mean(L(P) Norm Error)Loss (#4341) @softmaximalist
  • Feature flag + Tests for RL in burn-rl and burn-train (#4470) @Charles23R
  • Burn rl (#4447) @Charles23R
  • add AMSgrad support for Adam/AdamW (#4388) @donjuanplatinum
  • add LBFGS optimizer (#4471) @donjuanplatinum
  • Add SequenceOutput struct for sequence prediction outputs (#4474) @softmaximalist
  • fix: OptimSharded strategy validation device mismatch (#4527) @Dreaming-Codes
  • Implement CTC loss (#4529) @softmaximalist
  • Add Smooth L1 loss (#4547) @softmaximalist
  • Implements: LPIPS matrics for Image quality (#4403) @koreaygj
  • feat: Implements DISTS metric (#4574) @koreaygj
  • Add multi-scale SSIM for image quality assessment (#4555) @softmaximalist
  • Add Gram Matrix Loss for vision tasks (#4595) @softmaximalist
  • Add evaluator summary (#4578) @laggui
  • Fix cosine scheduler record in composed scheduler (#4617) @laggui
  • Implement RNNT loss (#4623) @cong-or
  • feat: add FID vision metric (#4644) @cong-or
  • Add Adan optimizer implementation with tests (#4651) @sepcnt
  • [Breaking] Split TrainingStrategy to decouple the DistributedBackend requirement (#4710) @laggui
  • Fix CrossEntropyLoss with probabilities (#4829) @laggui

Backends

  • More explicit global dtype support (#4400) @laggui
  • opt(burn-cubecl): Optimized tensors by default (#4402) @wingertge
  • Add device dtype usage (#4404) @laggui
  • Attention: add autotune gate (#4554) @louisfd
  • Attention autotune (#4552) @louisfd
  • Attention: remove default impl and implement for all backends (#4544) @louisfd
  • Add native sign unary ops for CubeCL float and int (#4513) @yash27-lab
  • [Feat] Global backend Dispatch (#4508) @laggui
  • allow flash attention with causal (#4509) @louisfd
  • Perf: Improve fusion score (#4511) @nathanielsimard
  • Dispatch autodiff checkpointing strategy support (#4629) @laggui
  • Selector/attention (#4648) @louisfd
  • update cubek and fix vecmat autotune (#4682) @louisfd
  • update cubek and cubecl (#4699) @louisfd
  • update cubek & fix gemv autotune (#4726) @louisfd
  • Feat/add rfft (#4707) @Sublime12
  • Feat/add irfft (#4719) @Sublime12
  • Feat/implement fusion for rfft (#4735) @Sublime12
  • Feat/implement fusion for irfft (#4736) @Sublime12
  • Add burn-flex CPU backend (#4761) @antimora
  • burn-flex: enable f16 tests and fix mean overflow, grid_sample and quantization (#4769) @antimora
  • Add softmax and layer_norm backend trait hooks (#4797) @antimora
  • burn-flex: implement softmax and layer_norm backend op (#4805) @antimora
  • Matmul selection (#4773) @nathanielsimard
  • Add native dispatch overrides and native tch ops for softmax, layer_norm (#4834) @antimora
  • [Breaking] Split Associated Types from Backend into BackendTypes (#4868) @skewballfox
  • Add ctc_loss backend trait hook + tch and cubecl impls (#4819) @antimora
  • Update CubeK: tile matmul refactor (#4901) @louisfd
  • Add argtopk for Cubecl backend (#4900) @Sublime12
  • Add fusion integration for argtopk (#4904) @Sublime12
  • Add cubecl integration to topk (#4906) @Sublime12
  • Fusion tests (#4872) @nathanielsimard
  • Enable & fix cubecl tests w/ fusion (#4917) @laggui

Bug Fixes

  • Fix reduce line size parallel and mean accumulator precision (#4467) @laggui
  • fix: default to single device strat when only 1 device (#4463) @Charles23R
  • fix: use all dilation entries in max_pool2d_with_indices_backward (#4466) @fcasal
  • Fix cubek matmul stage size (#4435) @laggui
  • fix: Fix interpolate with NHWC input (#4363) @wingertge
  • fix: Actually implement conv backwards ops for burn-fusion/burn-router (#4360) @wingertge
  • Fix memory growth: use GraphLocator::remove_entry for orphan cleanup (#4342) @jnamika
  • fix: Bool from_data_dtype panics on GPU backends (#4551) @antimora
  • fix: resolve macOS build and test failures (#4545) @antimora
  • Fix too many kernels (#4505) @nathanielsimard
  • Fix quantization non-contiguous input (#4498) @laggui
  • fix overflow in int_abs_elem for i64 min value (#4486) @Olexandr88
  • Fix: create multiple elemwise fused block (#4497) @nathanielsimard
  • Fix fusion cumulative op inputs (#4621) @laggui
  • Fix dispatch autodiff feature propagation (#4592) @laggui
  • Fix conv2d_weight_backward w/ strided channels and unit spatial dims (#4591) @laggui
  • Fix(lpips): load ImageNet backbone weights for pretrained models (#4557) @koreaygj
  • Fix tch int_zeros dtype in sync (#4664) @laggui
  • Fix fusion kernel vector_size mismatch on f16 output writes (#4675) @AdrianEddy
  • Fix fusion consistency checks and binding estimation (#4695) @nathanielsimard
  • Fix attention_fallback NaN for fully-masked rows (#4697) @antimora
  • fix output in attention tuner (#4702) @louisfd
  • fix: use integer arithmetic for nearest-neighbor coordinate scaling (#4687) @wkrettek
  • Fix cubecl cuda all-reduce + remove useless check in distributed server (#4720) @Charles23R
  • Fix fusion scalar broadcasting in write_output_aligned (#4741) @laggui
  • Fix quantization tests and flaky tolerance (#4743) @laggui
  • Fix select_assign OOB (#4760) @nathanielsimard
  • Fix burn-flex bool binary ops to broadcast operands (#4775) @antimora
  • Fix burn-flex attention rejecting broadcasted mask/bias (#4777) @antimora
  • fix(ndarray): grouped conv SIMD clamp + regressions (#4727) @dnvt
  • Fix autotune context, remove unsafe code (#4781) @ArthurBrussee
  • Fix cubecl cross product on non-last dimension (#4850) @dschulmeist
  • Fix burn-flex to_contiguous fast path for prefix views (#4856) @antimora
  • Fix burn-flex sum_dim reading contiguous storage on transposed input (#4861) @antimora
  • Fix burn-flex argmax NaN ordering; tighten expand; precise erf (#4859) @antimora
  • Fix fusion reduce broadcasted when multi block local might be a view (#4867) @laggui
  • Fix select_assign OOB units (#4870) @laggui
  • Update cubecl + cubek: fix matmul, reduce WASM and vector size check on strided tensors (#4874) @laggui
  • Fix fusion read_quantized native type (#4923) @laggui

Documentation & Examples

  • Update Burn Book: metrics and trig functions (#4413) @softmaximalist
  • docs: add DataframeDataset example using Polars (#4298) @SameerVers3
  • doc(notebook) : add more basic operations and some examples (#4542) @Tyooughtul
  • Update documentation link for burn-store (#4619) @softmaximalist
  • Update building-blocks chapter (#4625) @softmaximalist
  • Update ONNX import docs for LoadStrategy and from_bytes (#4607) @antimora
  • Use burn-flex in docs and examples (#4841) @antimora

Fixes

  • Add field docs to generated methods (#4408) @swfsql
  • Fix typo in dataset.md in Burn Book (#4380) @softmaximalist
  • Fix book guide training changes (#4340) @laggui
  • Fix image-classification-web links (#4536) @laggui
  • fix: replace ValidStep with InferenceStep in training.md (#4620) @TsaoLun

Enhancements

  • Add module.train() to move a module back to the autodiff backend (#3975) @laggui
  • Perf/fusion/reduce broadcasted (#4338) @nathanielsimard
  • feat: Enable 64-bit indexing for kernels (#4502) @wingertge
  • Refactor/device handle (#4593) @nathanielsimard
  • All reduce backward (#4650 #4873) @Charles23R
  • Perf/burn fusion overhead (#4645) @nathanielsimard
  • Device service usage (#4839) @nathanielsimard

Refactoring

  • Add Scalar runtime literal (#4337) @laggui
  • Move ONNX crates to burn-onnx repository (#4393) @antimora
  • chore: Update cubecl to runtime config refactor (#4489) @wingertge
  • chore: deprecate burn-candle backend (#4416) @antimora
  • Move ONNX import to burn-onnx crate (#4361) @laggui
  • [Breaking] perf: Make backing storage of Shape more flexible (#4516) @wingertge
  • refactor: Move from CubeOption to Option (#4543) @wingertge
  • [Breaking] refactor: Metadata type/strides refactor (#4534) @wingertge
  • Use shape in TensorData (#4603) @laggui
  • refactor: Vector size generic (#4624) @wingertge
  • refactor: View launch (#4639) @wingertge
  • Refactor backend tests to set device settings at initialization + use Dispatch (#4666) @laggui
  • Prep for Group Multi Optimizers (#4818) @crutcher
  • Cleanup OptimizerAdaptor / GradAdaptor API. (#4822) @crutcher
  • Remove unused M param from SimpleOptimizerMapper. (#4823) @crutcher
  • Move tensor tests from burn-flex to burn-backend-tests (#4812) @antimora
  • Fusion all reduce + refactor collective (#4803) @Charles23R
  • Migrate benchmarks from burn-flex to burn-backend-tests (#4853) @antimora
  • Migrate default test backend from NdArray to Flex (#4854) @antimora
  • Update cubecl: refactor toml config, fix autotune priority and fix persistent memory pool reset (#4858) @nathanielsimard
  • Add burn-std::config runtime configuration with fusion logging and search optimization (#4864) @nathanielsimard
  • Update/cubecl to client (#4866) @Charles23R
  • Centralize internal burn-* deps in [workspace.dependencies] (#4876) @antimora
  • Remove optim::optim (#4924) @crutcher

Miscellaneous

  • Update zip + time (#4468) @laggui
  • Update cubecl wgpu v28 (#4244) @laggui
  • [Breaking] Use cache_dir() instead of hardcoded ~/.cache path (#4372) @antimora
  • Make ElementComparison optional for dtypes (#4255) @skewballfox
  • Performance tweaks to the lp_norm code. (#4318) @crutcher
  • ensure that tensor is owned on iter_dim call (#4309) @tzemanovic
  • Use NodeType to point to unimplemented node (#4334) @laggui
  • Bump burn version 0.21 (#4333) @laggui
  • feat(burn-store): add ModuleAdapter chaining (#4407) @huahuadeliaoliao
  • Replace Vec-based TransitionBuffer with tensor-backed storage (#4504) @arferreira
  • Optional Ordering for NdArrayElement (#4559) @skewballfox
  • Move burn-nn module name checks in burn-store adapter to the test section (#4580) @softmaximalist
  • Expose BurnpackError (#4585) @AdrianEddy
  • Add HalfPrecisionAdapter for F32/F16 mixed-precision storage (#4594) @antimora
  • Improve module derive + add #[module(skip)] attribute (#4618) @laggui
  • Fix SSIM float types to f32 (#4602) @softmaximalist
  • Fix function arg name inconsistencies (#4626) @softmaximalist
  • Make Param Sync for parallel model inference (#4701) @antimora
  • Fix flaky initializer_normal_init test (#4766) @leohenon
  • Add Record<(R0,)> 1-Tuple (#4825) @crutcher
  • Display FlexDevice as Cpu (#4857) @antimora
  • Fix rustls-webpki audit (#4863) @laggui
  • Fix PytorchReader bugs to load legacy files correctly (#4897) @softmaximalist
  • Add Clone + 'static bounds to LrScheduler::Record and derive Clone for scheduler records (#4905) @crutcher
  • Add ParamId::try_deserialize() (#4881) @crutcher
  • Use gather_nd in RNN-T gather_loss (#4895) @antimora
  • Re-enable fusion f16 conv + bn regression tests (#4920) @laggui
  • rnnt.rs: Optimize extract_log_probs and init_alpha (#4922) @softmaximalist
  • Fix some test tolerances (#4926) @laggui

Full Changelog: https://github.com/tracel-ai/burn/compare/v0.20.0...v0.21.0

v0.20.1 Bug fix

Fixed book guide training changes and removed dequantize native debug statements.

Full changelog

Bug Fixes & Improvement

  • Fix book guide training changes (#4340) @laggui
  • Fix dequantize native debug statement (https://github.com/tracel-ai/cubek/pull/69) @laggui
  • Do not point to pinned exact versions to allow pulling patch releases @laggui
v0.20.0 Breaking risk
Breaking changes
  • Replaced `LearnerBuilder` with `SupervisedTraining::new(...).summary()` flow in the training API.
  • Added required `IndexingUpdateOp` argument to `scatter` and `select_assign` operations.
  • Removed const generic `D2` from slice / SliceArg APIs (now use dimension‑agnostic forms).
Notable features
  • Complete overhaul of the ONNX import system with support for new control flow operators (`If`, `Loop`, `Scan`) and memory‑mapped loading.
  • Integration of [`CubeCL`](https://github.com/tracel-ai/cubecl/) for unified CPU/GPU kernel performance across diverse hardware.
Full changelog

Summary

This release marks a major turning point for the ecosystem with the introduction of CubeK. Our goal was to solve a classic challenge in deep learning: achieving peak performance on diverse hardware without maintaining fragmented codebases.

By unifying CPU and GPU kernels through CubeCL, we've managed to squeeze maximum efficiency out of everything from NVIDIA Blackwell GPUs to standard consumer CPUs.

Beyond performance, this release makes the library more robust, flexible, and significantly easier to debug.

This release also features a complete overhaul of the ONNX import system, providing broader support for a wide range of ONNX models. In addition, various bug fixes and new tensor operations enhance stability and usability.

For more details, check out the release post on our website.

Changelog

Breaking

We've introduced a couple of breaking API changes with this release. The affected interfaces are detailed in the sections below.

Training

We refactored burn-train to better support different abstractions and custom training strategies. As part of this,
the LearnerBuilder has been replaced by the LearningParadigm flow:

- let learner = LearnerBuilder::new(ARTIFACT_DIR)
+ let training = SupervisedTraining::new(ARTIFACT_DIR, dataloader_train, dataloader_valid)
        .metrics((AccuracyMetric::new(), LossMetric::new()))
        .num_epochs(config.num_epochs)
-       .learning_strategy(burn::train::LearningStrategy::SingleDevice(device))
-       .build(model, config.optimizer.init(), lr_scheduler.init().unwrap());
+       .summary();
 
- let result = learner.fit(dataloader_train, dataloader_valid);
+ let result = training.launch(Learner::new(
+      model,
+      config.optimizer.init(),
+      lr_scheduler.init().unwrap(),
+ ));

Interface Changes

The scatter and select_assign operations now require an IndexingUpdateOp to specify the update behavior.

- let output = tensor.scatter(0, indices, values);
+ let output = tensor.scatter(0, indices, values, IndexingUpdateOp::Add);

API calls for slice, slice_assign, and slice_fill no longer require const generics for dimensions, which cleans up the syntax quite a bit:

- let prev_slice = tensor.slice::<[Range<usize>; D]>(slices.try_into().unwrap());
+ let prev_slice = tensor.slice(slices.as_slice());

The grid_sample_2d operation now supports different options.
To preserve the previous behavior, make sure to specify the matching options:

- let output = tensor.grid_sample_2d(grid, InterpolateMode::Bilinear);
+ let options = GridSampleOptions::new(InterpolateMode::Bilinear)
+     .with_padding_mode(GridSamplePaddingMode::Border)
+     .with_align_corners(true);
+ let output = tensor.grid_sample_2d(grid, options);

The QuantStore variants used in QuantScheme have been updated to support a packing dimension.

  pub enum QuantStore {
      /// Native quantization doesn't require packing and unpacking.
      Native,
+     /// Store packed quantized values in a natively supported packing format (i.e. e2m1x2).
+     PackedNative(usize),
      /// Store packed quantized values in a 4-byte unsigned integer.
-     U32,
+     PackedU32(usize),
 }

Finally, Shape no longer implements IntoIterator. If you need to iterate by-value over dimensions, access the dims field directly.

- for s in shape {
+ for s in shape.dims {

Module & Tensor

  • Generalize linalg::outer semantics; add linalg::outer_dim (#3923) @crutcher
  • Use square() where appropriate. (#3900) @crutcher
  • Add linalg matvec (#3967) @huy209vn
  • Add GaussianNoise layer (#4022) @kul-sudo
  • Make TransformerEncoderLayer fields public (#4053) @Mnwa
  • Workaround MPS embedding allocation error in LibTorch (#4073) @antimora
  • Fix Slice operation to handle empty ranges (#4083) @antimora
  • Handle empty tensors in cat and slice_assign ops (#4095) @antimora
  • [Breaking] Add IndexingUpdateOp to scatter and select_assign (#4070) @laggui
  • Add CrossAttention module to burn-nn (#4101) @huy209vn
  • Add reflect and edge padding modes to tensor.pad (#4105 #) @antimora
  • Fix GLU and quiet softmax activations (#4121) @laggui
  • Add ceil_mode support to pooling operations (MaxPool, AvgPool) (#4112) @antimora
  • [Breaking] Remove D2 const generic from slice / SliceArg (#4127) @crutcher
  • Add backend supports_dtype (#4155) @laggui
  • Fix repeat 0 times (#4216) @laggui
  • feat: add hardswish activation (#4209) @mertalev
  • Add more trig ops (#4282) @laggui
  • Add empty/zeros/ones/full TensorCreationOptions (#4285) @laggui
  • feat: nms op (#4246) @mertalev

Datasets & Training

  • Refactor metric loggers(#3895 #4017) @Charles23R
  • Add support for custom learning strategy (#3921) @Charles23R
  • Feat/optim/distributed (#4018) @nathanielsimard
  • Refactor MetricEntry (#4031) @Charles23R
  • Feature muon (#3925) @NewBornRustacean
  • Add warmup epochs to MetricEarlyStoppingStrategy (#4041) @crutcher
  • Log running values (#4199) @Charles23R
  • Fix checkpoint and summary log level (#4201) @J-F-Liu
  • [Breaking] Burn train api refactor (#4223 #4283) @Charles23R
  • Fix checkpointer interrupt (#4268) @Charles23R

Backends

  • Add candle device seeding (#3959) @laggui
  • feat: Enable tuning for MMA matmul (#3961) @wingertge
  • feat: TMA autotuning (#3986) @wingertge
  • feat: Enable tuning specialized matmul (#4026) @wingertge
  • Add CubeCL Flash Attention module (#4089 #4192) @louisfd
  • Zero-copy tensor loading for NdArray backend (#4178) @antimora
  • feat: Implicit GEMM weight gradients for convolution (#4182) @wingertge
  • Perf/reduce cpu + Fix OOB (#4197 #4204) @nathanielsimard
  • feat: Accelerated convolution data gradient (#4220) @wingertge
  • Remove linux-only constraint for cpu (#4233) @louisfd
  • Perf/into contiguous (#4257) @nathanielsimard
  • fix: grid sample using excessive memory (#4236 #4242) @mertalev
  • Add fast-path for batched vector–matrix matmul (#4300) @louisfd

Bug Fixes

  • Fix async barrier & TMA checks (#4007) @nathanielsimard
  • Fix fusion reduce local already registered as output (#4014) @laggui
  • Fix remainder int (#4015) @laggui
  • Fix cuda mem error (#4020) @nathanielsimard
  • Cleanup autodiff unused roots (#4039) @laggui
  • Fix autotuner (#4049) @nathanielsimard
  • Fix scatter values backward (#4064) @khoek
  • More correctness fixes in autodiff ops (#4069) @khoek
  • Fix transaction read (#4074) @laggui
  • Fix tch bf16 kind (#4088 #4142 #4203) @laggui
  • Fix cubecl cuda compilation error/typo (#4092) @BjornTheProgrammer
  • Fix output dtype for argmin / argmax (#4195) @tzemanovic
  • Return slice for each dimension in shape (#4152) @laggui

Documentation & Examples

  • Update raspberry pi pico example (#4034 #4132) @BjornTheProgrammer
  • Contributor Book: Update the "ONNX to Burn" Page (#4229) @softmaximalist
  • docs: add examples for bool tensor operations (#4248) @qburke
  • Update the "Adding New Operation" guide in the contributor book (#4284) @softmaximalist
  • Refactor dop_timer for multiple trials (for warmup). (#4288) @crutcher
  • Added documentation examples for more boolean tensor operations in burn-tensor (#4289) @qburke

Fixes

  • Fix book (#3942) @laggui
  • remove repetitive words in comment (#4029) @black5box
  • Include katex header as symlink (#4118) @laggui
  • Fix quantization docs (make it clear that only PTQ is currently supported) (#4316) @laggui

ONNX Support

  • ONNX IR and import refactor to better support complex graphs (#3872 #4019 #4033 #4094) @antimora
  • Add ONNX control flow operators: If, Loop, and Scan (#3936) @antimora
  • Silero VAD ONNX model verification (#3999) @antimora
  • Add support for yolo12x model variant (#4048) @antimora
  • Remove burn-import abstraction layer and use onnx-ir types directly (#4033) @antimora
  • Fix ConstantOfShape output size determination (#4085) @antimora
  • Specify output rank in squeeze_dims for type inference (#4086) @antimora
  • Fix Expand operation to use ONNX max-semantics (#4082) @antimora
  • [Breaking] Add ONNX GridSample op support and tests (#4084) @antimora
  • Add RF-DETR model check for burn-import (#4087) @antimora
  • Add LSTM operator support with configurable activations (#4106) @antimora
  • Add memory-mapped ONNX loading with tensor data ref (#4097) @antimora
  • Fix outer-scope variable references in ONNX subgraphs (If/Loop/Scan) (#4119) @antimora
  • Add Reshape scalar optimization and Gather scalar input support (#4146) @antimora
  • Update GELU ONNX test to use native op and fix expected values (#4161) @antimora
  • Add ONNX CumSum operator support (#4162) @antimora
  • Remove global ONNX opset version restriction, recommend opset 16 (#4168) @antimora
  • Handle 1D slope when importing prelu from onnx (#4205) @mertalev
  • Fix handling scalar scan outputs in ONNX loop nodes (#4210) @antimora
  • Add ONNX external data support for models >2GB (#4158) @antimora
  • fix: handle negative indices in onnx gather op (#4207) @mertalev
  • Split backend tensor ops tests (#4232) @laggui
  • Do not use alloc import in burn-import codegen (#4286) @laggui
  • Fix ONNX where broadcasted dims (#4315) @laggui

Enhancements

  • Feat/pinned memory staging (#4016) @nathanielsimard
  • burn-store enhancements for troubleshooting and new enum skip flag (#4051) @antimora
  • Feat/runtime error (#4079 #4110) @nathanielsimard
  • Perf/improve reduce autotuning + plane non uniform control flow check (#4208) @nathanielsimard
  • Packed quantized matmul with QuantStore changes (#4310 #4323) @wingertge

Refactoring

  • chore: Update to batch caching PR for cubecl (#3948) @wingertge
  • Refactor IR to define outputs as a function of the operation (#3877) @laggui
  • Chore/update dtypes (#3998) @nathanielsimard
  • Cleanup quantization strategy (CPU ref, ndarray only) (#4023) @laggui
  • Refactor/dtype cubecl (#4032) @nathanielsimard
  • Refactor of burn fusion and burn cubecl fusion (#4044) @nathanielsimard
  • chore: Update to cubecl scalar refactor (#4062) @wingertge
  • refactor: cubecl Runtime trait (#4065) @wingertge
  • Refactor/autotuner (#4068) @nathanielsimard
  • Move types from burn-tensor to burn-std and burn-backend (#4050) @laggui
  • Feat/error handling cubecl (#4076) @nathanielsimard
  • Refactor RemoteDevice and RemoteSender. (#4113 #4108) @crutcher
  • Refactor LocalCollectiveClient and LocalCollectiveServer (#4125 #4126) @crutcher
  • Move backend traits and types to burn-backend (#4111) @laggui
  • Migrate ONNX import to burnpack format (removing Record type) (#4122) @antimora
  • Refactor more basic ops (#4156) @laggui
  • Refactor configurable backend tests (no more testgen macros) (#4129) @laggui
  • Backends no longer depend on burn-tensor, but strictly burn-backend (#4169) @laggui
  • Refactor/cube dim (#4217) @nathanielsimard
  • Update ops subfolder file names (#4271) @softmaximalist
  • refactor: Migrate to usize indexing (#4273) @wingertge
  • Unify ReshapeArgs / Shape.reshape(args) (#4221 #4317) @crutcher @laggui
  • chore: Update to refactor cubecl types and traits (#4297) @wingertge

Miscellaneous

  • Add Shape::ravel_index for row-major raveling of indices. (#3879) @crutcher
  • ci: let CI server dispatch the test-gpu workflow (#3938) @syl20bnr
  • ci: check tag version against Cargo.toml version before publishing (#3939) @syl20bnr
  • Implement error for DataError (#3960) @laggui
  • Pin burn crates version (#4035) @Marc-AnthonyG
  • Implement FromStr for Slice with parsing and error handling (#3983) @crutcher
  • Enable no-std SafeTensors support and update hashbrown (#4071) @antimora
  • Move network utilities to burn-std (#4104) @laggui
  • Add 256-byte tensor alignment to burnpack format for mmap zero-copy support (#4100) @antimora
  • Fix/autotune checks (#4114) @nathanielsimard
  • Add direct tensor snapshot retrieval API to ModuleStore (#4131) @antimora
  • Implement Slice iterator and utility methods. (#4042) @crutcher
  • Shape FromStr/ToString (#4143) @crutcher
  • Add contiguous index mapping for non-contiguous layer indices (#4150) @antimora
  • Zero-copy loading for embedded burnpack weights (#4154) @antimora
  • Add flatten_dims method to Shape and refactor tensor flattening API (#4189) @crutcher
  • Make xtask validate run no-std checks first. (#4198) @crutcher
  • Add tracing::instrument and refactor collective operations. (#4157 #4234) @crutcher
  • Fix dtype preservation when loading tensors in burn-store (#4194) @antimora
  • Fix burn-store quantized tensor storage data length calculation (#4180) @antimora
  • Replace canonicalize_dim with expect_dim (#4196) @crutcher
  • Refactor: Consolidate shape and slice error handling into ExpressionError (#4218) @crutcher
  • Implement TODO tests and validation for Sum operation in onnx-ir (#4251) @softmaximalist
  • Fix burn-store collector tuple modules (#4270) @laggui
  • Fix rand os_rng (#4295) @laggui
  • chore: update xtask to 4.9.0 (#4311) @syl20bnr
v0.19.1 Bug fix

Fixed a pickle reader regression that prevented integer dictionary keys from being unpickled correctly.

Full changelog

Bug Fixes & Improvements

  • Autodiff: fixed RAM memory leak with correct graph cleanup (#3957 #3982) @laggui
  • Better memory reuse: improved sliced memory pool implementation (#3941) @nathanielsimard
  • Cuda: update cudarc, auto-detect CUDA version and fix some 12.8 features (CubeCL #1008) @wingertge
  • Quantized Linear: fixed fusion configuration to fuse more precisions (#3941) @nathanielsimard
  • PyTorch import: fixed pickle reader regression with integer dictionary keys (#3978) @laggui
  • Docs: switched to doc_cfg to fix docs.rs builds (#3979) @laggui
  • Tensor API fixes:
    • *_like preserves dtype (#3953) @crutcher
    • RotaryEncoding sum dimension for 3D input (#3954) @laggui
    • squeeze check for output rank > 0 (#3946) @laggui
    • Linear for input/output rank 1 (#3966) @lucasmdjl
v0.19.0 Breaking risk
Breaking changes
  • .devices(vec![device.clone()]) → .learning_strategy(LearningStrategy::SingleDevice(device.clone()))
  • `let model_trained = learner.fit(...)` now returns a `TrainingResult` instead of the trained model directly; access via `result.model` and use `result.renderer` for metrics.
  • Config trait now requires `Debug` implementation.
Notable features
  • Multi-stream execution and optimized device transfers enable true multi‑GPU parallelism.
  • New CPU backend based on MLIR/LLVM providing JIT compilation, autotuning and fusion on CPUs.
  • Comprehensive quantization support with fused dequantization and new quantized operations.
Full changelog

Summary

This release brings major improvements to enable efficient distributed training, quantization, and CPU support in Burn.

To achieve true multi-GPU parallelism, we had to rethink several core systems: we implemented multi-stream execution to keep all GPUs busy, optimized device transfers to avoid unnecessary synchronization, and redesigned our locking strategies to eliminate bottlenecks in autotuning, fusion, and autodiff. We also introduced burn-collective for gradient synchronization and refactored our training loop to support different distributed training strategies.

Additionally, we added comprehensive quantization support, allowing models to use significantly less memory while maintaining performance through fused dequantization and optimized quantized operations.

Finally, we introduced a new CPU backend powered by MLIR and LLVM, bringing the same JIT compilation, autotuning, and fusion capabilities from our GPU backends to CPU execution.

As with previous releases, this version includes various bug fixes, further optimizations and enhanced documentation. Support for ONNX models has also been expanded, with additional operators and bug fixes for better operator coverage.

For more details, check out the release post on our website.

Changelog

Breaking

We've introduced a couple of breaking API changes with this release. The affected interfaces are detailed in the sections below.

Learning Strategy

We refactored the Learner to support better distributed training strategies. Instead of registering a list of device(s), you now specify a training strategy.

  let learner = LearnerBuilder::new(artifact_dir)
      .metric_train_numeric(AccuracyMetric::new())
      .metric_valid_numeric(AccuracyMetric::new())
      .metric_train_numeric(LossMetric::new())
      .metric_valid_numeric(LossMetric::new())
      .with_file_checkpointer(CompactRecorder::new())
-     .devices(vec![device.clone()])
+     .learning_strategy(LearningStrategy::SingleDevice(device.clone()))
      .num_epochs(config.num_epochs)
      .summary()
      .build(
          config.model.init::<B>(&device),
          config.optimizer.init(),
          config.learning_rate,
      );

Learner Training Result

The Learner previously lacked an evaluation loop. We extended its return type to include all training states in a TrainingResult, which includes the trained model and a metrics renderer.

- let model_trained = learner.fit(dataloader_train, dataloader_valid);
+ let result = learner.fit(dataloader_train, dataloader_valid);

- model_trained
+ result
+    .model
     .save_file(format!("{artifact_dir}/model"), &CompactRecorder::new())
     .expect("Trained model should be saved successfully");

This enables the renderer to be reused by the new evaluator so that training and evaluation metrics appear together in the TUI dashboard:

let mut renderer = result.renderer;
let evaluator = EvaluatorBuilder::new(artifact_dir)
    .renderer(renderer)
    .metrics((AccuracyMetric::new(), LossMetric::new()))
    .build(result.model.clone());

evaluator.eval(name, dataloader_test);

Interface Changes

Config

The Config trait now requires Debug:

- #[derive(Config)]
+ #[derive(Config, Debug)]
  pub struct TrainingConfig {
      // ...
  }

BatchNorm

BatchNorm no longer requires the spatial dimension generic:

  #[derive(Module, Debug)]
  pub struct ConvBlock<B: Backend> {
      conv: nn::conv::Conv2d<B>,
-     norm: BatchNorm<B, 2>,
+     norm: BatchNorm<B>,
      pool: Option<MaxPool2d>,
      activation: nn::Relu,
  }

Backend::seed

Seeding is now device-specific:

- B::seed(seed);
+ B::seed(&device, seed);

Tensor

For consistency with other methods like unsqueeze() / unsqueeze_dim(dim), squeeze(dim) was renamed:

- tensor.squeeze(dim)
+ tensor.squeeze_dim(dim)

We've also added a tensor.squeeze() method which squeezes all singleton dimensions.

Finally, we removed tensor ^ T syntax, which was clunky.

- use burn::tensor::T;
- tensor ^ T
+ tensor.t()

tensor.t() is also a simple alias for tensor.transpose().

Module & Tensor

  • Fix unsqueeze rank check (#3429) @laggui
  • Feat/quant block (#3442) @laggui
  • Kill tensor^T magic transpose marker in favor of tensor.t(). (#3452) @crutcher
  • ADD GLU activation function (#3444) @bn-c
  • Add quantization params precision (#3453) @laggui
  • Improve select_assign check (#3483) @laggui
  • Add grid_sample function (#3495 #3523 #3522) @Cielbird
  • save_tensor_as_image utility (#3520) @Cielbird
  • Add affine_grid_2d (#3526) @Cielbird
  • ADD missing Debug derive for embedding (#3547) @bn-c
  • Dot Product Op (#3537) @kikefdezl
  • Lift .full()/.full_like() into base Tensor - support Tensor<B, D, Bool>::full()/full_like(). (#3562) @crutcher
  • Make Distribution::Default the Default::default(). (#3582) @crutcher
  • Implement int matmul (#3575) @wingertge
  • Feat/quant formats (#3613) @laggui
  • Switch Tensor::swap_dims/permute to AsIndex dim support. (#3619) @crutcher
  • Tensor::flatten() => AsIndex dims support. (#3620) @crutcher
  • Remove D param from BatchNorm<B, D>. (#3625) @crutcher
  • nn.activation; Activation (#3603 #3693) @crutcher
  • Add q4 q2 quantization (#3617) @laggui
  • Introduce NormLayer abstraction for unified normalization layers. (#3630) @crutcher
  • Add dtype to trait creation ops (#3670) @laggui
  • Make Config require Debug (#3689) @crutcher
  • Add NormalizationConfig::with_num_features() and related (#3688) @crutcher
  • Module quantization w/ tests (#3637) @nathanielsimard
  • Add NumPy-like take operation with multi-dimensional index support (#3681) @antimora
  • Added trace and diag with batch support for linalg crate (#3703) @niklund
  • Add step support to tensor slice operations (#3748) @antimora
  • Tensor::unfold(dim, size, step) (#3751 #3782 #3783) @crutcher
  • Slice assign with steps (#3776) @antimora
  • Add bool_xor operation for boolean tensors (#3785) @crutcher
  • [Breaking] Make squeeze/squeeze_dim consistent with other APIs (#3790) @laggui
  • Add cross product (#3743) @SinanGncgl
  • Enable stepped slicing for slice_fill and complete slice API cleanup (#3784) @antimora
  • Tensor::rank() (#3797) @crutcher
  • AsIndex dim handling for Numeric ops (#3795) @crutcher
  • Add outer and outer_batch ops in linalg (#3786) @huy209vn
  • Tensor::_dims() (#3811) @crutcher
  • Add tensor.cumsum(dim) first implementation (#3806) @antimora
  • slice_fill() should pick a compatible dtype (#3826) @crutcher
  • Default LU decomposition implementation (#3816) @DimitriTimoz
  • Add tensor.square and fast-path int-power exponents. (#3847) @crutcher
  • Add cumulative operations: cumprod, cummin, and cummax (#3819) @antimora
  • Add Tensor::sum_dims_squeeze(dims) (#3817) @crutcher
  • Allow linear to use quantized matmul (#3913) @wingertge

Datasets & Training

  • Pre-Shuffle Multithread DataLoaders on Shuffle (#3390) @crutcher
  • PixelDepth + Copy (#3419) @crutcher
  • Add Dice-Sorenson Coefficient Metric (#3407) @MathijsdeBoer
  • Add SelectionDataset, refactor ShuffledDataset, and add transform tests. (#3406) @crutcher
  • Evenly distribute complete chunks/batches across partial dataset splits (#3476) @laggui
  • Distributed Data Parallel (#3456) @Cielbird
  • Use tensor ops for clip_by_norm (#3485) @laggui
  • SamplerDataset distribution fix; constructors and builder. (#3490) @crutcher
  • Unify transform usage of RngOptions. (#3577) @crutcher
  • Fix bugs with ddp learning (#3581) @Cielbird
  • Add support for CIFAR-10 and CIFAR-100 datasets (#3579) @buttfa
  • Add with_interrupter for LearnerBuilder (#3611) @amfaber
  • Improved Burn Train (#3614 #3935) @nathanielsimard @laggui
  • Add 'TextFolderDataset' struct and AgNewsDataset (#3698) @buttfa
  • Add PerplexityMetric for language model evaluation (#3707) @TheDarkchip
  • Adding CER/WER metrics (#3418) @yazanmashal03
  • Fix/autodiff/multi threads (#3793) @nathanielsimard
  • Add cautious_weight_decay to AdamW optimizer. (#3869) @crutcher
  • Fix evaluator dataloader device (#3893) @laggui

Backends

  • Migrate to new cubecl multi tensor handle changes (#3136) @wingertge
  • More memory control with scoped static memory management (#3410) @nathanielsimard
  • Feat/fusion quant (#3454) @nathanielsimard
  • Expose client utilities (#3559) @allenqm
  • New CPU backend based on MLIR (#3411) @marcantoinem
  • feat: ndarray dynamic tensor types and int tensor cast (#3647) @wingertge
  • Implement optimized bool_select for primary backends (#3710) @TheDarkchip
  • Add backend level is_nan / is_inf implementations (#3809) @laggui
  • Feat/persistent memory (#3842) @nathanielsimard
  • feat: add backend implementations for Trunc op (#3860) @mooori

Bug Fixes

  • Fix ndarray interpolate coord precision at boundaries (#3481) @laggui
  • Fix ndarray conv2d groups channels (#3415) @laggui
  • Fix candle mask broadcasting (#3489) @laggui
  • Update cubecl: fix wgpu vec to scalar cast (#3496) @Cielbird
  • Fix/conv2d groups backward (#3521) @laggui
  • Fix/conv3d backward groups (#3533) @laggui
  • [Fix] Add some missing handling for flex32 (#3551) @wingertge
  • Fix backward scatter dim (#3555) @laggui
  • fix: Use correct datatype when filling boolean tensors (#3593) @wingertge
  • fix: Ensure output layout is the same for non-inplace SIMD ops in ndarray (#3604) @wingertge
  • Fix scalar binop not contiguous (#3636) @laggui
  • Fix dtype dispatch in cubecl module ops (#3658) @laggui
  • Fix wgpu bool and/or (#3664) @laggui
  • Fix tch bool ones and rand int (#3684) @laggui
  • fix: Select assign + bool cast (#3730) @wingertge
  • Fix register_float_tensor to use the correct dtype (#3774) @A2va
  • Fix: autotune errors with fusion (added fallback) (#3778) @nathanielsimard
  • Fix mask_where broadcasted line size (#3823) @laggui
  • Fix adaptive avg pool2d backward line size (#3840) @laggui
  • Fix line size regression bug (#3850) @nathanielsimard
  • Correctly set cubecl::random::seed(seed) (#3878) @laggui
  • Fix indexing for permuted tensors with cumulative ops (#3891) @wingertge
  • Fix quantized reshape and into_contiguous (#3903) @wingertge
  • Fix fusion matmul inputs (#3905) @laggui
  • Fix powf vectorization on WGPU (#3916) @nathanielsimard

Documentation & Examples

  • [Docs] Add python prerequisite disclaimer for HuggingfaceDatasetLoader (#3484) @laggui
  • Mnist example augmented data (#3534) @Cielbird
  • Improve DataLoaderBuilder docs. (#3482) @crutcher
  • Readme + Burn Book performance section (#3686) @nathanielsimard
  • Update README for improved ONNX import documentation (#3738) @antimora
  • Some updates to the book (#3906) @louisfd

Fixes

  • fix: link in examples (#3475) @domenicocinque
  • Fix webassembly description + fusion usage + missing device (#3474) @laggui
  • Fix dataset split docs (#3508) @laggui
  • docs: fix example (#3498) @domenicocinque
  • Fix tensor docs examples (#3525) @laggui
  • Fix MNIST example model (#3549) @Cielbird
  • Fix/conv2d docs display (#3586) @huy209vn
  • Fix KaTeX docs (#3787) @laggui
  • Fix typo in getting-started (#3868) @Charles23R

ONNX Support

  • Add ONNX IsNaN and IsInf ops (#3393) @Friedrich-S
  • Add support onnx bernoulli (#3394) @tye-singwa
  • fix onnx reshape op elem_type inference (#3395) @tye-singwa
  • Adding bitwise ONNX ops (#3120) @AshAnand34
  • Add ONNX Attention op (#3423) @Friedrich-S
  • Add support and tests for ONNX Abs operator (#3536) @antimora
  • Infer conv spatial dims from weight rank (#3538) @laggui
  • Debug log new name during ONNX renames (#3539) @torsteingrindvik
  • Proto conversion: Allow f16 tensors by casting via bytemuck from raw data (#3541) @torsteingrindvik
  • Fix onnx auto_pad and ceil_mode attrs handling (#3542) @laggui
  • Support int min/max types in clip_config (#3544) @antimora
  • Make onnx-ir parse error more informative. Handle more data type variants in TryFrom -> Argument (#3545) @torsteingrindvik
  • Add Identity node support and fix initializer handling (#3543) @antimora
  • Use try_cast_vec with fallback in proto conversion (#3546) @laggui
  • onnx-ir: Infer conv2d kernel shape from weight tensor (#3554) @torsteingrindvik
  • Add comprehensive Shape type support for ONNX operations (#3381) @antimora
  • Extend onnx reduce op support (#3497) @tye-singwa
  • Enhance ConstantOfShape to support static shape input (#3550) @torsteingrindvik
  • Don't panic on allowzero since reshape supports it (#3573) @torsteingrindvik
  • ONNX enhancements to support CLIP ViT-B-32 (#3560) @antimora
  • Use prettyplease to format burn-import output rust files (#3578) @n1ght-hunter
  • Fix ONNX import rank inference for nodes downstream of Shape-type constant conversion (#3564) @antimora
  • Support dynamic shape and tensor sizes in ONNX resize (#3563) @antimora
  • Refactor backend selection for onnx-tests (#3584) @antimora
  • Add broadcasting support for add, sub, mul, and div ops (#3589) @antimora
  • Fix ONNX Slice operation axes parameter handling (#3594) @antimora
  • ONNX model checking: Yolo11x (#3599) @antimora
  • CLIP ViT-B/32 text model ONNX verification & backend fixes (#3623) @antimora
  • clip-vit-b-32-vision model verifications and fixes (#3673) @antimora
  • Implemented MatMulInteger ONNX in burn-import and Uint8/int8 element types (#3672) @huy209vn
  • Fix ONNX import: Integer constants serialization and MatMulInteger broadcasting (#3696) @antimora
  • Add EyeLike ONNX operation support (#3731) @TheDarkchip
  • Support ONNX Squeeze with axes input and no axes (#3736) @antimora
  • Enhance ONNX PRelu config initialization with alpha and num_parameters (#3746) @antimora
  • Add support for negative indices in Gather shape ops (#3749) @antimora
  • Update ONNX dependency to stable version (#3772) @antimora
  • Add NonZero ONNX operation support (#3745) @TheDarkchip
  • Add static shape propagation and broadcasting support for ONNX IR operations (#3763) @antimora
  • trunc, fmod and Mod ONNX ops (#3767) @antimora
  • Add uint16 to onnx-ir (#3791) @TheGhostHuCodes
  • Add YOLO model family check with ONNX import and test (#3750) @antimora
  • ONNX albert model check and bug fix (#3810) @antimora
  • Add ModernBERT-base model check (#3814) @antimora
  • Add all-MiniLM-L6-v2 ONNX model check (#3813) @antimora
  • ONNX: support broadcasting for bool_and (#3829) @mooori
  • Lift constants for ReduceMax and ReduceMean nodes (#3827) @TheGhostHuCodes
  • Burn import refactor to node-based registry architecture (#3825) @antimora
  • ONNX: support broadcasting for bool_or, bool_xor (#3839) @mooori
  • Update ONNX model support version to Opset 16+ (#3870) @jc-cr
  • Handle empty tensor constants in ONNX import (#3904) @antimora

Enhancements

  • Add more operations support in fusion (#3552) @nathanielsimard
  • Perf/linear layout (#3587) @nathanielsimard
  • Perf/data transfer (#3695) @nathanielsimard
  • Perf: GPU to CPU Copy (#3708) @nathanielsimard
  • feat: Matmul quant (#3874 #3910) @wingertge
  • Fix/matmul/fusion (#3899) @nathanielsimard

Refactoring

  • Refactor burn-train (#3451) @Cielbird
  • [chore] Migrate to memory management API refactor (#3477) @wingertge
  • Update cubecl: matmul refactor (#3493) @louisfd
  • Refactor/quant (#3500) @nathanielsimard
  • chore: Update cubecl with new changes to Item and layouts (#3626) @wingertge
  • Refactor/seed (#3641) @nathanielsimard
  • Reorganize activation layer sources into nn.activation module (#3627) @crutcher
  • Remove backend QuantizedEncoding type and unused candle/tch impl (#3645) @laggui
  • chore: Update cubecl with stacked view changes (#3687) @wingertge
  • chore: Update cubecl for split traits (#3700) @wingertge
  • Use bytes from cubecl (#3701) @nathanielsimard
  • Update cubecl runtime features (#3711) @wingertge
  • Use ScalarIr to represent scalars generically (#3706) @laggui
  • chore: Update cubecl to tile refactor PR (#3728) @wingertge
  • Refactor/broadcast layout (#3733) @wingertge
  • Add cubecl re-export, root Tensor, doc updates and Noam scheduler fix (#3742) @laggui
  • Move nn components to burn-nn (#3740) @laggui
  • Update cubecl (#3752) @wingertge
  • Chore update cubecl (#3764) @nathanielsimard
  • Chore: update cubecl + fix no-std (#3771) @laggui
  • Move optimizer components to burn-optim (#3773) @laggui
  • Feat/multi streams (#3775) @nathanielsimard
  • chore: Update cubecl for quant refactor and other changes (#3828) @wingertge
  • chore: Update for launch refactor (#3841) @wingertge
  • Refactor Shape manipulations (#3845) @laggui
  • refactor: Refactor matmul to use views for its inputs (#3846) @wingertge
  • Refactor/cubecl client (#3873) @nathanielsimard

Miscellaneous

  • chore: update dependencies (#3389) @reneleonhardt
  • Use member name as filter for wgpu tests (#3405) @laggui
  • Fix fusion no default feat (#3408) @laggui
  • Bump MSRV from 1.85 to latest stable 1.87 (#3424) @Friedrich-S
  • Add benchmarks.toml (#3430 #3457) @syl20bnr
  • Test benchmark execution on an Nvidia A100 (#3435 #3446) @syl20bnr
  • Burn-collective base (#3288) @Cielbird
  • ci: split tests on GitHub runners and on GPU runners (#3382) @syl20bnr
  • ci: bench on multiple machines (#3455) @syl20bnr
  • ci: fix wgpu-info (#3466) @syl20bnr
  • HuggingfaceDatasetLoader automatically check for pip (#3479) @Puranjay-del-Mishra
  • Refactor/collective (#3450) @nathanielsimard
  • cfg-mask ddp constructor (#3488) @crutcher
  • Update MSRV to 1.88 (#3492) @laggui
  • Fix various warnings reported by run-checks (#3512) @crutcher
  • Burn-vision transforms (#3527) @Cielbird
  • Add feature flag to bytemuck due to usage of API extern_crate_alloc (#3556) @torsteingrindvik
  • Fix shape type annotation in test (#3576) @laggui
  • Refactor burn-collective (#3572) @Cielbird
  • fix: Fix bug with scalar tail in morphology op (#3588) @wingertge
  • apply clippy fixes to burn-ndarray (#3618) @torsteingrindvik
  • Derive clone for Record Items (#3601) @amfaber
  • Add From implementations for ActivationConfig and cleanup tests (#3631) @crutcher
  • Fix new stable clippy lints (#3643) @janhohenheim
  • Fix stable clippy lints (#3644) @janhohenheim
  • Fix obvious problems (#3646) @nathanielsimard
  • Limit cubecl cpu target (#3656) @laggui
  • Bump cubecl to use wgpu 26 (#3657) @janhohenheim
  • add some missing default-features = false (#3675) @dcrewi
  • Fix no-std support for burn-no-std-tests and warning clean up (#3671) @antimora
  • Strengthen Doc Lints (#3691) @crutcher
  • From impls for Activation (#3692) @crutcher
  • Remove DimSwappedActivation (#3693) @crutcher
  • Shape: into_iter(), into_ranges(), to_vec(), slice() (#3694) @crutcher
  • Add burn-store crate for model storage with safetensors support (#3666) @antimora
  • Add #[allow(clippy::too_many_arguments)] to config constructor (#3737) @crutcher
  • Remove empty indices tests (#3747) @laggui
  • Fix various clippy lints (#3766) @wingertge
  • chore: remove redundant words (#3770) @juejinyuxitu
  • Fix segfaults from fusion panics with simple workaround (#3777) @wingertge
  • Remove vulkan/mesa no-std CI setup (#3781) @laggui
  • ci: add dispatch trigger publish workflow and bump xtask to 2.1.10 (#3788) @syl20bnr
  • Slice: Copy, full(), default() (#3796) @crutcher
  • Fix tests with hardcoded types (#3805) @wingertge
  • Add PytorchStore for optimized model loading and in-house pickle reader (#3741) @antimora
  • Fix ndarray compilation when cubecl-common enables rayon but ndarray doesn't (#3848) @wingertge
  • PyTorch reader: Add F16, BF16, and unsigned integer support (#3849) @antimora
  • Fix minor typo in POEM.md (#3851) @jc-cr
  • BurnpackStore (#3792) @antimora
  • Expose de/serialize numericentry (#3890) @Charles23R
  • Bump tch to 0.22.0 (#3892) @laggui
  • update cubecl (#3896) @louisfd
  • Disable no-std safetensorsstore (#3902) @antimora

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

About

Stars
15,299
Forks
925
Languages
Rust Python C++

Community & Support

Beta — feedback welcome: [email protected]