microsoft/BitNet Microsoft’s open source 1-bit large model reasoning framework

BitNet is Microsoft’s open-source, ultra-low-bit large model inference framework optimized for CPU-based local inference and extreme compression (1-bit/1.58-bit quantization), delivering efficient and low-power execution for models like BitNet, Llama3-8B-1.58, and Falcon3 without GPU requirements . Released under the MIT license, it features both C++ and Python interfaces, backed by an active community and ongoing updates, making it ideal for embedded, mobile, and edge AI deployments.

microsoft/BitNet Microsoft's open source 1-bit large model reasoning framework

Official website: https://bitnet-demo.azurewebsites.net/
Source code: https://github.com/microsoft/BitNet?tab=readme-ov-file

Model Details

Architecture: Transformer-based, modified with BitLinear layers (BitNet framework).
- Uses Rotary Position Embeddings (RoPE).
- Uses squared ReLU (ReLU²) activation in FFN layers.
- Employs subln normalization.
- No bias terms in linear or normalization layers.
Quantization: Native 1.58-bit weights and 8-bit activations (W1.58A8).
- Weights are quantized to ternary values {-1, 0, +1} using absmean quantization during the forward pass.
- Activations are quantized to 8-bit integers using absmax quantization (per-token).
- Crucially, the model was trained from scratch with this quantization scheme, not post-training quantized.
Parameters: ~2 Billion
Training Tokens: 4 Trillion
Context Length: Maximum sequence length of 4096 tokens.
- Recommendation: For optimal performance on tasks requiring very long contexts (beyond the pre-training length or for specialized long-reasoning tasks), we recommend performing intermediate long-sequence adaptation/training before the final fine-tuning stage.
Training Stages:
1. Pre-training: Large-scale training on public text/code and synthetic math data using a two-stage learning rate and weight decay schedule.
2. Supervised Fine-tuning (SFT): Fine-tuned on instruction-following and conversational datasets using sum loss aggregation and specific hyperparameter tuning.
3. Direct Preference Optimization (DPO): Aligned with human preferences using preference pairs.
Tokenizer: LLaMA 3 Tokenizer (vocab size: 128,256).