microsoft/BitNet Microsoft’s open source 1-bit large model reasoning framework

BitNet is Microsoft’s open-source, ultra-low-bit large model inference framework optimized for CPU-based local inference and extreme compression (1-bit/1.58-bit quantization), delivering efficient and low-power execution for models like BitNet, Llama3-8B-1.58, and Falcon3 without GPU requirements . Released under the MIT license, it features both C++ and Python interfaces, backed by an active community and ongoing updates, making it ideal for embedded, mobile, and edge AI deployments.

microsoft/BitNet Microsoft's open source 1-bit large model reasoning framework

Official website:  https://bitnet-demo.azurewebsites.net/
Source code: https://github.com/microsoft/BitNet?tab=readme-ov-file

Model Details

  • Architecture: Transformer-based, modified with BitLinear layers (BitNet framework).
    • Uses Rotary Position Embeddings (RoPE).
    • Uses squared ReLU (ReLU²) activation in FFN layers.
    • Employs subln normalization.
    • No bias terms in linear or normalization layers.
  • Quantization: Native 1.58-bit weights and 8-bit activations (W1.58A8).
    • Weights are quantized to ternary values {-1, 0, +1} using absmean quantization during the forward pass.
    • Activations are quantized to 8-bit integers using absmax quantization (per-token).
    • Crucially, the model was trained from scratch with this quantization scheme, not post-training quantized.
  • Parameters: ~2 Billion
  • Training Tokens: 4 Trillion
  • Context Length: Maximum sequence length of 4096 tokens.
    • Recommendation: For optimal performance on tasks requiring very long contexts (beyond the pre-training length or for specialized long-reasoning tasks), we recommend performing intermediate long-sequence adaptation/training before the final fine-tuning stage.
  • Training Stages:
    1. Pre-training: Large-scale training on public text/code and synthetic math data using a two-stage learning rate and weight decay schedule.
    2. Supervised Fine-tuning (SFT): Fine-tuned on instruction-following and conversational datasets using sum loss aggregation and specific hyperparameter tuning.
    3. Direct Preference Optimization (DPO): Aligned with human preferences using preference pairs.
  • Tokenizer: LLaMA 3 Tokenizer (vocab size: 128,256).

Libre Depot original article,Publisher:Libre Depot,Please indicate the source when reprinting:https://www.libredepot.top/5525.html

Like (0)
Libre DepotLibre Depot
Previous 4 hours ago
Next 4 hours ago

Related articles

Leave a Reply

Your email address will not be published. Required fields are marked *