About 27 results
Open links in new tab
  1. On the small model, the actual GPU memory usage of Mamba2 is ... - GitHub

    Jul 2, 2024 · The parameters of the Mamba2 model are d_state=32, d_conv=4, expand=2, and head_dim=32 (using "nn. Conv1d" with padding method, without the constraint of …

  2. Memory usage for 7M model with 2^20 context too high

    Jan 17, 2024 · File "/mambaforge/envs/mamba-dna2/lib/python3.11/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 200, in advance result …

  3. Strange Loss Curve · Issue #199 · state-spaces/mamba - GitHub

    Feb 27, 2024 · However, the loss curve is strange: I wonder, are the implementation of optimizer and lr-scheduler right? And can you provide me any suggestion for fine-tuning Mamba-1.4B?

  4. Small datasets · Issue #454 · state-spaces/mamba - GitHub

    Jul 8, 2024 · You can try adding gradient truncation or setting a weight decay on the optimizer, which worked for me. In addition, the size of the Mamba model needs to be carefully adjusted …

  5. GitHub - state-spaces/mamba: Mamba SSM architecture

    On the other hand, other frameworks like DeepSpeed store parameters in float16 and upcasts when necessary (e.g. for optimizer accumulation). We've observed that higher precision for …

  6. bfloat16 overflow during training session #6 - GitHub

    Dec 5, 2023 · For models around 1B-3B on 8xA100s, sharding the optimizer states (e.g. with Pytorch distributed optimizer, equivalent to ZeRO-stage1) will help reduce the amount of …

  7. Gradient explosion in Mamba2 training, norm and loss divergence

    Aug 16, 2024 · tq = tqdm (range (0, len (tokens), block)) tloss = 0 model. train () optimizer. zero_grad () for p, i in enumerate (tq): src = torch. tensor ([tokens [i: i+block]], dtype=torch. …

  8. Implementation sensitive to token numbering #116 - GitHub

    Jan 19, 2024 · Like when I try to predict the next token as: 12179 4675 11374 1807, ..., after one step of the optimizer, the Network starts to generate NaNs. When I changed the token …

  9. Question about Multi-Head Abalations · Issue #619 - GitHub

    Nov 11, 2024 · Any details on learning rate, batch size, training steps, weight decay, gradient clipping, lr schedule, and optimizer values would be awesome! Also what expansion factor is …

  10. Is Context Length dependent on training data's context? #140

    Jan 29, 2024 · Im currently using L-BFGS optimizer with multiple iterations per data point and resetting the hessian on each data point so overfitting is a concern. Ronan, scaling down the …