Optimizer Computer Logo

About 27 results

Open links in new tab

Any time

github.com
https://github.com › state-spaces › mamba › issues
On the small model, the actual GPU memory usage of Mamba2 is ... - GitHub
Jul 2, 2024 · The parameters of the Mamba2 model are d_state=32, d_conv=4, expand=2, and head_dim=32 (using "nn. Conv1d" with padding method, without the constraint of …
github.com
https://github.com › state-spaces › mamba › issues
Memory usage for 7M model with 2^20 context too high
Jan 17, 2024 · File "/mambaforge/envs/mamba-dna2/lib/python3.11/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 200, in advance result …
github.com
https://github.com › state-spaces › mamba › issues
Strange Loss Curve · Issue #199 · state-spaces/mamba - GitHub
Feb 27, 2024 · However, the loss curve is strange: I wonder, are the implementation of optimizer and lr-scheduler right? And can you provide me any suggestion for fine-tuning Mamba-1.4B?
github.com
https://github.com › state-spaces › mamba › issues
Small datasets · Issue #454 · state-spaces/mamba - GitHub
Jul 8, 2024 · You can try adding gradient truncation or setting a weight decay on the optimizer, which worked for me. In addition, the size of the Mamba model needs to be carefully adjusted …
github.com
https://github.com › state-spaces › mamba › tree › main
GitHub - state-spaces/mamba: Mamba SSM architecture
On the other hand, other frameworks like DeepSpeed store parameters in float16 and upcasts when necessary (e.g. for optimizer accumulation). We've observed that higher precision for …
github.com
https://github.com › state-spaces › mamba › issues
bfloat16 overflow during training session #6 - GitHub
Dec 5, 2023 · For models around 1B-3B on 8xA100s, sharding the optimizer states (e.g. with Pytorch distributed optimizer, equivalent to ZeRO-stage1) will help reduce the amount of …
github.com
https://github.com › state-spaces › mamba › issues
Gradient explosion in Mamba2 training, norm and loss divergence
Aug 16, 2024 · tq = tqdm (range (0, len (tokens), block)) tloss = 0 model. train () optimizer. zero_grad () for p, i in enumerate (tq): src = torch. tensor ([tokens [i: i+block]], dtype=torch. …
github.com
https://github.com › state-spaces › mamba › issues
Implementation sensitive to token numbering #116 - GitHub
Jan 19, 2024 · Like when I try to predict the next token as: 12179 4675 11374 1807, ..., after one step of the optimizer, the Network starts to generate NaNs. When I changed the token …
github.com
https://github.com › state-spaces › mamba › issues
Question about Multi-Head Abalations · Issue #619 - GitHub
Nov 11, 2024 · Any details on learning rate, batch size, training steps, weight decay, gradient clipping, lr schedule, and optimizer values would be awesome! Also what expansion factor is …
github.com
https://github.com › state-spaces › mamba › issues
Is Context Length dependent on training data's context? #140
Jan 29, 2024 · Im currently using L-BFGS optimizer with multiple iterations per data point and resetting the hessian on each data point so overfitting is a concern. Ronan, scaling down the …

Pagination
- 1
- Next

On the small model, the actual GPU memory usage of Mamba2 is ... - GitHub

Memory usage for 7M model with 2^20 context too high

Strange Loss Curve · Issue #199 · state-spaces/mamba - GitHub

Small datasets · Issue #454 · state-spaces/mamba - GitHub

GitHub - state-spaces/mamba: Mamba SSM architecture

bfloat16 overflow during training session #6 - GitHub

Gradient explosion in Mamba2 training, norm and loss divergence

Implementation sensitive to token numbering #116 - GitHub

Question about Multi-Head Abalations · Issue #619 - GitHub

Is Context Length dependent on training data's context? #140