-
Notifications
You must be signed in to change notification settings - Fork 295
Issues: pytorch/torchtitan
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
[Checkpointing] fails out if checkpoint folder does not exist when using keep_latest_k
bug
Something isn't working
module: checkpoint
#911
opened Mar 2, 2025 by
lessw2020
[Possible PR discuss] Will a PR of training HF model be welcomed?
community help wanted
huggingface integration
#903
opened Feb 28, 2025 by
junjzhang
Question about triton in deepseek implementtion
question
Further information is requested
#902
opened Feb 28, 2025 by
zqwenn
Moving train.py to torchtitan submodule makes run_train.sh failed with "Can not find module"
#897
opened Feb 27, 2025 by
jianiw25
[Slurm support] New file layout breaks slurm launching...cannot find module 'torchtitan.blah'
#890
opened Feb 26, 2025 by
lessw2020
dcp.load fails on checkpoints prior to AdamW refactor
module: checkpoint
#886
opened Feb 25, 2025 by
eminorhan
[Evaluation] Minimal support for downstream tasks
enhancement
New feature or request
#883
opened Feb 24, 2025 by
K-H-Ismail
[Float8] Rowwise with AsyncTP runs at roughly same perf as vanilla TP
bug
Something isn't working
module: float8
#866
opened Feb 20, 2025 by
lessw2020
[Float8] Float8 rowwise with vanilla TP encounters NaN around 80 iters in...
bug
Something isn't working
module: float8
#865
opened Feb 20, 2025 by
lessw2020
[Float8] Unable to run asyncTP + Float8 row with 'full' AC active, leading dims mismatch
bug
Something isn't working
module: float8
#864
opened Feb 20, 2025 by
lessw2020
How to define Custom Communication Operations for Custom Operators in Distributed Settings
module: dtensor
question
Further information is requested
#852
opened Feb 17, 2025 by
Doraemonzzz
"Universal" Checkpointing
module: checkpoint
question
Further information is requested
#850
opened Feb 17, 2025 by
jeromeku
Mitigation to HuggingFace Trainer
enhancement
New feature or request
huggingface integration
#824
opened Feb 6, 2025 by
huyiwen
HSDP causes loss instability
module: fsdp
question
Further information is requested
#813
opened Jan 31, 2025 by
apkumar
debug model training hangs on NVIDIA B200 with >1 GPU
bug
Something isn't working
module: c10d
#810
opened Jan 28, 2025 by
vkuzo
Gradient Scaling With Pipeline Parallelism
module: pipelining
question
Further information is requested
#803
opened Jan 24, 2025 by
windsornguyen
[Bug] Unexpected performance drop with float8 training + compiling only nn.Linear layers + using selective per op AC
bug
Something isn't working
#786
opened Jan 10, 2025 by
danielvegamyhre
Why use RowwiseParallel for nn.Embedding instead of ColwiseParallel?
question
Further information is requested
#785
opened Jan 10, 2025 by
corey-lambda
BUG: early_step_in_backward with pipeline parallelism and len(model_parts) > 1
bug
Something isn't working
#777
opened Jan 7, 2025 by
cassanof
Previous Next
ProTip!
Exclude everything labeled
bug
with -label:bug.