pytorch / torchtitan Public

Notifications You must be signed in to change notification settings
Fork 295
Star 3.4k

Code
Issues 71
Pull requests 41
Discussions
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Discussions
Actions
Projects
Security
Insights

Issues: pytorch/torchtitan

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

71 Open 187 Closed

Author

Filter by author

Label

Filter by label

Use alt + click/return to exclude labels

or ⇧ + click/return for logical OR

Projects

Filter by project

Milestones

Filter by milestone

Assignee

Filter by who’s assigned

Assigned to nobody

Sort

Sort by

Newest Oldest Most commented Least commented Recently updated Least recently updated Best match

Most reactions

Issues list

[Checkpointing] fails out if checkpoint folder does not exist when using keep_latest_k bug

Something isn't working

module: checkpoint

#911 opened Mar 2, 2025 by lessw2020

[Checkpointing] Set reasonable default for keep_latest_k checkpoints and avoid overflowing user drives...currently set to 'infinite' checkpoints module: checkpoint

#910 opened Mar 1, 2025 by lessw2020

[Checkpointing] Using keep_latest_k setting results in failure when using external mounted drive module: checkpoint

#909 opened Mar 1, 2025 by lessw2020

[Possible PR discuss] Will a PR of training HF model be welcomed? community help wanted huggingface integration

#903 opened Feb 28, 2025 by junjzhang

Question about triton in deepseek implementtion question

Further information is requested

#902 opened Feb 28, 2025 by zqwenn

better handling of tensorwise float8 recipe in configuration

#901 opened Feb 27, 2025 by vkuzo

Moving train.py to torchtitan submodule makes run_train.sh failed with "Can not find module"

#897 opened Feb 27, 2025 by jianiw25

[Slurm support] New file layout breaks slurm launching...cannot find module 'torchtitan.blah'

#890 opened Feb 26, 2025 by lessw2020

[Feature request] Integrate DeepGEMM

#889 opened Feb 26, 2025 by lessw2020

dcp.load fails on checkpoints prior to AdamW refactor module: checkpoint

#886 opened Feb 25, 2025 by eminorhan

Possible to integrate DeepEP?

#885 opened Feb 25, 2025 by ericxsun

[Evaluation] Minimal support for downstream tasks enhancement

New feature or request

#883 opened Feb 24, 2025 by K-H-Ismail

[Float8] Rowwise with AsyncTP runs at roughly same perf as vanilla TP bug

Something isn't working

module: float8

#866 opened Feb 20, 2025 by lessw2020

[Float8] Float8 rowwise with vanilla TP encounters NaN around 80 iters in... bug

Something isn't working

module: float8

#865 opened Feb 20, 2025 by lessw2020

[Float8] Unable to run asyncTP + Float8 row with 'full' AC active, leading dims mismatch bug

Something isn't working

module: float8

#864 opened Feb 20, 2025 by lessw2020

How to define Custom Communication Operations for Custom Operators in Distributed Settings module: dtensor question

Further information is requested

#852 opened Feb 17, 2025 by Doraemonzzz

"Universal" Checkpointing module: checkpoint question

Further information is requested

#850 opened Feb 17, 2025 by jeromeku

Mitigation to HuggingFace Trainer enhancement

New feature or request

huggingface integration

#824 opened Feb 6, 2025 by huyiwen

HSDP causes loss instability module: fsdp question

Further information is requested

#813 opened Jan 31, 2025 by apkumar

debug model training hangs on NVIDIA B200 with >1 GPU bug

Something isn't working

module: c10d

#810 opened Jan 28, 2025 by vkuzo

unable to run 8b llama bug

Something isn't working

#807 opened Jan 27, 2025 by asahni04

Gradient Scaling With Pipeline Parallelism module: pipelining question

Further information is requested

#803 opened Jan 24, 2025 by windsornguyen

[Bug] Unexpected performance drop with float8 training + compiling only nn.Linear layers + using selective per op AC bug

Something isn't working

#786 opened Jan 10, 2025 by danielvegamyhre

Why use RowwiseParallel for nn.Embedding instead of ColwiseParallel? question

Further information is requested

#785 opened Jan 10, 2025 by corey-lambda

BUG: early_step_in_backward with pipeline parallelism and len(model_parts) > 1 bug

Something isn't working

#777 opened Jan 7, 2025 by cassanof

Previous 1 2 3 Next

Previous Next

ProTip! Exclude everything labeled bug with -label:bug.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly