Sample selection when trying to train a Model #527

azraelxuemo · 2025-02-14T02:44:57Z

azraelxuemo
Feb 14, 2025

Hi, sorry to bother you, but I have a question~

description

In the Figure 5.9, the demo have a smaple length of 6

And first select 1-6 tokens to train
And next select 7-12 tokens to train.

problem

That comes a problem, the 4,5,6 token can never see the 8 9 token.
And the same questions exist when context-length increase to 1024.
The part of the first batch token, can never see the part of second batch token.

And When we train the model, we use this kind method to train a epoch.
So I think it may cause some problems.

The solutions heard

And I headered that some people use random starting index to select the context~

What I want to know

What I want to know is that when you actually train the model in the enterprise, what do you do to get the data sample.

Thanks and best regrads. ( sorry for that I do not know how can I get the figure, so I just give the number)

Finally, I would like to say that this is the best book I have encountered while studying LLM!!!

Answered by rasbt

Feb 17, 2025

That's a really good observation and question.

That comes a problem, the 4,5,6 token can never see the 8 9 token.
And the same questions exist when context-length increase to 1024.
The part of the first batch token, can never see the part of second batch token.

And When we train the model, we use this kind method to train a epoch.
So I think it may cause some problems.

Yes, you are correct. Nowadays, some companies also have a long-context pre-training stage at the end of the pre-training cycle where the model specifically is fed whole, long context documents, e.g., with >100k tokens.

And I headered that some people use random starting index to select the context~

Since it's common …

View full answer

rasbt · 2025-02-17T20:01:30Z

rasbt
Feb 17, 2025
Maintainer

That's a really good observation and question.

That comes a problem, the 4,5,6 token can never see the 8 9 token.
And the same questions exist when context-length increase to 1024.
The part of the first batch token, can never see the part of second batch token.

And When we train the model, we use this kind method to train a epoch.
So I think it may cause some problems.

Yes, you are correct. Nowadays, some companies also have a long-context pre-training stage at the end of the pre-training cycle where the model specifically is fed whole, long context documents, e.g., with >100k tokens.

And I headered that some people use random starting index to select the context~

Since it's common to only train for 1 training epoch, this wouldn't really address the issue though. Usually the random index is more used to keep things simpler compared to the data loader.

0 replies

azraelxuemo · 2025-02-18T00:11:50Z

azraelxuemo
Feb 18, 2025
Author

Thank you very much for your reply, it helps me a lot.

1 reply

rasbt Feb 18, 2025
Maintainer

I am glad to hear that this was helpful!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sample selection when trying to train a Model #527

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Sample selection when trying to train a Model #527

azraelxuemo Feb 14, 2025

description

problem

The solutions heard

What I want to know

Replies: 2 comments · 1 reply

rasbt Feb 17, 2025 Maintainer

azraelxuemo Feb 18, 2025 Author

rasbt Feb 18, 2025 Maintainer

azraelxuemo
Feb 14, 2025

Replies: 2 comments 1 reply

rasbt
Feb 17, 2025
Maintainer

azraelxuemo
Feb 18, 2025
Author

rasbt Feb 18, 2025
Maintainer