Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only one line being read in a 100GB+ gzip file (Wikidata dump) #307

Closed
dipstef opened this issue Jun 10, 2022 · 2 comments
Closed

Only one line being read in a 100GB+ gzip file (Wikidata dump) #307

dipstef opened this issue Jun 10, 2022 · 2 comments

Comments

@dipstef
Copy link

dipstef commented Jun 10, 2022

Hi all!

Apologies if I did miss something out here and the error is on my behalf (however this is not different than any standard usage of this library), I am reading a wikidata dump line by line, and being this a giant json array only the first line containing the opening square bracket is being returned:

The dump is the following: https://dumps.wikimedia.org/wikidatawiki/entities/20220606/wikidata-20220606-all.json.gz

let path = ".../wikidata-20220606-all.json.gz";
let f = File::open(&path).expect("file not found");
let reader = BufReader::new(GzDecoder::new(f));

reader.lines().for_each(|l |{
    println!("{}", l.ok().unwrap());
})

Switching to the loop based format, the second call to read_lines returns 0 bytes, which should be in line with the lines iterator behaviour.

let reader = BufReader::new(GzDecoder::new(f));

let mut buf = String::new();
        while let Ok(usize) = reader.read_line(&mut buf) {
            match usize {
                0 => {
                    buf.clear();
                    break;
                }
                _ => {
                    println!("{}", buf);
                    buf.clear()
                }
            }
        }

No issues when reading the above file from gzcat or a python script.

Any idea on how to troubleshoot this?

Thanks in advance for your help!

@alexcrichton
Copy link
Member

I believe for wikipedia dumps you need to use MultiGzDecoder

@dipstef
Copy link
Author

dipstef commented Jun 10, 2022

Cheers, that did it!

My follow up question seem to be already be addressed in this issue:

#178

So I would rely on usages of MultiGzDecoder instead for arbitrary files.

Cheers,

@dipstef dipstef closed this as completed Jun 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants