Only one line being read in a 100GB+ gzip file (Wikidata dump) #307

dipstef · 2022-06-10T10:25:43Z

Hi all!

Apologies if I did miss something out here and the error is on my behalf (however this is not different than any standard usage of this library), I am reading a wikidata dump line by line, and being this a giant json array only the first line containing the opening square bracket is being returned:

The dump is the following: https://dumps.wikimedia.org/wikidatawiki/entities/20220606/wikidata-20220606-all.json.gz

let path = ".../wikidata-20220606-all.json.gz";
let f = File::open(&path).expect("file not found");

let reader = BufReader::new(GzDecoder::new(f));

reader.lines().for_each(|l |{
    println!("{}", l.ok().unwrap());
})

Switching to the loop based format, the second call to read_lines returns 0 bytes, which should be in line with the lines iterator behaviour.

let reader = BufReader::new(GzDecoder::new(f));

let mut buf = String::new();
        while let Ok(usize) = reader.read_line(&mut buf) {
            match usize {
                0 => {
                    buf.clear();
                    break;
                }
                _ => {
                    println!("{}", buf);
                    buf.clear()
                }
            }
        }

No issues when reading the above file from gzcat or a python script.

Any idea on how to troubleshoot this?

Thanks in advance for your help!

alexcrichton · 2022-06-10T17:29:48Z

I believe for wikipedia dumps you need to use MultiGzDecoder

dipstef · 2022-06-10T22:20:27Z

Cheers, that did it!

My follow up question seem to be already be addressed in this issue:

#178

So I would rely on usages of MultiGzDecoder instead for arbitrary files.

Cheers,

dipstef closed this as completed Jun 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only one line being read in a 100GB+ gzip file (Wikidata dump) #307

Only one line being read in a 100GB+ gzip file (Wikidata dump) #307

dipstef commented Jun 10, 2022 •

edited

Loading

alexcrichton commented Jun 10, 2022

dipstef commented Jun 10, 2022

Only one line being read in a 100GB+ gzip file (Wikidata dump) #307

Only one line being read in a 100GB+ gzip file (Wikidata dump) #307

Comments

dipstef commented Jun 10, 2022 • edited Loading

alexcrichton commented Jun 10, 2022

dipstef commented Jun 10, 2022

dipstef commented Jun 10, 2022 •

edited

Loading