Only one line being read in a 100GB+ gzip file (Wikidata dump) #307
Closed
Description
Hi all!
Apologies if I did miss something out here and the error is on my behalf (however this is not different than any standard usage of this library), I am reading a wikidata dump line by line, and being this a giant json array only the first line containing the opening square bracket is being returned:
The dump is the following: https://dumps.wikimedia.org/wikidatawiki/entities/20220606/wikidata-20220606-all.json.gz
let path = ".../wikidata-20220606-all.json.gz";
let f = File::open(&path).expect("file not found");
let reader = BufReader::new(GzDecoder::new(f));
reader.lines().for_each(|l |{
println!("{}", l.ok().unwrap());
})
Switching to the loop based format, the second call to read_lines returns 0 bytes, which should be in line with the lines iterator behaviour.
let reader = BufReader::new(GzDecoder::new(f));
let mut buf = String::new();
while let Ok(usize) = reader.read_line(&mut buf) {
match usize {
0 => {
buf.clear();
break;
}
_ => {
println!("{}", buf);
buf.clear()
}
}
}
No issues when reading the above file from gzcat or a python script.
Any idea on how to troubleshoot this?
Thanks in advance for your help!
Metadata
Assignees
Labels
No labels