Skip to content

Only one line being read in a 100GB+ gzip file (Wikidata dump) #307

Closed
@dipstef

Description

Hi all!

Apologies if I did miss something out here and the error is on my behalf (however this is not different than any standard usage of this library), I am reading a wikidata dump line by line, and being this a giant json array only the first line containing the opening square bracket is being returned:

The dump is the following: https://dumps.wikimedia.org/wikidatawiki/entities/20220606/wikidata-20220606-all.json.gz

let path = ".../wikidata-20220606-all.json.gz";
let f = File::open(&path).expect("file not found");
let reader = BufReader::new(GzDecoder::new(f));

reader.lines().for_each(|l |{
    println!("{}", l.ok().unwrap());
})

Switching to the loop based format, the second call to read_lines returns 0 bytes, which should be in line with the lines iterator behaviour.

let reader = BufReader::new(GzDecoder::new(f));

let mut buf = String::new();
        while let Ok(usize) = reader.read_line(&mut buf) {
            match usize {
                0 => {
                    buf.clear();
                    break;
                }
                _ => {
                    println!("{}", buf);
                    buf.clear()
                }
            }
        }

No issues when reading the above file from gzcat or a python script.

Any idea on how to troubleshoot this?

Thanks in advance for your help!

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions