Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

htmlParseEntityRef when parsing gett.com #229

Open
SaraVieira opened this issue Nov 25, 2018 · 1 comment
Open

htmlParseEntityRef when parsing gett.com #229

SaraVieira opened this issue Nov 25, 2018 · 1 comment

Comments

@SaraVieira
Copy link

Hey 👋

Amazing lib. This is the first time I get any issues.

I am trying to scrape the https://gett.com/uk/about/ webpapage to get all the cities they are in but I get an error:

❯ yarn seed
yarn run v1.12.3
$ node data/seed.js
(get) loaded [get] https://gett.com/uk/about
(find) no results for "#section3"
[]
(get) stack: 0, requests: 1 (0 queued), RAM: 36.09Mb (+36.09Mb), libxml: 0.0% (44 nodes), heap: 60% of 16.83Mb
✨  Done in 0.70s.

~/Projects/uber-cities master*
❯ yarn seed
yarn run v1.12.3
$ node data/seed.js
(get) loaded [get] https://gett.com/uk/about
Document {
  errors:
   [ { Error: htmlParseEntityRef: expecting ';'

    at Object.module.exports.fromHtml (/Users/saravieira/Projects/uber-cities/node_modules/libxmljs/lib/document.js:143:21)
    at next (/Users/saravieira/Projects/uber-cities/node_modules/osmosis/lib/Request.js:51:31)
    at /Users/saravieira/Projects/uber-cities/node_modules/osmosis/lib/Request.js:99:13
    at done (/Users/saravieira/Projects/uber-cities/node_modules/needle/lib/needle.js:432:14)
    at PassThrough.<anonymous> (/Users/saravieira/Projects/uber-cities/node_modules/needle/lib/needle.js:671:11)
    at PassThrough.emit (events.js:180:13)
    at endReadableNT (_stream_readable.js:1106:12)
    at process._tickCallback (internal/process/next_tick.js:178:19)
       domain: 5,
       code: 23,
       level: 2,
       column: 341,
       file: 'https://gett.com/uk/about',
       line: 1 },
     { Error: htmlParseEntityRef: expecting ';'

    at Object.module.exports.fromHtml (/Users/saravieira/Projects/uber-cities/node_modules/libxmljs/lib/document.js:143:21)
    at next (/Users/saravieira/Projects/uber-cities/node_modules/osmosis/lib/Request.js:51:31)
    at /Users/saravieira/Projects/uber-cities/node_modules/osmosis/lib/Request.js:99:13
    at done (/Users/saravieira/Projects/uber-cities/node_modules/needle/lib/needle.js:432:14)
    at PassThrough.<anonymous> (/Users/saravieira/Projects/uber-cities/node_modules/needle/lib/needle.js:671:11)
    at PassThrough.emit (events.js:180:13)
    at endReadableNT (_stream_readable.js:1106:12)
    at process._tickCallback (internal/process/next_tick.js:178:19)
       domain: 5,
       code: 23,
       level: 2,
       column: 473,
       file: 'https://gett.com/uk/about',
       line: 1 },
     { Error: htmlParseEntityRef: expecting ';'

    at Object.module.exports.fromHtml (/Users/saravieira/Projects/uber-cities/node_modules/libxmljs/lib/document.js:143:21)
    at next (/Users/saravieira/Projects/uber-cities/node_modules/osmosis/lib/Request.js:51:31)
    at /Users/saravieira/Projects/uber-cities/node_modules/osmosis/lib/Request.js:99:13
    at done (/Users/saravieira/Projects/uber-cities/node_modules/needle/lib/needle.js:432:14)
    at PassThrough.<anonymous> (/Users/saravieira/Projects/uber-cities/node_modules/needle/lib/needle.js:671:11)
    at PassThrough.emit (events.js:180:13)
    at endReadableNT (_stream_readable.js:1106:12)
    at process._tickCallback (internal/process/next_tick.js:178:19)
       domain: 5,
       code: 23,
       level: 2,
       column: 516,
       file: 'https://gett.com/uk/about',
       line: 1 },
     { Error: htmlParseEntityRef: expecting ';'

    at Object.module.exports.fromHtml (/Users/saravieira/Projects/uber-cities/node_modules/libxmljs/lib/document.js:143:21)
    at next (/Users/saravieira/Projects/uber-cities/node_modules/osmosis/lib/Request.js:51:31)
    at /Users/saravieira/Projects/uber-cities/node_modules/osmosis/lib/Request.js:99:13
    at done (/Users/saravieira/Projects/uber-cities/node_modules/needle/lib/needle.js:432:14)
    at PassThrough.<anonymous> (/Users/saravieira/Projects/uber-cities/node_modules/needle/lib/needle.js:671:11)
    at PassThrough.emit (events.js:180:13)
    at endReadableNT (_stream_readable.js:1106:12)
    at process._tickCallback (internal/process/next_tick.js:178:19)
       domain: 5,
       code: 23,
       level: 2,
       column: 525,
       file: 'https://gett.com/uk/about',
       line: 1 } ],

I assume this is because the HTML is malformatted on their page.

Is there any way to go arround this and return the HTML even if as a string?

Thank you

@rchipka
Copy link
Owner

rchipka commented Nov 25, 2018

Caused by HTML entities missing semicolon, such as:

Editors’ Choice&nbspon&nbspApp&nbspStore

One option would be using the preprocess option to fix these.

The only other option would be to set libxml to ignore HTML entities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants