Get PDF content #3551

matteocacciola · 2025-02-21T13:04:01Z

matteocacciola
Feb 21, 2025

If I have the link to a PDF file, how may get its content by using SeleniumBase?

Feb 21, 2025

There's sb.get_pdf_text(pdf):

SeleniumBase/help_docs/method_summary.md

Lines 329 to 331 in 8317bc8

     self.get_pdf_text(  
   pdf, page=None, maxpages=None, password=None,  
   codec='utf-8', wrap=False, nav=False, override=False, caching=True)  

 

Example: SeleniumBase/examples/test_get_pdf_text.py

from seleniumbase import BaseCase
BaseCase.main(__name__, __file__)

class PdfTests(BaseCase):
    def test_get_pdf_text(self):
        pdf = "https://nostarch.com/download/Automate_the_Boring_Stuff_sample_ch17.pdf"
        pdf_text = self.get_pdf_text(pdf, page=1)
        print("\n" + pdf_text)

View full answer

mdmintz · 2025-02-21T13:36:46Z

mdmintz
Feb 21, 2025
Maintainer

There's sb.get_pdf_text(pdf):

SeleniumBase/help_docs/method_summary.md

Lines 329 to 331 in 8317bc8

    
           self.get_pdf_text( 
        
               pdf, page=None, maxpages=None, password=None, 
        
               codec='utf-8', wrap=False, nav=False, override=False, caching=True)

Example: SeleniumBase/examples/test_get_pdf_text.py

from seleniumbase import BaseCase
BaseCase.main(__name__, __file__)

class PdfTests(BaseCase):
    def test_get_pdf_text(self):
        pdf = "https://nostarch.com/download/Automate_the_Boring_Stuff_sample_ch17.pdf"
        pdf_text = self.get_pdf_text(pdf, page=1)
        print("\n" + pdf_text)

4 replies

matteocacciola Feb 21, 2025
Author

Thank you for your prompt support. Perhaps I did not explain well my problem. I don't want to extract the text by PDF miner. I want to grab the pure bytes as they should come from the HTTP response. Is it more clear now?

mdmintz Feb 21, 2025
Maintainer

Maybe something from one of these:

matteocacciola Feb 21, 2025
Author

Thank you, but I can't figure out how to simply grab the content, like it happens with requests library for instance

mdmintz Feb 21, 2025
Maintainer

There are entire repos devoted to PDF-parsing for a reason (it's a non-trivial task):

If the SeleniumBase get_pdf_text(pdf) method isn't good enough, and you can't figure out how to retrieve the data via the examples above (such as examples/cdp_mode/raw_xhr_sb.py), then you're going to need to use an external PDF parser repo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get PDF content #3551

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

	self.get_pdf_text(
	pdf, page=None, maxpages=None, password=None,
	codec='utf-8', wrap=False, nav=False, override=False, caching=True)

Get PDF content #3551

matteocacciola Feb 21, 2025

Replies: 1 comment · 4 replies

mdmintz Feb 21, 2025 Maintainer

matteocacciola Feb 21, 2025 Author

mdmintz Feb 21, 2025 Maintainer

matteocacciola Feb 21, 2025 Author

mdmintz Feb 21, 2025 Maintainer

matteocacciola
Feb 21, 2025

Replies: 1 comment 4 replies

mdmintz
Feb 21, 2025
Maintainer

matteocacciola Feb 21, 2025
Author

mdmintz Feb 21, 2025
Maintainer

matteocacciola Feb 21, 2025
Author

mdmintz Feb 21, 2025
Maintainer