Java program which counts top Javascript libraries used on web pages found in google
In order to get the result links from google, I decided to parse the web page (I think it's the heaviest operation), but we can improve this by using the Google's custom search API. We will need to get an API Key with a limit rate of 100 daily requests or pay for more $$$.
We're already using threads to retrieve the scripts based on the google links and the result list is sanitized by removing the path and parameters in order to identify duplicated libs.
Regarding the tests, I added some unit tests and ITs, the code coverage isn't 100% but it's an example of what we can do.
You only need to use maven as follows:
mvn clean package
Use the following in the terminal:
java -jar target/webcrawler-1.0.0-SNAPSHOT-jar-with-dependencies.jar
mvn verify
- 1.0.0
- SC01: App creation
- Java 11
- JSoup
- Maven
__ __ _ _____ _
\ \ / / | | / ____| | |
\ \ /\ / /__| |__ | | _ __ __ ___ _| | ___ _ __
\ \/ \/ / _ \ '_ \| | | '__/ _` \ \ /\ / / |/ _ \ '__|
\ /\ / __/ |_) | |____| | | (_| |\ V V /| | __/ |
\/ \/ \___|_.__/ \_____|_| \__,_| \_/\_/ |_|\___|_|
Please, introduce the search term:
java
Processing search page...
Found 15 links
Retrieving scripts...
Found 73 scripts
Generating metrics...
The top five is:
------------------------------------
adsbygoogle.js=11
jquery.js=3
api.js=3
satelliteLib-e598c5b61e39a10b402e048e87dd27b0f1cd2d4c.js=2
stl.js=2
------------------------------------
***Task execution finished***
- Fork it (/~https://github.com/jorgelizandro/web-crawler.git)
- Create your feature branch (
git checkout -b feature/fooBar
) - Commit your changes (
git commit -am 'Add some fooBar'
) - Push to the branch (
git push origin feature/fooBar
) - Create a new Pull Request