Share
Explore

Write a Java program that uses the Jsoup library to scrape the content of a webpage.

This code uses SETS to store data in program space:

This example retrieves the page title and all links on the page.
Before you start, you'll need to add the Jsoup library to your Java project.
Start by making a Project Directory: Download your Jar File.

You can download the Jsoup .jar file directly from the Jsoup website, and then manually add it to your project's classpath. Here's a step-by-step guide for doing this in an IDE like IntelliJ IDEA or Eclipse:
Download the Jsoup .jar file from the Jsoup download page ().
Save the .jar file in your Project Directory: c:\projectDir
For IntelliJ IDEA:
Open your project in IntelliJ IDEA.
Click on File -> Project Structure.
In the Project Structure dialog, select Modules in the left-hand panel, and then select Dependencies tab.
Click on the + button -> JARs or directories.
Navigate to where you saved the Jsoup .jar file, select it, and click OK.
For Eclipse:
Open your project in Eclipse.
Right-click the project name in the Project Explorer, and then click Properties.
In the Properties window, select Java Build Path, and then select the Libraries tab.
Click Add External JARs, navigate to where you saved the Jsoup .jar file, select it, and click Open.
Click OK to close the Properties window.
After doing this, your project will be able to use the Jsoup library, and you can write and run your web scraping program.

To specify the jar file at compile time, you can use the -cp or -classpath option with the javac command. This allows the Java compiler to find the required classes in the specified jar file during the compilation process. Here's how you can do it:


C:\s23JavaSoupProject>javac -cp "jsoup-1.16.1.jar" WebCrawler.java
Explanation:
Use the javac command to compile the Java source file.
The -cp or -classpath option is used to specify the classpath.
jsoup-1.16.1.jar is the jar file containing the Jsoup library.
With this command, the Java compiler will be able to find the classes from the Jsoup library while compiling your WebCrawler class.
After compiling the program, you can run it as mentioned in the previous response using the java command with the -cp option:

C:\s23JavaSoupProject>java -cp "jsoup-1.16.1.jar;." WebCrawler


The above command specifies the classpath for the runtime environment, allowing Java to find the necessary classes and resources, including the Jsoup library, when running your program.
Remember that if you are on a Unix-based system (e.g., Linux or macOS), use : as the path separator instead of ; for both the javac and java commands.
By providing the correct classpath at compile time and runtime, your Java program should work seamlessly with the Jsoup library.
webcrawler.java

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

public class webcrawler{

public static void main(String[] args) {

Document document;
try {
// Get the HTML document
document = Jsoup.connect("http://cnn.com").get();

// Get the title of the webpage
String title = document.title();
System.out.println("Title: " + title);

// Get all links on the page
Elements links = document.select("a[href]");
for (Element link : links) {
System.out.println("Link: " + link.attr("href"));
}
} catch (IOException e) {
e.printStackTrace();
}
}
}

This program first connects to the webpage at and retrieves the HTML document. It then prints the title of the webpage and all the hyperlinks contained in the page.

The following code extends the earlier program to follow the links found on the "example.com/index.html" page and print out the title of each linked page. The modification involves adding a method processPage(String URL) which does the work of visiting a given URL, printing its title, and adding new URLs to be visited:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.HashSet;
import java.util.Set;

public class Main {
private static Set<String> visitedPages = new HashSet<>();

public static void main(String[] args) {
try {
processPage("http://example.com/index.html");
} catch (IOException e) {
e.printStackTrace();
}
}

private static void processPage(String URL) throws IOException {
// Check if we have already visited this URL
if (!visitedPages.contains(URL)) {
visitedPages.add(URL);

Document document = Jsoup.connect(URL).get();

System.out.println("Title: " + document.title());

// Get all links on the page
Elements linksOnPage = document.select("a[href]");

for (Element page : linksOnPage) {
// Only consider URLs which are part of the original site
if (page.attr("href").contains("example.com")) {
processPage(page.attr("abs:href"));
}
}
}
}
}

Please note that you should replace "http://example.com/index.html" and "example.com" with your desired start URL and domain to crawl respectively.
Also, be aware that this is a recursive process and can potentially follow links endlessly if not carefully controlled.

The script may require further refinement to avoid falling into traps like following links to the same page or cycling through a series of pages indefinitely. Always make sure to respect the website's robots.txt rules and be mindful not to cause a denial of service by sending too many requests in a short time.

What you need to do for the project:

Modify the provided program so that instead of endlessly calling every referenced link to every referenced link: You only do out to the first generation of referenced linked.


Also, note that you will need to handle various edge cases in real-world use, such as relative links, external links, broken links, etc., which are not fully handled in this simplified example.

Write an exhanced version of the above program to store the data retrieved from webpages as JSON on the FILE SYSTEM

JSON is JavaScript Object Notation: Very commonly used a “box” into which to put data to move it around over Web Application. Think of JSON as a Data Description Language just like HTML is a Page Description Language.
In this updated program, I'll use the org.json library to create JSON objects and the java.nio.file package to write data to the file system.
First, make sure to include the org.json library in your classpath.
Here is the enhanced version of the web scraper that stores the retrieved data as JSON:
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import org.json.JSONObject;

public class WebScraper {
public static void main(String[] args) throws IOException {
String url = "http://example.com/index.html";
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a[href]");
int i = 0;

for (var link : links) {
String linkHref = link.attr("abs:href");
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.