Share
Explore

Write a Java program that uses the Jsoup library to scrape the content of a webpage.

This code uses SETS to store data in program space:

This example retrieves the page title and all links on the page.
Before you start, you'll need to add the Jsoup library to your Java project.
Start by making a Project Directory: Download your Jar File.

You can download the Jsoup .jar file directly from the Jsoup website, and then manually add it to your project's classpath. Here's a step-by-step guide for doing this in an IDE like IntelliJ IDEA or Eclipse:
Download the Jsoup .jar file from the Jsoup download page ().
Save the .jar file in your Project Directory: c:\projectDir
For IntelliJ IDEA:
Open your project in IntelliJ IDEA.
Click on File -> Project Structure.
In the Project Structure dialog, select Modules in the left-hand panel, and then select Dependencies tab.
Click on the + button -> JARs or directories.
Navigate to where you saved the Jsoup .jar file, select it, and click OK.
For Eclipse:
Open your project in Eclipse.
Right-click the project name in the Project Explorer, and then click Properties.
In the Properties window, select Java Build Path, and then select the Libraries tab.
Click Add External JARs, navigate to where you saved the Jsoup .jar file, select it, and click Open.
Click OK to close the Properties window.
After doing this, your project will be able to use the Jsoup library, and you can write and run your web scraping program.

To specify the jar file at compile time, you can use the -cp or -classpath option with the javac command. This allows the Java compiler to find the required classes in the specified jar file during the compilation process. Here's how you can do it:


C:\s23JavaSoupProject>javac -cp "jsoup-1.16.1.jar" WebCrawler.java

Explanation:
Use the javac command to compile the Java source file.
The -cp or -classpath option is used to specify the classpath.
jsoup-1.16.1.jar is the jar file containing the Jsoup library.
With this command, the Java compiler will be able to find the classes from the Jsoup library while compiling your WebCrawler class.
After compiling the program, you can run it as mentioned in the previous response using the java command with the -cp option:

C:\s23JavaSoupProject>java -cp "jsoup-1.16.1.jar;." WebCrawler


The above command specifies the classpath for the runtime environment, allowing Java to find the necessary classes and resources, including the Jsoup library, when running your program.
Remember that if you are on a Unix-based system (e.g., Linux or macOS), use : as the path separator instead of ; for both the javac and java commands.
By providing the correct classpath at compile time and runtime, your Java program should work seamlessly with the Jsoup library.
webcrawler.java

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

public class webcrawler{

public static void main(String[] args) {

Document document;
try {
// Get the HTML document
document = Jsoup.connect("http://cnn.com").get();

// Get the title of the webpage
String title = document.title();
System.out.println("Title: " + title);

// Get all links on the page
Elements links = document.select("a[href]");
for (Element link : links) {
System.out.println("Link: " + link.attr("href"));
}
} catch (IOException e) {
e.printStackTrace();
}
}
}

This program first connects to the webpage at and retrieves the HTML document. It then prints the title of the webpage and all the hyperlinks contained in the page.

The following code extends the earlier program to follow the links found on the "example.com/index.html" page and print out the title of each linked page. The modification involves adding a method processPage(String URL) which does the work of visiting a given URL, printing its title, and adding new URLs to be visited:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.HashSet;
import java.util.Set;

public class Main {
private static Set<String> visitedPages = new HashSet<>();

public static void main(String[] args) {
try {
processPage("http://example.com/index.html");
} catch (IOException e) {
e.printStackTrace();
}
}

private static void processPage(String URL) throws IOException {
// Check if we have already visited this URL
if (!visitedPages.contains(URL)) {
visitedPages.add(URL);

Document document = Jsoup.connect(URL).get();

System.out.println("Title: " + document.title());

// Get all links on the page
Elements linksOnPage = document.select("a[href]");

for (Element page : linksOnPage) {
// Only consider URLs which are part of the original site
if (page.attr("href").contains("example.com")) {
processPage(page.attr("abs:href"));
}
}
}
}
}

Please note that you should replace "http://example.com/index.html" and "example.com" with your desired start URL and domain to crawl respectively.
Also, be aware that this is a recursive process and can potentially follow links endlessly if not carefully controlled.

The script may require further refinement to avoid falling into traps like following links to the same page or cycling through a series of pages indefinitely. Always make sure to respect the website's robots.txt rules and be mindful not to cause a denial of service by sending too many requests in a short time.

What you need to do for the project:

Modify the provided program so that instead of endlessly calling every referenced link to every referenced link: You only do out to the first generation of referenced linked.


Also, note that you will need to handle various edge cases in real-world use, such as relative links, external links, broken links, etc., which are not fully handled in this simplified example.

Write an exhanced version of the above program to store the data retrieved from webpages as JSON on the FILE SYSTEM

JSON is JavaScript Object Notation: Very commonly used a “box” into which to put data to move it around over Web Application. Think of JSON as a Data Description Language just like HTML is a Page Description Language.
In this updated program, I'll use the org.json library to create JSON objects and the java.nio.file package to write data to the file system.
First, make sure to include the org.json library in your classpath.
Here is the enhanced version of the web scraper that stores the retrieved data as JSON:
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import org.json.JSONObject;

public class WebScraper {
public static void main(String[] args) throws IOException {
String url = "http://example.com/index.html";
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a[href]");
int i = 0;

for (var link : links) {
String linkHref = link.attr("abs:href");
Document linkedDoc = Jsoup.connect(linkHref).get();

// We'll store the title and body of the linked page.
String title = linkedDoc.title();
String body = linkedDoc.body().text();

// Create a JSON object.
JSONObject jsonObject = new JSONObject();
jsonObject.put("title", title);
jsonObject.put("body", body);

// Write the JSON object to a file.
String jsonString = jsonObject.toString();
Files.write(Paths.get("page" + i++ + ".json"), jsonString.getBytes());

System.out.println("Data from " + linkHref + " stored to JSON file.");
}
}
}


To fix the "package org.json does not exist" error, you need to add the JSON jar file to the classpath while compiling your WebScraper2.java file. The steps are similar to what you did for jsoup-1.16.1.jar.
Assuming you have the json.jar file, follow these steps:
Place the json.jar file in the same directory as your Java source file (WebScraper2.java) or in a separate directory.
Compile your WebScraper2.java file with the classpath including both the jsoup-1.16.1.jar and json.jar files:

C:\s23JavaSoupProject>javac -cp "jsoup-1.16.1.jar;json.jar" WebScraper2.java

If you have placed the json.jar file in a different directory, provide the full path to the jar file in the classpath:
bashCopy code
C:\s23JavaSoupProject>javac -cp "jsoup-1.16.1.jar;C:\path\to\json.jar" WebScraper2.java

After successfully compiling the WebScraper2 class with both jars in the classpath, you can run your program using:
bashCopy code
C:\s23JavaSoupProject>java -cp "jsoup-1.16.1.jar;json.jar;." WebScraper2

Make sure to replace WebScraper2 with the actual name of your Java class if it differs from the provided name. With these steps, your program should be able to find the org.json package and resolve the JSONObject class.

In this program, we visit each page linked from the original URL and save the title and body of each page as a JSON object in a separate file.
Please make sure to handle exceptions appropriately in a production environment. Also note that the efficiency and robustness of this approach might not scale well for large amounts of data or very frequent scrapings.
It's important to respect the website's robots.txt file and to not scrape data too frequently in order to avoid being blocked. Finally, make sure you have the rights to scrape and store the data you're interested in. Some websites explicitly forbid web scraping in their terms of service.
How to get the org.json jar
The org.json library is a popular JSON processing library in Java. Here are the steps to download and use the org.json JAR file in your project:
Download the JAR file: You can download the org.json JAR file from the . Choose the version you want and click on it, then download the JAR file.
Add the JAR file to your project: How you add the JAR file to your project depends on the IDE you're using.
For Eclipse: Right-click on your project > Build Path > Configure Build Path > Libraries > Add External JARs > Select the org.json JAR file you downloaded > Apply and Close.
For IntelliJ IDEA: Right-click on your project > Open Module Settings > Libraries > '+' > Java > Select the org.json JAR file you downloaded > OK.
Now you can use the library in your code by importing it like this:
import org.json.JSONObject; import org.json.JSONArray;
Please note that using a dependency management tool like Maven or Gradle would simplify the process of adding dependencies, as they handle downloading and adding the JAR files to your project automatically. You just need to add the dependencies to your pom.xml or build.gradle file, respectively, and the tool will do the rest.

To extend the Java program to store the retrieved URLs into a Java ArrayList, you can create an ArrayList of Strings to hold the links and then add each link to the ArrayList as you iterate through the links on the page. Here's the modified version of your program:

javaCopy code
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.ArrayList;

public class Main {

public static void main(String[] args) {

Document document;
try {
// Get the HTML document
document = Jsoup.connect("http://example.com").get();

// Get the title of the webpage
String title = document.title();
System.out.println("Title: " + title);

// Get all links on the page
Elements links = document.select("a[href]");

// Create an ArrayList to store the links
ArrayList<String> linkList = new ArrayList<>();

// Iterate through the links and add them to the ArrayList
for (Element link : links) {
String url = link.attr("href");
System.out.println("Link: " + url);
linkList.add(url);
}

// Print the ArrayList of links
System.out.println("Retrieved URLs:");
for (String link : linkList) {
System.out.println(link);
}

} catch (IOException e) {
e.printStackTrace();
}
}
}

In this modified version, we added an ArrayList called linkList to store the URLs. As we iterate through the links on the page, we extract the URLs and add them to the linkList. After processing all the links, we print the linkList to display the retrieved URLs.
Now, the program will not only display the links but also store them in the ArrayList linkList. You can use this ArrayList later in your program for further processing or analysis of the retrieved URLs.

Next make another class which takes an input URL (retrieved from the ArrayList) and repeats the page contents acquision process


To create another class that takes an input URL retrieved from the ArrayList and repeats the page content acquisition process, we'll need to design a separate class responsible for fetching the content for a given URL. Let's call this class WebPageFetcher. Here's how you can implement it:
WebPageFetcher.java:
javaCopy code
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

public class WebPageFetcher {

public static void fetchAndPrintPageContent(String url) {
Document document;
try {
// Get the HTML document for the provided URL
document = Jsoup.connect(url).get();

// Get the title of the webpage
String title = document.title();
System.out.println("Title: " + title);

// Get all links on the page
Elements links = document.select("a[href]");

// Print the links on the page
System.out.println("Links on the page:");
for (Element link : links) {
System.out.println(link.attr("href"));
}

// Print other content from the page, if required
// For example, you can extract text or other elements here.

} catch (IOException e) {
e.printStackTrace();
}
}
}

Now, in your main class, you can use the WebPageFetcher class to fetch the contents for each URL stored in the linkList ArrayList:
Main.java:
javaCopy code
import java.util.ArrayList;

public class Main {

public static void main(String[] args) {
// ... (your code to fetch and store URLs into linkList)

// Sample ArrayList of URLs (replace this with your actual ArrayList)
ArrayList<String> linkList = new ArrayList<>();
linkList.add("http://example.com");
linkList.add("http://example2.com");

// Repeat the page content acquisition process for each URL in the ArrayList
for (String url : linkList) {
System.out.println("Fetching content for URL: " + url);
WebPageFetcher.fetchAndPrintPageContent(url);
}
}
}

In the Main class, we have a sample ArrayList called linkList containing URLs. You should replace this sample data with your actual linkList that you obtained earlier. We then use a for loop to iterate through each URL in the linkList and call the WebPageFetcher.fetchAndPrintPageContent(url) method to fetch and print the content for each URL.
By doing this, the Main class repeats the page content acquisition process for each URL in the linkList, and the WebPageFetcher class is responsible for fetching and printing the content for a given URL. You can customize the WebPageFetcher class further if you need to extract more information from the fetched page content.


Version 2.0 of the Program:

A version of your program that adds a LinkStore class to store the retrieved web links, and a Link class to encapsulate individual link data.
It also provides methods to output the stored links:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

public class webcrawler {

public static void main(String[] args) {

Document document;
try {
// Get the HTML document
document = Jsoup.connect("http://cnn.com").get();

// Get the title of the webpage
String title = document.title();
System.out.println("Title: " + title);

// Get all links on the page
Elements links = document.select("a[href]");
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.