Extract Images from Documents and Filter Resources with GroupDocs.Parser Java

Extracting images from documents is a common requirement when building document‑processing pipelines. In this tutorial you’ll discover how to extract images from documents using GroupDocs.Parser for Java, and you’ll also learn how to filter resources so that only the files you need are loaded. We’ll walk through setting up the library, creating a custom ExternalResourceHandler, and applying filtering logic to keep your application fast and secure.

Quick Answers

  • What does GroupDocs.Parser do? It parses a wide range of document formats and gives you access to text, images, and other embedded resources.
  • Can I skip unwanted images? Yes—by implementing a custom ExternalResourceHandler you can decide which resources to load.
  • Which Maven version is required? Use GroupDocs.Parser Java 25.5 or newer.
  • Do I need a license? A free trial works for evaluation; a permanent license is required for production.
  • Is this approach thread‑safe? Parsing objects are not shared across threads; create a new Parser instance per thread.

What is “extract images from documents”?

When a document contains embedded pictures, charts, or other media, “extract images from documents” means programmatically retrieving those binary files so you can store, display, or further process them outside the original file.

Why filter resources while extracting images?

Filtering resources helps you:

  • Reduce memory consumption by ignoring large or irrelevant files.
  • Improve security by preventing the loading of potentially unsafe content.
  • Speed up processing, especially with huge documents that contain many embedded objects.

Prerequisites

  • Java Development Kit (JDK) – version 8 or higher.
  • Maven – for dependency management.
  • Basic familiarity with Java I/O and exception handling.

Setting Up GroupDocs.Parser for Java

Add the GroupDocs repository and the parser dependency to your pom.xml:

<repositories>
   <repository>
      <id>repository.groupdocs.com</id>
      <name>GroupDocs Repository</name>
      <url>https://releases.groupdocs.com/parser/java/</url>
   </repository>
</repositories>

<dependencies>
   <dependency>
      <groupId>com.groupdocs</groupId>
      <artifactId>groupdocs-parser</artifactId>
      <version>25.5</version>
   </dependency>
</dependencies>

Alternatively, download the latest version from GroupDocs.Parser for Java releases.

License Acquisition

  • Free Trial – explore core features without cost.
  • Temporary License – unlock full functionality during evaluation.
  • Purchased License – required for commercial deployment.

How to filter resources while extracting images

Step 1: Create a custom handler

Define a class that extends ExternalResourceHandler. Inside the onLoading method you decide which resources to keep.

import com.groupdocs.parser.options.ExternalResourceHandler;
import com.groupdocs.parser.data.ExternalResourceLoadingArgs;

class Handler extends ExternalResourceHandler {
    @Override
    public void onLoading(ExternalResourceLoadingArgs args) {
        if (!args.getUri().endsWith("installation.png")) {
            args.setSkipped(true);
        }
        super.onLoading(args);
    }
}

Step 2: Configure ParserSettings with the handler

Pass your Handler instance to ParserSettings and use it when opening a document.

import com.groupdocs.parser.Parser;
import com.groupdocs.parser.data.PageImageArea;
import com.groupdocs.parser.exceptions.IOException;
import com.groupdocs.parser.options.ParserSettings;

public class LoadExternalResources {
    public static void run() throws IOException {
        ParserSettings settings = new ParserSettings(new Handler());
        
        try (Parser parser = new Parser("YOUR_DOCUMENT_DIRECTORY", settings)) {
            Iterable<PageImageArea> images = parser.getImages();
            
            for (PageImageArea image : images) {
                System.out.println(image.getFileType());
            }
        }
    }
}

Step 3: Fine‑tune the filtering logic

If you need more sophisticated rules—such as filtering by image size, format, or URI pattern—extend the onLoading method accordingly:

@Override
public void onLoading(ExternalResourceLoadingArgs args) {
    if (!args.getUri().endsWith("installation.png")) {
        args.setSkipped(true);
    }
}

Practical Applications

  1. Document Management Systems – Pull only the necessary images from scanned contracts to generate thumbnails.
  2. Data Extraction Services – Skip decorative graphics and focus on charts that contain valuable data.
  3. Web Scraping Tools – Filter out tracking pixels while retrieving meaningful media from HTML‑based documents.

Performance Considerations

  • Filter early: Apply your custom handler before iterating over resources to avoid loading unwanted data into memory.
  • Dispose promptly: Use try‑with‑resources (try (Parser parser = …)) to free native resources.
  • Async processing: For large batches, process documents in parallel streams while keeping each Parser instance confined to a single thread.

Common Issues & Solutions

IssueWhy it HappensFix
No images returnedHandler skips all resources inadvertentlyVerify the if condition and ensure args.setSkipped(true) is only called for unwanted URIs.
IOException on large filesInsufficient heap memoryIncrease JVM heap (-Xmx2g) or process pages in smaller chunks.
License not recognizedUsing trial DLL with production codeApply the correct license file path via License.setLicense("path/to/license").

Frequently Asked Questions

Q: What is the primary purpose of using a custom ExternalResourceHandler?
A: It lets you control which external resources are loaded, enhancing security and performance by filtering out unnecessary files.

Q: Can I use GroupDocs.Parser for Java without a license?
A: Yes, a free trial is available, but some advanced features may be limited until you obtain a temporary or purchased license.

Q: How do I handle exceptions during parsing with GroupDocs.Parser?
A: Wrap parsing calls in try‑catch blocks for IOException and other specific exceptions to gracefully handle errors.

Q: What are common pitfalls when filtering resources?
A: Incorrect URI checks can skip needed files; use logging or breakpoints to verify your conditions.

Q: Is it possible to parse non‑HTML documents using GroupDocs.Parser for Java?
A: Absolutely—GroupDocs.Parser supports PDFs, Word, Excel, PowerPoint, and many other formats.

Next Steps

Dive deeper into the library by exploring the API Reference or experimenting with additional settings such as ParserSettings.setDetectTables(true) for table extraction.


Last Updated: 2025-12-29
Tested With: GroupDocs.Parser 25.5 for Java
Author: GroupDocs

Resources