Want to pull text out of a heap of PDFs without crying or copying and pasting until your hands fall off? This UiPath tutorial walks you through a reliable RPA workflow for PDF extraction and data extraction with a dash of sarcasm and zero judgement. We cover searchable files, scanned pages with OCR, regex field grabs and saving results in a spreadsheet for later analysis.
What this workflow does
Short version, long benefits. The workflow loops through a folder of PDF files using For Each and extracts text with Read PDF Text for normal PDFs or Read PDF With OCR for images. Then it parses fields with Regex or string methods and writes one row per file to CSV or Excel. Wrap everything in Try Catch and log the messy bits so the bot can fail gracefully.
Step 1 Create project and install packages
Open UiPath Studio and create a new process. Add the UiPath.PDF.Activities package to get Read PDF Text. Add the UiPath.OCR.Activities package if you plan to handle scanned documents. Name the workflow something sensible so future you does not go treasure hunting in a spaghetti pile of workflows.
Step 2 Gather PDF files
Use an Assign activity to collect file paths into an array. For example use Directory.GetFiles(folderPath, "*.pdf") and store the result in a variable like pdfFiles. This keeps the For Each clean and avoids accidentally pulling in your resume drafts or random screenshots.
Step 3 Loop through each file
Use a For Each activity with TypeArgument set to String. Each iteration gives you one file path such as currentFile. Pass that file path to your extractor activity and to your logger so you always know which file caused trouble.
Step 4 Choose the right extractor
Use Read PDF Text for searchable documents. For scanned pages use Read PDF With OCR and pick an OCR engine and a scale that makes sense for your documents. Common engines you can use include Tesseract OCR, Microsoft OCR and Google Cloud Vision OCR. Store the output in a variable like extractedText for processing.
OCR tips
- Increase scale for small fonts to improve accuracy
- Try different OCR engines if the first one smells of failure
- Trim and normalize whitespace before running Regex to avoid brittle matches
Step 5 Extract fields with Regex or string methods
Use Regex to pull structured values like invoice numbers, dates or amounts. Keep patterns focused and test them in UiPath or an online tester. Alternatively use split and index operations for predictable formats. Example targets include invoice number, invoice date and total amount.
Step 6 Save results
Accumulate results using Append Range when writing to Excel or use Append CSV for a simpler flow. A single row per file is usually best for audits and filtering. Include the file name, extracted fields and a brief status column to show whether extraction succeeded or needs review.
Step 7 Error handling and logging
Wrap extraction and parsing in Try Catch. In the Catch block use Log Message to record the file path and exception message so troubleshooting is less mystical. Optionally write failed file paths to a separate log CSV for manual review.
Quick checklist for production
- Install UiPath.PDF.Activities and UiPath.OCR.Activities
- Collect files with Directory.GetFiles and filter by pdf extension
- Use For Each with TypeArgument String
- Choose Read PDF Text or Read PDF With OCR based on file type
- Parse with Regex and sanitize the text first
- Append results to Excel or CSV one row per file
- Wrap in Try Catch and log failures
There you go. A practical UiPath automation for PDF extraction and data extraction that survives real world files and occasional chaos. If something fails do not panic. Check the logs, tweak OCR settings or regex patterns and let the bot do the repetitive work while you enjoy slightly fewer headaches.