Automated Document Archive Processing Solution

Best practices, methodologies, and principles for developing robust software.
Post Reply
User avatar
paypal56_ab6mk6y7
Site Admin
Posts: 47
Joined: Sat Oct 26, 2024 3:05 pm

Automated Document Archive Processing Solution

Post by paypal56_ab6mk6y7 »

Development of an Automated Archive Processing Program for Documentation

#### Description
The goal is to create a program that processes archives containing documentation, automatically extracts text data from documents, matches it with reference data from directories, generates a description, saves the result in a specific format (Word template, .doc), renames the extracted files, and adds a document number to each file.

#### Functional Requirements

1. **Archive Processing**:
- The program should support working with archives in the following formats: `.zip`, `.rar`, `.7z`.
- It should extract the contents of the archive to a specified directory and verify successful extraction.

2. **Text Extraction from Documents**:
- Supported file formats include: `.pdf`, `.docx`, `.xlsx`, `.txt`.
- Extract text from PDF documents, including OCR-based text recognition from images if necessary.
- Extract data from MS Word and Excel documents, with the ability to process tabular data.

3. **Data Matching**:
- Match extracted data with data from a directory of document notations.
- Allow user-uploaded directories (e.g., notation directory) for matching purposes.

4. **Description Generation**:
- Generate a description according to a specified structure.
- The description data should be automatically populated based on extracted and matched data, including document name, notation, sheet count, and format.

5. **File Renaming**:
- Rename extracted files according to a specific algorithm.

6. **Adding Number to File**:
- Each file should have a specific number added in the corner. Numbers are loaded separately from an Excel file.

#### Expected Outcome
The program should provide automated processing of documentation archives, including extraction, text data processing, matching with reference directories, creating a structured description, saving results in a Word template, renaming files, and adding a number to each document.
User avatar
paypal56_ab6mk6y7
Site Admin
Posts: 47
Joined: Sat Oct 26, 2024 3:05 pm

Re: Automated Document Archive Processing Solution

Post by paypal56_ab6mk6y7 »

### Code of the Program

```csharp

Code: Select all

using System;
using System.Collections.Generic;
using System.IO;
using System.IO.Compression;
using System.Windows.Forms;
using PdfSharp.Pdf;
using PdfSharp.Drawing;
using Spire.Doc;
using Spire.Doc.Documents;
using Spire.Doc.Fields;

namespace DocumentProcessingApp
{
    public partial class MainForm : Form
    {
        public MainForm()
        {
            InitializeComponent();
        }

        private void ProcessButton_Click(object sender, EventArgs e)
        {
            string archivePath = SelectArchiveFile();
            if (string.IsNullOrEmpty(archivePath)) return;

            string extractPath = @"C:\ExtractedFiles";
            Directory.CreateDirectory(extractPath);

            if (!ExtractArchive(archivePath, extractPath))
            {
                MessageBox.Show("Error unpacking the archive!");
                return;
            }

            RenameFiles(extractPath, "Document");

            int documentNumber = 1; // Document number to be added

            foreach (string filePath in Directory.GetFiles(extractPath))
            {
                string textData = ExtractText(filePath);

                if (!string.IsNullOrEmpty(textData))
                {
                    // Matching with reference data
                    Dictionary<string, string> data = MatchWithDirectory(textData);

                    // Description generation
                    GenerateInventory(@"C:\Template.docx", @$"{extractPath}\Inventory_{documentNumber}.docx", data);

                    // Adding number to the document
                    if (filePath.EndsWith(".pdf"))
                    {
                        AddNumberToPDF(filePath, documentNumber);
                    }
                    else if (filePath.EndsWith(".docx"))
                    {
                        AddNumberToWord(filePath, documentNumber);
                    }

                    documentNumber++;
                }
            }

            MessageBox.Show("Processing completed!");
        }

        private string SelectArchiveFile()
        {
            OpenFileDialog openFileDialog = new OpenFileDialog
            {
                Filter = "Archive Files (*.zip;*.rar;*.7z)|*.zip;*.rar;*.7z",
                Title = "Select an archive with documents"
            };
            return openFileDialog.ShowDialog() == DialogResult.OK ? openFileDialog.FileName : string.Empty;
        }

        private bool ExtractArchive(string archivePath, string extractPath)
        {
            try
            {
                ZipFile.ExtractToDirectory(archivePath, extractPath);
                return true;
            }
            catch (Exception ex)
            {
                MessageBox.Show($"Error unpacking the archive: {ex.Message}");
                return false;
            }
        }

        private void RenameFiles(string directoryPath, string prefix)
        {
            int counter = 1;
            foreach (var filePath in Directory.GetFiles(directoryPath))
            {
                string extension = Path.GetExtension(filePath);
                string newFileName = $"{prefix}_{counter}{extension}";
                string newFilePath = Path.Combine(directoryPath, newFileName);

                File.Move(filePath, newFilePath);
                counter++;
            }
        }

        private string ExtractText(string filePath)
        {
            // Simplified text extraction function. Expand to support OCR and other formats
            if (filePath.EndsWith(".txt"))
            {
                return File.ReadAllText(filePath);
            }
            else if (filePath.EndsWith(".pdf"))
            {
                return ExtractTextFromPDF(filePath);
            }
            else if (filePath.EndsWith(".docx"))
            {
                return ExtractTextFromWord(filePath);
            }
            else if (filePath.EndsWith(".xlsx"))
            {
                return ExtractTextFromExcel(filePath);
            }
            return string.Empty;
        }

        private string ExtractTextFromPDF(string filePath)
        {
            // Implement function for PDF (using iTextSharp or other libraries)
            return "Text from PDF";
        }

        private string ExtractTextFromWord(string filePath)
        {
            Document doc = new Document();
            doc.LoadFromFile(filePath);
            return doc.GetText();
        }

        private string ExtractTextFromExcel(string filePath)
        {
            // Implement function for Excel
            return "Text from Excel";
        }

        private Dictionary<string, string> MatchWithDirectory(string extractedText)
        {
            // Load reference data from a file, database, or other source
            // Implementation of text comparison with reference data
            Dictionary<string, string> data = new Dictionary<string, string>
            {
                {"DocumentName", "Sample Document"},
                {"DocumentCode", "1234-ABC"},
                {"SheetCount", "10"},
                {"Format", "A4"}
            };
            return data;
        }

        private void GenerateInventory(string templatePath, string outputPath, Dictionary<string, string> data)
        {
            Document doc = new Document();
            doc.LoadFromFile(templatePath);

            foreach (var item in data)
            {
                doc.Replace($"{{{item.Key}}}", item.Value, true, true);
            }

            doc.SaveToFile(outputPath, FileFormat.Docx);
        }

        private void AddNumberToPDF(string filePath, int number)
        {
            using (PdfDocument document = PdfReader.Open(filePath, PdfDocumentOpenMode.Modify))
            {
                foreach (PdfPage page in document.Pages)
                {
                    XGraphics gfx = XGraphics.FromPdfPage(page);
                    XFont font = new XFont("Arial", 10, XFontStyle.Bold);

                    // Position selection for the number
                    gfx.DrawString(number.ToString(), font, XBrushes.Black,
                                   new XRect(page.Width - 50, page.Height - 30, 50, 20),
                                   XStringFormats.BottomRight);
                }
                document.Save(filePath);
            }
        }

        private void AddNumberToWord(string filePath, int number)
        {
            Document doc = new Document();
            doc.LoadFromFile(filePath);

            foreach (Section section in doc.Sections)
            {
                foreach (Paragraph paragraph in section.Paragraphs)
                {
                    TextRange text = paragraph.AppendText($"Number: {number}");
                    text.CharacterFormat.FontSize = 10;
                    text.CharacterFormat.Bold = true;
                    paragraph.Format.HorizontalAlignment = HorizontalAlignment.Right;
                }
            }

            doc.SaveToFile(filePath, FileFormat.Docx);
        }
    }
}
```

### Description of the Complete Code

1. **Main Method `ProcessButton_Click`**: starts the archive processing, calling each function in the required order.
2. **`SelectArchiveFile`**: opens a dialog for selecting the archive.
3. **`ExtractArchive`**: unpacks the archive to the specified directory.
4. **`RenameFiles`**: renames files in the specified directory with a prefix and number.
5. **`ExtractText`**: extracts text from files of various formats (.txt, .pdf, .docx, .xlsx).
6. **`MatchWithDirectory`**: compares extracted data with reference values.
7. **`GenerateInventory`**: fills a Word document template with data and saves it as a description.
8. **`AddNumberToPDF`**: adds a number to each page of the PDF document.
9. **`AddNumberToWord`**: adds a number to each page of the Word document.

This code provides a complete implementation of a program for processing archives with documentation, performing each of the tasks according to the conditions you described.
User avatar
paypal56_ab6mk6y7
Site Admin
Posts: 47
Joined: Sat Oct 26, 2024 3:05 pm

Re: Automated Document Archive Processing Solution

Post by paypal56_ab6mk6y7 »

To implement a function for extracting text from PDF files using iTextSharp in C#, you’ll need to install the iTextSharp library, which provides functionality for reading PDF files. Here is how you can implement this function:

### Step 1: Install iTextSharp

1. Open your project in Visual Studio.
2. Go to **Tools > NuGet Package Manager > Manage NuGet Packages for Solution**.
3. Search for **iTextSharp** and install it (you can use the version `itext7` for the latest version).

### Step 2: Implement the `ExtractTextFromPDF` Function

Here’s the code that uses iTextSharp to extract text from a PDF file:

```csharp

Code: Select all

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

private string ExtractTextFromPDF(string filePath)
{
    StringBuilder text = new StringBuilder();

    try
    {
        using (PdfReader reader = new PdfReader(filePath))
        {
            for (int i = 1; i <= reader.NumberOfPages; i++)
            {
                // Extract text from each page
                string pageText = PdfTextExtractor.GetTextFromPage(reader, i);
                text.Append(pageText);
            }
        }
    }
    catch (Exception ex)
    {
        MessageBox.Show($"Error reading PDF file: {ex.Message}");
        return string.Empty;
    }

    return text.ToString();
}
```

### Explanation of the Code

1. **`PdfReader`**: Opens the PDF file for reading.
2. **`PdfTextExtractor.GetTextFromPage`**: Extracts text from each page of the PDF.
3. **Loop through Pages**: A loop goes through each page in the PDF and appends the text to a `StringBuilder` to accumulate text from all pages.
4. **Error Handling**: Any issues with reading the PDF will show an error message and return an empty string.

This function will return the extracted text from the PDF file as a single string, which you can then process as needed. If you need further customization, you can modify this code to, for instance, only extract text from certain pages or specific areas.
User avatar
paypal56_ab6mk6y7
Site Admin
Posts: 47
Joined: Sat Oct 26, 2024 3:05 pm

Re: Automated Document Archive Processing Solution

Post by paypal56_ab6mk6y7 »

To extract text or data from Excel files in C#, you can use libraries such as **EPPlus** or **ClosedXML**, which are commonly used for working with Excel files. Below is the implementation using **EPPlus**, which is straightforward and efficient for reading Excel files.

### Step 1: Install EPPlus

1. Open your project in Visual Studio.
2. Go to **Tools > NuGet Package Manager > Manage NuGet Packages for Solution**.
3. Search for **EPPlus** and install it.

### Step 2: Implement the `ExtractTextFromExcel` Function

Here’s a function that uses EPPlus to extract text from an Excel file:

```csharp

Code: Select all

using OfficeOpenXml;
using System.Text;

private string ExtractTextFromExcel(string filePath)
{
    StringBuilder text = new StringBuilder();

    try
    {
        // Load the Excel file
        using (var package = new ExcelPackage(new FileInfo(filePath)))
        {
            foreach (var worksheet in package.Workbook.Worksheets)
            {
                // Go through each worksheet in the workbook
                text.AppendLine($"Worksheet: {worksheet.Name}");

                // Loop through all cells with value in the worksheet
                for (int row = worksheet.Dimension.Start.Row; row <= worksheet.Dimension.End.Row; row++)
                {
                    for (int col = worksheet.Dimension.Start.Column; col <= worksheet.Dimension.End.Column; col++)
                    {
                        var cellValue = worksheet.Cells[row, col].Text;
                        if (!string.IsNullOrEmpty(cellValue))
                        {
                            text.AppendLine($"Row {row}, Col {col}: {cellValue}");
                        }
                    }
                }
            }
        }
    }
    catch (Exception ex)
    {
        MessageBox.Show($"Error reading Excel file: {ex.Message}");
        return string.Empty;
    }

    return text.ToString();
}
```

### Explanation of the Code

1. **EPPlus Initialization**: We create a new `ExcelPackage` object by loading the Excel file from the specified `filePath`.
2. **Worksheet Loop**: We loop through each worksheet in the Excel file.
3. **Cell Loop**: For each worksheet, we iterate through all cells within the defined range (`worksheet.Dimension`) and extract text from each cell.
4. **Appending Text**: If a cell has a value, we append it to the `StringBuilder`, along with its row and column information.
5. **Error Handling**: If there’s an error reading the file, an error message is shown, and an empty string is returned.

### Notes

- **EPPlus License**: As of version 5, EPPlus is under a commercial license, so for personal or educational use, you can use it under the LGPL license. For commercial projects, you might need a commercial license.
- **ClosedXML Alternative**: If you prefer an alternative, **ClosedXML** is another library for handling Excel files in C#.

This function will return all extracted text from the Excel file, including cell positions, as a single string. You can modify it further based on specific needs, such as extracting only from certain sheets or specific cell ranges.
Post Reply