Page 1 of 1

Automating Data Integration from Multiple Excel Files for Enhanced Workflow Efficiency

Posted: Sun Nov 10, 2024 6:39 am
by paypal56_ab6mk6y7
The task is to develop a system for automatically integrating and merging data from two Excel files into one. The process involves extracting specific information, such as unique request and ticket IDs, from both files, and seamlessly merging them based on these identifiers. The merged data must be saved in a final Excel sheet, with the ability to update the file as new entries are added. Additionally, the system should handle missing data appropriately, ensuring consistency between records and flagging any discrepancies. This will streamline data processing, reduce manual input, and improve overall efficiency in handling customer requests and service processing.

Re: Automating Data Integration from Multiple Excel Files for Enhanced Workflow Efficiency

Posted: Sun Nov 10, 2024 6:41 am
by paypal56_ab6mk6y7
Here is the full code for automatically transferring data from two Excel files into one, considering your requirements and improvements:

### Full Code:

```python

Code: Select all

import pandas as pd
import logging
from openpyxl import load_workbook

# Set up logging
logging.basicConfig(filename='data_transfer.log', level=logging.INFO, format='%(asctime)s - %(message)s')

# Function to load and clean data
def load_and_clean_data(requests_file, services_file):
    # Load data from Excel
    try:
        requests_df = pd.read_excel(requests_file)
        services_df = pd.read_excel(services_file)
        logging.info(f'Data successfully loaded from {requests_file} and {services_file}.')
    except FileNotFoundError as e:
        logging.error(f'Error loading files: {e}')
        return None, None

    # Clean up empty rows
    requests_df.dropna(how='all', inplace=True)
    services_df.dropna(how='all', inplace=True)

    # Check for missing values in important columns
    requests_df = requests_df[requests_df['request_id'].notnull()]
    services_df = services_df[services_df['ticket_id'].notnull()]

    logging.info(f'After cleaning: {len(requests_df)} requests and {len(services_df)} services.')

    return requests_df, services_df

# Function to merge data
def merge_data(requests_df, services_df):
    # Merge data on unique identifiers
    final_df = pd.merge(requests_df, services_df, left_on='request_id', right_on='ticket_id', how='outer')

    # Fill empty values with "Not Provided"
    final_df.fillna('Not Provided', inplace=True)

    logging.info(f'Merged {len(final_df)} records.')
    return final_df

# Function to save final data
def save_final_data(final_df, final_file):
    # Save merged data to a new Excel file
    try:
        final_df.to_excel(final_file, index=False)
        logging.info(f'Final file successfully saved to {final_file}.')
    except Exception as e:
        logging.error(f'Error saving file: {e}')

# Function to automatically update the final file
def update_final_data(requests_file, services_file, final_file):
    # Load data
    requests_df, services_df = load_and_clean_data(requests_file, services_file)
    if requests_df is None or services_df is None:
        logging.error('Failed to load data. Terminating process.')
        return

    # Merge data
    final_df = merge_data(requests_df, services_df)

    # Try adding new data to the existing final file
    try:
        wb = load_workbook(final_file)
        sheet = wb.active
        for row in final_df.values:
            sheet.append(row)
        wb.save(final_file)
        logging.info(f'Final file successfully updated.')
    except FileNotFoundError:
        # If file not found, save as new
        save_final_data(final_df, final_file)

# Function to check data integrity
def check_data_integrity(requests_df, services_df):
    # Check for matches between request_id and ticket_id
    missing_in_services = requests_df[~requests_df['request_id'].isin(services_df['ticket_id'])]
    missing_in_requests = services_df[~services_df['ticket_id'].isin(requests_df['request_id'])]

    if not missing_in_services.empty:
        logging.warning(f'Found {len(missing_in_services)} request records without corresponding services.')
    if not missing_in_requests.empty:
        logging.warning(f'Found {len(missing_in_requests)} service records without corresponding requests.')

# Main function to run the process
def main():
    requests_file = 'requests.xlsx'  # Path to the requests file
    services_file = 'services.xlsx'  # Path to the services file
    final_file = 'final_output.xlsx'  # Path to the final output file

    # Update final file
    update_final_data(requests_file, services_file, final_file)

    # Check data integrity
    requests_df, services_df = load_and_clean_data(requests_file, services_file)
    if requests_df is not None and services_df is not None:
        check_data_integrity(requests_df, services_df)

if __name__ == "__main__":
    main()
```

### Code Explanation:
1. **Loading and Cleaning Data**:
- The `load_and_clean_data` function loads the Excel files (`requests.xlsx` and `services.xlsx`), removes empty rows, and ensures that essential fields like `request_id` and `ticket_id` are not missing.

2. **Merging Data**:
- The `merge_data` function merges the two dataframes based on unique identifiers (`request_id` and `ticket_id`) and fills any missing values with "Not Provided".

3. **Saving Final Data**:
- The `save_final_data` function saves the merged data into a new Excel file (`final_output.xlsx`), or updates the existing file with new data.

4. **Checking Data Integrity**:
- The `check_data_integrity` function checks whether each request has a corresponding service and vice versa, and logs any discrepancies.

5. **Logging**:
- The script uses Python's built-in `logging` module to log each step of the process, including successes and errors. Logs are saved in `data_transfer.log`.

### How to Use:
1. Place the files `requests.xlsx` and `services.xlsx` in the appropriate paths.
2. Run the script, and it will automatically update or create the `final_output.xlsx` file.
3. Any mismatches between the request and service records will be logged in the `data_transfer.log` file.

This solution automates the process of transferring and merging data from two sources into one, reducing errors and improving workflow efficiency.