Automating Data Integration from Multiple Excel Files for Enhanced Workflow Efficiency
- paypal56_ab6mk6y7
- Site Admin
- Posts: 47
- Joined: Sat Oct 26, 2024 3:05 pm
Automating Data Integration from Multiple Excel Files for Enhanced Workflow Efficiency
The task is to develop a system for automatically integrating and merging data from two Excel files into one. The process involves extracting specific information, such as unique request and ticket IDs, from both files, and seamlessly merging them based on these identifiers. The merged data must be saved in a final Excel sheet, with the ability to update the file as new entries are added. Additionally, the system should handle missing data appropriately, ensuring consistency between records and flagging any discrepancies. This will streamline data processing, reduce manual input, and improve overall efficiency in handling customer requests and service processing.
- paypal56_ab6mk6y7
- Site Admin
- Posts: 47
- Joined: Sat Oct 26, 2024 3:05 pm
Re: Automating Data Integration from Multiple Excel Files for Enhanced Workflow Efficiency
Here is the full code for automatically transferring data from two Excel files into one, considering your requirements and improvements:
### Full Code:
```python
```
### Code Explanation:
1. **Loading and Cleaning Data**:
- The `load_and_clean_data` function loads the Excel files (`requests.xlsx` and `services.xlsx`), removes empty rows, and ensures that essential fields like `request_id` and `ticket_id` are not missing.
2. **Merging Data**:
- The `merge_data` function merges the two dataframes based on unique identifiers (`request_id` and `ticket_id`) and fills any missing values with "Not Provided".
3. **Saving Final Data**:
- The `save_final_data` function saves the merged data into a new Excel file (`final_output.xlsx`), or updates the existing file with new data.
4. **Checking Data Integrity**:
- The `check_data_integrity` function checks whether each request has a corresponding service and vice versa, and logs any discrepancies.
5. **Logging**:
- The script uses Python's built-in `logging` module to log each step of the process, including successes and errors. Logs are saved in `data_transfer.log`.
### How to Use:
1. Place the files `requests.xlsx` and `services.xlsx` in the appropriate paths.
2. Run the script, and it will automatically update or create the `final_output.xlsx` file.
3. Any mismatches between the request and service records will be logged in the `data_transfer.log` file.
This solution automates the process of transferring and merging data from two sources into one, reducing errors and improving workflow efficiency.
### Full Code:
```python
Code: Select all
import pandas as pd
import logging
from openpyxl import load_workbook
# Set up logging
logging.basicConfig(filename='data_transfer.log', level=logging.INFO, format='%(asctime)s - %(message)s')
# Function to load and clean data
def load_and_clean_data(requests_file, services_file):
# Load data from Excel
try:
requests_df = pd.read_excel(requests_file)
services_df = pd.read_excel(services_file)
logging.info(f'Data successfully loaded from {requests_file} and {services_file}.')
except FileNotFoundError as e:
logging.error(f'Error loading files: {e}')
return None, None
# Clean up empty rows
requests_df.dropna(how='all', inplace=True)
services_df.dropna(how='all', inplace=True)
# Check for missing values in important columns
requests_df = requests_df[requests_df['request_id'].notnull()]
services_df = services_df[services_df['ticket_id'].notnull()]
logging.info(f'After cleaning: {len(requests_df)} requests and {len(services_df)} services.')
return requests_df, services_df
# Function to merge data
def merge_data(requests_df, services_df):
# Merge data on unique identifiers
final_df = pd.merge(requests_df, services_df, left_on='request_id', right_on='ticket_id', how='outer')
# Fill empty values with "Not Provided"
final_df.fillna('Not Provided', inplace=True)
logging.info(f'Merged {len(final_df)} records.')
return final_df
# Function to save final data
def save_final_data(final_df, final_file):
# Save merged data to a new Excel file
try:
final_df.to_excel(final_file, index=False)
logging.info(f'Final file successfully saved to {final_file}.')
except Exception as e:
logging.error(f'Error saving file: {e}')
# Function to automatically update the final file
def update_final_data(requests_file, services_file, final_file):
# Load data
requests_df, services_df = load_and_clean_data(requests_file, services_file)
if requests_df is None or services_df is None:
logging.error('Failed to load data. Terminating process.')
return
# Merge data
final_df = merge_data(requests_df, services_df)
# Try adding new data to the existing final file
try:
wb = load_workbook(final_file)
sheet = wb.active
for row in final_df.values:
sheet.append(row)
wb.save(final_file)
logging.info(f'Final file successfully updated.')
except FileNotFoundError:
# If file not found, save as new
save_final_data(final_df, final_file)
# Function to check data integrity
def check_data_integrity(requests_df, services_df):
# Check for matches between request_id and ticket_id
missing_in_services = requests_df[~requests_df['request_id'].isin(services_df['ticket_id'])]
missing_in_requests = services_df[~services_df['ticket_id'].isin(requests_df['request_id'])]
if not missing_in_services.empty:
logging.warning(f'Found {len(missing_in_services)} request records without corresponding services.')
if not missing_in_requests.empty:
logging.warning(f'Found {len(missing_in_requests)} service records without corresponding requests.')
# Main function to run the process
def main():
requests_file = 'requests.xlsx' # Path to the requests file
services_file = 'services.xlsx' # Path to the services file
final_file = 'final_output.xlsx' # Path to the final output file
# Update final file
update_final_data(requests_file, services_file, final_file)
# Check data integrity
requests_df, services_df = load_and_clean_data(requests_file, services_file)
if requests_df is not None and services_df is not None:
check_data_integrity(requests_df, services_df)
if __name__ == "__main__":
main()
### Code Explanation:
1. **Loading and Cleaning Data**:
- The `load_and_clean_data` function loads the Excel files (`requests.xlsx` and `services.xlsx`), removes empty rows, and ensures that essential fields like `request_id` and `ticket_id` are not missing.
2. **Merging Data**:
- The `merge_data` function merges the two dataframes based on unique identifiers (`request_id` and `ticket_id`) and fills any missing values with "Not Provided".
3. **Saving Final Data**:
- The `save_final_data` function saves the merged data into a new Excel file (`final_output.xlsx`), or updates the existing file with new data.
4. **Checking Data Integrity**:
- The `check_data_integrity` function checks whether each request has a corresponding service and vice versa, and logs any discrepancies.
5. **Logging**:
- The script uses Python's built-in `logging` module to log each step of the process, including successes and errors. Logs are saved in `data_transfer.log`.
### How to Use:
1. Place the files `requests.xlsx` and `services.xlsx` in the appropriate paths.
2. Run the script, and it will automatically update or create the `final_output.xlsx` file.
3. Any mismatches between the request and service records will be logged in the `data_transfer.log` file.
This solution automates the process of transferring and merging data from two sources into one, reducing errors and improving workflow efficiency.