Here is the full code for automatically transferring data from two Excel files into one, considering your requirements and improvements:
### Full Code:
```python
Code: Select all
import pandas as pd
import logging
from openpyxl import load_workbook
# Set up logging
logging.basicConfig(filename='data_transfer.log', level=logging.INFO, format='%(asctime)s - %(message)s')
# Function to load and clean data
def load_and_clean_data(requests_file, services_file):
# Load data from Excel
try:
requests_df = pd.read_excel(requests_file)
services_df = pd.read_excel(services_file)
logging.info(f'Data successfully loaded from {requests_file} and {services_file}.')
except FileNotFoundError as e:
logging.error(f'Error loading files: {e}')
return None, None
# Clean up empty rows
requests_df.dropna(how='all', inplace=True)
services_df.dropna(how='all', inplace=True)
# Check for missing values in important columns
requests_df = requests_df[requests_df['request_id'].notnull()]
services_df = services_df[services_df['ticket_id'].notnull()]
logging.info(f'After cleaning: {len(requests_df)} requests and {len(services_df)} services.')
return requests_df, services_df
# Function to merge data
def merge_data(requests_df, services_df):
# Merge data on unique identifiers
final_df = pd.merge(requests_df, services_df, left_on='request_id', right_on='ticket_id', how='outer')
# Fill empty values with "Not Provided"
final_df.fillna('Not Provided', inplace=True)
logging.info(f'Merged {len(final_df)} records.')
return final_df
# Function to save final data
def save_final_data(final_df, final_file):
# Save merged data to a new Excel file
try:
final_df.to_excel(final_file, index=False)
logging.info(f'Final file successfully saved to {final_file}.')
except Exception as e:
logging.error(f'Error saving file: {e}')
# Function to automatically update the final file
def update_final_data(requests_file, services_file, final_file):
# Load data
requests_df, services_df = load_and_clean_data(requests_file, services_file)
if requests_df is None or services_df is None:
logging.error('Failed to load data. Terminating process.')
return
# Merge data
final_df = merge_data(requests_df, services_df)
# Try adding new data to the existing final file
try:
wb = load_workbook(final_file)
sheet = wb.active
for row in final_df.values:
sheet.append(row)
wb.save(final_file)
logging.info(f'Final file successfully updated.')
except FileNotFoundError:
# If file not found, save as new
save_final_data(final_df, final_file)
# Function to check data integrity
def check_data_integrity(requests_df, services_df):
# Check for matches between request_id and ticket_id
missing_in_services = requests_df[~requests_df['request_id'].isin(services_df['ticket_id'])]
missing_in_requests = services_df[~services_df['ticket_id'].isin(requests_df['request_id'])]
if not missing_in_services.empty:
logging.warning(f'Found {len(missing_in_services)} request records without corresponding services.')
if not missing_in_requests.empty:
logging.warning(f'Found {len(missing_in_requests)} service records without corresponding requests.')
# Main function to run the process
def main():
requests_file = 'requests.xlsx' # Path to the requests file
services_file = 'services.xlsx' # Path to the services file
final_file = 'final_output.xlsx' # Path to the final output file
# Update final file
update_final_data(requests_file, services_file, final_file)
# Check data integrity
requests_df, services_df = load_and_clean_data(requests_file, services_file)
if requests_df is not None and services_df is not None:
check_data_integrity(requests_df, services_df)
if __name__ == "__main__":
main()
```
### Code Explanation:
1. **Loading and Cleaning Data**:
- The `load_and_clean_data` function loads the Excel files (`requests.xlsx` and `services.xlsx`), removes empty rows, and ensures that essential fields like `request_id` and `ticket_id` are not missing.
2. **Merging Data**:
- The `merge_data` function merges the two dataframes based on unique identifiers (`request_id` and `ticket_id`) and fills any missing values with "Not Provided".
3. **Saving Final Data**:
- The `save_final_data` function saves the merged data into a new Excel file (`final_output.xlsx`), or updates the existing file with new data.
4. **Checking Data Integrity**:
- The `check_data_integrity` function checks whether each request has a corresponding service and vice versa, and logs any discrepancies.
5. **Logging**:
- The script uses Python's built-in `logging` module to log each step of the process, including successes and errors. Logs are saved in `data_transfer.log`.
### How to Use:
1. Place the files `requests.xlsx` and `services.xlsx` in the appropriate paths.
2. Run the script, and it will automatically update or create the `final_output.xlsx` file.
3. Any mismatches between the request and service records will be logged in the `data_transfer.log` file.
This solution automates the process of transferring and merging data from two sources into one, reducing errors and improving workflow efficiency.