1. Introduction
This study note explains how to combine multiple Excel and CSV files into a single dataset using pandas, with a focus on speed and efficiency. It also covers the best engines for reading and writing files, and how to save the combined result in different formats for future use. Combining data from different files into one dataset is a common task in data analytics. Efficiently handling such operations helps improve workflow speed, especially when dealing with large datasets or when preparing data for exploratory analysis and model training.
2. Choosing the Right Engine
When working with multiple file types such as .xlsx
and .csv
, pandas offers different engines to handle file reading and writing operations. The choice of engine directly affects speed, memory usage, and compatibility.
Below is a comparison of the commonly used engines.
Engine | File Type | Speed | Description |
pyarrow | Excel (.xlsx) and CSV | Very Fast | Uses Apache Arrow’s vectorized backend for high-speed I/O operations. It is available from pandas version 2.1 and above. |
openpyxl | Excel (.xlsx) | Medium | Default engine for Excel files. Reliable and widely supported, but slower than pyarrow . |
xlsxwriter | Excel (.xlsx) (writing only) | Medium | Used only for writing Excel files. Produces lightweight, compatible output files. |
odf | ODS files | Slow | XML-based engine for OpenDocument spreadsheets. Generally slower than Excel formats. |
Built-in CSV Engine | CSV files | Very Fast | Uses optimized C parser or Arrow backend for CSV reading. Handles large CSVs efficiently. |
Conclusion:
The pyarrow
engine is the fastest option for reading both Excel and CSV files. It is also memory-efficient and integrates well with pandas’ internal Arrow backend.
3. Installation of Required Packages
Before working with these engines, install the necessary libraries.
3. Installation of Required Packages
Before working with these engines, install the necessary libraries.
pip install pyarrow openpyxl
pyarrow
enables fast data reading/writing for CSV and Excel files.openpyxl
ensures compatibility with Excel when Arrow is unavailable or for writing back to.xlsx
files.
4. Combining All Excel and CSV Files into a Single Sheet
This section demonstrates how to merge all Excel and CSV files located in the same folder into one combined dataset using pandas.
Step 1: Import Required Libraries
import pandas as pd
import glob
import os
These libraries handle:
pandas
: data manipulation and combination.glob
: locating multiple files by pattern (e.g., all.csv
or.xlsx
files).os
: file path operations.
Step 2: Collect All Excel and CSV Files
files = glob.glob("*.xlsx") + glob.glob("*.csv")
This line searches the current directory and lists all files ending with .xlsx
or .csv
. The resulting list, files
, contains the paths of all identified files.
Step 3: Read Each File and Store in a List
combined_data = []
for file in files:
ext = os.path.splitext(file)[1].lower()
if ext == ".csv":
df = pd.read_csv(file, engine="pyarrow") # Fast CSV reader
elif ext == ".xlsx":
df = pd.read_excel(file, engine="pyarrow") # Fast Excel reader
else:
continue
# Add filename for traceability
df["source_file"] = os.path.basename(file)
combined_data.append(df)
Explanation:
- The code loops through each file, checking its extension.
- Depending on whether it’s a CSV or Excel file, it uses
pd.read_csv()
orpd.read_excel()
with thepyarrow
engine. - A new column
source_file
is added to record the file name, which is helpful for tracing data origins. - Each dataframe is appended to the
combined_data
list.
Step 4: Combine All Dataframes
final_df = pd.concat(combined_data, ignore_index=True)
This concatenates all dataframes in combined_data
into a single dataframe named final_df
.
The parameter ignore_index=True
resets the row indices to create a continuous index across all merged data.
Step 5: Save Combined Data to Excel
final_df.to_excel("all_combined.xlsx", sheet_name="Merged_Data", index=False, engine="xlsxwriter")
Explanation:
- The final merged dataset is written to a single Excel file named
all_combined.xlsx
. - The parameter
sheet_name="Merged_Data"
specifies the worksheet name. - The
xlsxwriter
engine ensures faster and reliable writing performance. index=False
prevents pandas from writing row indices to the file.
Result:
- All
.csv
and.xlsx
files are merged into one Excel sheet. - A new column,
source_file
, identifies the origin of each record. - The process executes quickly due to the
pyarrow
reading engine.
5. Converting Combined Data to Parquet Format for Faster Access
While Excel files are human-readable, they are not the most efficient for large-scale analytical work.
For better performance, the combined dataset can be stored in Parquet format, which is optimized for both speed and compression.
Step 1: Save Data as Parquet
final_df.to_parquet("all_combined.parquet", engine="pyarrow")
This saves the dataframe in a binary, columnar format using Apache Arrow.
Parquet format reduces file size and allows faster read/write operations compared to CSV or Excel.
Step 2: Load Parquet File for Future Use
df = pd.read_parquet("all_combined.parquet", engine="pyarrow")
This loads the saved Parquet file back into pandas almost instantly.
It’s the recommended method when repeatedly accessing large datasets during analysis or model building.
6. Summary
Task | Recommended Engine | Reason |
Read Excel files | pyarrow | Fast and memory-efficient |
Write Excel files | xlsxwriter | Reliable and lightweight for writing |
Read CSV files | pyarrow | Uses vectorized Arrow backend |
Save for future use | pyarrow (Parquet) | Fastest and most space-efficient storage |
7. Key Notes
- The
pyarrow
engine provides the fastest reading performance for both CSV and Excel files. - The
xlsxwriter
engine is used for writing Excel files but cannot read them. - Combining datasets into a single sheet is useful for quick access and unified analysis during pandas practice.
- For continuous use and faster loading, converting to Parquet format is recommended.
- Parquet files are particularly suitable for analytical workloads, machine learning pipelines, and cloud storage.
8. Practical Use Case Example
This workflow is ideal for practice scenarios where:
- Multiple departmental reports exist as separate CSV and Excel files.
- The analyst wants to unify them into a single dataset for aggregation, transformation, and exploratory data analysis in pandas.
- After processing, the result can be saved as either an Excel file for reporting or a Parquet file for performance optimization.
9. Conclusion
Efficiently combining Excel and CSV files into one dataset allows faster experimentation and cleaner data management.
Using the pyarrow
engine ensures optimal speed, while xlsxwriter
provides flexibility for exporting results.
For long-term or repeated use, Parquet offers the best balance between performance and storage efficiency.
This setup serves as a foundational workflow for advanced pandas operations and future data engineering tasks.
How to Add Another File to an Existing Parquet Dataset
1. Read the Existing Parquet File
Start by loading your current Parquet dataset into memory.
import pandas as pd
df = pd.read_parquet("all_combined.parquet", engine="pyarrow")
This retrieves the existing data into a pandas DataFrame called df
.
2. Read the New File You Want to Add
You can add a new file of any format — CSV or Excel.
Simply read it using the same Arrow engine for consistency.
# Example: new CSV file
new_data = pd.read_csv("new_file.csv", engine="pyarrow")
# OR, if it’s an Excel file
# new_data = pd.read_excel("new_file.xlsx", engine="pyarrow")
Optionally, you can add a column identifying the file name for traceability:
new_data["source_file"] = "new_file.csv"
3. Combine the New Data with the Existing Dataset
Use pd.concat()
to append the new rows.
updated_df = pd.concat([df, new_data], ignore_index=True)
ignore_index=True
resets the index after concatenation.- Both DataFrames should have matching column names; if not, missing columns will be filled with
NaN
.
4. Save the Updated Data Back to Parquet
Finally, overwrite or version your Parquet file with the new combined data.
updated_df.to_parquet("all_combined.parquet", engine="pyarrow")
✅ This overwrites the existing file with the updated version.
If you prefer version control (keeping backups), save with a new name:
updated_df.to_parquet("all_combined_v2.parquet", engine="pyarrow")
5. (Optional) Append Without Loading All Data (Efficient Method)
If your dataset is very large, instead of loading everything into memory, you can use the partitioned Parquet datasetapproach.
Step 1: Save Each Batch Separately
new_data.to_parquet("parquet_dataset/new_file.parquet", engine="pyarrow")
Step 2: Load All Partitions Together
import pyarrow.dataset as ds
dataset = ds.dataset("parquet_dataset", format="parquet")
df = dataset.to_table().to_pandas()
This automatically merges all .parquet
files inside the folder parquet_dataset/
without concatenating manually.
It’s faster and more memory-efficient for incremental additions.
6. Summary
Step | Description | Command |
1 | Read existing Parquet file | pd.read_parquet("all_combined.parquet") |
2 | Read new CSV or Excel file | pd.read_csv() or pd.read_excel() |
3 | Combine old and new data | pd.concat([df, new_data]) |
4 | Save combined data | to_parquet("all_combined.parquet") |
5 | For large datasets | Use partitioned folder approach with pyarrow.dataset |
7. Example Workflow
import pandas as pd
# Step 1: Read the existing Parquet file
df = pd.read_parquet("all_combined.parquet", engine="pyarrow")
# Step 2: Read a new CSV file
new_data = pd.read_csv("sales_update.csv", engine="pyarrow")
new_data["source_file"] = "sales_update.csv"
# Step 3: Combine the two DataFrames
updated_df = pd.concat([df, new_data], ignore_index=True)
# Step 4: Save the updated data
updated_df.to_parquet("all_combined.parquet", engine="pyarrow")
8. Notes
- Always use the same column names and data types to ensure smooth merging.
- If you plan to update data frequently, use partitioned Parquet storage in a folder instead of one big file.
- Parquet supports schema evolution, but it’s better to maintain consistent structure across files.