Python | The fastest way to read CSV (csv/NumPy/pandas/VBA)

Python

This article describes which is the fastest way to read CSV among csv (Python standard module) , NumPy, pandas, and Excel VBA.

Conclusion

pandas is the fastest. csv (Python standard module) is twice as slow. NumPy and VBA are 20x slower.

Experimental method

  • Measure the time 10 times to read a CSV file with Python standard module csv and put it in a variable. Measure this five times.
  • Same with NumPy
  • Same with pandas
  • Measure the time 10 times to read a CSV file with VBA and put it in a EXCEL. Measure this five times.

CSV file to read

  • 200,000 rows of data 0,1,2,3,4,5,6,7,8,9
  • File size 4.0MB

This is what the file looks like when opened in Excel.

Code

Python standard module csv

#!/usr/bin/python

from time import time
import csv


file_name = r"C:\Users\yoshihiko\Desktop\yoshihiko.csv"
iteration = 10


def load_csv_BY_csv_reader(file):
    result = []
    for n in range(iteration):
        with open(file, "r", encoding="utf-8") as f:
            reader = csv.reader(f)
            for row in reader:
                result.append(row)
    return result

start_time = time()
result = load_csv_BY_csv_reader(file_name)
elapsed_time = time() - start_time
print("{:.3f}".format(elapsed_time), "秒")

NumPy

#!/usr/bin/python

from time import time
import numpy as np


file_name = r"C:\Users\yoshihiko\Desktop\yoshihiko.csv"
iteration = 10


def load_csv_BY_numpy(file):
    for n in range(iteration):
        csv = np.loadtxt(file, delimiter=',')
    return csv

start_time = time()
result = load_csv_BY_numpy(file_name)
elapsed_time = time() - start_time
print("{:.3f}".format(elapsed_time), "秒")

pandas

#!/usr/bin/python

from time import time
import pandas as pd


file_name = r"C:\Users\yoshihiko\Desktop\yoshihiko.csv"
iteration = 10


def load_csv_BY_pandas(file):
    for n in range(iteration):
        csv = pd.read_csv(file)
    return csv

start_time = time()
result = load_csv_BY_pandas(file_name)
elapsed_time = time() - start_time
print("{:.3f}".format(elapsed_time), "秒")

VBA

Option Explicit

Private Sub read_csv()
    Dim csv_path As String: csv_path = "C:\Users\yoshihiko\Desktop\yoshihiko.csv"
    Dim csv_file_name As String: csv_file_name = "yoshihiko.csv"
    Dim csv_sheet_name As String: csv_sheet_name = "yoshihiko"
    Dim self_file_name As String: self_file_name = Application.ActiveWorkbook.Name
    Dim i As Integer
    Dim iterate As Integer: iterate = 10
    Dim start_time As Variant
    Dim elapsed_time As Variant

    Application.ScreenUpdating = False
    start_time = Timer

    For i = 1 To iterate
        Workbooks(self_file_name).Activate
        Workbooks.Open csv_path
        Workbooks(self_file_name).Worksheets("Sheet1").Range("A1:J20000") = _
            Workbooks(csv_file_name).Worksheets(csv_sheet_name).Range("A1:J20000").Value
        Workbooks(csv_file_name).Close
    Next i

    elapsed_time = Timer - start_time
    Application.ScreenUpdating = True
    MsgBox elapsed_time & "sec"

End Sub

PC

CPU : CORE i7 7th Gen ( i7-7500U 2.70GHz up to 3.50 GHz )

Result

  • pandas is the fastest with an average of 1.001 seconds.
  • Based on pandas, csv is 2x slower, NumPy and VBA 20x slower.
Trial NumbercsvNumPypandasVBA
12.44520.8611.15724.836
22.14019.7841.01222.086
32.12419.4600.94320.719
42.09420.0800.95426.750
52.10020.4840.94021.180
average2.18120.1341.00123.114

unit : second

Summary

pandas is the fastest.

コメント