Converting a Large CSV files to a relational database using Pandas and MySQL

Margret Azuma
4 min readMay 10, 2023

--

Data transfer from a csv file to SQL databases using pandas

Sometime after scraping a large amount of data, connecting to a MySQL database and inserting the data may be the next logical step. Benefits include improved data management, faster data retrieval, better scalability and efficient back up and data recovery. Also, as a beginner in the data space, converting and saving a csv file into a MySQL database allows for practices to improve my SQL skillset.

Install Relevant Dependencies

To convert these large csv files, you need pandas and mysql connector libraries. Install the libraries using pip. I would assume you already have pip installed. If you do not, this article should be enough guide. In your VS Code terminal (or any other terminals of your choice), type in the following installation commands.

# Install mysql connector and pandas
pip install mysql-connector-python
pip install Pandas3

Now you can import the libraries to the python environment.

# Import dependencies 
import mysql.connector
import pandas as pd

Establish Database Connection

Next, connect to an already existing MySQL database. You do so by using the mysql.connector.connect()method. Substitute the information below with your own information. Information about how to set up a MySQL database can be found here.

# Establishing a connection to the database server
conn = mysql.connector.connect(
user='username', password='password', host='host name', database='name of database'
)

Create a SQL Table

Create a table in the database using SQL CREATE TABLE command. The SQL syntax, with the triple quotes """ """ is simple to understand. It simply asks for a table named EU_cities_hotel_data to created if it does not exist. Within the bracket, each line declares column names and their respective data type.

# Create new table headings
def create_table():
create_table_query = """
CREATE TABLE IF NOT EXISTS EU_cities_hotel_data(
name VARCHAR(255),
url VARCHAR(8000),
distance VARCHAR(255),
price VARCHAR(255),
original_price VARCHAR(255),
review VARCHAR(255),
metro VARCHAR(255),
checkin DATE,
checkout DATE,
city VARCHAR(255)
);
"""

# Create cursor object and execute the table creation
cursor = conn.cursor()
cursor.execute(create_table_query)

# Create a cursor object to interact with the MySQL database
cursor = conn.cursor()

# Create the table if it doesn't exist
create_table()

The cursor = conn.cursor() uses the "cursor()" method of the database connection object "conn" to create a cursor object. This allows python interact with a SQL database.

The cursor.execute(create_table_query) uses the execute() method to send the query to the database server and execute the query.

Insert Data in SQL Table in Batches

If you have a sufficiently large table, mine is over 5GB, it is worthwhile to insert the rows concurrently and in batches. I initially tried filling in the table at once but received a “max_allowed_packet exceeded” error. The max_allowed_packet is a MySQL system variable that specifies the maximum size of a single packet that can be sent between the server and client. This errors occurs when attempting to insert data into aMySQL database table that exceeds the maximum allowed packet size.

To correct this error, you need to either increase the packet size or run the code in batches. Depending to what database hist you are connected too, the first option may not possible. In my case, for instance, the permissible max_allowed_packet size for AWS is 1073741824 bytes, which was insufficient.

The first step is to declare a suitable batch size and read the csv file in chunks using the pandas.read_csv() chunksize argument. Note that the result is not a DataFrame. Instead, a TextFileReader object is returned to iterative over the number of rows indicated in the batch size. Each chunk is returned as a DataFrame object.

# Create batch size 
batch_size = 10000

# Read csv in batches
df = pd.read_csv('/Users/oghenekarooray/Desktop/hotelsEU/data/combined_data.csv', chunksize=batch_size)

Afterwards, loop through individual chunks and replace all NaN values with any value of your choice using the chuck.fillna(value='value', inplace=True) method. This step is particularly important because NaN values are not recognised by MySQL and cannot be stored in columns with data types such as INT or VARCHAR.

Next, create a subset of the DataFrame columns and convert its values to a nested list. This is done to prepare the data for insertion into the MySQL table and correct for any data type differences. df.to_sql() method can also be used, but only in cases where the datatypes of the DataFrame columns match those of the SQL table columns.

# Loop through each chunk and execute the insert query
for chunk in df:
# Replace any NaN values in the DataFrame with a suitable value
chunk.fillna(value='NA', inplace=True)
# Create a subset of dataframe and convert its values to a nested list
data = chunk[['name', 'url', 'distance', 'price', 'original_price', 'review', 'metro', 'checkin', 'checkout', 'city']].values.tolist()
cursor.executemany(insert_query, data)
conn.commit()

# Close cursor and connection
cursor.close()
conn.close()

The executemany() method is used to insert multiple rows of data into a MySQL database table. Finally, the conn.commit() method commits the changes made to the database. As a good practice, close the cursor and connection objects after finishing working with the database using the close() method.

--

--