<aside>
😁 tl;dr: use pandas.read_csv
, pandas.DataFrame.convert_dtypes
and pandas.DataFrame.to_sql
</aside>
For a project at work, my team and I have been in charge of writing scripts to continuously extract data from a legacy application into a new system, with the plan of eventually phasing out the legacy one. For several (valid) reasons, we were not allowed direct access into the legacy application's database or even a clean dump of it but were only given a bunch of exported CSV files. We have been using Pandas to load the CSVs into memory and do data cleaning, joining, etc. But I'm relative new to Pandas and it has quite a large API surface. Meanwhile I'm already comfortable with a great data wrangling tool: SQL. So naturally I wanted to import the CSVs into a SQL database and query to my heart's desire. Something small, lite, local like SQLite would be perfect.
A straightforward way to to do this would be to manually write down CREATE TABLE
statements and write some kind of Python or Ruby scripts to load the rows into the newly created tables:
conn.execute('''
CREATE TABLE my_table (
id INTEGER PRIMARY KEY,
name TEXT,
age INTEGER
);
''')
with open('my_table.csv') as csv_file:
reader = csv.reader(csv_file, delimiter=';')
for row in reader:
conn.execute(
'INSERT INTO my_table (id, name, age) VALUES (%, %, %)',
row[0], row[1], row[2])
Imagine doing this for 10+ files. So much repetition, so much manual work. I'm lazy...
Most articles on the Internet, however, will point out that SQLite has a built-in command called .import
that can import the CSV files directly and automatically create tables matching all the columns in the file. Sweet!
Except that all the columns will have TEXT
as the type and all NULL
values will be represented as the string 'NULL'
. You can run an UPDATE
statement to convert the NULL
values but you will have to issue a statement for each column. Manually, of course. Same for the data type conversion. Also, in SQLite you can't ALTER COLUMN
so you'll have to copy the data into a new table with the correct column types.
Urggghhh!
Fortunately, Pandas has a read_csv
function, that, with the help of DataFrame.convert_dtypes
, tries its best to detect the column type and automatically detects NULL
values. There is also a DataFrame.to_sql
method that conveniently creates the table with the correct schema and inserts all the rows for you. Can't ask for more.
The code now becomes a 3-liner:
pandas.read_csv('my_table.csv')\\
.convert_dtypes()\\
.to_sql('my_table', conn, index=False)
There you have it. You can now add indices and custom changes, if so required. Also check out the Pandas documentation for options to control the import/conversion process.
<aside> 👌 Side note: If you can afford it, IntelliJ/DataGrip's SQL integration is just MIND-BOGGLINGLY AMAZING!
</aside>