Importing CSV into SQLite with automatic schema and NULL detection

<aside> 😁 tl;dr: use pandas.read_csv, pandas.DataFrame.convert_dtypes and pandas.DataFrame.to_sql

</aside>

For a project at work, my team and I have been in charge of writing scripts to continuously extract data from a legacy application into a new system, with the plan of eventually phasing out the legacy one. For several (valid) reasons, we were not allowed direct access into the legacy application's database or even a clean dump of it but were only given a bunch of exported CSV files. We have been using Pandas to load the CSVs into memory and do data cleaning, joining, etc. But I'm relative new to Pandas and it has quite a large API surface. Meanwhile I'm already comfortable with a great data wrangling tool: SQL. So naturally I wanted to import the CSVs into a SQL database and query to my heart's desire. Something small, lite, local like SQLite would be perfect.

A straightforward way to to do this would be to manually write down CREATE TABLE statements and write some kind of Python or Ruby scripts to load the rows into the newly created tables:

conn.execute('''
CREATE TABLE my_table (
	id INTEGER PRIMARY KEY,
	name TEXT,
	age INTEGER
);
''')

with open('my_table.csv') as csv_file:
    reader = csv.reader(csv_file, delimiter=';')
    for row in reader:
        conn.execute(
            'INSERT INTO my_table (id, name, age) VALUES (%, %, %)', 
            row[0], row[1], row[2])

Imagine doing this for 10+ files. So much repetition, so much manual work. I'm lazy...

Most articles on the Internet, however, will point out that SQLite has a built-in command called .import that can import the CSV files directly and automatically create tables matching all the columns in the file. Sweet!

Except that all the columns will have TEXT as the type and all NULL values will be represented as the string 'NULL'. You can run an UPDATE statement to convert the NULL values but you will have to issue a statement for each column. Manually, of course. Same for the data type conversion. Also, in SQLite you can't ALTER COLUMN so you'll have to copy the data into a new table with the correct column types.

Urggghhh!

Fortunately, Pandas has a read_csv function, that, with the help of DataFrame.convert_dtypes, tries its best to detect the column type and automatically detects NULL values. There is also a DataFrame.to_sql method that conveniently creates the table with the correct schema and inserts all the rows for you. Can't ask for more.

The code now becomes a 3-liner:

pandas.read_csv('my_table.csv')\\
    .convert_dtypes()\\
    .to_sql('my_table', conn, index=False)

There you have it. You can now add indices and custom changes, if so required. Also check out the Pandas documentation for options to control the import/conversion process.

<aside> 👌 Side note: If you can afford it, IntelliJ/DataGrip's SQL integration is just MIND-BOGGLINGLY AMAZING!

</aside>