Text file import done right

ServerFarm

 

One thing I have seen time and time again over the years is loading data from some external source into a database – and see it either fail or blow up in flight and corrupting data in the process.

Importing data is simplicity itself, but only if the source data is perfect, and the world can and will conspire against you to make your import fail, so rather than taking the optimistic assumption that there will be no problems,  your file import needs to be parsed, checked, and only if you are satisfied that the data in the file is correct, do you then start inserting it into your database.

The Basics

In the beginning there was a file…

The import process starts, the contents of the target database are purged, and the import file then instructs the database to load the new import file, except that file in question was not generated or transferred and the import process has nothing to do… Do you have a database backup?

Make sure before you do anything definitive, at least make sure that you actually do have a file to process, and that file actually has contents!

And that file should have correct contents

Don’t just check that the file exists, but check if there is anything inside that file. Did the file get truncated during a transfer and may not be complete? If you blindly import what you have, then you will end up with an inconsistent database. You will need to validate the contents before loading them.

I also recommend that if you are using plain text files, that you include a header line that contains the column names that the file should contain. This allows explicit validation of the number of fields supplied, and will alert you to a file structure change to be checked if there is changes to this list as something unexpected is happening.

Your import procedure must have a list of the fields expected per file, what field they map to in the database, and the expected data type and size of the corresponding field in the database. Also, if you wish to accept files with the columns out of order, then you need to track what column header is in what order as to be able to then check each corresponding value that the value corresponds

And the import must succeed

If the source data is clean, then it can be imported. if you are writing into a database, it could be a very good idea to use a transaction, so that if there is any outage during the import, the data will not be already written, and you will not have inconstant data.

If you cannot use a transaction due to system constraints, I recommend using a temporary table and a completion flag, that allows you to then transfer a complete dataset from the temporary import tables to the production tables, and notify where you are and if the process has completed: This in essence is making a “manual” transaction, and will be slower, but you limit corruption and incoherencies as much as possible.

Status and errors must be logged

There is no excuse to create a black box import, that provides no information about what was happening in case of a crash or error.

The minimum information would  be the name and sizes of the files being processed. An MD5 hash of the file would be a bonus. When parsing the file and checking the format and values, you will also need to log where you are in the file in case of an error, what line (if any) is giving problems and why (expected a number, got a string, or obligatory value not provided for example). Everything should also have a timestamp, as you can then confirm when your process ran, how long it took, and in case of a slow process that may depend on on other processes down the line, you have information about the runtime, and can make informed decisions about changing the time of your import run.

Detailed Steps

  • Do not write or delete anything in your target database until all steps are validated
  • Check that all the files needed are present
  • For each file:
    • Check the first line of the file are the column headers and they are what you expected to receive.
    • For each row of the current file, check that the individual field corresponds to the datatype AND the size of what is expected in the database.
  • Decide how to manage an incoherence: Do you fail the line and carry on? Do you fail the complete import? In both cases you must log an error somewhere to aid in correcting the error.
  • Once you have validated any and all files, that they have the correct number of columns and each column value contains the expected value of the expected length, you can then start importing into the database.
  • If you can start a transaction before importing into the database:
    • Start the transaction
    • Run your import in the transaction. If importing via SQL inserts or updates, remember to escape your data!
    • Check that there have been no errors when importing the data
    • If there has been an error, log the error, and if possible the affected file, line, data and the SQL query executed if at all possible to aid in correcting the error
    • In case of error, rollback the transaction, otherwise commit. Log the final status of the import.
  • If transactions are not possible:
    • Create or clear a process log working table that will log what temporary table you are updating and the status of the procedure. It needs 2 columns, tablename and state.
    • Create or clear a pre-existing temporary import table for each file to import, to ensure that no production data can be changed until you are ready.
    • If importing via SQL inserts or updates, remember to escape your data!
    • Mark in the process log table the table you are inserting data
    • Check that there have been no errors when importing the data in the working table. If there has been an error, log the error, and if possible the affected file, line, data and the SQL query executed if at all possible to aid in correcting the error
    • Mark in the process log table that the insert has complete. This allows you to know where you were during a process if there was a database failure or a deconnection so you know where to restart.
    • If all processes are complete, insert or update your main data from the working tables. For each temp table processed, mark again in the process log table that the data has been sent across. In case of a database outage or disconnection, you will know what dataset you were working on, and you can replay that data again or check for inconcistancies.
    • When all processes are complete, delete or clear both the temp tables and the process log.
  • When the database import part is complete, log a successful import along with the date and time

 

Leave a Reply

Your email address will not be published. Required fields are marked *