Convert a file full of "INSERT INTO xxx VALUES" in to something Bulk Insert can parse

Go To StackoverFlow.com

1

This is a followup to my first question "Porting “SQL” export to T-SQL".

I am working with a 3rd party program that I have no control over and I can not change. This program will export it's internal database in to a set of .sql each one with a format of:

INSERT INTO [ExampleDB] ( [IntField] , [VarcharField], [BinaryField])
VALUES
(1 , 'Some Text' , 0x123456),
(2 , 'B' , NULL),
--(SNIP, it does this for 1000 records)
(999, 'E' , null);
(1000 , 'F' , null);

INSERT INTO [ExampleDB] ( [IntField] ,  [VarcharField] , BinaryField)
VALUES
(1001 , 'asdg', null),
(1002 , 'asdf' , 0xdeadbeef),
(1003 , 'dfghdfhg' , null),
(1004 , 'sfdhsdhdshd' , null),
--(SNIP 1000 more lines)

This pattern continues till the .sql file has reached a file size set during the export, the export files are grouped by EXPORT_PATH\%Table_Name%\Export#.sql Where the # is a counter starting at 1.

Currently I have about 1.3GB data and I have it exporting in 1MB chunks (1407 files across 26 tables, All but 5 tables only have one file, the largest table has 207 files).

Right now I just have a simple C# program that reads each file in to ram then calls ExecuteNonQuery. The issue is I am averaging 60 sec/file which means it will take about 23 hrs for it to do the entire export.

I assume if I some how could format the files to be loaded with a BULK INSERT instead of a INSERT INTO it could go much faster. Is there any easy way to do this or do I have to write some kind of Find & Replace and keep my fingers crossed that it does not fail on some corner case and blow up my data.

Any other suggestions on how to speed up the insert into would also be appreciated.


UPDATE:

I ended up going with the parse and do a SqlBulkCopy method. It went from 1 file/min. to 1 file/sec.

2012-04-03 22:20
by Scott Chamberlain
Ensure transactions are being used -- I am assuming there is only one INSERT INTO per file but... that is, make sure the issue is caused by not using TDS first. It might be easiest to take data and turn it into CSV first as most tools (including bulk data/merge) understand CSV. Also ensure the chosen cluster is not silly and thrashing IO on inserts - NoName 2012-04-03 22:26
@pst there is more than one insert into per file, There is a INSERT INTO per 1000 rows, as if you attempt to insert more than that you will get a error The number of row value expressions in the INSERT statement exceeds the maximum allowed number of 1000 row values.. My distilled question is Is there any easy way to convert to CSV or do I have to write some kind of Find & Replace and keep my fingers crossed that it does not fail on some corner case and blow up my data. - Scott Chamberlain 2012-04-03 22:31
@pst can you elaborate on how transactions would help speed it up? should I do one transaction per file or have one open transaction and then commit it when all of the files have been parsed? Also how would I check for IO Thrashing - Scott Chamberlain 2012-04-03 22:34
Just make sure the cluster doesn't have to be continuously updated (e.g. backing keys are "generally increasing" and not random). I would just write the "to CSV" converter already. SQL is a relatively simple syntax. The basic cases for values are: it's a number (starts with a digit and might be hex), null (of any case), or a string (starts with ', and terminated with a ' not followed by another '). It should take about 10 minutes to write - NoName 2012-04-03 22:55
As far are transactions, they don't sound like the issue: they would be if after each single insert (but at batches of 1000 this is minimized). Might want to not let the transaction get too big though -- but I am not sure what the ultimate considerations are for transaction sizes, as my "big" inserts are only about 50k records at a time - NoName 2012-04-03 22:58


1

Well, here is my "solution" for helping convert the data into a DataTable or otherwise (run it in LINQPad):

var i = "(null, 1 , 'Some''\n Text' , 0x123.456)";
var pat = @",?\s*(?:(?<n>null)|(?<w>[\w.]+)|'(?<s>.*)'(?!'))";
Regex.Matches(i, pat,
      RegexOptions.IgnoreCase | RegexOptions.Singleline).Dump();

The match should be run once per value group (e.g. (a,b,etc)). Parsing of the results (e.g. conversion) is left to the caller and I have not tested it [much]. I would recommend creating the correctly-typed DataTable first -- although it may be possible to pass everything "as a string" to the database? -- and then use the information in the columns to help with the extraction process (possibly using type converters). For the captures: n is null, w is word (e.g. number), s is string.

Happy coding.

2012-04-03 23:19
by NoName
Thanks, that snippet gets me on the right track. I am actually generating the destination Datatables in SQL from an accompanying xml file so creating the datatables in C# will not be a problem either - Scott Chamberlain 2012-04-03 23:26
What is .Dump() - Scott Chamberlain 2012-04-03 23:31
@ScottChamberlain It's an extension method added by LINQPad to show the result. (This should run as "C# Statements" context in LINQPad). I have added the link to the main answer - NoName 2012-04-04 00:32


1

Apparently your data is always wrapped in parentheses and starts with a left parenthesis. You might want to use this rule to split(RemoveEmptyEntries) each of those lines and load it into a DataTable. Then you can use SqlBulkCopy to copy all at once into the database.

This approach would not necessarily be fail-safe, but it would be certainly faster.

Edit: Here's the way how you could get the schema for every table:

private static DataTable extractSchemaTable(IEnumerable<String> lines)
{
    DataTable schema = null;
    var insertLine = lines.SkipWhile(l => !l.StartsWith("INSERT INTO [")).Take(1).First();
    var startIndex = insertLine.IndexOf("INSERT INTO [") + "INSERT INTO [".Length;
    var endIndex = insertLine.IndexOf("]", startIndex);
    var tableName = insertLine.Substring(startIndex, endIndex - startIndex);
    using (var con = new SqlConnection("CONNECTION"))
    {
        using (var schemaCommand = new SqlCommand("SELECT * FROM " tableName, con))
        {
            con.Open();
            using (var reader = schemaCommand.ExecuteReader(CommandBehavior.SchemaOnly))
            {
                schema = reader.GetSchemaTable();
            }
        }
    }
    return schema;
}

Then you simply need to iterate each line in the file, check if it starts with ( and split that line by Split(new[] { ',' }, StringSplitOptions.RemoveEmptyEntries). Then you could add the resulting array into the created schema-table.

Something like this:

var allLines = System.IO.File.ReadAllLines(path);
DataTable result = extractSchemaTable(allLines);
for (int i = 0; i < allLines.Length; i++)
{
    String line = allLines[i];
    if (line.StartsWith("("))
    {
        String data = line.Substring(1, line.Length - (line.Length - line.LastIndexOf(")")) - 1);
        var fields = data.Split(new[] { ',' }, StringSplitOptions.RemoveEmptyEntries);
        // you might need to parse it to correct DataColumn.DataType
        result.Rows.Add(fields);
    }
}
2012-04-03 22:43
by Rango
What is the best way you recommend getting the data in to a DataTable object. I tried to do it earlier but I was having trouble figuring out the correct way to go from a line of text to a DataRow - Scott Chamberlain 2012-04-03 23:05
@ScottChamberlain: Edited my answer. Just seen that you can simply get the table name from the file name so you can skip that part. But the tricky part is to create the DataRow from the String[]. Maybe you need some helper methods for the conversion or i've missed a simple way here - Rango 2012-04-04 00:14