SSIS Lessons Learned, Or Not, And Mistakes Beginners Make

It’s been a while since I’ve used SSIS for anything of substantial complexity, and a lot longer than that since I went through the SSIS class that Brian Knight taught back when we were in the training business together. Recently I had a chance to take on a small project with some twists and it’s reminded me, again, of the difference between knowledge and experience, and how beginners often get trapped in the details of syntax in a way that obscures higher level thinking. It has also been a case study on the merits of tenacity (which sounds better than stubbornness) and the challenge of doing what you know versus venturing into the unknown.

The basics were to process a bunch of Excel files, most of which had more than one sheet, ranging from a few thousand rows up to 900k rows. Files could contain unused sheets, and each sheet was supposed to have – but not guaranteed of course – to have the same columns. From there I needed to validate several columns that would match to foreign keys, do a more complicated lookup based on multiple values, and a few other things, and then insert the rows, omitting duplicates and with a hierarchy to which row was kept if there was a dupe. What made it more interesting was that I had to return the Excel files with a results column added and add a result for every row, including the duplicates. Along the way we discovered a few new requirements, some easy, some of the head scratching type.

My goal was to try to do it 100% in SSIS. An easier approach (for me anyway) would be to use SSIS to import the data, run a series of queries to validate/tag the data, do the insert, and then export it all back to a new Excel file. Not much learning doing it the easy way, and it didn’t look too terrible in SSIS, so that made me stick to the 100% SSIS goal for a while, maybe too long!

With that long intro out of the way, some notes/thoughts:

Excel imports as unicode, no easy way to change it short of a follow up cast, ugly
OpenSchema to check sheet header seemed very slow on large wookbooks
Union ALL gets ugly when you have 10 inputs and you change metadata. Union late in design if you can
I started off using the full files so I wouldn’t have a design that failed on large files. That took too long. I switched to small files and yes, later, had a problem with big files.
Suffered from lack of good test files with obvious problems
Renewed my appreciation for how much easier to explain process in SSIS vs a pure code solution
Didn’t see a better way, so tagged dupes using script and a hashet. SORT will remove them, I needed to keep the entire row
My plan was to do a series of tests, at at each test take the bad rows and direct to a catch-all union, adding a result column and reason in flight. Worked ok.
I tested first column of each sheet and if it matched, good enough. The few odd failures after that weren’t a big deal, and weren’t worth more effort right then
Working with script is both great and horrible, waiting on VS to load each time is excruciating, 5, 10, 15 seconds each time. What the heck.
Layout is a pain, as much as I see the value in the flowchart model.
Lookup perf is very very good if you cache all
I needed to add a column with a static value. I did it early because it was easy, should have done late so I didn’t have to carry it through the whole pipeline
For the final export I had to convert back to unicode, was easier to do there because I could do the cast in the SELECT

I finally got stuck when the only way I could see to solve one problem was to use a SORT transform and it just stalled on the biggest pages. Running out of time, patience, I did some minor changes to the package to use most of the work, then land it it in a table near the end, do the final ops, and then pick it back up with a final data flow for the export.

The final lesson? Hard to say. It’s always a challenge to use the right tool for the job, yet know when you can substitute a chainsaw for the circular saw to speed things up, or vice versa. Learned a lot, think I could do it faster if I started over, and remembered that knowing when to stay the course/change continues to be the hardest thing to know. I know that there’s nothing like new challenges to drive learning.

Share this: