Notes From SQLSaturday #248

For once the drive from Orlando to Tampa went smoothly, no delays due to accidents or construction or UFO’s, a good start to the weekend! We checked in at the Hilton Garden Inn about 3 miles from the event in what seems to be one of those ‘hotel areas’ that seems to support both the local hospitals and the University of South Florida. The lobby was full of people in wheel chairs when we arrived, in town for a basketball event. Maybe the most unusual conversation of the weekend was during the elevator ride up, one of the people going up at the same time had an artificial leg and someone else asked how it was holding up. He answered that he ran, played basketball with it, and had been using it for several years. The person asking turns out to be in the business of actually building artificial legs. The asker exits and then, in a moment we can all appreciate, says “I should have asked for his card!”.

Dinner was at Tia’s Tex Mex in their outdoor patio. Good location near the event hotel and the event site. Plenty of room. Food was very slow coming out, about 40 minutes late, but was very good – basically they set up a taco bar so you could make your own. I stayed until about 9, it was getting cool outside and I was – as you might expect – in shorts and a Hawaiian shirt. My only real complaint thinking back is that the music was just a little too loud.

Saturday morning was leisurely, breakfast with the family and reading the paper, arriving about 9:15 in time to catch most of John Welch talking about Hive and Pig.

Hive supports structs, arrays, collections, so you can have complex data in a way that we’re not used to in SQL Server.
Supports indexes and partitions, but not quite the way it works in SQL
Multiple replicas of data on other nodes so you don’t have to worry about a single node failing
Permissions are mostly either read or read/write
Working on distributed node replication as DR solution, bandwidth will be key issue due to volume of data
Sqoop for import/export to Hadoop, not a lot of transforms, not super fast, creates a MapReduce job
PIG is rough equivalent of SSIS, no direct access to SQL
Pull data out of nodes, run through SSIS, or…do it using PIG across all the nodes to produce the results, and then retrieve the results (into SSIS or whatever)
Can use linked server, but not great/good performance

Next I sat through about half of Hadoop on Premise by Jon Bloom. Interesting comments about “who is going to manage/use Hadoop”. Is it the DBA, the ETL person, who? Doesn’t quite fit, today at least, into the world of the SQL DBA, especially if you run Hadoop on Linux. Have to admit I’m trying to see where that goes as well – do I invest time in learning it, or enough to know how to consume it?

I then moved to watch Kevin Boles talk about column stores:

Three different encoding types for compression
Dictionary has information about each million row segment – that way we store the key and not the value
Still cost based, for small numbers of rows still better to do seek/lookup
Hint to force it to NOT use column store index for any part of query (IGNORE_NONCLUSTERED_COLUMN_STORE_INDEX)
No batch mode if MAXDOP 1 (can be huge gain even with MAXDOP 2, far more than cutting time in half)
No bath mode with left join, but can be creative about writing the query to get it for part of it (need to look at deck for link to columnstore tuning guide)
Lots of improvements coming in SQL 2014

At 11 am I did my presentation on PCI for about 20 people, good attendance for a BI centric event.

Lunch was tacos again (same place I think), still good. Lingered over lunch talking shop with people at the table, then checked in with some of the sponsors and even found a new contact based in Orlando, heard about some possible work, and caught with Tim Mitchell for a few minutes.

Next was Brian Mitchell presenting on HDInsight:

Hive useful, but not ANSI compliant, gets translated to MapReduce
Mahout – machine learning
RHadoop – stats processing
Pegasus – graph
Oozie – workflow
Flume – event pipeline
RDP off by default, can be enabled with an expiration date, but ideas is to NOT do this
Can build cluster from Powershell, only create when/as long as needed to run jobs
HDInsight = exact same as Hadoop, code for one will run the other without any changes
Speed advantage comes with more data/more nodes
Advantage that data stored separate from clusters, can point multiple clusters at same storage and can spin down cluster when not being used
Tools better on Linux
Can run locally via HortonWorks code, supported on server OS only

Next up, 5 TSQL commands I’ve Been Missing by Jason Carter, 25 attendees, I’m in for a few minutes to see how it goes. Covering CTE, Output, Merge, Apply, Windowing. I like that there is such good attendance at a beginner session. I then moved over to watch another Big Data presentation by Jason Virtue.

My pick for 3 pm cancelled, so I took a break and caught up on my notes, admiring the building that Jose found on the USF Campus. Great open indoor area for lunch and hanging out, rooms are good, plenty of easy/free parking.

I saw a good handful of people from Orlando at the event and a couple people that drove up from Miami. Have to love the dedication of anyone that will drive 3-1/2 hours to spend a day learning.

I’ve got some thoughts on big data, BI, and Azure that the event helped both clarify and confuse, I’ll try to write more on that later this week. Maybe the biggest is that it feels like it’s time to think about a BI Edition in Orlando. Not sure we need/want one every year, but I think given the rate of change that there is a gap we should try to fill.

Jose Chincilla and his team did a nice job on the event and a very nice job finding a new and better venue. We can and often do make SQLSaturday work anywhere, but a better facility always makes it easier and better.

Share this: