Yesterday was quite a day. Spent the majority of it working with a tech from a very large shipping company. They have several tools we use and we’ve been working with them for years. We keep seeing the occasional transactions where the shipping cost on the order is $900 but when the order is transferred to invoice it becomes $50. That’s not a good thing.
Spending the entire day with him was kind of painful, but there was a big payoff for me at the end of the day. This is what I learned:
This guy has too much access to our systems. He’s showing up in the warehouse and implementing upgrades, logging and who knows what, and we in IT don’t know about it. I’m going to have to discover or create a policy for vendors.
Not all views are prefixed with a “v.” I know, that’s pretty silly, but I spent 5 minutes looking at why the heck “SELECT TOP 10 * FROM SHIPWS_GOODS” yielded results but I couldn’t find the table in SSMS. I should have queried sysobjects as soon as I wanted to know what that was.
A poorly defined table is bad practice. The history table that logs the tracking number, invoice number and more for every shipment had no primary key. That tracking number in the shipper’s world IS primary and unique, and there shouldn’t have been more than one row per tracking number.
Which brings me to the part I actually liked…how do we fix this mess?
The problem was a table with ~50K rows of data actually only had ~27K distinct tracking numbers. Many of the tracking numbers were in the table 2 and 3 times, and what made it a little more challenging, some of those “duplicates” had different data in the rows. They were nearly identical. So how do I purge the duplicates and get all the data possible? It was a mental exercise that I’m sure many DBAs have faced before. I was initially trying to make it too hard, and must give much thanks and credit to an article on SimpleTalk that crystalized the entire problem and solution.
After reading the entire article I got to the the section entitled “New Techniques for Removing Duplicate Rows in SQL Server 2005” and that was my Aha! moment. I’d seen but not understood the OVER and PARTITION BY functionality. Right now, it makes perfect sense. So this is what I ended up doing:
Quick temp table (SELECT * INTO HistoryTEMP FROM History)
TRUNCATE TABLE History
ALTER TABLE History ADD PRIMARY KEY (TrackingNo)
WITH DeDuplication AS (
, row_number() OVER ( PARTITION BY TrackingNo ORDER BY TrackingNo ) AS nr
INSERT INTO History
SELECT [CompanyCode] –this is an abbreviated list of columns
WHERE nr = 1
The insert happened in a flash, and all I had left was some updating from my HistoryTEMP table to add some of the data. I’ve essentially merged and de-duplicated data, and learned a lot to boot.
Next steps will include tightening the controls for consultants doing work for us!