Follow Us on Twitter

Eliminating duplicates

by Martin Schapendonk on March 3, 2011 · 3 comments

Today I was confronted with the following situation: a table with two columns that contained IDs. The content was more or less something like this:

ID ALIAS
1  2
2  1
1  3
2  2

I had to eliminate the records that pointed to themselves (i.e. ID = ALIAS), because it’s a bit silly to say that you’re an alias of yourself.

I also had to eliminate the records that were essentially duplicates. In this example, the record “2, 1″ is functionally the same record as “1, 2″ and thus I didn’t need it.

Could this be solved using SQL only? I suspected it, but couldn’t find a solution fast enough. I called in the help of my colleagues and within minutes Martin Schaap helped me out:

select id, alias
from t
where id <> alias -- that was the easy part
and not exists ( -- here comes the hard part
  select 1
  from t
  where id = t.alias
  and alias = t.id
  and rowid > t.rowid
);

Thanks Martin!

UPDATE: one day later, Frank Dorst (yes, that’s right, our “I once was a developer” CTO) also sent me an elegant, working solution:

select distinct least(id, alias), greatest(id, alias)
from t
where id <> alias;

Kudos to you too, Frank (you still got it ;-))!


Ratings:
VN:F [1.9.22_1171]
Rating: 0.0/5 (0 votes cast)
Tags: , ,

{ 3 comments… read them below or add one }

Scott Wesley March 9, 2011 at 3:27 am

As soon as I saw “eliminating duplicates” I thought of my favoured analytics solution – but it’s bugging me that I can’t (yet) figure out how to use it to solve your particular problem.

This is what I typically use if I have duplicate values in the ID field, disregarding any other fields:

create table t(id number, doesnt_matter number);
insert into t values (1,2);
insert into t values (2,1);
insert into t values (1,3);
insert into t values (3,2);

select * from
(select t.*, row_number() over (partition by idorder by null) rn from t)
where rn= 1;

Reply

Martin Schapendonk March 9, 2011 at 12:42 pm

Hi Scott,

Thanks for sharing your solution to eliminate duplicate ID values. It is very efficient to use analytics to recognize/eliminate duplicate keys, but unfortunately it doesn’t work in this case (as you already noticed). Each ID/ALIAS pair has to be transformed first to determine if it really is a duplicate.

But… your way of thinking optimizes Frank’s solution a bit further as it eliminates the need for the rather heavy DISTINCT:

select *
from (
select t.*
, row_number() over (partition by least(id, alias), greatest(id, alias) order by null) rn
from t
where id <> alias
)
where rn = 1;

However, Martin Schaap’s solution is still the most efficient, as it only involves index scanning and no extra function executions or whatsoever.

Reply

Scott Wesley March 9, 2011 at 3:29 am

And that’s the best c-level to have – one who knows what it takes to write code!

onya Frank

Reply

Leave a Comment

 

Previous post:

Next post:

About Whitehorses
Company profile
Services
Technology

Whitehorses website

Home page
Whitebooks
Jobs

Follow us
Blog post RSS
Comment RSS
Twitter