Follow Us on Twitter

Eliminating duplicates

by Martin Schapendonk on March 3, 2011 · 3 comments

Today I was confronted with the following situation: a table with two columns that contained IDs. The content was more or less something like this:

ID ALIAS
1  2
2  1
1  3
2  2

I had to eliminate the records that pointed to themselves (i.e. ID = ALIAS), because it’s a bit silly to say that you’re an alias of yourself.

I also had to eliminate the records that were essentially duplicates. In this example, the record “2, 1″ is functionally the same record as “1, 2″ and thus I didn’t need it.

Could this be solved using SQL only? I suspected it, but couldn’t find a solution fast enough. I called in the help of my colleagues and within minutes Martin Schaap helped me out:

select id, alias
from t
where id <> alias -- that was the easy part
and not exists ( -- here comes the hard part
  select 1
  from t
  where id = t.alias
  and alias = t.id
  and rowid > t.rowid
);

Thanks Martin!

UPDATE: one day later, Frank Dorst (yes, that’s right, our “I once was a developer” CTO) also sent me an elegant, working solution:

select distinct least(id, alias), greatest(id, alias)
from t
where id <> alias;

Kudos to you too, Frank (you still got it ;-))!

Ratings:
VN:F [1.9.22_1171]
Rating: 0.0/5 (0 votes cast)
Tags: , ,

3 comments on “Eliminating duplicates

  1. As soon as I saw “eliminating duplicates” I thought of my favoured analytics solution – but it’s bugging me that I can’t (yet) figure out how to use it to solve your particular problem.

    This is what I typically use if I have duplicate values in the ID field, disregarding any other fields:

    create table t(id number, doesnt_matter number);
    insert into t values (1,2);
    insert into t values (2,1);
    insert into t values (1,3);
    insert into t values (3,2);

    select * from
    (select t.*, row_number() over (partition by idorder by null) rn from t)
    where rn= 1;

    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)
    • Hi Scott,

      Thanks for sharing your solution to eliminate duplicate ID values. It is very efficient to use analytics to recognize/eliminate duplicate keys, but unfortunately it doesn’t work in this case (as you already noticed). Each ID/ALIAS pair has to be transformed first to determine if it really is a duplicate.

      But… your way of thinking optimizes Frank’s solution a bit further as it eliminates the need for the rather heavy DISTINCT:

      select *
      from (
      select t.*
      , row_number() over (partition by least(id, alias), greatest(id, alias) order by null) rn
      from t
      where id <> alias
      )
      where rn = 1;

      However, Martin Schaap’s solution is still the most efficient, as it only involves index scanning and no extra function executions or whatsoever.

      VN:F [1.9.22_1171]
      Rating: 0 (from 0 votes)
  2. And that’s the best c-level to have – one who knows what it takes to write code!

    onya Frank

    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)

Leave a Reply

Your email address will not be published. Required fields are marked *

*

* Copy This Password *

* Type Or Paste Password Here *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

 

Previous post:

Next post:

About Whitehorses
Company profile
Services
Technology

Whitehorses website

Home page
Whitebooks
Jobs

Follow us
Blog post RSS
Comment RSS
Twitter