How to Remove Duplicates from a List

Sometimes when running through a CSV or any kind of a log file, you may encounter lists with a lot of duplicates. I will show an example of the simplest order here.

Say, you have a duplicates.txt that goes

one
two
three
one
four
two
four

Now, how to remove duplicates from a list such as the one shown above. If you use a command such as

sort – u < duplicates.txt or cat duplicates.txt | sort | uniq

you may end up with a list that while stripping out the duplicates, does not keep the original order

four
two
three
one

Now, there is a way where you can remove duplicates from a list… but still keep the original order.

nl duplicates.txt | sort -k2 -u | sort -n | cut -f2-

Step 1. First number the entries in the duplicates.txt using nl

nl duplicates.txt

This will give you the list:

1 one
2 two
3 three
4 one
5 four
6 two
7 four

Step 2. We now need to sort the list

nl duplicates.txt | sort -k2

7 four
5 four
1 one
4 one
3 three
2 two
6 two

Step 3. Now, we need to remove the lines with duplicate fields:

nl duplicates.txt | sort -k2 -u

7 four
1 one
3 three
2 two

Step 4. Restore the original order:

nl duplicates.txt | sort -k2 -u | sort -n

1 one
2 two
3 three
7 four

Step 5. Remove the numbering we inserted in Step 1

nl duplicates.txt | sort -k2 -u | sort -n | cut -f2-

one
two
three
four

Similar Posts:

Tags: ,

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.