Site Overlay

How to Remove Duplicates from a List

Sometimes when running through a CSV or any kind of a log file, you may encounter lists with a lot of duplicates. I will show an example of the simplest order here.

Say, you have a duplicates.txt that goes

one
two
three
one
four
two
four

Now, how to remove duplicates from a list such as the one shown above. If you use a command such as

sort – u < duplicates.txt or cat duplicates.txt | sort | uniq

you may end up with a list that while stripping out the duplicates, does not keep the original order

four
two
three
one

Now, there is a way where you can remove duplicates from a list… but still keep the original order.

nl duplicates.txt | sort -k2 -u | sort -n | cut -f2-

Step 1. First number the entries in the duplicates.txt using nl

nl duplicates.txt

This will give you the list:

1 one
2 two
3 three
4 one
5 four
6 two
7 four

Step 2. We now need to sort the list

nl duplicates.txt | sort -k2

7 four
5 four
1 one
4 one
3 three
2 two
6 two

Step 3. Now, we need to remove the lines with duplicate fields:

nl duplicates.txt | sort -k2 -u

7 four
1 one
3 three
2 two

Step 4. Restore the original order:

nl duplicates.txt | sort -k2 -u | sort -n

1 one
2 two
3 three
7 four

Step 5. Remove the numbering we inserted in Step 1

nl duplicates.txt | sort -k2 -u | sort -n | cut -f2-

one
two
three
four

Similar Posts:

Published By:

Author: Ajit Gaddam

Ajit Gaddam is an accomplished technology executive and is currently the Head of Security Engineering at Visa, where he is responsible for building large scale AI driven cybersecurity products, leading engineering programs, and providing expert guidance on cybersecurity matters. He has presented at conferences worldwide, including USENIX Enigma, RSA, Black Hat, Strata Data Hadoop, COSO Dublin, and GCS Ukraine. Ajit has been quoted by major media organizations and his work has been showcased in academic journals, security publications, and in two published books. He is an active participant in various open source and standards bodies, is a prolific inventor of disruptive technologies (over 100+ global patents), and moonlights as an instructor (SANS, community colleges).

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Scroll Up