Sometimes when running through a CSV or any kind of a log file, you may encounter lists with a lot of duplicates. I will show an example of the simplest order here.
Say, you have a duplicates.txt that goes
one
two
three
one
four
two
four
Now, how to remove duplicates from a list such as the one shown above. If you use a command such as
sort – u < duplicates.txt or cat duplicates.txt | sort | uniq
you may end up with a list that while stripping out the duplicates, does not keep the original order
four
two
three
one
Now, there is a way where you can remove duplicates from a list… but still keep the original order.
nl duplicates.txt | sort -k2 -u | sort -n | cut -f2-
Step 1. First number the entries in the duplicates.txt using nl
nl duplicates.txt
This will give you the list:
1 one
2 two
3 three
4 one
5 four
6 two
7 four
Step 2. We now need to sort the list
nl duplicates.txt | sort -k2
7 four
5 four
1 one
4 one
3 three
2 two
6 two
Step 3. Now, we need to remove the lines with duplicate fields:
nl duplicates.txt | sort -k2 -u
7 four
1 one
3 three
2 two
Step 4. Restore the original order:
nl duplicates.txt | sort -k2 -u | sort -n
1 one
2 two
3 three
7 four
Step 5. Remove the numbering we inserted in Step 1
nl duplicates.txt | sort -k2 -u | sort -n | cut -f2-
one
two
three
four