One liner Linux command to extract URLs from text files

I find my self needing to extract URLs from text files quite a lot and this is the easiest one liner linux command line magic that I got to extract urls from text files.

cat filename | grep http | grep -shoP 'http.*?[" >]' > outfilename

The first grep helps reduce cpu load. The second grep uses perl grep syntax to enable non-greedy grepping and thus allow you to get multiple URLs in one line of HTML and allows you to get the closest extraction.
With the above you will still get a trailing quote in the end most of the time, this you can easily delete using your favorite text editor by simply replacing all instances of a quote with nothing.

Simple and short and works well.

This one is a lot more powerful it will search all files under this directory for links and output it to a file one directory level about this level. If you output the links in the same directory you will go into an infinite loop that will fill your hard drive.

 find * -exec cat {} \;  | grep http | grep -shoP 'http.*?[" >]' > ..outfilename


Post Comment