linux Part 7
Searching and Extracting Data from files and Archiving.
Command Line Pipes.
Sometimes when we run a command we want to put the output into the input of another command, to do that we use pipes. A pipe is created by creating a vertical pipe (|), pipes can be useful for example when we have a lengthy output from a cat command that would display alot on the screen, if we piped it with the less command, it will show only one scroll length (one page at a time) until you page down to the next page
cat sample | less
You can also use pg and more with the cat command
Check out the man pages for: less pg and more to get more information about these commands
Pipes allow for the quick execution of commands performing complex of tasks and run them consecutively so that the system doesn’t have to wait for them.
One of the most common commands used with pipes is the grep command. The grep command searches for the keywords in the output
To use the grep command
grep + optional set of options + regular expression + optional filename specification
I/O Redirection
Input-output redirection
When you have a need to save a program’s output for a future reference you can redirect it to an output file. If you have a program that needs to take it as an input you can redirect it from an input file.
When you are dealing with input redirection some programs will rely on this feature to enable them to process data, such as raw text files that a re being fed through a program to search it for patterns and things like that. In addition to redirecting your output to files as input to another program, a programs output can be used as an input via the process of piping.
Xargs
The Xargs command enables a user to generate command line options from files or other programs output; this is helpful when you want to create scripts.
Let’s take a look at some commonly used redirector operations .
> The greater than sign (>). This means it is going to create a new file containing standard output, if the specified file exists then it will be overwritten with whatever the program is going to export
>> 2 greater than signs this is going to append the put to a standard file, so it will not overwrite the existing file, it will just add to it, so if you are creating log files, this is something very useful if you want to add data to the end of the file, while keeping the previous data
2> This will create a new file containing any standard errors that the program is going to throw. If a specified file exists its going to be overwritten as well.
2>> This will append any error to the file, if the file doesnt exist it will be created.
&> This will create a new output file with the standard error, if the specified file doesnt exist then it will be overwritten
< The less than sign, this is going to send the contents of the specified file as input back into the standard input, what standard input on a machine ? normally your keyboard, what ever your typing is standard input, but in this case we can actually use a file as our input instead of typing the commands yourself
<< 2 less than signs will accept text on the following lines as standard input, so instead of reading it from a file it will read it from a file we will read it from the screen as well from what we have typed in
<> This going to cause the specified file to be used for standard input and output, meaning it can read and write from that file and put it back in or take out of the program.
So remember we have our 2 standard types
Standard output – Normal program messages
Standard error messages – which we can use as part of our log files.
Using the grep command in the /etc directory and outputting to a file
grep /etc/* > userfile.txt
This is using grep to search from the etc directory and output all the information to a txt file.
Another example
grep rwelsh /etc/ *
This will display all the files where the user names appear, however, if you are a normal use and not root user, there will be alot of error messages that appear because you will not have all the permissions to read all files in the in the etc directory
If you don’t care about the errors or they are not useful you can redirect them to the null file. There is a device called
/dev/null
This is a device file that serves as trash for all the data we don’t find useful or care about.
Example
grep welsh /etc/* 2> /dev/null
Anything not good will be discarded into the null file.
Redirecting and Piping
Outputting a directory list to a file
Create a folder named test in the Document folder
mkdir test
Create the following files in the test directory
~/Documents/test$ touch barry.txt
~/Documents/test$ touch bob
~/Documents/test$ touch example.png
~/Documents/test$ touch firstfile
~/Documents/test$ touch fool
~/Documents/test$ touch video.mpeg
~/Documents/test$ ls
To export / output the directory to a file we use the following
ls > myoutput
Nothing will appear in the display but it will create a file in the test directory, the file will be named my output.
To see the contents that were added to the myoutput file we use the following command
cat myoutput
barry.txt
bob
example.png
firstfile
fool
myoutput
video.mpeg
How to count how many lines exist in a file
wc -l
7 myoutput
Now we will overwrite the myoutput file with the same name using the wc -l
wc -l myoutput > myoutput
:~/Documents/test$ cat myoutput
0 myoutput
The result is 0 because when the file is overwritten it is blanked out
Appending output to the same file
So in Barry,txt the there is a line count of 7
ls > barry.txt
barry.txt
bob
example.png
firstfile
fool
myoutput
video.mpeg
When we run the ls command but this time with the append command >> then it will list output the 7 files into the barry txt
ls >> barry.txt
cat barry.txt
barry.txt
bob
example.png
firstfile
fool
myoutput
video.mpeg
barry.txt
bob
example.png
firstfile
fool
myoutput
video.mpeg
The if we run a line count
wc -l barry.txt
14 barry.txt
Inputting data from a file. When we input data to a file its as if the data comes from a keyboard. To this we must use the less than command
wc-l < barry.txt
line result will be
14
Inputting and outputting at the same time using wc -l command. I this example we will go from the input the barry text file and output to myoutput
wc -l < barry.txt > myoutput
cat myoutput
Result 14
Redirecting errors
If we run the following
~/Documents/test$ ls video.mpg blahahah.foo
ls: cannot access ‘video.mpeg’: No such file or directory
ls: cannot access ‘blahahah.foo’: No such file or directory
The error is printed on the screen.
But if we wish to output the errors to a file we can run this command
ls video.mpeg blahahah.foo 2> errors txt
The same error will be put in to errors.txt file
cat errors.txt
:~/Documents/test$ ls video.mpeg blahaha 2> errors.txt
video.mpeg
:~/Documents/test$ cat errors.txt
ls: cannot access ‘blahaha’: No such file or directory
Sending to a standard output file
rwelsh@Ubuntu01:~/Documents/test$ ls video.mpeg blahaha > myoutput 2>&1
2>&1 represents the standard output file
Piping –Running multiple commands together
Example 1
ls | head -4
This will list the current directory a show the first 4 items /files
Example 2
ls | tail -2 will show the last 2 files in the directory
ls | head -5 | tail -2
List the first 5 and from those files show the last 2
ls | head -5 | tail -2 > outputtail.txt
cat outputail.txt
Best practice is to save to a directory where you are not manipulating because if the new filename falls within the range of the file within the scope of the manipulation the file will change to appear in the output when running the piping commands.
Basic regular expressions
A way to describe patterns that a user might want to look for in data files
Regular expressions are similar to wildcards, like when we want to look for filenames, regular expressions are to search for certain expressions with in a file.
There are 2 ways to use expressions
Baisc
Extended
Which one we use depends on the program we use to do out searches, some programs let you use both and some will only let you use the basic one.
The simplest type of regular expression is an alphanumeric string or an alphabetic string , for example
HWaddr or Linux3
Whe for example we would search for HWaddr (hardware address) we might get results such as
HWaddr
This is the HWaddr
The HWaraddress is unknown
More advanced expressions
[ ] Brackets
Searching using brackets b[aeiou]g, this could result in the following words being found.
Bag beg big bog bug
The b is telling the search program to search for words beginning with “b”
[aeiou] is to specify to look for second letters with vowels aeiou and the g is the last letter that should be looked for.
Resulting in bag beg big bog bug as possible results
[-] range expressions
This will include a start point and endpoint. For example
A[2-4]z
This specifies to search for a and z with the numbers 2,3 or 4 between
A2z
A3z
A4z
. (dot) anything where the . is
a.z could be
a1z a2z aaz abz atz a7z
When we are looking through logfiles we may want to know where a start of a line is and where a line ends
^ indicates the start of a line
$ indicates the end of a line
This will help break apart the logs and find information easier
Repetition
You can have a full or partial regular expression that is followed by a special symbol to denote repetition of the matched item. You might want to look for a * that denotes 0 or more matches, so that asterix can be combined with .* so this will specify a match any substring you have been finding.
What if you are trying to find a . (dot) within a string? In this case we have to escape it
file1.txt would be expressed as file1\.txt
Archiving files
The most popular archiving formats for files are tar and zip.
Tar (tape archiver)
Used to archive various dat files into a single file (archive file) while the original files remain on the disk. It’s a popular way to back up your data or archive your data. Archive files can be gig so what we can do is compress it, we can do this within the tar program, we can do this within a tarball. Tarballs are often used to distribute to many computers at once.
When you are using tar you should be using it with one qualifier or option. Check the man tar for options.
Zip
We have programs that are going to use zip compression:
Gzip
Bzip
Xz
Whne using these prgrams for archiving or opening they will have extensions
Gzip = gz can be uncompressed with the gunzip program
Bzip = .bz2 can be uncompressed with bzip2
XZ = .xz can be uncompressed with UNXZ
The tar program provided explicit support of all 3 of these compression formats.
Tar programs can compress in to the 3 formats. If the tarball program has compressed in to one of the 3 formats you can tell by the extension
Gzip (.tgz)
Bzip (.tbz, .tbz2, .tb2)
XZ (.txz)
Data Search and Extraction
Using grep to search for data within file
Creat a file fruitstand.txt
Nano fruitstand txt
Add the following data
Peter Melons $300
Paul Oranges $230
Stefan Grapes $100$
Robert Bananas $100
Patrick Carrots $120
John Tomatoes $100
Sandra Strawberries $320
Witney Kiwis $100
Sarah Peaches $58
Ctrl + s to save the data
cat fruitstand.txt to make sure the data is saved.
Now to search for specific words within a file the command is:
Grep ‘Oranges’ fruitstand.txt
Result
~/Documents$ grep ‘Oranges’ fruitstand.txt
Paul Oranges $230
Grep -n ‘Oranges’ fruitstand.txt the -n is going to specify which line the result is on
3:Paul Oranges $230
Using Grep with Mellons will show 2 lines because it exists more than once
grep -n ‘Mellons’ fruitstand.txt
1:Peter RockMellons $300
2:Jane GreenMellons $145
Be are that when you are searching the words are case sensitive
Grep also has a feature for regular expressions, so if you wish to slice and dice this data file a bunch of different ways you can do that using Grep -E the E stands for expressions.
Example
This Grep will look for the expression of 2 Vowels existing together in the fruitstand.txt file.
grep -E ’[aeiou]’{2,}’ fruitstand.txt
Jane GreenMellons $145
Paul Oranges $230
John Tomatoes $100
Sandra Strawberries $320
Sarah Peaches $58
Searching for data that doesnt have a two at the end
grep -E ‘2.+’ fruitstand.txt
Paul Oranges $230
Patrick Carrots $120
Sandra Strawberries $320
All the data that has the number 2 standing alone is not revealed, only the data with a 2 and a number has been listed after it has
***** need to understand this more *******
Searching for a number at the end of the line
grep -E ‘$2’ fruitstand.txt
Peter apples $2
**************Need to check this… it didnt show Patrick Carrots $102
Using Grep to search for multiple letter combinations
grep – E ‘is|or|go|an’ fruitstand.txt
Jane GreenMellons 145
Paul Oranges 30
Stefan Grapes 100$
Robert Bananas 100
Sandra Strawberries 320
Witney Kiwis 100
The examples we have used may be silly, but they show how to slice and dice and search for specific terms, these methods can be applied just the same when searching huge log files.
Using Grep to search a names from A to L
grep -E ^'[E-K]’ fruitstand.txt
Jane GreenMellons 145
John Tomatoes $00
$ grep -E ^'[A-Z]’ fruitstand.txt
Peter RockMellons 300
Jane GreenMellons 145
Paul Oranges 30
Stefan Grapes 100
Robert Bananas 100
Patrick Carrots 102
John Tomatoes 00
Sandra Strawberries 320
Witney Kiwis 100
Sarah Peaches 58
Peter apples 2
Outputting to file
$ grep -E ^'[A-Z]’ fruitstand.txt > A-Znames.txt
cat A-Znames.txt
Peter RockMellons 300
Jane GreenMellons 145
Paul Oranges 30
Stefan Grapes 100$
Robert Bananas 100
Patrick Carrots 102
John Tomatoes $00
Sandra Strawberries 320
Witney Kiwis 100
Sarah Peaches 58
Peter apples 2
When you don’t want characters
grep -E ‘^[^JP] fruitstand.txt
This specifies to filter out lines that begin with J or P
Result
Stefan Grapes 100$
Robert Bananas 100
Sandra Strawberries 320
Witney Kiwis 100
Sarah Peaches 58