linux Part 7

Searching and Extracting Data from files and Archiving.

Command Line Pipes.
Sometimes when we run a command we want to put the output into the input of another command, to do that we use pipes. A pipe is created by creating a vertical pipe (|), pipes can be useful for example when we have a lengthy output from a cat command that would display alot on the screen, if we piped it with the less command, it will show only one scroll length (one page at a time) until you page down to the next page

cat sample | less

You can also use pg and more with the cat command

Check out the man pages for: less pg and more to get more information about these commands

Pipes allow for the quick execution of commands performing complex of tasks and run them consecutively so that the system doesn’t have to wait for them. 

One of the most common commands used with pipes is the grep command. The grep command searches for the keywords in the output

To use the grep command

grep + optional set of options + regular expression + optional filename specification

I/O Redirection

Input-output redirection

When you have a need to save a program’s output for a future reference you can redirect it to an output file. If you have a program that needs to take it as an input you can redirect it from an input file.

When you are dealing with input redirection some programs will rely on this feature to enable them to process data, such as raw text files that a re being fed through a program to search it for patterns and things like that. In addition to redirecting your output to files as input to another program, a programs output can be used as an input via the process of piping.

Xargs

The Xargs command enables a user to generate command line options from files or other programs output; this is helpful when you want to create scripts.

Let’s take a look at some commonly used redirector operations .

> The greater than sign (>). This means it is going to create a new file containing standard output, if the specified file exists then it will be overwritten with whatever the program is going to export

>> 2 greater than signs this is going to append the put to a standard file, so it will not overwrite the existing file, it will just add to it, so if you are creating log files, this is something very useful if you want to add data to the end of the file, while keeping the previous data

2> This will create a new file containing any standard errors that the program is going to throw. If a specified file exists its going to be overwritten as well.

2>> This will append any error to the file, if the file doesnt exist it will be created.

&> This will create a new output file with the standard  error, if the specified file doesnt exist then it will be overwritten

< The less than sign, this is going to send the contents of the specified file as input back into the standard input, what standard input on a machine ? normally your keyboard, what ever your typing is standard input, but in this case we can actually use a file  as our input instead of typing the commands yourself

<< 2 less than signs will accept text on the following lines as standard input, so instead of reading it from a file it will read it from a file we will read it from the screen as well from what we have typed in 

<> This going to cause the specified file to be used for standard input and output, meaning it can read and write from that file and put it back in or take out of the program.

So remember we have our 2 standard types

Standard output – Normal program messages
Standard error messages – which we can use as part of our log files.

Using the grep command in the /etc directory and outputting to a file

grep /etc/* > userfile.txt

This is using grep to search from the etc directory and output all the information to a txt file.

Another example

grep rwelsh /etc/ *

This will display all the files where the user names appear, however, if you are a normal use and not root user,  there will be alot of error messages that appear because you will not have all the permissions to read all files in the in the etc directory

If you don’t care about the errors or they are not useful you can redirect them to the null file. There is a device called 

/dev/null

This is a device file that serves as trash for all the data we don’t find useful or care about.

Example 

grep welsh /etc/* 2> /dev/null

 

Anything not good will be discarded into the null file.

Redirecting and Piping

Outputting a directory list to a file

Create a folder named test in the Document folder

mkdir test  

Create the following files in the test directory

~/Documents/test$ touch barry.txt

~/Documents/test$ touch bob

~/Documents/test$ touch example.png

~/Documents/test$ touch firstfile

~/Documents/test$ touch fool

~/Documents/test$ touch video.mpeg

~/Documents/test$ ls

 

To export / output the directory to a file we use the following 

ls > myoutput

Nothing will appear in the display but it will create a file in the test directory, the file will be named my output.

To see the contents that were added to the myoutput file we use the following command

cat myoutput

barry.txt

bob

example.png

firstfile

fool

myoutput

video.mpeg



How to count how many lines exist in a file

wc -l

7 myoutput

Now we will overwrite the myoutput file with the same name using the wc -l

 wc -l myoutput > myoutput

:~/Documents/test$ cat myoutput

0 myoutput

The result is 0 because when the file is overwritten it is blanked out

Appending output to the same file

 

So in Barry,txt the there is a line count of 7 

 

ls > barry.txt

 

barry.txt
bob
example.png
firstfile
fool
myoutput
video.mpeg

 

When we run the ls command but this time with the append command >> then it will list output the 7 files into the barry txt

 

ls >> barry.txt
cat barry.txt 

 

barry.txt

bob

example.png

firstfile

fool

myoutput

video.mpeg

barry.txt

bob

example.png

firstfile

fool

myoutput

video.mpeg

 

The if we run a line count 

 

wc -l barry.txt

 

14 barry.txt

 

Inputting data from a file. When we input data to a file its as if the data comes from a keyboard. To this we must use the less than command

 

wc-l < barry.txt

 

line result will be 

14

 

Inputting and outputting at the same time using wc -l command. I this example we will go from the input the barry text file and output to myoutput

 

wc -l < barry.txt > myoutput

 

cat myoutput 

 

Result 14






Redirecting errors

 

If we run the following 

 

~/Documents/test$ ls video.mpg blahahah.foo 

 

ls: cannot access ‘video.mpeg’: No such file or directory

ls: cannot access ‘blahahah.foo’: No such file or directory

 

The error is printed on the screen.

But if we wish to output the errors to a file we can run this command

 

ls video.mpeg blahahah.foo 2> errors txt 

 

The same error will be put in to errors.txt file

 

cat errors.txt

 

:~/Documents/test$ ls video.mpeg blahaha 2> errors.txt

video.mpeg

:~/Documents/test$ cat errors.txt

ls: cannot access ‘blahaha’: No such file or directory




Sending to a standard output file 

rwelsh@Ubuntu01:~/Documents/test$ ls video.mpeg blahaha > myoutput 2>&1

 

2>&1 represents the standard output file




Piping –Running multiple commands together

 

Example 1 

 

ls | head -4  

 

This will list the current directory a show the first 4 items /files

 

Example 2 

 

ls | tail -2 will show the last 2 files in the directory

ls | head -5 | tail -2 

 

List the first 5 and from those files show the last 2 

 

ls | head -5 | tail -2 > outputtail.txt 

 

cat outputail.txt 

 

Best practice is to save to a directory where you are not manipulating because if the new filename falls within the range of the file within the scope of the manipulation the file will change to appear in the output when running the piping commands.

 

Basic regular expressions

 

A way to describe patterns that a user might want to look for in data files

 

Regular expressions are similar to wildcards, like when we want to look for filenames, regular expressions are to search for certain expressions with in a file.

 

There are 2 ways to use expressions

 

Baisc

 

Extended 

Which one we use depends on the program we use to do out searches, some programs let you use both and some will only let you use the basic one.

 

The simplest type of regular expression is an alphanumeric string or an alphabetic string , for example

 

HWaddr or Linux3

 

Whe for example we would search for HWaddr (hardware address) we might get results such as

 

HWaddr

This is the HWaddr

The HWaraddress is unknown

 

More advanced expressions

 

[ ] Brackets 

 

Searching using brackets b[aeiou]g, this could result in the following words being found.

Bag beg big bog bug

The b is telling the search program to search for words beginning with “b” 

[aeiou] is to specify to look for second letters with vowels aeiou and the g is the last letter that should be looked for.

 

Resulting in bag beg big bog bug as possible results

 

[-] range expressions

 

This will include a start point and endpoint. For example

 

A[2-4]z 

 

This specifies to search for a and z with the numbers 2,3 or 4 between

 

A2z

A3z

A4z

 

. (dot) anything where the . is

 

a.z could be 

 

a1z a2z aaz abz atz a7z

 

When we are looking through logfiles we may want to know where a start of a line is and where a line ends

 

^ indicates the start of a line

$ indicates the end of a line

 

This will help break apart the logs and find information easier

 

Repetition 

 

You can have a full or partial regular expression that is followed by a special symbol to denote repetition of the matched item. You might want to look for a * that denotes 0 or more matches, so that asterix can be combined with .* so this will specify a match any substring you have been finding.

 

What if you are trying to find a . (dot) within a string? In this case we have to escape it 

 

file1.txt  would be expressed as file1\.txt




Archiving files 

 

The most popular archiving formats for  files are tar and zip. 

 

Tar (tape archiver)

 

Used to archive various dat files into a single file (archive file) while the original files remain on the disk. It’s a popular way to back up your data or archive your data. Archive files can be gig so what we can do is compress it, we can do this within the tar program, we can do this within a tarball. Tarballs are often used to distribute to many computers at once.

 

When you are using tar you should be using it with one qualifier or option. Check the man tar for options.

 

Zip 

 

We have programs that are going to use zip compression:

 

Gzip  

Bzip

Xz

 

Whne using these prgrams for archiving or opening they will have extensions

 

Gzip = gz can be uncompressed with the gunzip program

Bzip = .bz2 can be uncompressed with bzip2

XZ = .xz can be uncompressed with UNXZ

 

The tar program provided explicit support of all 3 of these compression formats.

 

Tar programs can compress in to the 3 formats. If the tarball program has compressed in to one of the 3 formats you can tell by the extension

 

Gzip (.tgz)

Bzip (.tbz, .tbz2, .tb2)

XZ (.txz)

 

Data Search and Extraction

 

Using grep to search for data within file

 

Creat a file fruitstand.txt

Nano fruitstand txt

 

Add the following data 

 

Peter Melons $300

Paul Oranges $230

Stefan Grapes $100$

Robert Bananas $100

Patrick Carrots $120

John Tomatoes $100

Sandra Strawberries $320

Witney Kiwis $100

Sarah Peaches $58

 

Ctrl + s to save the data

 

cat fruitstand.txt to make sure the data is saved.

 

Now to search for specific words within a file the command is:

 

Grep ‘Oranges’ fruitstand.txt

 

Result

 

~/Documents$ grep ‘Oranges’ fruitstand.txt 

 

Paul Oranges $230

 

Grep -n ‘Oranges’ fruitstand.txt the -n is going to specify which line the result is on

 

3:Paul Oranges $230

 

Using Grep with Mellons will show 2 lines because it exists more than once

 

grep -n ‘Mellons’ fruitstand.txt 

 

1:Peter RockMellons $300

2:Jane GreenMellons $145

 

Be are that when you are searching the words are case sensitive

Grep also has a feature for regular expressions, so if you wish to slice and dice this data file a bunch of different ways you can do that using Grep -E the E stands for expressions.

 

Example 

This Grep will look for the expression of 2 Vowels existing together in the fruitstand.txt file.

 

grep -E ’[aeiou]’{2,}’ fruitstand.txt

 

Jane GreenMellons $145

Paul Oranges $230

John Tomatoes $100

Sandra Strawberries $320

Sarah Peaches $58

 

Searching for data that doesnt have a two at the end

 

grep -E ‘2.+’ fruitstand.txt

 

Paul Oranges $230

Patrick Carrots $120

Sandra Strawberries $320

 

All the data that has the number 2 standing alone is not revealed, only the data with a 2 and a number has been listed after it has 

 

***** need to understand this more *******

 

Searching for a number at the end of the line 

 

grep -E ‘$2’ fruitstand.txt

 

Peter apples $2

 

**************Need to check this… it didnt show Patrick Carrots $102





Using Grep to search for multiple letter combinations

 

grep – E ‘is|or|go|an’ fruitstand.txt

 

Jane GreenMellons 145

Paul Oranges 30

Stefan Grapes 100$

Robert Bananas 100

Sandra Strawberries 320

Witney Kiwis 100

 

The examples we have used may be silly, but they show how to slice and dice and search for specific terms, these methods can be applied just the same when searching huge log files.



Using Grep to search a names from A to L 

 

grep -E ^'[E-K]’ fruitstand.txt

Jane GreenMellons 145

John Tomatoes $00

 

$ grep -E ^'[A-Z]’ fruitstand.txt

Peter RockMellons 300

Jane GreenMellons 145

Paul Oranges 30

Stefan Grapes 100

Robert Bananas 100

Patrick Carrots 102 

John Tomatoes 00

Sandra Strawberries 320

Witney Kiwis 100

Sarah Peaches 58

Peter apples 2

 

Outputting to file

 

$ grep -E ^'[A-Z]’ fruitstand.txt > A-Znames.txt



cat A-Znames.txt

 

Peter RockMellons 300

Jane GreenMellons 145

Paul Oranges 30

Stefan Grapes 100$

Robert Bananas 100

Patrick Carrots 102 

John Tomatoes $00

Sandra Strawberries 320

Witney Kiwis 100

Sarah Peaches 58

Peter apples 2

 

When you don’t want characters 

 

grep -E ‘^[^JP] fruitstand.txt

This specifies to filter out lines that begin with J or P

 

Result 

Stefan Grapes 100$

Robert Bananas 100

Sandra Strawberries 320

Witney Kiwis 100

Sarah Peaches 58