Skip to content

Latest commit

 

History

History
242 lines (139 loc) · 8.45 KB

file-manipulation.rst

File metadata and controls

242 lines (139 loc) · 8.45 KB

File manipulation

oxford comma

count words in file

len(open('data/test.txt', 'r').read().split())

Returns the number of words in a text file, test.txt. open('data/test.txt', 'r').read() gets us the text of the file. We get the word using .split(), splits the text by any whitespace and returns a list of words. Empty strings are removed from the results. Then, len gives us the count of the words.

To run it for arbitrary files

$ python -c "import sys; print(len(open(sys.argv[1], 'r').read().split()))" data/test.txt
47

count lines in file

len(open('data/test.txt', 'r').read().split('\n'))

Returns the number of lines in a text file, test.txt open('data/test.txt', 'r').read() gets us the text of the file, we get the line by using .split('\n'), which splits the text at every new line character \n and returns a list. Calling len on the lists gives us the count of lines.

To run it for arbitrary files

$ python -c "import sys;len(open(sys.argv[1], 'r').read().split('\n'))" data/test.txt
54

add spaces after punctuation

def repl(*args): obj=args[0]; return obj.string[obj.start()]+' '+obj.string[obj.start()+1]
import re, string; re.sub('['+string.punctuation+'][a-zA-Z0-9]+', repl, "this'will;be.formatted,with! spaces")

Parse text and add a space after punctuations if its not present. If the space after the punctuation is present it will remain intact. re.sub calls the method repl at each encounter of a punctuation followed by a letter or a number as specified by the regular expression '['+string.punctuation+'][a-zA-Z0-9]+'. repl add a space between the punctuation and the letter/number and return it.

add line numbers to text file

out=open('data/test-out.txt', 'w')
for i, j in enumerate(open('data/test.txt', 'r')): out.write(str(i+1) + j)
out.close()

Creates a new file with line number before each line.

add line numbers to text file, don't number empty lines

out=open('data/test-out.txt', 'w'); i=0
for line in open('data/test.txt', 'r'):
    if line.strip(): out.write(str(i)+line);i+=1
    else: out.write(line)
out.close()

Creates a new file with line number before each line. If the line is empty it will be skipped. line.strip() will remove leading and trailing whitespaces from the line. If its an empty line line.strip() will be an empty string.

delete trailing spaces

python -c "print(' trailing spaces will be removed.                        '.rstrip())"

.rstrip() method is applied on strings to strip all the trailing whitespace on the right side of the string.

delete multiple newlines between paragraphs to keep only one line

out=open('data/out-single-line-gap.txt', 'w')
out.write((re.sub('(\n\n)[\n]*', '\n\n', open('data/test.txt','r').read())))

Delete multiple new lines from a file between paragraphs and save it in a new file.

To run it for arbitrary files

$ python -c "import sys;out=open('data/out-single-line-gap.txt', 'w');out.write((re.sub('(\n\n)[\n]*', '\n\n', open(sys.argv[1],'r').read())))" data/test.txt

first ten lines of file

open('data/100west.txt', 'r').read().split('\n')[:10]

Returns first 10 lines of a file. split('\n') will return the file as a list where each item of the list is a line from file. [:10] will perform a slicing operation on the list to return elements from index 0 to 9.

To run it for arbitrary files

$ python -c "import sys; open(sys.argv[1], 'r').read().split('\n')[:10]" data/test.txt

last ten lines of file

open('data/100west.txt', 'r').read().split('\n')[-10:]

Returns last 10 lines of a file split('\n') will return the file as a list where each item of the list is a line from file. [-10:] will perform a slicing operation on the list. Negative indexing, [-n], denotes nth element from the last. Here, the slicing operation will return last 10 items of the list i.e. last 10 lines of the file.

To run it for arbitrary files

$ python -c "import sys; open(sys.argv, 'r').read().split('\n')[-10:]" data/test.txt

Reverse a file line by line

python -c "import sys;c=open('reverse.txt','w');c.write('\n'.join([x for x in open(sys.argv[1], 'r').read().split('\n')[::-1]]));c.close()" data/100west.txt

Creates a file with line in reverse order from that of the input file.

split('\n') will return the file as a list where each item of the list is a line from file. [::-1] will reverse the list.

Get alternate lines from files starting from the top

python -c "import sys;c=open('alternate.txt','w');c.write('\n'.join([x for x in open(sys.argv[1], 'r').read().split('\n')[::2]]));c.close()" data/100west.txt

Creates a new file with alternate lines of the input file. Here, slicing operation, [::2], is defined with a step=2. Which will extract every 2nd item of the list. Thus, creating a list of every alternate line from a file.

Find the most common words in a file

python -c "import string,collections,sys,re;z=re.sub('[\n{}]'.format(string.punctuation), '',  open(sys.argv[1],'r').read().lower());x=collections.Counter(z.split(' '));del x[''];print(sorted(x.items(), key=lambda kv: kv[1], reverse=True)[:15])"

Returns the 15 most used words in a text file. re.sub('[\n{}]'.format(string.punctuation), '', open(sys.argv[1],'r').read().lower()) this code will strip all the punctuations and newline characters from the text and .lower() will convert the text into lowercase.

z.split(' ') will return a list of all the words in the given file.

collections.Counter will take the list and creates a list of dictionary as [{<item>: <frequency of item in the list>}, ...]

The list of dictionary will be sorted by value with sorted(x.items(), key=lambda kv: kv[1], reverse=True) and slicing [:15] will return top most common words in the file.

Find the lines which match a specified text

python -c "import sys;search_text=input('Enter text to search:  ');print([x for x in open(sys.argv[1], 'r').read().split('\n') if search_text.lower() in x.lower()])" data/100west.txt

Returns all the lines containing an input text.

Convert file permissions to octal

rw-r--r-- = 644 rwxrwxrwx = 777

python -c "_input=input('Enter file permissions: ');print(''.join([{'r--': '4', 'rw-': '6', 'r-x':'5', 'rwx':'7'}[x] for x in [_input[i:i+3] for i in range(0, len(_input), 3)]]))"

Program to convert unix file permissions rwxr--r-- to octal system. r, w and x represents read, write and execute permissions. The permissions are divided into 3 types, owner permission, group permission and others permission. Example, if the file has these permissions, rw-r--r--, the owner can read/write that file but can not execute it. Member of a group to which that file belongs can only read the file. Similarly, others (who is neither the owner nor a member of the group to which the file belongs) also can only read that particular file.

Convert octal file permissions to rwx format

644 = rw-r--r-- 777 = rwxrwxrwx

python -c "_input=input('Enter file permissions: ');print(''.join([{'4':'r--', '6':'rw-','5':'r-x','7':'rwx'}[x] for x in str(_input)]))"

Program to convert octal file permissions 644 to unix permissions. 4, 5 and 6 and 7 represents r--, r-x, rw- and rwx permissions respectively.