I have tons of word documents and recently I wanted to grep
through them. Poking around I could not
find a good tool to get the job done.
Then I found a post that suggested converting the word documents to text
files then grepping them. I thought that
was a good idea so I decided to go that route.
One tool I found that converts the docx files to text is
docx2txt http://docx2txt.sourceforge.net/
[1]
Ubuntu Install
Installing it in Ubuntu is pretty easy
> sudo apt-get install docx2txt
|
Cygwin Install
Installing it on cygwin takes a few more steps. Here is the commands I ran to get it
installed.
> curl -L http://downloads.sourceforge.net/docx2txt/docx2txt-1.4.tgz?download
-o docx2txt-1.4.tgz
> tar -xvzf
docx2txt-1.4.tgz
> cp
docx2txt-1.4/docx2txt.pl /usr/bin/docx2txt
|
Convert them all in place
If you want to simply batch convert a ton of .docx files and
create the .txt file in the same folder as the original .docx file then run
this simple command.
> find `pwd` -iname "*.docx" |
xargs -I{} docx2txt {}
|
Now test
Convert them all in another directory
What if you really did not want to place all the .txt files
in the same location as the .docx but to mirror it in a new folder?
Well I spent a while trying to make a neat one liner… but eventually gave up on that idea and made
this script.
#!/bin/bash
#
#
Simple script to convert docx files to txt
#
files and putting them in a new folder
#
to Change the folder rename NEW_FOLDER variable
#
##################################################
find
`pwd` -iname "*.docx" |
while
read docxfile
do
NEW_FOLDER="txtFiles"
BASE_FOLDER=`pwd`
TXT_FILE=$(echo $docxfile | sed 's/\.docx$/\.txt/' | sed
's?'$BASE_FOLDER'?'$BASE_FOLDER'/'$NEW_FOLDER'?')
DIR=$(dirname "${TXT_FILE}")
echo $TXT_FILE
mkdir -p "$DIR"
docx2txt "$docxfile"
"$TXT_FILE"
done
|
This script places all the converted docx files into a
folder called txtFiles. Now you just
grep there.
References
[1] docx2txt home page
Accessed 04/2016
No comments:
Post a Comment