Grep word docx files

Posted on Wednesday, April 20, 2016

I have tons of word documents and recently I wanted to grep through them.  Poking around I could not find a good tool to get the job done.  Then I found a post that suggested converting the word documents to text files then grepping them.  I thought that was a good idea so I decided to go that route.

One tool I found that converts the docx files to text is docx2txt [1]

Ubuntu Install

Installing it in Ubuntu is pretty easy

 > sudo apt-get install docx2txt

Cygwin Install

Installing it on cygwin takes a few more steps.  Here is the commands I ran to get it installed.

 > curl -L -o docx2txt-1.4.tgz
> tar -xvzf docx2txt-1.4.tgz
> cp docx2txt-1.4/ /usr/bin/docx2txt

Convert them all in place

If you want to simply batch convert a ton of .docx files and create the .txt file in the same folder as the original .docx file then run this simple command.

 > find `pwd` -iname "*.docx" | xargs -I{} docx2txt {}

Now test


Convert them all in another directory

What if you really did not want to place all the .txt files in the same location as the .docx but to mirror it in a new folder?

Well I spent a while trying to make a neat one liner…  but eventually gave up on that idea and made this script.

# Simple script to convert docx files to txt
# files and putting them in a new folder
# to Change the folder rename NEW_FOLDER variable

find `pwd` -iname "*.docx" |
while read docxfile
  TXT_FILE=$(echo $docxfile | sed 's/\.docx$/\.txt/' | sed 's?'$BASE_FOLDER'?'$BASE_FOLDER'/'$NEW_FOLDER'?')
  DIR=$(dirname "${TXT_FILE}")

  echo $TXT_FILE

  mkdir -p "$DIR"
  docx2txt "$docxfile" "$TXT_FILE"

This script places all the converted docx files into a folder called txtFiles.  Now you just grep there.


[1]        docx2txt home page
                Accessed 04/2016

No comments:

Post a Comment