It seems that some of our constituients have not been paying overly much attention to the settings on their scanners. We have over 40Gb of black-and-white, text-only documents scanned at 24 BPP, uncompressed, consuming 10 Mb each!
This happened once before. My colleague Warren licensed a product called “2TIFF” to shrink the files in question. This works well, except in his case ALL of the images in an Application folder needed to be compressed. I only need to shrink SOME of them.
After much fooling around and wasting of time, I was able to use a win32 port of the UNIX “find” command to hunt down all of the large files, dump the list to a file, and then use this file as a source for 2TIFF. The big mess of images now occupies only about 30 Mb of space.
Here are the sommand syntax details:
> find.exe “I:\OBJECTS\PURCHASE_ORDERS” -size +3M -fprint bigfiles.txt
(searches the PURCHASE_ORDERS document tree for all files larger then 3 Mb, dumps results to the text file “bigfiles.txt)
> FOR /f %F in (bigfiles.txt) DO ( “C:\Program Files\2TIFF\2tiff” s=%F d=%~dF\shrink%~pF -namegen=”[name].[srcext]” -quantize8 -ct4 -cd4 -keepexif)
(Perform a loop operation. For each loop, set the next line in bigfiles.txt to the variable %F. Run the 2Tiff program using %F as the source file. Use \shrink as the output directory (example: when %F=”c:\objects\procurement\1\163.bin, the output directory will be “c:\shrinkProcurement\1\”).)
Here is what the 2Tiff arguments mean:
-namegen=”[name].[srcext]” -> The name of the destination file is the same as that of the source ([name] is a built in variable equal to the source file name. [srcext] equals the source file’s extention)
-quantize=8 -> sets the “quantization” level of the TIFF. This value effects the “sampling rate” and affects image quality. Eight is the maximum value, for best quality.
-ct4 -> Compression type “LZW” is used. This is the default type for color scans. We are using LZW rather than the standard “type 3” for B/W documents because tests showed that reducing these images to monochrome yielded very low quality in some cases. We are keeping some color information to allow anti-aliasing and thus better letter quality.
-cd4 -> Sets the color depth down to 4 BPP from the source 24 BPP. CD1 would be better, but as mentioned above, this results in poor readability of the destination TIFF.
-keepexif -> preserves EXIF tags in the destination file from the source. Probably there is no EXIF info in these files, but I thought we would keep it in case I am wrong.
Warren had used the “dither” switch, but IMNSHO this makes the target document look worse and also results in larger files.