I would suggest using an extra tool like a good text editor (notetab,
Ultraedit, Textpad) to copy the context of a asksam doc (that is, the
content of the pdf file) into this editor and have it clean these
text by "search&replace". Activating the option "regular expressions"
would convert any double, triple or what ever space into just one:
Search for: " +"
(two spaces and the +-sign, without "")
replace by: " "
Reformatting broken lines (deleting line feeds at the end of line)
highligt the text you want to reformat and press Ctrl-J
Convert many blank lines into one (without regular expression
activated) would be:
replace by: ^p^p
and repeat this operation if necessary (askSam itself is able to do
this, but I'm not very lucky with its global operator - it is not
very stable - sometimes it works, sometimes not.
PS: I know that this is only a uncomfortable workaround, but as
someone said - PDF files are very difficult to export/reformat by
almost any external tool. The only exception I know of is dtSearch,
(Version 5.xx or higher) where you can toggle the option "show search
results in the internal viewer" and what you get is a simple, but
clean text output (extracted from the original pdf file no matter how
complex the original layout may be).
On Wed, 12 Feb 2003 17:01:53 -0500, Phil Schnyder wrote:
>TextPipe only works during import (and only on Text and HTML files).
>But it does have pre-set functions to remove blank lines, remove
>spaces at the beginning of lines, remove HTML tags, etc.
>So although you can create filters for specific types of files,
>there are also generic functions that will work on any type of file.
>Hope this helps.
>Perry, FL (The Silicon Swamp)
>askSam: Turn Information into Searchable
>Try it at: http://www.askSam.com
>On Wed, 12 Feb 2003 10:46:05 -0800, Bonnie Britt wrote:
>>This is a characteristic of PDF files, and there's no obvious way
>>to "make it pretty" in askSam or any other software outside of
>>its native PDF format without wasting amazing amounts of time. If
>>it is a text-based PDF (and not a graphic) I import the PDF file,
>>warts and all, into askSam, then link to the original on the hard
>>drive as a backup. I don't bother to clean up the pdf import
>>since the text is there only to be searched. If and when it does
>>come up of interest in a search, then it makes sense to look at
>>the original for context, tables and images. If you wanted to
>>clean it up a bit to save space etc., you could search through
>>that imported document with ^p^p^p and replace that with ^p^p to
>>eliminate some excess blank lines. That takes only a second.
>>Is anyone using the new Textpipe thingie, and if so, does it save
>>cleanup is desirable? Is it useful only when importing documents
>>in the same manner? Or, does it require new sets of instructions
>>of documents formatted every which way?
>>From: "Frank Thomas" <[log in to unmask]>
>>| But there is a nother reason that made me hesitate to import pdf-
>>| You need to post-treat the text quite a lot as there are
>>| lines, change of character size or style etc. And of course the
>>| question of tables and of images.
JulianFlor, [log in to unmask] on 13.02.2003