I would suggest using an extra tool like a good text editor (notetab,
Ultraedit, Textpad) to copy the context of a asksam doc (that is, the
content of the pdf file) into this editor and have it clean these
text by "search&replace". Activating the option "regular expressions"
would convert any double, triple or what ever space into just one:
Search for: " +"
(two spaces and the +-sign, without "")
replace by: " "
(ONE space)
Reformatting broken lines (deleting line feeds at the end of line)
(in Notetab)
highligt the text you want to reformat and press Ctrl-J
Convert many blank lines into one (without regular expression
activated) would be:
search: ^p^p^p
replace by: ^p^p
and repeat this operation if necessary (askSam itself is able to do
this, but I'm not very lucky with its global operator - it is not
very stable - sometimes it works, sometimes not.
Julian
PS: I know that this is only a uncomfortable workaround, but as
someone said - PDF files are very difficult to export/reformat by
almost any external tool. The only exception I know of is dtSearch,
(Version 5.xx or higher) where you can toggle the option "show search
results in the internal viewer" and what you get is a simple, but
clean text output (extracted from the original pdf file no matter how
complex the original layout may be).
On Wed, 12 Feb 2003 17:01:53 -0500, Phil Schnyder wrote:
>Bonnie,
>
>TextPipe only works during import (and only on Text and HTML files).
>But it does have pre-set functions to remove blank lines, remove
>spaces at the beginning of lines, remove HTML tags, etc.
>
>So although you can create filters for specific types of files,
>there are also generic functions that will work on any type of file.
>
>Hope this helps.
>
>Phil
>
>
>askSam Systems
>Perry, FL (The Silicon Swamp)
>http://www.askSam.com/
>850-584-6590
>__________________________________________________________________
>askSam: Turn Information into Searchable
>Databases
>Try it at: http://www.askSam.com
>
>
>
>On Wed, 12 Feb 2003 10:46:05 -0800, Bonnie Britt wrote:
>>This is a characteristic of PDF files, and there's no obvious way
>>to "make it pretty" in askSam or any other software outside of
>>its native PDF format without wasting amazing amounts of time. If
>>it is a text-based PDF (and not a graphic) I import the PDF file,
>>warts and all, into askSam, then link to the original on the hard
>>drive as a backup. I don't bother to clean up the pdf import
>>since the text is there only to be searched. If and when it does
>>come up of interest in a search, then it makes sense to look at
>>the original for context, tables and images. If you wanted to
>>clean it up a bit to save space etc., you could search through
>>that imported document with ^p^p^p and replace that with ^p^p to
>>eliminate some excess blank lines. That takes only a second.
>>
>>Is anyone using the new Textpipe thingie, and if so, does it save
>>time when
>>cleanup is desirable? Is it useful only when importing documents
>>formatted
>>in the same manner? Or, does it require new sets of instructions
>>for new
>>kinds
>>of documents formatted every which way?
>>Bonnie Britt
>>
>>
>>From: "Frank Thomas" <[log in to unmask]>
>>
>>| But there is a nother reason that made me hesitate to import pdf-
>>files:
>>| You need to post-treat the text quite a lot as there are
>>numerous empty
>>| lines, change of character size or style etc. And of course the
>>| question of tables and of images.
>>| Thanks
>>| /Frank
>>|
--
JulianFlor, [log in to unmask] on 13.02.2003
|