Mechanism for working with vision-based models #245

daedsidog · 2024-03-10T14:27:38Z

daedsidog
Mar 10, 2024

A lot of times I need to pass information to ChatGPT that I can't copy, such as a snippet from an old, scanned document or formatted mathematics.

Right now I have a manual process where I query a visual model (though a website) so that he tells me what he sees. E.g., when I paste him a snippet of mathematics, he will give me LaTeX code, which I then pass down to ChatGPT.

Would be very nice to have something like this.

karthink · 2024-03-10T19:24:16Z

karthink
Mar 10, 2024
Maintainer

I forget, do you use org-mode or markdown with gptel?

…

On Sun, Mar 10, 2024, 7:27 AM daedsidog ***@***.***> wrote: A lot of times I need to pass information to ChatGPT that I can't copy, such as a snippet from an old, scanned document or formatted mathematics. Right now I have a manual process where I query a visual model (though a website) so that he tells me what he sees. E.g., when I paste him a snippet of mathematics, he will give me LaTeX code, which I then pass down to ChatGPT. Would be very nice to have something like this. — Reply to this email directly, view it on GitHub <#244>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACBVOLFYMVX46CWSCGOBZ43YXRUW7AVCNFSM6AAAAABEPASGXKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE3TONZUGUYTKNY> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

0 replies

daedsidog · 2024-03-10T20:30:33Z

daedsidog
Mar 10, 2024
Author

I used to use Org mode but I switched to Markdown because I was tired of gptel sometimes doing weird things (like removing underscores). I think it's fixed in the latest version, but I haven't switched back.

Why is that relevant, though?

0 replies

karthink · 2024-03-10T21:17:46Z

karthink
Mar 10, 2024
Maintainer

I used to use Org mode but I switched to Markdown because I was tired of gptel sometimes doing weird things (like removing underscores). I think it's fixed in the latest version, but I haven't switched back.

It should be fixed now, yeah.

Why is that relevant, though?

It's easier to support vision models in Org mode. That said, please see the discussion in #231.

0 replies

daedsidog · 2024-03-10T23:36:03Z

daedsidog
Mar 10, 2024
Author

It's easier to support vision models in Org mode. That said, please see the discussion in #231.

Interesting.

I honestly think the most power from gptel comes with just the abstraction layer it provides when interacting with various models. I, for one, have completely eliminated the process of manually typing code by implementing context generation. Below is a demonstration of me constructing a context buffer, and with a keypress I use gptel's replace-in-place with it. Works exceedingly well. This is kind of its own separate thing from gptel, but I was wondering if the scope of gptel should include this sort of thing.

What I want now is just a way, totally unrelated to Org mode or MD, which will allow me to "send" ChatGPT queries with images (i.e., send current image saved on clip) and get an input in place.

0 replies

karthink · 2024-03-11T00:34:23Z

karthink
Mar 11, 2024
Maintainer

I honestly think the most power from gptel comes with just the abstraction layer it provides when interacting with various models

I see.

Below is a demonstration

Sorry, I had trouble following your demo. My best guess is that the buffer on the right is sent as the context (or system message), and you're asking it to do something with those functions.

I don't understand how it knows what "special function" means, or what "process" means,
or what the utility of this approach is in more general situations.

What I want now is just a way, totally unrelated to Org mode or MD, which will allow me to "send" ChatGPT queries with images (i.e., send current image saved on clip) and get an input in place.

I'm not sure gptel is set up to do that -- it's a very buffer-oriented system. At minimum it will need to distinguish between text as text and text that represents a file path and act on the file instead. A common way to do this would be to define a gptel-send-image command, but I'm not interested in growing the command surface area of gptel.

Basically, handling images is not ruled out, but right now I don't know the best way of doing so that conforms to a simple mental model like the chat usage does.

0 replies

karthink · 2024-03-11T00:55:38Z

karthink
Mar 11, 2024
Maintainer

I was wondering if the scope of gptel should include this sort of thing.

I'm interested to understand what you mean here -- I just had trouble following the demo.

0 replies

daedsidog · 2024-03-11T02:20:12Z

daedsidog
Mar 11, 2024
Author

I was wondering if the scope of gptel should include this sort of thing.

I'm interested to understand what you mean here -- I just had trouble following the demo.

My apologies, my explanation was terrible.

The demo showcases a way to mark areas in different buffers, and aggregate them in their own dedicated buffer. That buffer can then be copied and handed to gptel as context. This is much easier than manually copy pasting sections of context into the dedicated chat buffer/external ChatGPT website, and also has the added bonus of minimizing the context by collapsing code that doesn't contribute to the context.

You can manually remove context snippets from the context buffer.

In a nutshell, it's a glorified yanker, but I found it incredibly useful.

I'm interested to understand what you mean here

I am wondering if you would be open for this to be integrated into gptel, or should this remain its own separate package. It's pretty useless outside of gptel, though.

0 replies

karthink · 2024-03-12T00:32:36Z

karthink
Mar 12, 2024
Maintainer

I am wondering if you would be open for this to be integrated into gptel, or should this remain its own separate package. It's pretty useless outside of gptel, though.

I like the idea! I'll have to think about how to integrate it into gptel though. Right now the best idea I have is "Add an option to the transient menu to append a selected region to the system prompt". This won't work well across buffers since each buffer has its own system prompt.

You've developed a more sophisticated UI for this style of usage, it's interesting.

0 replies

karthink · 2024-03-12T00:34:09Z

karthink
Mar 12, 2024
Maintainer

Converting to a discussion since there's nothing to fix in gptel right now.

0 replies

daedsidog · 2024-03-12T09:26:39Z

daedsidog
Mar 12, 2024
Author

I am wondering if you would be open for this to be integrated into gptel, or should this remain its own separate package. It's pretty useless outside of gptel, though.

I like the idea! I'll have to think about how to integrate it into gptel though. Right now the best idea I have is "Add an option to the transient menu to append a selected region to the system prompt". This won't work well across buffers since each buffer has its own system prompt.

You've developed a more sophisticated UI for this style of usage, it's interesting.

Just mind you, it's not a system prompt. I lexically set a system prompt that tells it how to treat the text it's supposed to replace, then I insert the user prompt in place with what is supposed to be replaced, and then I use the gptel refactor to handle everything.

0 replies

doctorguile · 2024-03-13T20:11:35Z

doctorguile
Mar 13, 2024

@daedsidog your demo is very interesting

If you don't mind, can share your gptel add-on so we can try it out?

Thanks

0 replies

daedsidog · 2024-03-13T23:45:33Z

daedsidog
Mar 13, 2024
Author

@doctorguile I'll add all the things to my fork sometime soon.

0 replies

karthink · 2024-09-26T06:54:23Z

karthink
Sep 26, 2024
Maintainer

I have added vision support to gptel in the feature-capabilities branch. Currently all backends except Gemini are supported. This is not a proof of concept hack like my previous attempt, I did it the "right" (and painful) way this time -- please test if you're interested. It's a pretty big change so there are sure to be bugs.

It's actually a little more general than vision support -- a lot of the changes are about specifying per-model capabilities, to pave the way to add function calling, JSON output and image output (DALL-E etc) uniformly to gptel-request.

To set it up correctly,

ensure that your gptel-model is a symbol, not a string (i.e. gpt-4o-mini instead of "gpt-4o-mini"), and set to a model that supports vision (except Gemini models).
See the documentation of gptel-make-*, where * is openai, anthropic or ollama,
and the documentation of the new user option gptel-track-media.

There are two ways to use it.

Set gptel-track-media to t and type in a link to an image file, on a line by itself, in an Org or Markdown chat buffer. Then gptel-send will find and send the image along with your text.
Add an image file to the context using gptel-add-file, or gptel-add from an image-mode buffer.

0 replies

wlauppe · 2024-09-30T15:09:32Z

wlauppe
Sep 30, 2024

Hello Karthik what wonderful news! I was hoping for a long time you would implement vision capabilites.
Here are my test results:

Method 1.

Using this method works perfectly.
I added the image as context via the gptel-menu -f option
asked "Describe the picture"
and got back a satisfactory answer.

The image shows a simple, line-drawn outline of a house. The house has a basic structure with a peaked roof on the left side and a flat roof extension on the right. There's a rectangular door in the center and two square windows with cross-frames on the right side. A chimney is visible on the roof. The drawing is entirely in black and white, with no shading or detail beyond the basic outlines. This style of drawing is often used in coloring books or as a simple representation of a home in diagrams or children's illustrations.

Method 2:

i marked the region and pressed my gptel-send- shortcut, but was not successful.

I read the documentation of the gptel-track-media variable
and the readme of the branch, there you wrote, that at the moment this method is only supported in chat mode. But in chat mode it didnt work either.
my image link is of this form.

[[file:orgdownload/20240930_083921_screenshot_20240930_083918.png]]

it conforms. to the org mode image link defaults.
An image is a link to an image file that does not have a description part
https://orgmode.org/manual/Images.html

But this doesnt work.
in your documentation you say, it has to be a different format. Can you give an image link example that works?

14 replies

wlauppe Oct 8, 2024

running your latest version i get a new error:
gptel-org--link-standalone-p: Symbol’s function definition is void: org-element-begin

wlauppe Oct 8, 2024

thats the backtrace
backtrace2.txt

wlauppe Oct 8, 2024

i rebuild emacs using the master branch of
git://git.sv.gnu.org/emacs.git emacs --version returns GNU Emacs 31.0.50

there everything works perfectly, but im happy to test gptel using the default ubuntu emacs 29.4 version as long as you need it.

karthink Oct 8, 2024
Maintainer

running your latest version i get a new error: gptel-org--link-standalone-p: Symbol’s function definition is void: org-element-begin

You're on a very old version of Org. Anyway, I pushed a fix working around this issue. It should now work in the default Ubuntu Emacs 29.4 version.

wlauppe Oct 8, 2024

your latest version seems to work also with emacs version 29.4. that seems to be fixed, now. Great 👍

karthink · 2024-10-12T08:07:14Z

karthink
Oct 12, 2024
Maintainer

Image support has been merged, and is available in gptel 0.9.5.

0 replies

daedsidog · 2024-11-06T22:26:24Z

daedsidog
Nov 6, 2024
Author

There is no support for Markdown mode yet, right? I tried multiple link variations and none of them were detected. Looking in the code, only org mode has link parsing.

4 replies

karthink Nov 7, 2024
Maintainer

Image or document links in markdown are supported. The documentation for gptel-track-media explains.

Plus you can add a media file to the context with gptel-add irrespective of major-mode.

daedsidog Nov 7, 2024
Author

I don't know why it did not occur to me to read its documentation.

Nevertheless, the angled bracket format does not seem to work still, while the other format (i.e. the [name](link/to/media)) does:

Incidentally, this was the format I tried, which led me to my assumptions.

This in my opinion is actually a very nice integration of vision. Great work.

karthink Nov 7, 2024
Maintainer

Angle bracket links in markdown syntax need to have the full URI to be recognized as a link. i.e. you need

<file:///tmp/clipboard52.png>

What do you see

daedsidog Nov 7, 2024
Author

Thanks, that worked. I tried <file:/tmp/etc.png> before, like in org, and that's why it didn't work.

Cheers!

Mechanism for working with vision-based models #245

daedsidog Mar 10, 2024

Replies: 16 comments · 18 replies

karthink Mar 10, 2024 Maintainer

daedsidog Mar 10, 2024 Author

karthink Mar 10, 2024 Maintainer

daedsidog Mar 10, 2024 Author

karthink Mar 11, 2024 Maintainer

karthink Mar 11, 2024 Maintainer

daedsidog Mar 11, 2024 Author

karthink Mar 12, 2024 Maintainer

karthink Mar 12, 2024 Maintainer

daedsidog Mar 12, 2024 Author

doctorguile Mar 13, 2024

daedsidog Mar 13, 2024 Author

karthink Sep 26, 2024 Maintainer

wlauppe Sep 30, 2024

Method 1.

Method 2:

wlauppe Oct 8, 2024

wlauppe Oct 8, 2024

wlauppe Oct 8, 2024

karthink Oct 8, 2024 Maintainer

wlauppe Oct 8, 2024

karthink Oct 12, 2024 Maintainer

daedsidog Nov 6, 2024 Author

karthink Nov 7, 2024 Maintainer

daedsidog Nov 7, 2024 Author

karthink Nov 7, 2024 Maintainer

daedsidog Nov 7, 2024 Author

daedsidog
Mar 10, 2024

Replies: 16 comments 18 replies

karthink
Mar 10, 2024
Maintainer

daedsidog
Mar 10, 2024
Author

karthink
Mar 10, 2024
Maintainer

daedsidog
Mar 10, 2024
Author

karthink
Mar 11, 2024
Maintainer

karthink
Mar 11, 2024
Maintainer

daedsidog
Mar 11, 2024
Author

karthink
Mar 12, 2024
Maintainer

karthink
Mar 12, 2024
Maintainer

daedsidog
Mar 12, 2024
Author

doctorguile
Mar 13, 2024

daedsidog
Mar 13, 2024
Author

karthink
Sep 26, 2024
Maintainer

wlauppe
Sep 30, 2024

karthink Oct 8, 2024
Maintainer

karthink
Oct 12, 2024
Maintainer

daedsidog
Nov 6, 2024
Author

karthink Nov 7, 2024
Maintainer

daedsidog Nov 7, 2024
Author

karthink Nov 7, 2024
Maintainer

daedsidog Nov 7, 2024
Author