Discussion:
Proposal to change default handling for content-encoded responses
Kartikaya Gupta
2018-10-30 00:34:28 UTC
Permalink
Hello curl users,

I was recently writing a script to download a JSON file with curl, and discovered that the server was sending the file with 'Content-Encoding: gzip'. The downloaded file therefore had to be gunzip'd before it was usable. Other similar JSON files from the same server were *not* being similarly encoded, so I couldn't just pipe the result through gunzip unconditionally. After some searching online, I found [1] which said to use the --compressed argument, and sure enough that resolved my problem.

The documentation for --compressed says that it makes curl *request* a compressed response, which is not quite the same as just decompressing the received response. So --compressed actually does both - request a compressed response, and automatically decompress the response if needed. I only need the latter, so using the --compressed flag seems semantically incorrect for my purpose, even though it works.

I also looked at the relevant HTTP spec [2], which says (paraphrasing) that a request without any Accept-Encoding headers means the server can send any Content-Encoding in response. Personally, I think that if the client side is capable of decoding the encoding, it should attempt to do so, as that provides the most useful default. Otherwise it's up to the user of curl to check for encodings and explicitly decompress them. It just seems like a not-so-great pitfall.

Does anybody have examples where turning on automatic content-decoding would adversely impact the use case? Any other comments on changing the default behaviour here? I'm curious to know also if other people have run into this problem before. According to Daniel [3] the current behaviour is something that has just been inherited down through the ages, and so it's possible there's no strong argument for keeping it the way it is.

[1] https://stackoverflow.com/questions/8364640/how-to-properly-handle-a-gzipped-page-when-using-curl
[2] https://tools.ietf.org/html/rfc7231#section-5.3.4
[3] https://github.com/curl/curl/issues/3192#issuecomment-434124116
-----------------------------------------------------------
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-users
Etiquette: https://curl.haxx.se/mail/etiquette.h
Daniel Stenberg
2018-10-30 08:51:30 UTC
Permalink
Post by Kartikaya Gupta
Does anybody have examples where turning on automatic content-decoding would
adversely impact the use case? Any other comments on changing the default
behaviour here?
I'm in favor of this change unless someone can present a use case indicating a
risk for serious user discomfort.

Changing the default to use '--compressed' by default might also improve life
for a bunch of users since it might save bandwidth and can reduce transfer
waits (since it'll need to transfer less data to get the job done).

It could also be noted that --no-compressed would then be necessary to use in
order to *not* ask for a compressed (and auto-decompress) resource.

I can't think of any middle-way approach for this change that would make the
transition easier. It seems to be rather binary.
--
/ daniel.haxx.se
-----------------------------------------------------------
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-users
Etiquette: https://curl.haxx.se/mail/etiquett
Dan Fandrich
2018-10-30 09:14:06 UTC
Permalink
Post by Daniel Stenberg
Post by Kartikaya Gupta
Does anybody have examples where turning on automatic content-decoding
would adversely impact the use case? Any other comments on changing the
default behaviour here?
I'm in favor of this change unless someone can present a use case indicating
a risk for serious user discomfort.
Any script that downloads compressed tar balls will suddenly start breaking on
many servers if curl starts silently decompressing them. Many servers mark
these with Content-Encoding: gzip so this is a very common case, I would guess.
I'd say that having it on by default would probably make more sense if it
weren't for this.
Post by Daniel Stenberg
Post by Kartikaya Gupta
Dan
-----------------------------------------------------------
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-users
Etiquette: https://curl.haxx.se/mail/etiquette.h
Kartikaya Gupta
2018-10-30 12:06:15 UTC
Permalink
Post by Dan Fandrich
Any script that downloads compressed tar balls will suddenly start breaking on
many servers if curl starts silently decompressing them. Many servers mark
these with Content-Encoding: gzip so this is a very common case, I would guess.
I'd say that having it on by default would probably make more sense if it
weren't for this.
Are you referring to tarballs that are already compressed (e.g. foo.tgz
or foo.tar.gz)? If so, the server should only be marking it
'Content-Encoding: gzip' if the the tarball is double-compressed. If it
is sending the raw foo.tgz bytes with a 'Content-Encoding: gzip' that
sounds like a server bug.

Or do you mean that the uncompressed tarball (foo.tar) is compressed by
the server and sent down with 'Content-Encoding: gzip'? If that's the
case I would expect the client side to be expecting the uncompressed
tarball, not the compressed one.

In other words, the content-encoding should be transparent to the end
user (who/whatever is invoking curl). It's something to be negotiated
between curl and the server, although of course it can be requested via
curl options.
-----------------------------------------------------------
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-users
Etiquette: https://
Dan Fandrich
2018-10-30 12:20:34 UTC
Permalink
Post by Kartikaya Gupta
Post by Dan Fandrich
Any script that downloads compressed tar balls will suddenly start breaking on
many servers if curl starts silently decompressing them. Many servers mark
these with Content-Encoding: gzip so this is a very common case, I would guess.
I'd say that having it on by default would probably make more sense if it
weren't for this.
Are you referring to tarballs that are already compressed (e.g. foo.tgz
Yes.
Post by Kartikaya Gupta
or foo.tar.gz)? If so, the server should only be marking it
'Content-Encoding: gzip' if the the tarball is double-compressed. If it
is sending the raw foo.tgz bytes with a 'Content-Encoding: gzip' that
sounds like a server bug.
I wouldn't call it necessarily a bug—the file is gzip compressed and the
contents need to be uncompressed to make use of it, so Content-Encoding: gzip
isn't prima facie wrong. It's just that many clients would prefer to keep the
file compressed in order to save it to disk intact. But consider a web
application that, for example, lets you list the directories of tar balls given
the URL of one on the web. For such a client, Content-Encoding: gzip makes
sense as the data it needs is in tar format, not gzip format.

Regardless, many servers do this and we need to keep in mind the ramifications
of changing the default curl behaviour.
-----------------------------------------------------------
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-users
Etiquette: https://curl.haxx.se/mail/et
Kartikaya Gupta
2018-10-31 00:56:37 UTC
Permalink
Post by Dan Fandrich
Regardless, many servers do this and we need to keep in mind the ramifications
of changing the default curl behaviour.
If there many servers that do this, then silent decompression is likely
to break things. But maybe sending 'Accept-Encoding: identity' with the
default request (i.e. when --compressed is not specified) would work
here?
-----------------------------------------------------------
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-users
Etiquette: https://curl.haxx.se/mail/etiquet
Dan Fandrich
2018-10-31 08:56:01 UTC
Permalink
Post by Kartikaya Gupta
If there many servers that do this, then silent decompression is likely
to break things. But maybe sending 'Accept-Encoding: identity' with the
default request (i.e. when --compressed is not specified) would work
here?
I doubt that would change anything. Most such servers have a dumb rule to just
Content-Encoding: gzip to any file ending in .gz and nothing the client sends
would change that.
-----------------------------------------------------------
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-users
Etiquette: https://curl.haxx.se/mail
Roman Neuhauser
2018-11-01 15:53:52 UTC
Permalink
the server should only be marking it 'Content-Encoding: gzip' if the
the tarball is double-compressed. If it is sending the raw foo.tgz
bytes with a 'Content-Encoding: gzip' that sounds like a server bug.
I wouldn't call it necessarily a bug—the file is gzip compressed and the
contents need to be uncompressed to make use of it, so Content-Encoding: gzip
isn't prima facie wrong. It's just that many clients would prefer to keep the
file compressed in order to save it to disk intact.
i would call it a violation of http/1.1: if the request is for /foo.tgz,
the server can only send Content-Encoding: gzip iff gunzip is necessary
to arrive at the actual foo.tgz. it's perfectly ok for the server to
serve /foo as well backed by on-disk foo.tgz with the C-E: gzip header,
but we're not discussing that.
But consider a web application that, for example, lets you list the
directories of tar balls given the URL of one on the web. For such a
client, Content-Encoding: gzip makes sense as the data it needs is in
tar format, not gzip format.
what would be the Content-Type response header in this case?
Regardless, many servers do this and we need to keep in mind the ramifications
of changing the default curl behaviour.
this does not match my experience. i do remember seeing this many years
ago, downloading a something.tar.gz with a browser, i would end up with
something.tar instead. maddening.
--
roman
Kartikaya Gupta
2018-10-30 12:01:51 UTC
Permalink
Post by Daniel Stenberg
Changing the default to use '--compressed' by default might also improve life
for a bunch of users since it might save bandwidth and can reduce transfer
waits (since it'll need to transfer less data to get the job done).
For the record, I was originally thinking that curl should just do the
decompression part by default, not necessarily *request* the compression
by default. (i.e. it would keep the current default of not sending a
Accept-Encoding header, but would decode any reponses that were
content-encoded anyway).
Post by Daniel Stenberg
I can't think of any middle-way approach for this change that would make the
transition easier. It seems to be rather binary.
I thought about this a bit more, and another option is to have the
default request send 'Accept-Encoding: identity' to discourage servers
from sending the response encoded. However as you say many users might
benefit from the compression in terms of bandwidth savings and so in
that case defaulting to --compressed might be a better option.

In any case it sounds like (from your comment on the Github issue) that
it's possible to build curl without support for --compressed, and in
that case it probably does make sense to send 'Accept-Encoding:
identity' explicitly. In that scenario use of --compressed is not an
option, and setting this header would be an accurate representation of
what the client side can handle in terms of content-encodings.
-----------------------------------------------------------
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-users
Jan Stary
2018-10-31 11:11:46 UTC
Permalink
(Please wrap your lines; these long paragrahps make it
uneasy to reply to a specific portion of your text.)
Post by Kartikaya Gupta
Hello curl users,
I was recently writing a script to download a JSON file with curl, and discovered that the server was sending the file with 'Content-Encoding: gzip'. The downloaded file therefore had to be gunzip'd before it was usable. Other similar JSON files from the same server were *not* being similarly encoded, so I couldn't just pipe the result through gunzip unconditionally. After some searching online, I found [1] which said to use the --compressed argument, and sure enough that resolved my problem.
The documentation for --compressed says that it makes curl *request* a compressed response, which is not quite the same as just decompressing the received response. So --compressed actually does both - request a compressed response, and automatically decompress the response if needed.
Yes, and the manpage says so:

--compressed

(HTTP) Request a compressed response using one of the
algorithms curl supports, and save the uncompressed document.
If this option is used and the server sends an unsupported
encoding, curl will report an error.
Post by Kartikaya Gupta
I only need the latter,
That is, you only need to uncompress, IF the response is compressed,
without asking for the compression yourself. Right?
That seems like a sensible default behaviour.
Post by Kartikaya Gupta
I also looked at the relevant HTTP spec [2], which says (paraphrasing) that a request without any Accept-Encoding headers means the server can send any Content-Encoding in response. Personally, I think that if the client side is capable of decoding the encoding, it should attempt to do so, as that provides the most useful default.
Yes.
Post by Kartikaya Gupta
Otherwise it's up to the user of curl to check for encodings and explicitly decompress them. It just seems like a not-so-great pitfall.
Does anybody have examples where turning on automatic content-decoding would adversely impact the use case? Any other comments on changing the default behaviour here? I'm curious to know also if other people have run into this problem before.
Post by Kartikaya Gupta
Does anybody have examples where turning on automatic content-decoding
would adversely impact the use case? Any other comments on changing the
default behaviour here?
I'm in favor of this change unless someone can present a use case indicating
a risk for serious user discomfort.
Changing the default to use '--compressed' by default might also improve
life for a bunch of users since it might save bandwidth and can reduce
transfer waits (since it'll need to transfer less data to get the job done).
Please don't change the default to --compressed, as that also *requires*
compression. For a slow machine with a fast connection (not that rare
when serving big static files), this would actually be counterproductive.
For example, my slow home server will happily serve a 700MB *.iso file
down its fast line, but it would take ages if it were to compress it.
Post by Kartikaya Gupta
It could also be noted that --no-compressed would then be necessary to use
in order to *not* ask for a compressed (and auto-decompress) resource.
Not asking for anything extra should imho be the default.
If the user wants the server to compress the content, let him say so.
Post by Kartikaya Gupta
Post by Kartikaya Gupta
Post by Kartikaya Gupta
Does anybody have examples where turning on automatic content-decoding
would adversely impact the use case? Any other comments on changing the
default behaviour here?
I'm in favor of this change unless someone can present a use case indicating
a risk for serious user discomfort.
Any script that downloads compressed tar balls will suddenly start breaking on
many servers if curl starts silently decompressing them. Many servers mark
these with Content-Encoding: gzip so this is a very common case
Can you please give an example?
I haven't found any server that does that.
Post by Kartikaya Gupta
Post by Kartikaya Gupta
Any script that downloads compressed tar balls will suddenly start breaking on
many servers if curl starts silently decompressing them. Many servers mark
these with Content-Encoding: gzip so this is a very common case, I would guess.
I'd say that having it on by default would probably make more sense if it
weren't for this.
Are you referring to tarballs that are already compressed (e.g. foo.tgz
or foo.tar.gz)? If so, the server should only be marking it
'Content-Encoding: gzip' if the the tarball is double-compressed.
Exactly. Responding with 'Content-Encoding: gzip' means that the content
has been compressed. The fact that the content itself is a gzipped
tarball is none of the server's bussines.
Post by Kartikaya Gupta
If it is sending the raw foo.tgz bytes with a 'Content-Encoding: gzip' that
sounds like a server bug.
Yes.
Post by Kartikaya Gupta
Post by Kartikaya Gupta
Post by Kartikaya Gupta
Any script that downloads compressed tar balls will suddenly start breaking on
many servers if curl starts silently decompressing them. Many servers mark
these with Content-Encoding: gzip so this is a very common case, I would guess.
I'd say that having it on by default would probably make more sense if it
weren't for this.
Are you referring to tarballs that are already compressed (e.g. foo.tgz
Yes.
Post by Kartikaya Gupta
or foo.tar.gz)? If so, the server should only be marking it
'Content-Encoding: gzip' if the the tarball is double-compressed. If it
is sending the raw foo.tgz bytes with a 'Content-Encoding: gzip' that
sounds like a server bug.
I wouldn't call it necessarily a bug—the file is gzip compressed and the
contents need to be uncompressed to make use of it,
That's none of the server's bussines. Neither is its job to speculate
on what the content is and what the client wants to do with it.
Post by Kartikaya Gupta
so Content-Encoding: gzip isn't prima facie wrong.
Yes it is. If the server sends the file raw, as is,
it's plainly wrong to say "Content-Encoding: gzip".

If there is any need to advise the client about what the content is,
there is the Content-Type header.
Post by Kartikaya Gupta
It's just that many clients would prefer to keep the
file compressed in order to save it to disk intact. But consider a web
application that, for example, lets you list the directories of tar balls
given the URL of one on the web.
I don't know what you mean: what application?
Do you mean an FTP directory listing?
Post by Kartikaya Gupta
For such a client,
What client? curl is the client here.
(And the server doesn't care who the client is.)
Post by Kartikaya Gupta
Content-Encoding: gzip makes
sense as the data it needs is in tar format, not gzip format.
Regardless, many servers do this
Please show an URL of a gzipped tarball on some such server.
Post by Kartikaya Gupta
Post by Kartikaya Gupta
If there many servers that do this, then silent decompression is likely
to break things. But maybe sending 'Accept-Encoding: identity' with the
default request (i.e. when --compressed is not specified) would work
here?
I doubt that would change anything. Most such servers have a dumb rule to just
Content-Encoding: gzip to any file ending in .gz and nothing the client sends
would change that.
Which http server does that, for example?
Post by Kartikaya Gupta
For the record, I was originally thinking that curl should just do the
decompression part by default, not necessarily *request* the compression
by default. (i.e. it would keep the current default of not sending a
Accept-Encoding header, but would decode any reponses that were
content-encoded anyway).
That seems like the obvious thing to do.

Jan

-----------------------------------------------------------
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-users
Etiquette: https://curl.
Daniel Stenberg
2018-10-31 12:12:24 UTC
Permalink
Post by Kartikaya Gupta
I also looked at the relevant HTTP spec [2], which says (paraphrasing) that
a request without any Accept-Encoding headers means the server can send any
Content-Encoding in response. Personally, I think that if the client side
is capable of decoding the encoding, it should attempt to do so, as that
provides the most useful default.
Yes.
Strictly speaking and protocol-wise, the header just tells the client that the
content is compressed. It doesn't mean "please uncompress this". Even if
that's what some clients do and I'm sure some servers want to imply when
sending that header...
--
/ daniel.haxx.se
-----------------------------------------------------------
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-users
Et
Jan Stary
2018-10-31 15:17:24 UTC
Permalink
Post by Daniel Stenberg
Post by Kartikaya Gupta
I also looked at the relevant HTTP spec [2], which says
(paraphrasing) that a request without any Accept-Encoding headers
means the server can send any Content-Encoding in response.
Personally, I think that if the client side is capable of decoding
the encoding, it should attempt to do so, as that provides the most
useful default.
Yes.
Strictly speaking and protocol-wise, the header just tells the client that
the content is compressed.
At the risk of splitting hair, it tells the client
that the server _has_compressed_ the content it is sending, right?
Which seems to be exactly the difference when sending a file.tar.gz
which the server is not compressing.
-----------------------------------------------------------
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-users
Etiquette: https://curl.haxx.se/mail/etiquette.
Daniel Stenberg
2018-10-31 21:19:59 UTC
Permalink
Post by Jan Stary
Post by Daniel Stenberg
Strictly speaking and protocol-wise, the header just tells the client that
the content is compressed.
At the risk of splitting hair, it tells the client
that the server _has_compressed_ the content it is sending, right?
No. Content-Encoding describes the content, it says nothing about who did it
nor when the encoding was done or even for what purpose. Automatic
decompression of this for transfer purposes was always a bit of an abuse of
HTTP since the original intent was to use transfer-encoding for that. But
nobody (except curl and some rare clients) implements that and everyone just
sticks to the Content-Encoding header...
--
/ daniel.haxx.se
-----------------------------------------------------------
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-users
Etiquette: https://
Loading...