sort of tangent to curl -- web browser/extension

Discussion:

bruce via curl-users

2021-06-01 11:14:09 UTC

Hi.

Researching browser extensions and thought I'd post her as well.

Curl is great for most of my needs to date. However I'm wondering if
anyone has run across any kind of browser extension that would allow
the user to be viewing a url/page and the extension would allow the
underlying "real" content to be captured. I'm not referring to the
base "view source". I'm looking to be able to capture the underlying
"javascript/dynamic content" that a page would generate.

And of course free/open source would be useful!

thanks

ps. My real world use case. A site I'm looking at has implemented
perimeterX as a blocking tech. The constraints of the tool are
onerous, causing major issues. If I'm manually viewing the site, I can
view the actual page/text by accessing the actual "GET" url. This gets
the text in a separate tab that can be viewed.

I'm wondering if anyone has run across an extension that would
capture/save the underlying content of the "current/viewed" page. Once
I have the "raw" content, parsing is trivial.
-----------------------------------------------------------
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-users
Etiquette: https://curl.haxx.se/mail/etiquette.ht

Petr Pisar via curl-users

2021-06-01 12:56:35 UTC

Permalink

Post by bruce via curl-users
Researching browser extensions and thought I'd post her as well.
Curl is great for most of my needs to date. However I'm wondering if
anyone has run across any kind of browser extension that would allow
the user to be viewing a url/page and the extension would allow the
underlying "real" content to be captured. I'm not referring to the
base "view source". I'm looking to be able to capture the underlying
"javascript/dynamic content" that a page would generate.
And of course free/open source would be useful!

Pierre Brico via curl-users

2021-06-01 15:12:01 UTC

Permalink

I will just add that the "Tools for web developers" is also present in
Chrome and Edge (also hitting F12).

Pierre

On Tue, Jun 1, 2021 at 3:03 PM Petr Pisar via curl-users <

Post by Petr Pisar via curl-users

Firefox web browser comes with that extension. When the web page is open,
press F12 to open "Tools for web developers" window. There you click on
Network tab, then invoke a context menu for any of the HTTP request-reponse
lines and select Save All as HAR. That will dump a complete network traffic
of the web browser to a local file.
-- Petr
-----------------------------------------------------------
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-users
Etiquette: https://curl.haxx.se/mail/etiquette.html

bruce via curl-users

2021-06-01 16:43:17 UTC

Permalink

Hi Petr/Pierre

I took a quick look at the HAR. I opened Dev tools/Network. Went to
the select URL in the browser/URL tab. Selected the GET for the
content, followed by selecting the "Save as HAR" from the right click
menu.

The HAR content lists the "network" data as well as we GET url but not
the actual content of the page.

Did I miss something?

thanks

On Tue, Jun 1, 2021 at 11:19 AM Pierre Brico via curl-users

I will just add that the "Tools for web developers" is also present in Chrome and Edge (also hitting F12).
Pierre

Post by Petr Pisar via curl-users

-----------------------------------------------------------
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-users
Etiquette: https://curl.haxx.se/mail/etiquette.html

-----------------------------------------------------------
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-users
Etique

Petr Pisar via curl-users

2021-06-01 17:33:06 UTC

Permalink

Post by bruce via curl-users
The HAR content lists the "network" data as well as we GET url but not
the actual content of the page.

The HAR file is in a JSON format. If you go to "entries" list and pick a item
whose request/url value matches your desired URL, then in the subsequent
response/content/text string should be a body of the HTTP response.

Or what do you mean with "actual content of the page". There is nothing as
page in HTTP. There are only documents returned for the requests. The page is
rendered from the documents by a web browser. There won't be any image of
the "actual content".

Maybe if you provided us the URL, we could help you.

-- Petr

bruce via curl-users

2021-06-01 18:34:33 UTC

Permalink

Hi.

The test page/URL I'm looking at for my use case would be:
https://www.bkstr.com/efollettstore/home

Selecting the dev tools/network, and then doing an inspection of the

Post by Petr Pisar via curl-users

Post by bruce via curl-users

GET - svc.bkstr.com - https://svc.bkstr.com/store/byName?storeType=FMS&catalogId=10001&langId=-1&schoolName=*

(might need to do a refresh to gen the list of functions/calls for the page)

It's the 4th/5th "json" (type) after the group of "plain" types

I can easily select the "display in diff tab" and gen the content for the URL.

However I'm trying to figure out if there's an extension or a method
to gen/save the content of the page when I'm viewing the top level
URL. The process would have to then gen the content of the associated
URLs which comprise the page.

I rcall seeing some extensions that managed to "track" the item a
person was looking at (selecting) on a page and then generating
comparison pricing data. I would assume that these extensions where
extracting the data from the content and therefore accessing the
content of the page.

Hope this gives a bit more clarity to what I'm attempting to find.

Thanks

On Tue, Jun 1, 2021 at 1:39 PM Petr Pisar via curl-users

Post by Petr Pisar via curl-users

Post by bruce via curl-users
The HAR content lists the "network" data as well as we GET url but not
the actual content of the page.

The HAR file is in a JSON format. If you go to "entries" list and pick a item
whose request/url value matches your desired URL, then in the subsequent
response/content/text string should be a body of the HTTP response.
Or what do you mean with "actual content of the page". There is nothing as
page in HTTP. There are only documents returned for the requests. The page is
rendered from the documents by a web browser. There won't be any image of
the "actual content".
Maybe if you provided us the URL, we could help you.
-- Petr
-----------------------------------------------------------
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-users
Etiquette: https://curl.haxx.se/mail/etiquette.html

-----------------------------------------------------------
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-users
Etiquette: https://curl.haxx.se/mail/etiq

Petr Pisar via curl-users

2021-06-02 18:54:29 UTC

Permalink

Post by bruce via curl-users
Hi.
https://www.bkstr.com/efollettstore/home
Selecting the dev tools/network, and then doing an inspection of the

GET - svc.bkstr.com - https://svc.bkstr.com/store/byName?storeType=FMS&catalogId=10001&langId=-1&schoolName=*

(might need to do a refresh to gen the list of functions/calls for the page)

I don't get that address (.../byName) loaded anywhere when visiting the
(.../home) page. But that can be because I block various address, cookies,
scripts etc. by default. Or maybe the scam-like web page is completely
broken because it randomly returns to me "under maintanance" message error.

Post by bruce via curl-users
However I'm trying to figure out if there's an extension or a method
to gen/save the content of the page when I'm viewing the top level
URL. The process would have to then gen the content of the associated
URLs which comprise the page.

I'm sorry, I cannot help you.

-- Petr

bruce via curl-users

2021-06-02 20:26:33 UTC

Permalink

Hi Petr.

Thanks so much for your time on this.

As far as I can tell. It appears that the overall sit/process uses
"blocking" tech from perimterX. This combines with recaptcha and other
processes to impact the ability to access the data. Even using the
browser/tabs too fast will generate issues.

I'm thinking a possible solution might be to go in a "different"
direction via some sort of browser extension which would access the
content and then send the content off to a extern server for
parsing/processing.

Thanks

On Wed, Jun 2, 2021 at 3:02 PM Petr Pisar via curl-users

Post by Petr Pisar via curl-users

Post by bruce via curl-users
Hi.
https://www.bkstr.com/efollettstore/home
Selecting the dev tools/network, and then doing an inspection of the

GET - svc.bkstr.com - https://svc.bkstr.com/store/byName?storeType=FMS&catalogId=10001&langId=-1&schoolName=*

(might need to do a refresh to gen the list of functions/calls for the page)

I'm sorry, I cannot help you.
-- Petr
-----------------------------------------------------------
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-users
Etiquette: https://curl.haxx.se/mail/etiquette.html

-----------------------------------------------------------
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-users
E

Bill Mercer via curl-users

2021-06-02 21:54:46 UTC

Permalink

-----Original Message-----
curl-users
Sent: Wednesday, June 2, 2021 4:27 PM
Subject: Re: sort of tangent to curl -- web browser/extension
Hi Petr.
Thanks so much for your time on this.
As far as I can tell. It appears that the overall sit/process uses "blocking" tech
from perimterX. This combines with recaptcha and other processes to impact
the ability to access the data. Even using the browser/tabs too fast will
generate issues.

Honestly, if the site owners are going to these lengths to prevent you from doing what you're trying to do, you may want to reevaluate whether it's worth the effort.

I'm thinking a possible solution might be to go in a "different"
direction via some sort of browser extension which would access the content
and then send the content off to a extern server for parsing/processing.

This sounds a lot like capturing a HAR and replaying it.

-----------------------------------------------------------
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-users
Etiquett