Discussion:
Scrape text from the screen
Mike Lambert
2018-06-18 13:00:40 UTC
Permalink
Hi,



I have a web page where the text is displayed from a sever directly onto the
screen .. Hence data not found in the web page source code.



How can I use Curl to scrape the text from the screen buffer ?? .. the
displayed data can go over multiple screens .



Thanks,



Mike
Daniel Lublin
2018-06-18 13:24:51 UTC
Permalink
Post by Mike Lambert
I have a web page where the text is displayed from a sever directly onto the
screen .. Hence data not found in the web page source code.
How can I use Curl to scrape the text from the screen buffer ?? .. the
displayed data can go over multiple screens .
Hi Mike,

It sounds like you want to extract words, letters and digits, from an image.
This is not something that curl does. Simply put it, curl downloads
documents, like texts or images, from a location (URL).

The process of extracting words from an image is called optical character
recognition, OCR. You might want to search for some tool that does OCR on a
screenshot, or an image of screen-targeted printed text -- as opposed to OCR
on a scanned, physical paper -- I think the workflow and processes may
differ some.

If the web page in question just display large images, you might end up
using curl to download these images and feed them into your OCR tool, but
that's it.
-----------------------------------------------------------
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-users
Etiquette:
Georg Horn
2018-06-19 10:51:19 UTC
Permalink
Hello,
Date: Mon, 18 Jun 2018 15:24:51 +0200
Subject: Re: Scrape text from the screen
Post by Mike Lambert
I have a web page where the text is displayed from a sever directly onto the
screen .. Hence data not found in the web page source code.
How can I use Curl to scrape the text from the screen buffer ?? .. the
displayed data can go over multiple screens .
It sounds like you want to extract words, letters and digits, from an image.
This is not something that curl does. Simply put it, curl downloads
documents, like texts or images, from a location (URL).
I´d rather belive that the webpage makes use of Javascript/Ajax to load
content like many modern webpages do. You can use browser-addons like
LiveHttpHeader or the builtin developer tools to record all the
HTTP-requests that the browser executes while loading the page, and then
try to mimic that behaviour with curl. Unfortunately the requests which are
made often contain dynamically generated parts and you have to emulate
the behaviour of the javascript which is executed in the browser in your
test script...

I work on website testing and monitoring for many years now and curl was
and is a very handy tool for the job, but meanwhile i use the selenium
framework too. With selenium you can remote control a real browser, so
the page is just called as if a real user did it.

Regards,
Georg
-----------------------------------------------------------
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-users
Etiquette:

Loading...