Scripting browser-like tasks
curl can do almost every HTTP operation and transfer your favorite browser can. It can actually do a lot more than so as well, but in this chapter we will focus on the fact that you can use curl to reproduce, or script, what you would otherwise have to do manually with a browser.
Here are some tricks and advice on how to proceed when doing this.
Figure out what the browser does
This is really a necessary first step. Second-guessing what it does risk having you chase down the wrong problem rat-hole. The scientific approach to this problem pretty much requires that you first understand what the browser does.
To learn what the browser does to perform a certain task, you can either read the HTML pages that you operate on and with a deep enough knowledge you can see what a browser would do to accomplish it and then start trying to do the same with curl.
The Copy as curl section describes how you can record a browser's request and easily convert that to a curl command line.
Those copied curl command lines are often not Good enough though since they tend to copy exactly that request, while you probably want to be a bad bit more dynamic so that you can reproduce the same operation and not just resend the verbatim request.
A lot of the web today works with a user name and password login prompt somewhere. In many cases you even logged in a while ago with your browser but it has kept the state and keeps you logged in.
The logged-in state is almost always done by using cookies. A common operation would be to first login and save the returned cookies in a file, and then let the site update the cookies in the subsequent command lines when you traverse the site with curl.
Web logins and sessions
Although the login page is "visible" (if you'd use a browser) on https://example.com/, the HTML form tag on that page informs you about which exact URL to send the POST to, using the "action" parameter.
In our imaginary case, the form tag looks like this:
<form action="login.cgi" method="POST"> <input type="text" name="user"> <input type="password" name="secret"> <input type="hidden" name="id" value="bc76"> </form>
There are three fields of importance. text, secret and id. The last one, the id, is marked "hidden" which means that it will not show up in the browser and it is not a field that a user fills in. It is generated by the site itself, and for your curl login to succeed, you need extract that value and use that in your POST submission together with the rest of the data.
Send correct contents to the fields to the correct destination URL:
curl -d user=daniel -d secret=qwerty -d id=bc76 https://example.com/login.cgi -o out
Many login pages even send you a session cookie already when presenting the
login, and since you often need to extract the hidden fields from the
tag anyway, you could do something like this first:
curl -c cookies https://example.com/ -o loginform
You would often need a HTML parser or some script language to extract the "id" field from there and then you can proceed and login as mentioned above, but with the added cookie loading (I'm splitting the line into two lines to make it more readable):
curl -d user=daniel -d secret=qwerty -d id=bc76 https://example.com/login.cgi \ -b cookies -c cookies -o out
You can see that it uses both
-b for reading cookies from the file and
to store cookies again, for when the server sends back updated cookies.
Always, always, add
-v to the command lines when working out the
details. See also the verbose section for more details
It is common for servers to use redirects when responding to a login POST. It is so common I would probably say it is rare that it is not solved with a redirect.
You then just need to remember that curl does not follow redirects
automatically. You need to instruct it to do this by adding the
line option. Adding that to the previous command line then makes the full one
curl -d user=daniel -d secret=qwerty -d id=bc76 https://example.com/login.cgi \ -b cookies -c cookies -L -o out
In the above example command lines, we save the login response output in a file named 'out' and in your script you should probably verify that it contains some text or something that confirms that the login is successful.
Once successfully logged in, get the files or perform the HTTP operations you
need and remember to keep using both
-c on the command lines to use
and update the cookies.
Some sites will check that the
Referer: is actually identifying the
legitimate "parent" URL when you request something or when you login or
similar. You can then inform the server from which URL you arrived by using
-e https://example.com/ etc. Appending that to the previous login attempt
then makes it:
curl -d user=daniel -d secret=qwerty -d id=bc76 https://example.com/login.cgi \ -b cookies -c cookies -L -e "https://example.com/" -o out