Forum Settings
Forums
New
Jan 8, 2023 2:33 AM
#1
Offline
Jan 2015
10
Hi, I am back again with a new question,

Since the limitation on the API are too big, I am thinking to scrape most of the data from the HTML myself.

At the moment I only scrape the "later" season once a month in order to get the anime ids (since I can't get them in the API).
I believe this isn't a problem since it is a request per month on the HTML.

What I need to do now is to get the ANN id (and maybe more data) for all the anime in MAL, this wouldn't be a problem for the older anime since I could do it slowly over time once. The problem is on the new anime that are released, since every month I would need to retrieve the ANN id for all of them

This could be many anime per month, so I would need to access many HTML pages per month.

What rate should I be using in order to retrieve that amount of information every month? Is this even allowed or could I get IP banned?

I could do it with a cronjob on the anime that have the information "null" on my DB but I am really not sure if I could do it 1 html page every second, minute or hour.

Thank you very much.
Reply Disabled for Non-Club Members
Jan 8, 2023 8:16 AM
#2
四十二

Offline
Mar 2016
479
@Davenzo: Hi! There're no precise indications of how often you can send new HTTP requests. My suggestion is to add a 700 milliseconds (or longer) delay before sending each request. In this way, you shouldn't get rate limited or temporarily blocked.
HTCPCP/1.0  ★ MetaMAL  ★ Picture credits: Kieed & 1041uuu
Jan 8, 2023 2:01 PM
#3
Offline
Jan 2015
10
Thank you as always, ZeroCrystal.
I'll go for one second from each request and see what happens.

I still have a few questions:

In order to let the script run by itself I would need to do some tests to get temporarily blocked and be able to detect the behaviour of the HTML page  so I can set a sleep function that's much longer than one second for that specific case.

Could this lead me to get a permanent ban of some sort? I can't control the IP of my hosting so this would be a major problem if it could happen.
Is this being "temporarily blocked" the same as getting error 403 in the API? (because that happened to me so many times when I was testing but it would go away in a few minutes)
Could I, with the rate limit of one anime per second, scrape about 1000 anime in one run (once a month)?
Jan 8, 2023 2:57 PM
#4
四十二

Offline
Mar 2016
479
Davenzo said:
In order to let the script run by itself I would need to do some tests to get temporarily blocked and be able to detect the behaviour of the HTML page  so I can set a sleep function that's much longer than one second for that specific case.

Could this lead me to get a permanent ban of some sort? I can't control the IP of my hosting so this would be a major problem if it could happen.

I never heard of an IP getting permanently banned for sending too many requests.

MAL should be using WAF and CloudFront to rate limit the requests received from a specific IP address. I was sure to have saved somewhere the page that is sent when you trigger the rate limit, but I can't find it at the moment...
If I remember correctly, it was a plain HTML page from CloudFront with a 4XX status code. You get temporarily blocked for about 5 minutes, and then everything goes back to normal. Take it with a pinch of salt, though.


Davenzo said:
Is this being "temporarily blocked" the same as getting error 403 in the API? (because that happened to me so many times when I was testing but it would go away in a few minutes)

Yes, I think both the API and the website rely on the same mechanism. I suggest you double-check it.


Davenzo said:
Could I, with the rate limit of one anime per second, scrape about 1000 anime in one run (once a month)?

Yes, I don't see any issue with that.
HTCPCP/1.0  ★ MetaMAL  ★ Picture credits: Kieed & 1041uuu
Jun 23, 6:35 AM
#5
Offline
Dec 2020
19
I’ve done scraping for hobby projects and rate limiting has always been a big concern. When I scraped MAL in the past, I started with a delay of 2 seconds between requests. I did get the occasional temporary block (the kind that lasts a few minutes, like you mentioned) but never faced a permanent ban. Most of the time, going slow and spreading it out kept me out of trouble. Scraping 1000 anime in one go with a second or two delay is doable, just expect it to take a little while. I always monitored the responses for 4XX codes so I could pause or back off if needed.

If you want something that handles temporary blocks and potential bans more smoothly, there are services built for this purpose. I’ve read about tools that rotate proxies, auto-handle captchas, and make large crawls less risky. I saw some details on one site about all that kind of stuff, especially when you need to scrape a lot without worrying about IPs or getting shut out, here’s where I found some background info: https://crawlbase.com/
7k72Jun 27, 4:33 AM
Reply Disabled for Non-Club Members

More topics from this board

» Requesting additional authorizations

SomeNewGuy - Aug 18

1 by ZeroCrystal »»
Aug 22, 8:31 AM

» Accessing Many Users' List

loukylor - Jun 11

0 by loukylor »»
Jun 11, 3:07 PM

» Caching strategy to avoid making additional API calls

Jakuten - Jun 3

4 by Jakuten »»
Jun 8, 11:30 AM

» Manga Update API Endpoint Disabled

arturitojedi - May 15

0 by arturitojedi »»
May 15, 6:45 PM

» what do i put to Homepage URL?

Neks- - May 9

1 by ZeroCrystal »»
May 10, 7:18 AM
It’s time to ditch the text file.
Keep track of your anime easily by creating your own list.
Sign Up Login