Upload server error: failed to create the cluster {}

GITNE · March 7, 2025, 2:39pm

@tao I am no DNS expert but it looks like the rupload.facebook.com alias and star.c10r.facebook.com have very volatile and diverging TTLs. Maybe I am doing something wrong.

$ dig +ttlunits rupload.facebook.com
;; ANSWER SECTION:
rupload.facebook.com.	11m55s	IN	CNAME	star.c10r.facebook.com.
star.c10r.facebook.com.	40s	IN	A	157.240.223.17

;; Query time: 32 msec
;; SERVER: 127.0.0.53#53(127.0.0.53) (UDP)
;; WHEN: Fri Mar 07 14:16:37 UTC 2025
;; MSG SIZE  rcvd: 89
$ dig +ttlunits rupload.facebook.com
;; ANSWER SECTION:
rupload.facebook.com.	10m50s	IN	CNAME	star.c10r.facebook.com.
star.c10r.facebook.com.	13s	IN	A	157.240.27.18

;; Query time: 35 msec
;; SERVER: 127.0.0.53#53(127.0.0.53) (UDP)
;; WHEN: Fri Mar 07 14:17:43 UTC 2025
;; MSG SIZE  rcvd: 89
$ dig +ttlunits rupload.facebook.com
;; ANSWER SECTION:
rupload.facebook.com.	58m49s	IN	CNAME	star.c10r.facebook.com.
star.c10r.facebook.com.	45s	IN	A	157.240.27.18

;; Query time: 73 msec
;; SERVER: 127.0.0.53#53(127.0.0.53) (UDP)
;; WHEN: Fri Mar 07 14:28:50 UTC 2025
;; MSG SIZE  rcvd: 89

When I set my primary DNS server to Google’s DNS (8.8.8.8) then the TTL are also volatile. So, it looks like the issue is at the source, since if I understand things correctly the TTL should be propagated.

$ dig +ttlunits rupload.facebook.com
;; ANSWER SECTION:
rupload.facebook.com.	51m8s	IN	CNAME	star.c10r.facebook.com.
star.c10r.facebook.com.	50s	IN	A	157.240.252.10

;; Query time: 21 msec
;; SERVER: 8.8.8.8#53(8.8.8.8) (UDP)
;; WHEN: Fri Mar 07 14:34:08 UTC 2025
;; MSG SIZE  rcvd: 89
$ dig +ttlunits rupload.facebook.com
;; ANSWER SECTION:
rupload.facebook.com.	45s	IN	CNAME	star.c10r.facebook.com.
star.c10r.facebook.com.	45s	IN	A	157.240.252.10

;; Query time: 1 msec
;; SERVER: 8.8.8.8#53(8.8.8.8) (UDP)
;; WHEN: Fri Mar 07 14:34:12 UTC 2025
;; MSG SIZE  rcvd: 89

tao · March 7, 2025, 6:59pm

Thanks for the investigation @GITNE !

AFAIK once DNS resolves you an IP, the subsequent HTTP/TLS/TCP session will be using this IP address despite DNS changes (i.e. once the connection established, it won’t suddenly switch to a new IP), so DNS is unlikely the cause IMO. Also busy sites usually use 1h or even shorter TTLs for better load balancing.

I’m trying to reproduce the offset reset issue with different network settings (VPN, proxies). I think that’s the key. Let’s see!

GITNE · March 7, 2025, 8:35pm

Generally, that’s true. However, when the TTL expires the resolver is forced to make a new DNS query, AFAIK. Otherwise, the TTL would be pointless.

Right, but most of them only push data and do not expect long running uploads. For an upload server you would want the TTL to be indefinite but only the host name resolve to different IPs depending on load. You can have dynamic TTLs on an upload server but then you also have to make sure that upload sessions migrate to different IPs, which makes the whole concept of upload load balancing more complex than actually needed. As a compromise, you can also use a very long TTL, like a week (but conceptually it will not make much of a difference).

Test, test, test,…

GITNE · March 7, 2025, 10:46pm

@tao graph.mapillary.com also maps to star.c10r.facebook.com, which has a dynamic IP address and the same volatile TTL behavior as rupload.facebook.com.

github.com/mapillary/mapillary_tools

mapillary_tools/api_v4.py

main


      
          MAPILLARY_GRAPH_API_ENDPOINT = os.getenv(
              "MAPILLARY_GRAPH_API_ENDPOINT", "https://graph.mapillary.com"

github.com/mapillary/mapillary_tools

mapillary_tools/upload_api_v4.py

main


      
          url = f"{MAPILLARY_GRAPH_API_ENDPOINT}/finish_upload"

Hence, uploads can hit 100% on one IP address but ultimately fail on another IP address because the upload finished request can go to a different IP address than the upload IP address. And, upload sessions do not migrate. This is really messy and confusing.

Hmm, if everything maps to star.c10r.facebook.com why the different aliases?

bob3bob3 · March 8, 2025, 12:35am

briefly @tao

I wasn’t complaining or overly concerned, but will do a log check after I update the tools.

GITNE · March 9, 2025, 1:39am

Uploading ZIP mly_tools_90f4e91938f32803fe70e8c82c5b8669.zip (1/1): 100%|█████████████████████████████████████████████████████████████████████████████| 85.9G/85.9G [20:57:55<00:00, 1.22MB/s]
2025-03-09 01:05:53,363 - INFO    -        1  ZIP files uploaded
2025-03-09 01:05:53,386 - INFO    -  88010.9M data in total
2025-03-09 01:05:53,420 - INFO    -  88010.9M data uploaded
2025-03-09 01:05:53,421 - INFO    -  79334.1s upload time

@tao Since I mapped rupload.facebook.com, graph.mapillary.com, and star.c10r.facebook.com to the same static IP address, everything works flawlessly free of SSLErrors! Plus, I can pause the upload session at any time and as often as I need or like, and the upload resumes reliably. Finally!

I am not sure but I think that mapping aliases only may not be enough. Again, I am not a DNS expert but something tells me that resolution may go A→CNAME→IP address.

Next, I am going to comment out the static IP address mapping form the hosts file and try to substitute the host name of both endpoint URLs with star.c10r.facebook.com to see whether I get SSLErrors and resuming uploads works properly. My expectation is that I should get SSLErrors again and resuming uploads should break either.

tao · March 11, 2025, 5:26am

Hey @GITNE very happy to see you get a workaround here.

I can only reproduce the offset reset issue by switching VPNs, and I can confirm that resumable uploads do not work across data centers (i.e. dc in the response), i.e. if you connects to a new data center, it’s likely the offset will be rest. What affects the data center selection is likely your IP, which is always routed to the nearest data center I assume. By using a static IP for rupload.facebook.com I guess it also fixes which dc to connect to, so you don’t see any offset issues. I can’t reproduce any SSLError so can’t find more information here.

BTW the mapillary_tools repo provides a neat test CLI to test upload without affecting your uploads:

python3 -m tests.cli.upload_api_v4 ~/Downloads/GS010002.360 SESSION_KEY --chunk_size=1 --user_name=YOUR_MLY_USERNAME

So you can experiment different network configurations, or parameters (e.g. chunk sizes).

GITNE · March 11, 2025, 11:52am

Did you throttle your upload speed? How large were your files? Try uploading a large file, like a few GBs, and throttle down to 64 kbps to make things perhaps a bit more extreme (but maybe not that unrealistic in some scenarios) to provoke an SSLError. Additinally, since graph.mapillary.com is shared for uploading and the web app, try surfing the web app at the same time over the same VPN connection. Maybe this has some impact too? Oh, and please flush your DNS cache first then make sure that host names are resolved over the VPN connection, not your local DNS server. Try using a non‑facebook.com DNS server.

TheWizard · March 11, 2025, 5:09pm

Also a couple times now a SSL error, it’s just reuploading the file, no problem there. Just to inform that it happens more. Upload seems to be capped to 100 Mbit (I have a 1 Gbit fiber connection). Is it an option to implement a multi upload?

GITNE · March 11, 2025, 5:21pm

Thank you for confirming the issue. However,…

… it is actually a real problem. The upload infrastructure design is broken. Throwing brute force or throughput at this problem is no solution at all. No amount of throughput is going to solve it.

And, that broken design affects all upload routes, including the Mapillary mobile apps, which is especially annoying since many contributors continue to pay for metered mobile connections. Imho it is a disgrace for a multi‑billion dollar tech company to expect contributors to not only capture imagery for free but also to cause additional cost to their contributors (their second most valiuable asset) because they are unable to do their simplest homework; like uploading files.

tao · March 12, 2025, 7:27pm

@TheWizard were you running the latest version Release v0.13.3 · mapillary/mapillary_tools · GitHub, if not, could you try it out and let me know if speed improves?

With v0.13.3 I think the egress can be saturated with a single upload process, as I also tried to run multiple MT processes to upload and I didn’t notice any speed improvement, compared to single upload process.

TheWizard · March 13, 2025, 1:15pm

Hi tao,

I was indeed on the previous version. I tested it now with the 0.13.3 version and the upload speeds are greatly improved! There is now more fluctuation between the different uploads, it’s not related to my computer, CPU time is 20%. I presume it’s related to the server or interconnects. But I love those speeds way more than before

GITNE · March 16, 2025, 10:46pm

@tao When pip building the latest mapillary_tools==0.13.3 the pynmea2==1.19 dependency does not build. All other dependencies build flawlessly.

❯ python3 --version
Python 3.12.9
❯ pip --version
pip 25.0.1 from /usr/lib/python3.12/site-packages/pip (python 3.12)
❯ pip install --no-binary :all: mapillary_tools
Collecting pynmea2<2.0.0,>=1.12.0 (from mapillary_tools)
  Downloading pynmea2-1.19.0.tar.gz (36 kB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [6 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-wk7g44vb/pynmea2_da86abe101874af696200f318c659b8c/setup.py", line 3, in <module>
          import imp
      ModuleNotFoundError: No module named 'imp'
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

Looks like this has been already fixed

but the source dist has not been updated on PyPI yet.

tao · March 17, 2025, 5:38pm

Thanks for report. Not sure how many users are using nmea for geotagging. If not, I think we can mark this package as optional by default.

GITNE · March 18, 2025, 12:05am

I have made a handful of larger uploads with the tests.cli.upload_api_v4 module and a legit dummy upload with v0.13.3 into the actual upload feed, both over a direct connection without static IP address mapping (sort of default net config). I have also played with different chunk sizes. None of the uploads caused a SSLError, even when surfing the Mapillary web app at the same time. All uploads also resumed correctly. I am not sure what you have changed but things look stable for now. The DNS TTLs continue to be volatile and generally quite short, usually around one minute. I will continue to monitor the situation on upcoming uploads.

GITNE · March 23, 2025, 9:05pm

@tao

2025-03-22 02:23:55,816 - INFO    - Uploading to organization: {"slug": "XXXXXXXXXXXXXXX", "name": "XXXXXXXXXXXXXXXXXX", "id": "XXXXXXXXXXXXXXXX"}
Uploading ZIP mly_tools_90f4e91938f32803fe70e8c82c5b8669.zip (1/1):  49%|████████████████████████████████████▉                                      | 42.3G/85.9G [9:58:22<10:14:05, 1.27MB/s]
2025-03-22 12:38:35,219 - WARNING - Error uploading chunk_size 5242880 at begin_offset 0: SSLError: HTTPSConnectionPool(host='rupload.facebook.com', port=443): Max retries exceeded with url: /mapillary_public_uploads/mly_tools_90f4e91938f32803fe70e8c82c5b8669.zip (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:2406)')))
2025-03-22 12:38:35,281 - INFO    - Retrying in 2 seconds (1/200)
Uploading ZIP mly_tools_90f4e91938f32803fe70e8c82c5b8669.zip (1/1):  49%|████████████████████████████████████▍                                     | 42.3G/85.9G [10:14:39<10:34:15, 1.23MB/s]
Uploading ZIP mly_tools_90f4e91938f32803fe70e8c82c5b8669.zip (1/1):  25%|██████████████████▉                                                        | 21.7G/85.9G [5:05:21<15:06:53, 1.27MB/s]

Looks like I have to resort to the static IP address workaround though. No errors with that.

GITNE · March 24, 2025, 7:07pm

github.com/mapillary/mapillary_tools

mapillary_tools/upload_api_v4.py

main


      
          def upload_chunks(
              self,
              chunks: T.Iterable[bytes],
              offset: T.Optional[int] = None,
          ) -> str:
              if offset is None:
                  offset = self.fetch_offset()
          
              chunks = self._attach_callbacks(self._offset_chunks(chunks, offset))
          
              headers = {
                  "Authorization": f"OAuth {self.user_access_token}",
                  "Offset": f"{offset}",
                  "X-Entity-Name": self.session_key,
                  "X-Entity-Type": self.MIME_BY_CLUSTER_TYPE[self.cluster_filetype],
              }
              url = f"{MAPILLARY_UPLOAD_ENDPOINT}/{self.session_key}"
              resp = request_post(

github.com/mapillary/mapillary_tools

mapillary_tools/api_v4.py

main


      
          resp = session.post(url, data=data, json=json, **kwargs)

github.com/mapillary/mapillary_tools

mapillary_tools/api_v4.py

main


      
          resp = requests.post(url, data=data, json=json, **kwargs)

@tao I would assume files to be PUT in chunks instead of POSTed.

GITNE · March 25, 2025, 3:37pm

@tao I continue to rarely get SSLErrors and progress resets with static IP address mapping:

Uploading ZIP mly_tools_687a1686021c47aa8fb831fba230de15.zip (1/1):  34%|█████████████████████████▉                                                  | 12.5G/36.5G [2:56:50<5:38:48, 1.27MB/s]
2025-03-25 02:49:25,859 - WARNING - Error uploading chunk_size 5242880 at begin_offset 0: SSLError: HTTPSConnectionPool(host='rupload.facebook.com', port=443): Max retries exceeded with url: /mapillary_public_uploads/mly_tools_687a1686021c47aa8fb831fba230de15.zip (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:2427)')))
2025-03-25 02:49:25,860 - INFO    - Retrying in 2 seconds (1/200)
Uploading ZIP mly_tools_687a1686021c47aa8fb831fba230de15.zip (1/1):  34%|█████████████████████████▉                                                  | 12.5G/36.5G [3:13:11<6:12:30, 1.15MB/s]

Due to its sporadic nature, maybe this also happens when the server is briefly overloaded? The server then just closes the oldest connection(s) to free resources for new connections?

tao · March 25, 2025, 10:09pm

It resumed from 34%, not 0. The begin_offset=0 in the warning was the params used in the first upload trial.

tao · March 25, 2025, 10:17pm

Using PUT seems to work too. But you can not resume from a PUT uploading, i.e. the offset is always returned as 0 if you use PUT, which makes sense from its semantics perspective (replace a resource) PUT - HTTP | MDN

Topic		Replies	Views
Desktop uploader fails to upload Contributing and equipment	3	523	October 5, 2021
Mapillary_tools upload errors Command line tools	6	1030	April 28, 2022
Error while uploading with Windows Tools Command line tools	4	728	May 20, 2021
Upload failed. Sorry! Something went wrong Contributing and equipment	12	940	August 3, 2021
[bugreport] mapillary_tools hangs on unhandled exception Command line tools	4	710	July 29, 2020

Upload server error: failed to create the cluster {}

Related topics