I am about to start using a “new” camera in the vehicle that requires some time to process each days take. After some local processing I run the job in separate “process” and upload steps.
For a 200-300km travel day I might have 40,000 images and the process step can take 4-5 hours on an i5 laptop. This is is a vehicle so there is an energy budget as well. The images have a fully populated EXIF so there is no geotagging or rate/distance calcs to worry about.
I know most of the Linux OS tricks to optimise this, but would appreciate any general ideas. I note too that the --num_processes would set to the number of CPU’s and wonder if an increase would help. It currently doesn’t get very local SSD I/O or CPU intensive. I should point out that 40GBytes/day over a USB2 plug in drive is probably a bottleneck, so I’ll likely increase the read ahead buffer and write delay. From memory too there are hard references to the physical file locations so running “process” on the SSD and “upload” from a USB2 plugged drive (after moving) can be problematic.
When planning longer recordings, you should be aware that, in my experience, Mapillary and GSV can only process file uploads with a maximum size of 40 GB each.
This means you might have to edit longer recordings. Another method is to reduce the number of GPX points accordingly. In the Insta 360 universe, I edit sequences with a maximum length of 5 minutes each and reduce the GPX track output by a factor of 10. A DJI Osmo produces sequences of 15 minutes each, which cannot be edited. I reduce the file size during export from a nominal 50 GB to 40 GB and reduce the GPX track by a factor of 28.
This means that for each new camera, the optimal balance between recording duration, GPX density, and file size must be determined iteratively.
You should also keep in mind that longer journeys will very likely involve a mix of different speeds. Therefore, I recommend using video mode for such recordings and letting Mapillary decide how many frames are actually extracted from the video. The number of GPX nodes isn’t necessarily crucial here; Mapillary interpolates additional GPX points as needed. My experience has shown that a high GPX node density during upload is more of a hindrance than a help for this process.
Those using a GoPro MAX2 are in luck, as Mapillary takes care of everything for them.
Okay thanks for that, but don’t know how significant that will be in my case as the tools are already looking at populated jpg EXIFs. (using exiftool to do the geotagging & interpolation) They don’t look at a GPX at all and so far the 3FPS frames have only had a few duplicate removals that signify deliberately dropped frames. My process command line though does have;
Most of the take is at highway speeds of perhaps 8m/frame. I’ll probably increase the --duplicate_distance to about 2m (and --duplicate_angle to maybe 20) but have to keep in mind that the images are only 2K and are bit gluggy, so object recognition may be better.
For reference my source GPS data is NMEA at 5/sec from a ublox as a single daily file. The source video (that I process with ffmpeg) are 1 GByte, 5 minute 2K MOV’s at 30FPS that I only pull the (1 in 10) index/key frames from. Largest Mapillary sequences are then around 800-900 images.
Uploading to Mapillary and efficiency… I guess we can only dream. Like you have already mentioned, a USB 2.0 connected drive is always going to be the main bottleneck in this scenario. And, it will not help that the drive is a SSD. I too do my image post‑processing on a USB 2.0 connected hard drive (for multiple reasons there is no need to go into here). The HDD is more than fast enough for USB 2.0, so I accept this bottleneck.
However, for me the tightest bottleneck is mapillary_tools itself. It is incredibly slow and inefficient. Most things happen on one CPU core or thread only because of Python’s global interpreter lock (GIL). In other words, Python does not support true multi‑threading to this very day, despite having a threading API module. So, do not get fooled by --num_processes and --max_upload_workers options. They do not do exactly what you think they should do. Due to the GIL Python however does support asynchronicity at best. Effectively, this is comparable to Hyper‑Threading at CPU level or Windows 3.0-3.11 switched tasking. Only because of I/O’s inherent asynchronous nature on modern operating systems, you kind of get the illusion of multi‑threading in Python, although it is not true multi‑threading.
But, this is not the worst part. mapillary_tools has many horrendous internal bottlenecks. For instance, mapillary_tools validates its own JSON image description files (yes, these are files that it has just output itself!) with an awfully slow single threaded validator. Then, it MD5 hashes each file fully, also on a single thread only. When uploading, it does not even hash ahead all sequences in the upload queue. It only hashes the next sequence right before it commences uploading. All of the above may work okay for a handful of sequences with a couple hundred images each but it does not at all scale to hundreds of sequences with thousands of images per sequence. You, as a contributor, cannot fix this easily. This is something rather on Mapillary’s end to do. So, I am afraid, I cannot give you any really helpful ideas due to the “elephant in the room”.
I have said it before and will say it again; Python is a rapid prototyping tool, nothing more, nothing less. It should not be used in any serious production code scenarios, including running on thousands of end user machines. The best solution to this problem would have been if Mapillary had published an upload protocol specification years ago, like I have requested multiple times. This way, people would have innovated, competed, and implemented the best solution(s) for their needs. Instead, we are stuck with mapillary_tools as the GIL hampered gate of pain and sorrow.
Actually, there is piece of advice I can share with you, especially regarding your USB 2.0 connected drive scenario! I consistently get higher throughput on my drive when I set --num_processes 0, which effectively means single thread I/O.
I guess some questions then. Unfortunately I am stuck with what is available! I personally would have liked for big uploaders to have a shell account on the Mapillary server and just use rsync to get batches of videos or images transferred..
About multiple instances generally? I seem to remember you made the comment some time ago that you regularly ran the tools this way and some recent patch/upgrade (SQlite?) helped the collisions that occasionally occurred. I also quite often run (the tools on) 4 camera view jobs concurrently and these collisions/abends haven’t happened for a while. Looking at the process table during uploading too I more or less figured that the --max_upload_workers functionality was handled that way.
IYO then has the ability to safely run concurrently reliable? That may even go well within your --num_processes 0 setting. I have had the vague impression that larger numbers of images being processed are not linear. ie 5000 may take 20x longer than 1000. Splitting the whole job into the number of CPU’s?
I am hoping to improve the process, not upload step. My last test run was about 5 hours process and 2 hours upload. The upload would not be hindered by a USB drive, but I wonder if running process on an i7/SSD then transferring it to the USB drive for upload on the i5 might help. I guess I’d also have to modify the json file for any file location change or do some creative symlinking. Can you see any other problem (apart from the log in .cache/mapillary_tools) doing an xfer?
For some of my processing too I have to throttle the laptop CPU, but am pretty sure the tools use never cause that. This all suggests an I/O bottleneck to me.
Well, this would be one idea. But, I am not sure this would be a good solution even for power users. mapillary_tools is like driving with the parking (or hand) brake applied. I have my own set of ideas how efficiency could be improved best, like publishing an upload protocol spec and others. But, this thread is not about sharing what could be done.
No, I do not run multiple instances, neither for processing nor uploading. And, I have always advised against it for uploading due to the way mapillary_tools is implemented, especially because of the way how it creates and handles the upload history and runs sequence upload sessions. However, AFAIK @TheWizard and @osmplus_org did or continue to run multiple instances. You can safely run the process sub‑command concurrently. So, this is where you can do as you like. However, you should not run the upload sub‑command concurrently. The collisions you are referring to were race conditions when using multiple upload workers (--num_upload_workers) that had to write the upload history. This was an internal mapillary_tools bug, which has been fixed quite some time ago. Generally speaking, for the sake of comparability, elimination of bugs, and thus making the conversation easier, try to always run the latest mapillary_tools version, despite the fact that you usually cannot expect much of a performance improvement on every new version.
This is true, mostly because of the sleepy JSON validator.
Like you say, this should work, as long as you update the absolute file paths in the JSON image description file. These absolute image file paths is another unfortunate design choice to me. It may have had some rational cause to work around another issue, which eludes me right now, but it nevertheless is unfortunate.