How Many Images per Sequence?

In other words; is it better to upload a single sequence or multiple sequences for reconstruction? Which improves existing reconstruction results best?

Mapillary help says that sequences should be limited to 1,000 images. However, imho this limit is rather low and quite arbitrary, especially if you want to somewhat group your sequences into areas, like streets or neighborhoods etc. Things get even more limiting if you capture in video mode and want to make use of every frame. Say you capture in 25 fps then this makes 40 s long sequences. This is oddly short. I think, I know the technical reason for this 1k images soft limit — it has to do with OpenSfM’s RAM requirements — but imho it should not actually exist. In fact, if you think about it for a bit longer then there should be no image count nor storage size limits on sequences. OpenSfM should probably process images in a stream with multiple passes rather than in one huge RAM dataset, which with the advent of SSDs is totally acceptable (though a hardware JPEG decoding step would surely massively help in this scenario).

Furthermore, it looks to me like OpenSfM chops up a sequence in bundles of 30 or so of chronologically ordered images, reconstructs these into bundles, and then tries to reconstruct the bundles to each other. It works but it seems to me like this may be the RAM limiting factor. Why not reconstruct following a binary tree; search for best overlapping (reconstructed) image pairs (into tuple-bundles) and then reconstruct tuples into quad-tuples (pairs of pairs), and so on?

I am asking because something tells me that as things are right now it would be best to upload 1 (or maybe 2) image sequences for best results.

Anyhow, maybe I am completely confused and do not know what I am talking about. :face_with_hand_over_mouth: So, it would be nice if you could share some insight in all of this. Thanks!

I don’t give any thought to the end use when creating sequence splits.

I upload 2FPS BlackVue vids and more or less a continuous stream of (left pointing) images from 2 phones only at 3-38kph. They work out at around 0.5 to 2FPS and are not synchronous. (One uses Wifi and is a lot slower than the other’s USB connection)

The BlackVue vids are always 1 sequence per minute/file, so end up with about 120 images per. (It use to misses 5 images odd due to the NMEA data in the file being offset by 2-3 seconds. Most of my older uploads are 117/118 frames.)

The phone based images can breach the 500 frame sequence limit and split, but not often. I remove a stack of bad (eg sun in lens or blurred) or private frames (eg fronts of houses/homes where no other useful mapping features are in view) manually before running the tools.

For me the biggest factors are associated with time and cost of upload. Australia has a low population density compared to ROW, making for weaker cell signals and higher cost. The average cell phone system design really wants to see a base site within maybe 3-5km, yet I regularly upload at 10-12km consuming a lot of battery/solar energy to do so.

This is all not very relevant to your post of course, but my view is that after the server “could” adjust lat/lon/dir/fov for errors and break for date/time shifts, it would be possible to do anything in reconstructing. ie a sequence is only handy as an error reduction method and a way to reduce processing/CPU time. Multiple frames (sequence agnostic) will eventually be used. Might take a year to get there though!

I often deliberately capture the same view twice, (2 sequences) on the side cameras as to fit between pedestrians and parked cars. I am of course looking for business names/types/contacts for OSM entry. If traffic permits I will also vary by road speed. Splitting over a few hours is also helpful for better lighting angles and avoiding view obstructions.

My view.. Cheers.

In theory, a sequence is just a data structure for grouping images, nothing more and nothing less. Reconstruction complexity should scale linearly, i.e. O(n) = n. Reconstructing even hundreds of millions of images (with a couple of thousand features per image) per sector or batch should not be limited much by RAM these days. Sure, you can run out of RAM quickly if you massively crank up the number of features per image but that’s a function of diminishing returns, so that you usually want to tune the process to as few features per image as possible to get acceptable reconstruction results in some finite time. :thinking: Maybe I am just missing something.

I suspect not linear but more a square or cubed thing. (- processing areas and possibly volumes) I am not at all across the method but it will be resource intensive, not just RAM but time and grunt. Complex things just take longer. If the maths used resource is large compared to the I/O resource, then the process may benefit from a widely distributed processing approach.

And one asks are we only defining/recognizing objects or building a 3D view (with other uses) first. Kind of like ISAR (inverse synthetic aperture radar). That uses a moving sensor platform which isnt unlike the Mapillary sequence approach. ie “using a reasonably straight sensor line” lock down as many variables as possible and process until a defined quality is achieved. If the quality isn’t achieved with 1 or a few sequences then bring in more nearby ones until it is, or wait/prompt for more image input. Meaning drive the need for more imagery not on so much on what doesn’t exist, but whether the quality of an area of interest is too low.

This is all just waffling and talking through my hat stuff though. I am sure the Mapillary people have sat down and figured the best course already.

You are right, I was overly optimistic. The naive implementation demands square complexity. However, a binary tree optimization would push it below square, yet won’t make it linear. Nevertheless, I don’t think that RAM could be a limiting factor.

I am sure too they have but this 1k figure makes no sense to me. :wink:

Quick update on that point - I think the 1,000 number was historical, and no longer accurate. We have removed it from the help center - generally you can capture and upload as many images/videos as you’d like - the more the merrier!

2 Likes

From a reconstruction perspective, there is no big difference in uploading long sequences as a single one or two. Internally, we align different sequences to each other anyway.

The only thing that is not advisable are very short sequences. We use the fact that images are in a sequence to group the processing of those. Having very short sequences can limit the quality of the reconstructions.

That said, after say 50 images, it does not make a big difference. As @GITNE guesses, we currently process long sequences in chunks and therefore the total length of the sequence has little impact on the result.

Keep in mind though, that the way images are processed could change in the future. So the choice of the sequence length should be independent of it. If you are capturing images as a sequence, upload them as a sequence and don’t worry about how long or short it is.

3 Likes

@boris @paulinus Thank you for the clarification! It was really bugging me because some of my over 1k sequences get stuck others are processed rapidly. Hence, I was really confused that maybe I was doing something wrong. :smile:

1 Like