First of all, I have not characterised the MB webcam, so I cannot guarantee it is diffraction limited. And I am not involved in small lens development, so everything I say is based on general optical knowledge. However, if the lens weren't at least close to the diffraction limit, the resolution would be much worse. And on the other hand, the engineering effort stops at the diffraction limit.
"Real" camera lenses are a bit different story when it comes to their dimensions and tolerances. There it really holds true that diffraction limits apply only when stopped down to maybe f/4 or even f/8. Actually, the same applies to the human eye, it is diffraction limited to approximately 3 mm pupil size, larger apertures are limited by optical aberrations mainly in the cornea.
But even with DSLRs the story is a bit different with smaller sensors and full frame sensors. It is more difficult to make a diffraction-limited FF lens than a diffraction-limited crop sensor lens. One way to think of this is to consider the diffraction limit size on the sensor. The diffraction limit on the image plane (sensor surface) depends only on the f-number, not on the focal length of the sensor. It is easier to make a lens which creates a 1.3 mm x 1.7 mm image diffraction-limited (where the diffraction limit of 2.5 um is in the order of 1 / 500 of the image size) than a 36 mm x 24 mm image with the same diffraction limit (where the same diffraction limit is in the order of 1 / 15,000 of the image size).
It would actually be a fun — and not a very difficult — experiment to characterise the transfer function of the webcam. Print a spoke target (google "Siemens star") and keep it at a suitable constant distance from the camera. Take a snapshot of the target (well-illuminated) and look at the softening of the star. In this case you can just print a target with a laser printer (there are PDFs available). If you vary the distance of the target, you can see the effect of the lack of auto-focus. (Spoke target is great because it is scale-invariant, i.e., you do not need to take the distance into account.)
And, of course, the resolution may be quite different in different areas of the image. The center may be (close to) diffraction-limited, but the corners may be much softer.
The image size per se is not a problem. It is well possible to make much smaller pixels than we have at the moment. Smallest pixels available are typically around 1.0 um x 1.0 um, and semiconductor processes would allow much smaller photosites. (There is another limit, though, and that relates to the maximum dynamic range of a pixel.)
The main problem is really the physical aperture of the lens, because that is the problem with the laws of physics. Then engineering limitations tell us something about the distance from the lens to the image plane. So, the engineering logic goes like this: need larger physical aperture (more light, less diffraction effects) -> need a longer focal length (to avoid impossible numerical apertures) -> need more distance to the sensor and a larger sensor.
The formulae for image and pixel size are quite simple in geometric terms. Let us use the following quantities:
f — focal length of the lens
? — horizontal (or vertical) FOV angle
n — number of pixels horizontally (or vertically)
Physical size of the sensor is then (distance from the lens to the image is f):
d = f * 2 * tan(? / 2)
(Draw it and revise high school trigonometry!) And pixel size:
p = d / n
So, for a 78° (horizontal) FOV webcam with 3.7 mm focal length and 1920x1080 image:
d = 3.7 mm * 2 * tan(39°) ≈ 6 mm
p ≈ 6 mm / 1920 ≈ 3.0 um
Reasonable pixel sizes are between 1 um (small phone/webcam) and 10 um (full-frame sensor).