Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

SrWebDeveloper

macrumors 68000
Original poster
Dec 7, 2007
1,871
4
Alexandria, VA, USA
I've got MySQL 4.1 and some legacy software that allows users to rate the quality of an item in a poll, rated from lowest to highest (1-5).

I need to create a SQL query that on its own generates a top 10 list of highest rated items which returns as columns the ranking, pollID and average rating per poll (minumum 6 votes). I am seeking help on the average rating column.

The simplified structure of the table that stores the poll results:

Code:
  PollID    Rating   Votes
  -----     ------   -----
  1         5        100
  1         4        80
  1         3        20
  1         2        18
  1         1        2
  ...and so on...
I did a little research and I think what I need (or close) is a Bayesian estimate such as the commonly used formula documented here.

Problem is, I'm not sure what to plug in and where. Once I see that I can go from there (set sort order and limit, add additional select fields that exist in the real table, etc.) Needs to be a query as the software uses the query results to display the top 10 report. The real report is more detailed than this, of course.

Thank you!

-jim
 
SQL Without the Ranking

Hello there,

This should provide the top 10, without the ranking (ordered by rating first, PollID second to ensure consistency):

Code:
Select
	PollID,
	sum(Results.Rating*Results.Votes)/sum(Results.Votes)
			as Simple_Average,
	( sum(Results.Votes )/( Sum(Results.Votes) + 6) ) * 
		sum(Results.Rating*Results.Votes)/sum(Results.Votes)
	+ (6 / ( Sum(Results.Votes) + 6) ) ) * Mean.mean_vote_across_dim
			as Bayesian_Estimate_Formula
From
	TableName Results,
	(
		Select
			sum(Rating*Votes)/sum(Votes)
				as mean_vote_across_dim
		From
			TableName
	) Mean
ORDER BY
	( sum(Results.Votes )/( Sum(Results.Votes) + 6) ) * 
		sum(Results.Rating*Results.Votes)/sum(Results.Votes)
	+ (6 / ( Sum(Results.Votes) + 6) ) ) * Mean.mean_vote_across_dim DESC,
	PollID
limit 10

I've recently started a job with mySQL 5.1 and I haven't got access to an editor at home, so this may not work. Thanks for the link to the Bayesian formula; I've never seen that before.

Regards,

Joshua

Edit: I should add that your minimum value of 6 is hardcoded into the formula; if you wish to make the query easier to edit you could add in another scalar subquery in the From clause {e.g. (Select 6 as Min_Value From TableName limit 1) Min_Value}. Also, TableName is, obviously, the name of table.
 
While I'm at it...

Hello again,

This should do the whole caboodle, including returning those in the top 10 regardless of how many there are.

Code:
Select
	count(Distinct B.PollID) + 1 as Ranking,
	A.PollID,
	A. Bayesian_Estimate_Formula
From
(
Select
	PollID,
	( sum(Results.Votes )/( Sum(Results.Votes) + 6) ) * 
		sum(Results.Rating*Results.Votes)/sum(Results.Votes)
	+ (6 / ( Sum(Results.Votes) + 6) ) ) * Mean.mean_vote_across_dim
			as Bayesian_Estimate_Formula
From
	TableName Results,
	(
		Select
			sum(Rating*Votes)/sum(Votes)
				as mean_vote_across_dim
		From
			TableName
	) Mean
) A
LEFT OUTER JOIN
(
Select
	PollID,
	( sum(Results.Votes )/( Sum(Results.Votes) + 6) ) * 
		sum(Results.Rating*Results.Votes)/sum(Results.Votes)
	+ (6 / ( Sum(Results.Votes) + 6) ) ) * Mean.mean_vote_across_dim
			as Bayesian_Estimate_Formula
From
	TableName Results,
	(
		Select
			sum(Rating*Votes)/sum(Votes)
				as mean_vote_across_dim
		From
			TableName
	) Mean
) B
ON	A.Bayesian_Estimate_Formula < B.Bayesian_Estimate_Formula
GROUP BY
	A.PollID,
	A. Bayesian_Estimate_Formula
HAVING
	count(Distinct B.PollID) + 1 <= 10
ORDER BY
	A. Bayesian_Estimate_Formula DESC,
	A.PollID

It's not pretty, but it'll do unless you have many poll IDs. I know you weren't looking for the whole solution, but I find it's easier for me to hammer out the entire thing than to try and work out which part of something wasn't understood.

Regards,

Joshua
 
:cool:

Thank you very, very much, Joshua. I could not ask for better advice than a complete working sample. I see what you did with the handling of the formula using subqueries as well as the recent changes and it is VERY much appreciated. I can now take relevant parts and optimize for my needs since your query shows me the overall context.

I'm going to work on it today on my dev server and I will post a followup to let you know how it came out or if any questions along the way. Again, thank you for the above and beyond effort, you've inspired me.

Cheers.

-jim

ps: Congratulations on your new project! :)
 
After a few minor adjustments here is your query tested and working perfectly on my dev server using actual data:

Code:
Select
    count(Distinct B.PollID) + 1 as Ranking,
    A.PollID,
    round(A.Bayesian_Estimate_Formula,2) as Rating,
    A.Total_Votes as 'Total Votes'
From
(
Select
    PollID, sum(Results.votes) as Total_Votes,
    ( sum(Results.votes )/( Sum(Results.votes) + 6) ) * 
        sum(Results.answers*Results.votes)/sum(Results.votes)
    + (6 / ( Sum(Results.votes) + 6) )  * Mean.mean_vote_across_dim 
            as Bayesian_Estimate_Formula
From
    poll_answers as Results,
    (
        Select
            sum(answers*votes)/sum(votes)
                as mean_vote_across_dim
        From
            poll_answers as pa
    ) as Mean
    group by PollID
) as A
LEFT OUTER JOIN
(
Select
    PollID,
    ( sum(Results.votes )/( Sum(Results.votes) + 6) ) * 
        sum(Results.answers*Results.votes)/sum(Results.votes)
    + (6 / ( Sum(Results.votes) + 6) )  * Mean.mean_vote_across_dim
            as Bayesian_Estimate_Formula
From
    poll_answers Results,
    (
        Select
            sum(answers*votes)/sum(votes)
                as mean_vote_across_dim
        From
            poll_answers
    ) as Mean
    group by PollID
) as B
ON    A.Bayesian_Estimate_Formula < B.Bayesian_Estimate_Formula
GROUP BY
    A.PollID,
    A.Bayesian_Estimate_Formula
HAVING
    count(Distinct B.PollID) + 1 <= 10 
    and A.Total_Votes > 1
ORDER BY
    A.Bayesian_Estimate_Formula DESC,
    A.Total_Votes desc,
    A.PollID
LIMIT 10
I had to adjust a few minor syntax errors with the parenthesis, add group by statements in A and B sub queries and customized a tad, i.e. sort order, limit and rounding off of the rating. In the production version more information from other tables to create the report plus most static settings (min votes in Bayesian, min total votes and limit, sort order based on top/bottom report) will be replaced by variables.

I'm sure others will find this a great starting point for their projects, so once again - thank you. :cool:

Below is example result of a top 10 report (rating 1-5, 5=best) -- lookin' good...
 

Attachments

  • sql.gif
    sql.gif
    3.3 KB · Views: 4,904
You're very welcome. Looking at that list of my minor mistakes I see that IDEs are more helpful than I realised.

Regards,

Joshua

P.S. Thanks for the congratulations; it's more of a full time job than a new project. There are a team of three of us in a race against time; we have to replace the data warehouse before we go mad maintaining their legacy systems.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.