2012--Super-Bit Locality-Sensitive Hashing
Super-BitLocality-SensitiveHashing
JianqiuJi ,JianminLi ,ShuichengYan ,BoZhang ,QiTian
StateKeyLaboratoryofIntelligentTechnologyandSystems,
TsinghuaNationalLaboratoryforInformationScienceandTechnology(TNList),
DepartmentofComputerScienceandTechnology,
TsinghuaUniversity,Beijing100084,China
jijq10@,
{lijianmin,dcszb}@
DepartmentofElectricalandComputerEngineering,
NationalUniversityofSingapore,Singapore,117576
eleyans@nus.edu.sg
DepartmentofComputerScience,UniversityofTexasatSanAntonio,
OneUTSACircle,UniversityofTexasatSanAntonio,SanAntonio,TX78249-1644
qitian@cs.utsa.edu
Abstract
Sign-random-projectionlocality-sensitivehashing(SRP-LSH)isaprobabilisticdimensionreductionmethodwhichprovidesanunbiasedestimateofangularsim-ilarity,yetsuffersfromthelargevarianceofitsestimation.Inthiswork,wepro-posetheSuper-Bitlocality-sensitivehashing(SBLSH).Itiseasytoimplement,whichorthogonalizestherandomprojectionvectorsinbatches,anditistheoreti-callyguaranteedthatSBLSHalsoprovidesanunbiasedestimateofangularsim-ilarity,yetwithasmallervariancewhentheangletoestimateiswithin(0,π/2].Theextensiveexperimentsonrealdatawellvalidatethatgiventhesamelengthofbinarycode,SBLSHmayachievesigni cantmeansquarederrorreductioninestimatingpairwiseangularsimilarity.Moreover,SBLSHshowsthesuperiorityoverSRP-LSHinapproximatenearestneighbor(ANN)retrievalexperiments.1Introduction
Locality-sensitivehashing(LSH)methodaimstohashsimilardatasamplestothesamehashcodewithhighprobability[7,9].ThereexistvariouskindsofLSHforapproximatingdifferentdistancesorsimilarities,e.g.,bit-samplingLSH[9,7]forHammingdistanceand 1-distance,min-hash[2,5]forJaccardcoef cient.AmongthemaresomebinaryLSHschemes,whichgeneratebinarycodes.BinaryLSHapproximatesacertaindistanceorsimilarityoftwodatasamplesbycomputingtheHammingdistancebetweenthecorrespondingcompactbinarycodes.SincecomputingHammingdistanceinvolvesmainlybitwiseoperations,itismuchfasterthandirectlycomputingotherdis-tances,e.g.Euclidean,cosine,whichrequiremanyarithmeticoperations.Ontheotherhand,thestorageissubstantiallyreducedduetotheuseofcompactbinarycodes.Inlarge-scaleapplications
[20,11,5,17],e.g.near-duplicateimagedetection,objectandscenerecognition,etc.,weareoftenconfrontedwiththeintensivecomputingofdistancesorsimilaritiesbetweensamples,thenbinaryLSHmayactasascalablesolution.
1.1Locality-SensitiveHashingforAngularSimilarity
Formanydatarepresentations,thenaturalpairwisesimilarityisonlyrelatedwiththeanglebetweenthedata,e.g.,thenormalizedbag-of-wordsrepresentationfordocuments,images,andvideos,andthenormalizedhistogram-basedlocalfeatureslikeSIFT[18].Inthesecases,angularsimilarity
a,b canserveasasimilaritymeasurement,whichisde nedassim(a,b)=1 cos 1( )/π.Here
a,b denotestheinnerproductofaandb,and · denotesthe 2-normofavector.
OnepopularLSHforapproximatingangularsimilarityisthesign-random-projectionLSH(SRP-LSH)[3],whichprovidesanunbiasedestimateofangularsimilarityandisabinaryLSHmethod.Formally,inad-dimensionaldataspace,letvdenotearandomvectorsampledfromthenormaldistributionN(0,Id),andxdenoteadatasample,thenanSRP-LSHfunctionisde nedashv(x)=sgn(vTx),wherethesignfunctionsgn(·)isde nedas
1,z≥0sgn(z)=0,z<0
a,b Giventwodatasamplesa,b,letθa,b=cos 1( ),thenitcanbeproventhat[3]
Pr(hv(a)=hv(b))=θa,b
Thispropertywellexplainstheessenceoflocality-sensitive,andalsorevealstherelationbetweenHammingdistanceandangularsimilarity.
ByindependentlysamplingKd-dimensionalvectorsv1,...,vKfromthenormaldistributionN(0,Id),wemayde neafunctionh(x)=(hv1(x),hv2(x),...,hvK(x)),whichconsistsofKSRP-LSHfunctionsandthusproducesK-bitcodes.Thenitiseasytoprovethat
E[dHamming(h(a),h(b))]=Kθa,b
=Cθa,b
Thatis,theexpectationoftheHammingdistancebetweenthebinaryhashcodesoftwogivendatasamplesaandbisanunbiasedestimateoftheirangleθa,b,uptoaconstantscalefactorC=K/π.ThusSRP-LSHprovidesanunbiasedestimateofangularsimilarity.
SincedHamming(h(a),h(b))followsabinomialdistribution,i.e.dHamming(h(a),h(b))~
Kθa,bθa,bθitsvarianceisThisimpliesthatthevarianceofB(K,a,b
),(1 ).
dHamming(h(a),h(b))/K,i.e.Var[dHamming(h(a),h(b))/K],satis es
Var[dHamming(h(a),h(b))/K]=θa,b
(1 θa,b
)
Thoughbeingwidelyused,SRP-LSHsuffersfromthelargevarianceofitsestimation,whichleadstolargeestimationerror.Generallyweneedasubstantiallylongcodetoaccuratelyapproximatetheangularsimilarity[22,12,21].Thereasonisthatanytwooftherandomvectorsmaybeclosetobeinglinearlydependent.Thustheresultingbinarycodemaybelessinformativeasitseems,andevencontainsmanyredundantbits.Anintuitiveideawouldbetoorthogonalizetherandomvectors.However,oncebeingorthogonalized,therandomvectorscannolongerbeviewedasindependentlysampled.Moreover,itremainsunclearwhethertheresultingHammingdistanceisstillanunbiasedestimateoftheangleθa,bmultipliedbyaconstant,terwewillgiveanswerswiththeoreticaljusti cationstothesetwoquestions.
Inthenextsection,basedontheaboveintuitiveidea,weproposetheso-calledSuper-Bitlocality-sensitivehashing(SBLSH)method.Weprovidetheoreticalguaranteesthatafterorthogonalizingtherandomprojectionvectorsinbatches,westillgetanunbiasedestimateofangularsimilarity,yetwithasmallervariancewhenθa,b∈(0,π/2],andthustheresultingbinarycodeismoreinformative.Ex-perimentsonrealdatashowtheeffectivenessofSBLSH,whichwiththesamelengthofbinarycodemayachieveasmuchas30%meansquarederror(MSE)reductioncomparedwiththeSRP-LSHinestimatingangularsimilarityonrealdata.Moreover,SBLSHperformsbestamongseveralwidelyuseddata-independentLSHmethodsinapproximatenearestneighbor(ANN)retrievalexperiments.2Super-BitLocality-SensitiveHashing
TheproposedSBLSHisfoundedonSRP-LSH.WhenthecodelengthKsatis es1<K≤d,wheredisthedimensionofdataspace,wecanorthogonalizeN(1≤N≤min(K,d)=K)oftherandomvectorssampledfromthenormaldistributionN(0,Id).Theorthogonalizationprocedure
v12
相关推荐:
- [资格考试]石油钻采专业设备项目可行性研究报告编
- [资格考试]2012-2013学年度第二学期麻风病防治知
- [资格考试]道路勘测设计 绪论
- [资格考试]控烟戒烟知识培训资料
- [资格考试]建设工程安全生产管理(三类人员安全员
- [资格考试]photoshop制作茶叶包装盒步骤平面效果
- [资格考试]授课进度计划表封面(09-10下施工)
- [资格考试]麦肯锡卓越工作方法读后感
- [资格考试]2007年广西区农村信用社招聘考试试题
- [资格考试]软件实施工程师笔试题
- [资格考试]2014年初三数学复习专练第一章 数与式(
- [资格考试]中国糯玉米汁饮料市场发展概况及投资战
- [资格考试]塑钢门窗安装((专项方案)15)
- [资格考试]初中数学答题卡模板2
- [资格考试]2015-2020年中国效率手册行业市场调查
- [资格考试]华北电力大学学习实践活动领导小组办公
- [资格考试]溃疡性结肠炎研究的新进展
- [资格考试]人教版高中语文1—5册(必修)背诵篇目名
- [资格考试]ISO9001-2018质量管理体系最新版标准
- [资格考试]论文之希尔顿酒店集团进入中国的战略研
- 全国中小学生转学申请表
- 《奇迹暖暖》17-支2文学少女小满(9)公
- 2019-2020学年八年级地理下册 第六章
- 2005年高考试题——英语(天津卷)
- 无纺布耐磨测试方法及标准
- 建筑工程施工劳动力安排计划
- (目录)中国中央空调行业市场深度调研分
- 中国期货价格期限结构模型实证分析
- AutoCAD 2016基础教程第2章 AutoCAD基
- 2014-2015学年西城初三期末数学试题及
- 机械加工工艺基础(完整版)
- 归因理论在管理中的应用[1]0
- 突破瓶颈 实现医院可持续发展
- 2014年南京师范大学商学院决策学招生目
- 现浇箱梁支架预压报告
- Excel_2010函数图表入门与实战
- 人教版新课标初中数学 13.1 轴对称 (
- Visual Basic 6.0程序设计教程电子教案
- 2010北京助理工程师考试复习《建筑施工
- 国外5大医疗互联网模式分析




